From 601dba768fbcdfe2c56727b432521cf71b22e5f7 Mon Sep 17 00:00:00 2001 From: hathawayj Date: Wed, 3 Jan 2024 21:58:59 +0000 Subject: [PATCH] deploy: 72f53f0eda6bdc102e0751d4e0fd5415dede51ce --- index.html | 2 +- sitemap.xml | 2 +- slides/index.html | 2 +- slides/introduction/day01/index.html | 2 +- slides/introduction/day02/index.html | 2 +- slides/introduction/index.html | 4 +- slides/p2/d1/index.html | 36 - slides/p2/d1/index.xml | 1 - slides/p2/d2/cars.html | 3464 -------------------------- slides/p2/d2/index.html | 54 - slides/p2/d2/index.xml | 1 - slides/p2/d3/index.html | 23 - slides/p2/d3/index.xml | 1 - slides/p2/d4/index.html | 108 - slides/p2/d4/index.xml | 1 - slides/p2/index.html | 7 - slides/p2/index.xml | 1 - slides/p3/d1/index.html | 4 +- slides/p3/d2/index.html | 2 +- slides/p3/d3/index.html | 2 +- slides/p3/d4/index.html | 2 +- slides/p3/index.html | 2 +- slides/p4/d1/index.html | 2 +- slides/p4/d2/index.html | 2 +- slides/p4/d3/index.html | 2 +- slides/p4/d4/index.html | 2 +- slides/p4/index.html | 2 +- slides/p5/d1/index.html | 2 +- slides/p5/d2/index.html | 2 +- slides/p5/d3/index.html | 2 +- slides/p5/d4/index.html | 2 +- slides/p5/index.html | 2 +- slides/p6/d2/index.html | 2 +- slides/p6/d3/index.html | 2 +- slides/p6/d4/index.html | 2 +- slides/p6/index.html | 2 +- 36 files changed, 27 insertions(+), 3724 deletions(-) delete mode 100644 slides/p2/d1/index.html delete mode 100644 slides/p2/d1/index.xml delete mode 100644 slides/p2/d2/cars.html delete mode 100644 slides/p2/d2/index.html delete mode 100644 slides/p2/d2/index.xml delete mode 100644 slides/p2/d3/index.html delete mode 100644 slides/p2/d3/index.xml delete mode 100644 slides/p2/d4/index.html delete mode 100644 slides/p2/d4/index.xml delete mode 100644 slides/p2/index.html delete mode 100644 slides/p2/index.xml diff --git a/index.html b/index.html index 1b83cadc..ce076aec 100644 --- a/index.html +++ b/index.html @@ -3,4 +3,4 @@

CSE 250: Data Science Programming

Using pandas, Altiar, scikit-learn, and NumPy to program with data

-
\ No newline at end of file +
\ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 0b4ee895..37192a6a 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -1 +1 @@ -https://byuistats.github.io/DS250-Cannon/slides/introduction/day02/2020-09-17T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d4/2020-09-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p2/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/syllabus/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/pull_merge/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/day01/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p2/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d3/2020-10-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/pandas_altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-1/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/python-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d2/2020-09-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p2/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/json_missing/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-2/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p2/d1/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d1/2020-10-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d1/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d1/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/machine-learning/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-3/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/relational_data/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/ml_sklearn/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-4/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-5/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/sql-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/munging/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/vs-code/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p2/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/git_github/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/markdown/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/quarto-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/faq/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slack/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/2020-10-06T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/final_coding_challenge/sp22/https://byuistats.github.io/DS250-Cannon/categories/https://byuistats.github.io/DS250-Cannon/final_coding_challenge/https://byuistats.github.io/DS250-Cannon/contact/https://byuistats.github.io/DS250-Cannon/tags/ \ No newline at end of file +https://byuistats.github.io/DS250-Cannon/slides/introduction/day02/2020-09-17T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d4/2020-09-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/syllabus/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/pull_merge/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/day01/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d3/2020-10-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/pandas_altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-1/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/python-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d2/2020-09-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/json_missing/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-2/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d1/2020-10-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d1/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d1/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/machine-learning/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-3/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/relational_data/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/ml_sklearn/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-4/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-5/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/sql-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/munging/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/vs-code/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/git_github/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/markdown/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/quarto-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/faq/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slack/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/2020-10-06T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/final_coding_challenge/sp22/https://byuistats.github.io/DS250-Cannon/categories/https://byuistats.github.io/DS250-Cannon/final_coding_challenge/https://byuistats.github.io/DS250-Cannon/contact/https://byuistats.github.io/DS250-Cannon/tags/ \ No newline at end of file diff --git a/slides/index.html b/slides/index.html index 807282ae..d6bc9ff1 100644 --- a/slides/index.html +++ b/slides/index.html @@ -3,5 +3,5 @@

Slides

Use the navigation pane on the left to review the class slides.

\ No newline at end of file diff --git a/slides/introduction/day01/index.html b/slides/introduction/day01/index.html index 095dc7e3..031452f2 100644 --- a/slides/introduction/day01/index.html +++ b/slides/introduction/day01/index.html @@ -2,7 +2,7 @@
\ No newline at end of file diff --git a/slides/introduction/day02/index.html b/slides/introduction/day02/index.html index 660e48dd..523d860c 100644 --- a/slides/introduction/day02/index.html +++ b/slides/introduction/day02/index.html @@ -2,6 +2,6 @@
\ No newline at end of file diff --git a/slides/introduction/index.html b/slides/introduction/index.html index 7f21f55c..669f3a2d 100644 --- a/slides/introduction/index.html +++ b/slides/introduction/index.html @@ -2,6 +2,6 @@

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p2/d1/index.html b/slides/p2/d1/index.html deleted file mode 100644 index bf0f7412..00000000 --- a/slides/p2/d1/index.html +++ /dev/null @@ -1,36 +0,0 @@ -Day 1: Intro to Flights Data

Day 1: Intro to Flights Data

Welcome to class!

Spiritual Thought

Short

Link

Project 1 Comments

  1. Don’t include data as a table. Only include tables that add useful information. If I have to scroll up and down it isn’t useful.
  2. Reports should be readable by an intelligent, but non-technical audience (Meaningful titles and section names)
  3. Make it like something you’d like to read
  4. Clean out any code output, logs, that distract from the message (“My Useless Chart”)
  5. Eliminate “warnings”

Project 2: Late Flights and Missing Data

JSON files (JavaScript Object Notation)

Today, JSON is the de-facto standard for exchanging data between web and mobile clients and back-end services. source


What is JSON?
[
-  {
-    "car": "Mazda RX4",
-    "mpg": 21,
-    "cyl": 6,
-    "disp": 160,
-    "hp": 110,
-    "drat": 3.9,
-    "wt": 2.62,
-    "qsec": 16.46,
-    "vs": 0,
-    "am": 1,
-    "gear": 4,
-    "carb": 4
-  },
-  {
-    "car": "Mazda RX4 Wag",
-    "mpg": 21,
-    "cyl": 6,
-    "disp": 160,
-    "hp": 110,
-    "drat": 3.9,
-    "wt": 2.875,
-    "qsec": 17.02,
-    "am": 1,
-    "gear": 4,
-    "carb": 4
-  }
-]
-

Introduce the data

Load the JSON file and spend a few minutes studying it. Can you learn enough about it to describe the columns and rows?

Hints:

  • You can use .describe() to learn about the distribution of a numeric variable.
  • You can use .value_counts() to learn about the distribution of a categorical variable.
  • .crosstab() creates a “cross tabulation” of two or more categorical variables.

Can you trust the data?

Do you notice anything interesting about the flights data?


Question Brainstorming

In your group, try to answer the following questions about your assigned question:

  • What is our goal?
  • How can we get there?
  • What will the answer look like when we’re done?



Project 2 FAQs

Not all missing data is represented as np.nan. For an example, look at the column that counts delays due to late aircraft.

We will learn how to identify and deal with missing data next week. For now, we can drop rows we don’t want using square brackets [] or .query().

  • num_of_delays_weather
  • num_of_delays_late_aircraft
  • num_of_delays_nas

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p2/d1/index.xml b/slides/p2/d1/index.xml deleted file mode 100644 index 64d83ce4..00000000 --- a/slides/p2/d1/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 1: Intro to Flights Data on DS250https://byuistats.github.io/DS250-Cannon/slides/p2/d1/Recent content in Day 1: Intro to Flights Data on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p2/d2/cars.html b/slides/p2/d2/cars.html deleted file mode 100644 index cf12a545..00000000 --- a/slides/p2/d2/cars.html +++ /dev/null @@ -1,3464 +0,0 @@ - - - - - - - - - - -Cars: Team Work - - - - - - - - - - - - - - - - - - - - - - -
-
-
-

Cars: Team Work

-

Course DS 250

-
-
- - -
- -
-
Author
-
-

Paul Cannon

-
-
- - -
- - -
- -
- - - - -
-
-Show the code -
import pandas as pd   # to load and transform data
-import numpy as np    # for math/stat calculations
-
-# %%
-# from url to pandas dataframe
-url = "https://github.com/byuidatascience/data4missing/raw/master/data-raw/mtcars_missing/mtcars_missing.json" 
-cars = pd.read_json(url)
-
-
-
-

Group 1

-
-
-Show the code -
cyl_6 = (cars
-    .query("cyl == 6")
-    .sort_values(by="mpg", ascending=False)
-    )
-cyl_6
-
-car = (cars[cars['car'].str.contains('Hornet')].sort_values(['wt']))
-display(car)
-
-
- -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
carmpgcyldisphpdratwtqsecvsamgearcarb
4Hornet Sportabout18.78360.0175.03.153.4417.020.0032
3Hornet 4 Drive21.46258.0110.03.08NaN19.441.0031
-
-
-
-
-
-

Group 2

-
-
-Show the code -
cars2 = cars.filter(['car', 'cyl', 'qsec']).assign(
-    cyl_x_qsec = lambda x: x.cyl*x.qsec
-)
-
-# Using filter() to specify a the columns you want
-cars_important = cars2.filter(items=['car', 'mpg', 'cyl', 'cyl_x_qsec']) # You pass the columns as a list
-display(cars_important)
-
-
- -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
carcylcyl_x_qsec
0Mazda RX4698.76
1Mazda RX4 Wag6102.12
2Datsun 710474.44
3Hornet 4 Drive6116.64
4Hornet Sportabout8136.16
5Valiant6121.32
68126.72
7480.00
8Merc 230491.60
9Merc 2806109.80
10Merc 280C6113.40
11Merc 450SE8139.20
12Merc 450SL8140.80
13Merc 450SLC8144.00
14Cadillac Fleetwood8143.84
15Lincoln Continental8142.56
16Chrysler Imperial8139.36
17Fiat 128477.88
18474.08
19Toyota Corolla479.60
20Toyota Corona480.04
21Dodge Challenger8134.96
228138.40
23Camaro Z288123.28
24Pontiac Firebird8136.40
25Fiat X1-9475.60
26Porsche 914-2466.80
27Lotus Europa467.60
28Ford Pantera L8116.00
29Ferrari Dino693.00
30Maserati Bora8116.80
31Volvo 142E474.40
-
-
-
-
-
-

Group 3

-
-
-Show the code -
cars.query('cyl >= 4').value_counts()
-
-display(pd.crosstab(cars.mpg,cars.cyl).reset_index())
-
-
- -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cylmpg468
010.4002
113.3001
214.3001
314.7001
415.0001
515.2002
615.5001
715.8001
816.4001
917.3001
1017.8010
1118.1010
1218.7001
1319.2011
1419.7010
1521.0020
1621.4110
1721.5100
1822.8200
1924.4100
2026.0100
2127.3100
2230.4200
2332.4100
2433.9100
-
-
-
-
-
-

Group 4

-
-
-Show the code -
cars.groupby("car").sum().mean()
-cyl_summary = pd.DataFrame(cars.groupby("cyl").agg(["min", "median", "mean", "max"]))
-display(cyl_summary)
-
-
- -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
mpgdisphp...amgearcarb
minmedianmeanmaxminmedianmeanmaxminmedian...meanmaxminmedianmeanmaxminmedianmeanmax
cyl
421.426.026.66363633.971.1108.0105.136364146.752.091.0...0.72727313999.0727.54545599912.01.5454552
617.819.719.74285721.4145.0167.6183.314286258.0105.0110.0...0.428571134.03.857143514.03.4285716
810.415.215.10000019.2275.8350.5353.100000472.0150.0175.0...0.142857133.03.285714523.53.5000008
-

3 rows × 40 columns

-
-
-
-
- -
- - -
- - - - \ No newline at end of file diff --git a/slides/p2/d2/index.html b/slides/p2/d2/index.html deleted file mode 100644 index 4b55b5c0..00000000 --- a/slides/p2/d2/index.html +++ /dev/null @@ -1,54 +0,0 @@ -Day 2: Transforming Data

Day 2: Transforming Data

Welcome to class!

Spiritual Thought

Announcements

  1. Code chunk options:
    • Locally using -#| warning: false
    • Globally in the YAML using -execute: -warning: false

Flights Data Issues:

What are some of the data issues you discovered while getting to know your data?

Loading JSON files into pandas

Let’s load in some practice data! Data link.

Here’s a description of the data: Data Description.

import pandas as pd   # to load and transform data
-import numpy as np    # for math/stat calculations
-
-# from url to pandas dataframe
-url = "https://github.com/byuidatascience/data4missing/raw/master/data-raw/mtcars_missing/mtcars_missing.json" 
-cars = pd.read_json(url)
-
-# or from file to pandas dataframe
-cars = pd.read_json("mtcars_missing.json")
-

Look at the data for the first two cars. What is different about the format?

[
-  {
-    "car": "Mazda RX4",
-    "mpg": 21,
-    "cyl": 6,
-    "disp": 160,
-    "hp": 110,
-    "drat": 3.9,
-    "wt": 2.62,
-    "qsec": 16.46,
-    "vs": 0,
-    "am": 1,
-    "gear": 4,
-    "carb": 4
-  },
-  {
-    "car": "Mazda RX4 Wag",
-    "mpg": 21,
-    "cyl": 6,
-    "disp": 160,
-    "hp": 110,
-    "drat": 3.9,
-    "wt": 2.875,
-    "qsec": 17.02,
-    "am": 1,
-    "gear": 4,
-    "carb": 4
-  }
-]
-

Your Turn: Transforming Data

With your group, research these functions and create an example using the cars data. Post your example in Slack. Be prepared to teach the class about your functions.

You can use the Data Transformation textbook chapter and the pandas documentation to help you.

Recreate the following output to the best of your abilities:

Group 1: Working with rows
  • .query() allows you to subset observations (rows)
  • .sort_values() arranges rows in a particular order
Group 2: Working with columns
  • .filter() (as well as [] and .loc[]) allow you to select columns
  • .assign() is one way to add new columns to a dataframe
Group 3: Counting items
  • .value_counts() summarizes a column by counting the values inside
  • .crosstab() creates a “cross tabulation” of two or more variables
Group 4: Summarizing data
  • Using .groupby() and .agg() together allows you to calculate group summaries

Your Turn: Summarizing the cars data

Write code to calculate the mean weight wt for each cylinder type cyl.

cars.groupby('cyl').agg(mean_weight = ('wt', np.mean)).reset_index()
-

Can you print the answer as a markdown table?

print(cars.groupby('cyl').agg(mean_weight = ('wt', np.mean)).reset_index().to_markdown(index = False))
-

Project 2 FAQs

One main reason:

You can create multiple columns within the same assign() where one of the columns depends on another one defined within the same assign. source: Documentation

Other resources:

Not related, but also fun: Should you use “dot notation” or “bracket notation” with pandas?

Two ways to define the same function:

def square(x):
-     return x**2
-
-square = lambda x:x**2
-

There are some difference between them as listed below.

  1. lambda is a keyword that returns a function object and does not create a ‘name’. Whereas def creates name in the local namespace
  2. lambda functions are good for situations where you want to minimize lines of code as you can create function in one line of python code. It is not possible using def
  3. lambda functions are somewhat less readable for most Python users.
  4. lambda functions can only be used once, unless assigned to a variable name.

source

What if you want to create a new column, whose values depend on another column? There are a lot of ways to accomplish this (see this stackoverflow answer). Some functions I use:

We will learn how to identify and deal with missing data next week. For now, we can drop rows we don’t want using square brackets [] or .query().

API’s and JSON: A Primer

Application Programming Interfaces (APIs)

Representational State Transfer (REST APIs)

Over the course of the ’00s, another Web services technology, called Representational State Transfer, or REST, began to overtake [all other tools] for the purpose of transferring data. One of the big advantages of programming using REST APIs is that you can use multiple data formats — not just XML, but JSON and HTML as well. As web developers came to prefer JSON over XML, so too did they come to favor REST over SOAP. As Kostyantyn Kharchenko put it on the Svitla blog, “In many ways, the success of REST is due to the JSON format because of its easy use on various platforms.”
Today, JSON is the de-facto standard for exchanging data between web and mobile clients and back-end services. ref


JavaScript Object Notation

Well, when you’re writing frontend code in Javascript, getting JSON data back makes it easier to load that data into an object tree and work with it. And JSON formats data in a more succinct way, which saves bandwidth and improves response times when sending messages back and forth to a server.
In a world of APIs, cloud computing, and ever-growing data, JSON has a big role to play in greasing the wheels of a modern, open web. ref


Other Resources





J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p2/d2/index.xml b/slides/p2/d2/index.xml deleted file mode 100644 index 6ff15052..00000000 --- a/slides/p2/d2/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 2: Transforming Data on DS250https://byuistats.github.io/DS250-Cannon/slides/p2/d2/Recent content in Day 2: Transforming Data on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p2/d3/index.html b/slides/p2/d3/index.html deleted file mode 100644 index 3425ab8b..00000000 --- a/slides/p2/d3/index.html +++ /dev/null @@ -1,23 +0,0 @@ -Day 3: Missing Data

Day 3: Missing Data

Welcome to class!

Announcements


Questions 1 and 2

What issues are we still running into?


How to work with missing data

What counts as missing data?


How to identify missing data

  • df.isnull().sum()
  • df.describe()
  • df.column.value_counts(dropna=False)
  • pd.crosstab()

Option 1: Remove missing values

Be careful with .dropna(), and make sure you know what it is doing to your data!

Let’s use the pandas example:

df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
-                   "toy": [np.nan, 'Batmobile', 'Bullwhip'],
-                   "born": [pd.NaT, pd.Timestamp("1940-04-25"),
-                            pd.NaT]})
-
A: Almost never! Why do you think it is a bad idea? df.dropna()

Option 2: Replacing missing values

Again, let’s use the pandas example:

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
-                   [3, 4, np.nan, 1],
-                   [np.nan, np.nan, np.nan, 5],
-                   [np.nan, 3, np.nan, 4]],
-                  columns=list("ABCD"))
-

Question 3

What columns do we need to use for question 3 (total number of flights delayed by weather)?

  • num_of_delays_weather
  • num_of_delays_late_aircraft
  • num_of_delays_nas
weather = flights.assign(
-    severe = #????,
-    mild_late = #????,
-    mild_nas = np.where(#????),
-    total_weather = # add up severe and mild,
-).filter(['airport_code','month','severe','mild_late','mild_nas',
-    'total_weather', 'num_of_delays_total'])
-

Other resources for question 3


J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p2/d3/index.xml b/slides/p2/d3/index.xml deleted file mode 100644 index e4ee1755..00000000 --- a/slides/p2/d3/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 3: Missing Data on DS250https://byuistats.github.io/DS250-Cannon/slides/p2/d3/Recent content in Day 3: Missing Data on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p2/d4/index.html b/slides/p2/d4/index.html deleted file mode 100644 index e82ee697..00000000 --- a/slides/p2/d4/index.html +++ /dev/null @@ -1,108 +0,0 @@ -Day 4: Exporting JSON

Day 4: Exporting JSON

Welcome to class!

Spiritual Thought

  • 3 Nephi 12:14 Verily, verily, I say unto you, I give unto you to be the light of this people.

Announcements

  • Hackathon Opening Social

Question 5

Let’s do an example of question 5 using the mtcars data.

Load packages and data

#%%
-import pandas as pd
-import numpy as np
-import json
-
-url_cars = "https://github.com/byuidatascience/data4missing/raw/master/data-raw/mtcars_missing/mtcars_missing.json"
-cars = pd.read_json(url_cars)
-

Find all the missing values

#%%
-# method 1: find "official" null values
-# hp, wt, and vs
-cars.isnull().sum()
-
-#%%
-# method 2: just look at the data
-# car, hp, wt, vs, gear
-cars.head(10)
-
-#%%
-# method 3: look at summaries
-# the values in 'gear' look funny
-cars.describe()
-
-#%%
-# method 4: count up categories
-# looks like 4 rows are blank
-cars.car.value_counts()
-

Reformat the missing values

Remember, you need to reformat your missing values to make them consistent!

Reading the examples in the replace documentation might give you some ideas.

#%% 
-# There are a lot of functions
-# we could use to give the missing values
-# a consistent format.
-
-# `replace()` is one of the easiest
-# let's change everything to np.nan
-cars_new = cars.replace(999, np.nan).replace("", np.nan)
-
-# or equivalently:
-cars_new = cars.replace([999, ""], np.nan)
-
-
-# did we get them all?
-cars_new.isnull().sum()
-

Saving JSON files from a pandas dataframe

You can save a DataFrame as a JSON file like this:

#%%
-# save the new data as a json
-cars_new.to_json("my_cars_data.json")
-

The df.to_json() documentation shows us how to change the way the JSON file is organized. (By row? By column? etc.)

This is the format we would like to see in the report:

[
-  {
-    "car": "Mazda RX4",
-    "mpg": 21,
-    "cyl": 6,
-    "disp": 160,
-    "hp": 110,
-    "drat": 3.9,
-    "wt": 2.62,
-    "qsec": 16.46,
-    "vs": 0,
-    "am": 1,
-    "gear": 4,
-    "carb": 4
-  }
-]
-

And here are the various options:

# %%
-# Question 5 wants us to "include one record example"
-# in our md report that "has a missing value"
-
-# you can print out a json file like this:
-json_data = cars_new.to_json()
-print(json_data)
-
-# but that won't look good in our report.
-# instead....
-
-#%%
-# you can do this.
-# in this format, the json file is
-# organized/printed by column
-json_data = cars_new.to_json()
-json_object = json.loads(json_data)
-json_formatted_str = json.dumps(json_object, indent = 4)
-print(json_formatted_str)
-
-# %%
-# we can change the format of the
-# json file using 'orient'
-json_data = cars.to_json(orient="split")
-json_object = json.loads(json_data)
-json_formatted_str = json.dumps(json_object, indent = 4)
-print(json_formatted_str)
-
-# %%
-# by table
-json_data = cars.to_json(orient="table")
-json_object = json.loads(json_data)
-json_formatted_str = json.dumps(json_object, indent = 4)
-print(json_formatted_str)
-
-# %%
-# by "record" or "row"
-json_data = cars.to_json(orient="records")
-json_object = json.loads(json_data)
-json_formatted_str = json.dumps(json_object, indent = 4)
-print(json_formatted_str)
-

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p2/d4/index.xml b/slides/p2/d4/index.xml deleted file mode 100644 index 522a3854..00000000 --- a/slides/p2/d4/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 4: Exporting JSON on DS250https://byuistats.github.io/DS250-Cannon/slides/p2/d4/Recent content in Day 4: Exporting JSON on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p2/index.html b/slides/p2/index.html deleted file mode 100644 index 364af8f6..00000000 --- a/slides/p2/index.html +++ /dev/null @@ -1,7 +0,0 @@ -Week 4-5: Project 2 - Flights

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p2/index.xml b/slides/p2/index.xml deleted file mode 100644 index 90cf4071..00000000 --- a/slides/p2/index.xml +++ /dev/null @@ -1 +0,0 @@ -Week 4-5: Project 2 - Flights on DS250https://byuistats.github.io/DS250-Cannon/slides/p2/Recent content in Week 4-5: Project 2 - Flights on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p3/d1/index.html b/slides/p3/d1/index.html index 493a2e20..cba67f08 100644 --- a/slides/p3/d1/index.html +++ b/slides/p3/d1/index.html @@ -3,7 +3,7 @@

Day 1: Intro to Project 3

Welcome to class!

Spiritual Thought

Announcements

  1. Project 2 Highlights
  2. Project 2 comments
  • Turn them in
  • Clean up graphs (main titles, axis labels, legends)
  • Column headers on tables in your report (don’t include index number either)
  • Technically Proportion of all flights delayed by weather, not the proportion of delayed flights
  • JSON should look like a text example of a record, not a table
  1. Things for next project:
  • Be sure to give section headers meaningful titles (NOT “Question 1”)

What is Structured Query Language (SQL)?

+active">Day 1: Intro to Project 3
  • Week 1: Introduction
  • Day 1: Intro to Project 3

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Project 2 Highlights
    2. Project 2 comments
    • Turn them in
    • Clean up graphs (main titles, axis labels, legends)
    • Column headers on tables in your report (don’t include index number either)
    • Technically Proportion of all flights delayed by weather, not the proportion of delayed flights
    • JSON should look like a text example of a record, not a table
    1. Things for next project:
    • Be sure to give section headers meaningful titles (NOT “Question 1”)

    What is Structured Query Language (SQL)?



    Ok, but how does it work?

    SQL uses keywords to pull (or “fetch”, “extract”) the data we want from a database. The computer reads those keywords in a specific order.

    From EverSQL we can get some more background:

    This is the logical order of operations, also known as the order of execution, for an SQL query:


    1. FROM, including JOINs
    2. WHERE
    3. GROUP BY
    4. HAVING
    5. WINDOW functions
    6. SELECT
    7. DISTINCT
    8. UNION
    9. ORDER BY
    10. LIMIT and OFFSET

    But the reality isn’t that easy nor straight forward. As we said, the SQL standard defines the order of execution for the different SQL query clauses. Said that, modern databases are already challenging that default order by applying some optimization tricks which might change the actual order of execution, though they must end up returning the same result as if they were running the query at the default execution order.

    For CSE 250: Don’t think too hard about optimization at this point. Let the database figure out the optimized routine.

    Most SQL queries are typed in the following pattern:

    SELECT -- <columns> and <column calculations>
     FROM -- <table name>
       JOIN -- <table name>
    @@ -34,4 +34,4 @@
     """, con)
     
     

    Understanding SQL queries

    Make sure you do the project readings!

    J. Hathaway and BYU-I ©

    \ No newline at end of file +Week 1: Introduction
    \ No newline at end of file diff --git a/slides/p3/d2/index.html b/slides/p3/d2/index.html index 3ad23de5..7911fb15 100644 --- a/slides/p3/d2/index.html +++ b/slides/p3/d2/index.html @@ -3,7 +3,7 @@

    Day 2: SQL Calculations

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Project 3 - SQL practice

    Class Activity in Slack

    Part 1

    Goal: Describe in words (NOT using code) how to get from your starting data to your ending data.

    Post your answer in your group’s Slack thread. You have 7 minutes, and are allowed to ask me 1 question.

    Part 2

    Goal: Now try to write a SQL query to get your ending data.

    Post your SQL query in your group’s Slack thread. You have 7 minutes, and are allowed to ask me 1 question.

    Here is the SQL template for your use.

    Day 2: SQL Calculations

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Project 3 - SQL practice

    Class Activity in Slack

    Part 1

    Goal: Describe in words (NOT using code) how to get from your starting data to your ending data.

    Post your answer in your group’s Slack thread. You have 7 minutes, and are allowed to ask me 1 question.

    Part 2

    Goal: Now try to write a SQL query to get your ending data.

    Post your SQL query in your group’s Slack thread. You have 7 minutes, and are allowed to ask me 1 question.

    Here is the SQL template for your use.

    SELECT -- <columns> and <column calculations>
     FROM -- <table name>
       JOIN -- <table name>
       ON -- <columns to join>
    diff --git a/slides/p3/d3/index.html b/slides/p3/d3/index.html
    index 82f05ed9..04e427f3 100644
    --- a/slides/p3/d3/index.html
    +++ b/slides/p3/d3/index.html
    @@ -3,6 +3,6 @@
     

    Day 3: The end of baseball

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Practice Coding Challenge
    2. Can I still get an “A”?
      • Profile of an “A” student
      • What if I fall behind?
    3. Reminders:
      • DS community assignment
      • Review and Request Letter

    Coding Challenge:

    How do I prepare? +active">Day 3: The end of baseball

  • Day 2: SQL Calculations
  • Day 1: Intro to Project 3
  • Week 1: Introduction
  • Day 3: The end of baseball

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Practice Coding Challenge
    2. Can I still get an “A”?
      • Profile of an “A” student
      • What if I fall behind?
    3. Reminders:
      • DS community assignment
      • Review and Request Letter

    Coding Challenge:

    How do I prepare? What would your coding challenge look like?

    Project 3 Questions

    1. Integer Division
    2. Career Batting Average
    3. What have come up with for Q3? Metrics? Visualizations?

    Question 1

    Ask yourself:

    1. What do I want and expect the end table to look like?
    2. What table(s) and calculations do I need?
    3. What makes a row in my end table unique?
    4. What problems can I anticipate?

    Question 2

    Ask yourself:

    1. What do I want and expect the end table to look like?
    2. What table(s) and calculations do I need?
    3. What makes a row in my end table unique?
    4. What problems can I anticipate?

    Question 3

    What are some ideas for Grand Question 3? Ask yourself:

    1. What information will you use to compare the two baseball teams?
    2. What table(s) and calculations do I need?
    3. What makes a row in my end table unique?
    4. What problems can I anticipate?

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p3/d4/index.html b/slides/p3/d4/index.html index 60255c55..fae56e53 100644 --- a/slides/p3/d4/index.html +++ b/slides/p3/d4/index.html @@ -3,5 +3,5 @@

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p3/index.html b/slides/p3/index.html index 508dabc0..84ef7258 100644 --- a/slides/p3/index.html +++ b/slides/p3/index.html @@ -3,7 +3,7 @@

    Week 6-7: Project 3 - Baseball

    Week 6-7: Project 3 - Baseball

    We will use a baseball relational database to explore SQL in Python for data science applications. Finding relationships in baseball

    Completed Readings: SQL for Data Science Readings (read all links) and Why SQL is beating NoSQL, and what this means for the future of data

    Use the data.world baseball url for the Data Connection. You can read the
    Connection Instructions for data.world here

    Grand Questions

    1. Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.

    2. This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)

      1. Write an SQL query that provides playerID, yearID, and batting average for players with at least one at bat. Sort the table from highest batting average to lowest, and show the top 5 results in your report.
      2. Use the same query as above, but only include players with more than 10 “at bats” that year. Print the top 5 results.
      3. Now calculate the batting average for players over their entire careers (all years combined). Only include players with more than 100 at bats, and print the top 5 results.
    3. Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc.). Write an SQL query to get the data you need. Use Python if additional data wrangling is needed, then make a graph in Altair to visualize the comparison. Provide the visualization and the compiled Vega script that would build the visualization.

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p4/d1/index.html b/slides/p4/d1/index.html index aefee2cf..46fd23c4 100644 --- a/slides/p4/d1/index.html +++ b/slides/p4/d1/index.html @@ -3,7 +3,7 @@

    Day 1: Intro to ML

    Welcome to class!

    Announcements

    1. Project 3 - Getting pickier about good communication
      • Career batting average
      • Meaningful report name (Drop “Client Report”)
      • Meaningful section headers so the table of contents is useful (don’t call them “Question 1”)
      • Don’t include “My useless chart” from the template
    2. Coding Challenge - Table
    3. Ask for help!
      • Computing lab
      • Computing lab Slack channel (search)
      • Slack classmates or general channel

    Spiritual Thought

    Genesis 1:1 and Machine Learning
    Are facts true?


    Pictionary!



    From Sebastian Thrun:

    AI is able to learn ‘rules’ from highly repetitive data.


    The single most important thing for AI to accomplish in the next ten years is to free us from the burden of repetitive work.


    Your Turn: Student Classification Problem

    Can we predict if a student is from Utah?


    Your Turn: Features and Targets

    Import dwellings.csv. With a neighbor:

    1. Try to describe the data. Explain what each observation (row) is and what measurements we have on that observation (columns).
    2. Now try describing the modeling (machine learning) we are going to do in terms of “features” and “targets”. Watch out - are there any columns that are the target in disguise? (You may need to review the project goal.)
    3. What features do you expect to have a strong relationship with the target?

    Before Next Class

    Machine Learning Introduction

    • Step-by-step guide (mostly) for training a GaussianNB classifier. (The steps will be the same for any algorithm you use.)

    Visual Introduction to Machine Learning

    1. Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.
    2. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data.
    3. Overfitting happens when some boundaries are based on distinctions that don’t make a difference. You can see if a model overfits by having test data flow through the model.

    The goal of Question 1 is to help us with “feature selection”.

    • Remember: Overfitting happens when some boundaries are based on on distinctions that don’t make a difference.
    • More data does not always lead to better models. (Occam’s Razor)

    Common questions:

    MaxRowsError: How can I plot Large Datasets?

    You may also save data to a local filesystem and reference the data by file path. Altair has a JSON data transformer that will do this transparently when enabled:

    Day 1: Intro to ML

    Welcome to class!

    Announcements

    1. Project 3 - Getting pickier about good communication
      • Career batting average
      • Meaningful report name (Drop “Client Report”)
      • Meaningful section headers so the table of contents is useful (don’t call them “Question 1”)
      • Don’t include “My useless chart” from the template
    2. Coding Challenge - Table
    3. Ask for help!
      • Computing lab
      • Computing lab Slack channel (search)
      • Slack classmates or general channel

    Spiritual Thought

    Genesis 1:1 and Machine Learning
    Are facts true?


    Pictionary!



    From Sebastian Thrun:

    AI is able to learn ‘rules’ from highly repetitive data.


    The single most important thing for AI to accomplish in the next ten years is to free us from the burden of repetitive work.


    Your Turn: Student Classification Problem

    Can we predict if a student is from Utah?


    Your Turn: Features and Targets

    Import dwellings.csv. With a neighbor:

    1. Try to describe the data. Explain what each observation (row) is and what measurements we have on that observation (columns).
    2. Now try describing the modeling (machine learning) we are going to do in terms of “features” and “targets”. Watch out - are there any columns that are the target in disguise? (You may need to review the project goal.)
    3. What features do you expect to have a strong relationship with the target?

    Before Next Class

    Machine Learning Introduction

    • Step-by-step guide (mostly) for training a GaussianNB classifier. (The steps will be the same for any algorithm you use.)

    Visual Introduction to Machine Learning

    1. Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.
    2. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data.
    3. Overfitting happens when some boundaries are based on distinctions that don’t make a difference. You can see if a model overfits by having test data flow through the model.

    The goal of Question 1 is to help us with “feature selection”.

    • Remember: Overfitting happens when some boundaries are based on on distinctions that don’t make a difference.
    • More data does not always lead to better models. (Occam’s Razor)

    Common questions:

    MaxRowsError: How can I plot Large Datasets?

    You may also save data to a local filesystem and reference the data by file path. Altair has a JSON data transformer that will do this transparently when enabled:

    alt.data_transformers.disable_max_rows()
     subset_data = denver.sample(n = 4999)
     

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p4/d2/index.html b/slides/p4/d2/index.html index 4ea393b9..a177f002 100644 --- a/slides/p4/d2/index.html +++ b/slides/p4/d2/index.html @@ -3,7 +3,7 @@

    Day 2: Intro to Machine Learning

    Welcome to class!

    Announcements

    Spiritual thought

    Are facts true?

    • How do you distinguish between truth and error?
    • Joshua and Caleb

    Building a Decision Tree

    Splitting the Data

    1. Start with packages and data set

    We’ll be using some parts of SKLEARN package and the Seaborn package.

    Day 2: Intro to Machine Learning

    Welcome to class!

    Announcements

    Spiritual thought

    Are facts true?

    • How do you distinguish between truth and error?
    • Joshua and Caleb

    Building a Decision Tree

    Splitting the Data

    1. Start with packages and data set

    We’ll be using some parts of SKLEARN package and the Seaborn package.

    # If you haven't already, install scikit-learn and seaborn
     pip install scikit-learn seaborn
     
    from types import GeneratorType
     import pandas as pd
    diff --git a/slides/p4/d3/index.html b/slides/p4/d3/index.html
    index 3880ccff..911f0f0f 100644
    --- a/slides/p4/d3/index.html
    +++ b/slides/p4/d3/index.html
    @@ -3,7 +3,7 @@
     

    Day 3: Training a Classifier, Part 2

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Coding Challenge code posted

    Prepping data for the Machine

    alt text

    Building a Decision Tree

    Day 3: Training a Classifier, Part 2

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Coding Challenge code posted

    Prepping data for the Machine

    alt text

    Building a Decision Tree

    import pandas as pd
     import altair as alt
     
     from sklearn.model_selection import train_test_split
    diff --git a/slides/p4/d4/index.html b/slides/p4/d4/index.html
    index 94a69ead..7b8c832c 100644
    --- a/slides/p4/d4/index.html
    +++ b/slides/p4/d4/index.html
    @@ -3,7 +3,7 @@
     

    Day 4: Evaluating Our Models, Part 2

    Announcements

    Today:

    1. Continue discussion about evaluating models
    2. Try to understand what models are doing

    Evaluating model performance cont

    Confusion Matrix

    Why isn’t accuracy enough?

    A confusion matrix is a quick way to see the strengths and weaknesses of your model. A confusion matrix is not a “metric”. A confusion matrix provides an easy way to calculate multiple metrics such as accuracy, precision, and recall.

    alt text


    Your Turn

    With your group, use the links above to find a definition for your assigned metric. Then try using the confusion matrix on the screen to calculate your metric for my model.

    • Group 1: Accuracy
    • Group 2: Sensitivity/Recall
    • Group 3: Precision
    • Group 4: Specificity
    • Group 5: F1 Score
    • Group 6: Balanced Accuracy

    Validation metrics

  • Week 6-7: Project 3 - Baseball
  • Week 1: Introduction
  • Day 4: Evaluating Our Models, Part 2

    Announcements

    Today:

    1. Continue discussion about evaluating models
    2. Try to understand what models are doing

    Evaluating model performance cont

    Confusion Matrix

    Why isn’t accuracy enough?

    A confusion matrix is a quick way to see the strengths and weaknesses of your model. A confusion matrix is not a “metric”. A confusion matrix provides an easy way to calculate multiple metrics such as accuracy, precision, and recall.

    alt text


    Your Turn

    With your group, use the links above to find a definition for your assigned metric. Then try using the confusion matrix on the screen to calculate your metric for my model.

    • Group 1: Accuracy
    • Group 2: Sensitivity/Recall
    • Group 3: Precision
    • Group 4: Specificity
    • Group 5: F1 Score
    • Group 6: Balanced Accuracy

    Validation metrics


    #%%
     # a confusion matrix
     print(metrics.confusion_matrix(y_test, y_predicted_DT))
    diff --git a/slides/p4/index.html b/slides/p4/index.html
    index 874725f9..69da7642 100644
    --- a/slides/p4/index.html
    +++ b/slides/p4/index.html
    @@ -3,6 +3,6 @@
     

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p5/d1/index.html b/slides/p5/d1/index.html index 3d51ac38..b2bf734d 100644 --- a/slides/p5/d1/index.html +++ b/slides/p5/d1/index.html @@ -3,7 +3,7 @@

    Day 1: The war with Star Wars

    Day 1: The war with Star Wars

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Project 4 thoughts
      • Feature Importances - Sorted Bar Graph, not unsorted tables
      • Suppress warnings
      • And the winner is…

    The Star Wars data

    Load the Star Wars data

    # %%
     import pandas as pd 
     import altair as alt
     import numpy as np
    diff --git a/slides/p5/d2/index.html b/slides/p5/d2/index.html
    index e3770ff8..90727a1c 100644
    --- a/slides/p5/d2/index.html
    +++ b/slides/p5/d2/index.html
    @@ -3,7 +3,7 @@
     

    Day 2: Star Wars and strings

    Day 2: Star Wars and strings

    Welcome to class!

    Announcements

    What’s something you’re grateful for today?


    The .str functions in pandas


    .str.strip()

    s = pd.Series(['1. Ant.  ', '2. Bee!\n', '3. Cat?\t', '4. Beat?\t', np.nan])
     
     s.str.strip()
     
    diff --git a/slides/p5/d3/index.html b/slides/p5/d3/index.html
    index c34c0ff9..d40cc180 100644
    --- a/slides/p5/d3/index.html
    +++ b/slides/p5/d3/index.html
    @@ -3,7 +3,7 @@
     

    Day 3: Validating data, cleaning columns

    Welcome to class!

    Announcements

    Spiritual Thought

    Let’s validate some data!

    Pick something from the Star Wars article you want to validate (“double check”).


    Moving from categories to values.

    1. Create an additional column(s) that converts the income ranges to a number.
    2. Create an additional column(s) that converts the age ranges to a number.
    3. Create an additional column(s) that converts the school groupings to a number.

    Validating visuals

    You’re going to make a lot of bar charts!


    Getting started on Question 3

    One-hot encoding

    Project 5 asks you to “one-hot encode all columns that have categories” and “convert all yes/no responses to 1/0 numeric”.

    The get_dummies method can be used to create one-hot encoded variables. The pd.get_dummies documentation is a great place to start.

    After reading the documentation, study the code below and get started on Grand Question #3.

    Day 3: Validating data, cleaning columns

    Welcome to class!

    Announcements

    Spiritual Thought

    Let’s validate some data!

    Pick something from the Star Wars article you want to validate (“double check”).


    Moving from categories to values.

    1. Create an additional column(s) that converts the income ranges to a number.
    2. Create an additional column(s) that converts the age ranges to a number.
    3. Create an additional column(s) that converts the school groupings to a number.

    Validating visuals

    You’re going to make a lot of bar charts!


    Getting started on Question 3

    One-hot encoding

    Project 5 asks you to “one-hot encode all columns that have categories” and “convert all yes/no responses to 1/0 numeric”.

    The get_dummies method can be used to create one-hot encoded variables. The pd.get_dummies documentation is a great place to start.

    After reading the documentation, study the code below and get started on Grand Question #3.

    #%%
     # When we use machine learning to predict salary,
     # let's only look at people that have seen at least
     # one star wars film
    diff --git a/slides/p5/d4/index.html b/slides/p5/d4/index.html
    index 7d192c50..1172339c 100644
    --- a/slides/p5/d4/index.html
    +++ b/slides/p5/d4/index.html
    @@ -3,7 +3,7 @@
     

    Day 4: May the ML columns be with you

    Welcome to class!

    Spiritual Thought

    Announcements


    Getting the data ready for machine learning.


    What are machine learning algorithms expecting to see?

    We need to handle missing values and categorical features before feeding the data into a machine learning algorithm, because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression. ref

    We have some options when converting categorical features (columns) to numeric.

    • If the category contains numeric information (like a range of numbers) we can convert it to a numeric variable by taking the minimum, average, or maximum of the range.
    • Factorization: If the category is an “ordinal” variable (meaning, there is an order to the categories) we can assign each category to an integer. (For example, good = 1, better = 2, best = 3.)
    • One-hot Encoding or Dummy Variables: If the category is a “nominal” variable (without an order) then we need to use one-hot encoding (sometimes called “dummy variable encoding").
    • If the category is some version of True/False or Yes/No then we can simply convert the values to zeros and ones.

    What’s our game plan for the Star Wars columns?

    1. Break into Groups

    Strategize + Code + Share

    • Group 1: How are you going to turn Age, Income and Education into numbers?
    • Group 2: How are you going to encode
      • Who Shot First
      • Gender
      • Location
      • All the Yes/No responses
    • Group 3: How are you going to deal with the character rankings?

    2. Combine all the factors into one big X dataframe

    3. Define Y as those making > $50k

    First: Limit the data to only people who answered “Yes” to the question “Have you seen any of the 6 films in the Star Wars franchise?”.

    Then: Use the table below as a guide to prepare your data for machine learning.

    ColumnOriginal FormatConvert To
    agecategory (ordinal, age ranges)number
    incomecategory (ordinal, income ranges)number
    educationcategory (ordinal, name of degree)number
    shot_firstcategory (nominal)one-hot
    gendercategory (nominal)one-hot
    locationcategory (nominal)one-hot
    fan_star_warsYes/No0/1
    expanded_universeYes/No0/1
    fan_exapandedYes/No0/1
    fan_star_trekYes/No0/1
    seen_iYes/No (name of movie/NaN)0/1
    seen_iiYes/No (name of movie/NaN)0/1
    seen_iiiYes/No (name of movie/NaN)0/1
    seen_ivYes/No (name of movie/NaN)0/1
    seen_vYes/No (name of movie/NaN)0/1
    seen_viYes/No (name of movie/NaN)0/1
    movie rankingsnumber-
    character rankingscategory (ordinal)one-hot or factorize

    What functions can we use to convert the categorical columns to numeric?

    Question: When and why would we drop the first column when we convert a category using pd.get_dummies()?

    Answer: Whenever your algorithm needs to calculate a matrix inverse.

    The one-hot encoding creates one binary variable for each category.


    The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red”, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].


    This is called a dummy variable encoding, and always represents C categories with C-1 binary variables. In addition to being slightly less redundant, a dummy variable representation is required for some models.


    For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

    Source


    Predicting income.

    Grand Question 4 wants us to “build a machine learning model that predicts whether a person makes more than $50k”.

    Day 4: May the ML columns be with you

    Welcome to class!

    Spiritual Thought

    Announcements


    Getting the data ready for machine learning.


    What are machine learning algorithms expecting to see?

    We need to handle missing values and categorical features before feeding the data into a machine learning algorithm, because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression. ref

    We have some options when converting categorical features (columns) to numeric.

    • If the category contains numeric information (like a range of numbers) we can convert it to a numeric variable by taking the minimum, average, or maximum of the range.
    • Factorization: If the category is an “ordinal” variable (meaning, there is an order to the categories) we can assign each category to an integer. (For example, good = 1, better = 2, best = 3.)
    • One-hot Encoding or Dummy Variables: If the category is a “nominal” variable (without an order) then we need to use one-hot encoding (sometimes called “dummy variable encoding").
    • If the category is some version of True/False or Yes/No then we can simply convert the values to zeros and ones.

    What’s our game plan for the Star Wars columns?

    1. Break into Groups

    Strategize + Code + Share

    • Group 1: How are you going to turn Age, Income and Education into numbers?
    • Group 2: How are you going to encode
      • Who Shot First
      • Gender
      • Location
      • All the Yes/No responses
    • Group 3: How are you going to deal with the character rankings?

    2. Combine all the factors into one big X dataframe

    3. Define Y as those making > $50k

    First: Limit the data to only people who answered “Yes” to the question “Have you seen any of the 6 films in the Star Wars franchise?”.

    Then: Use the table below as a guide to prepare your data for machine learning.

    ColumnOriginal FormatConvert To
    agecategory (ordinal, age ranges)number
    incomecategory (ordinal, income ranges)number
    educationcategory (ordinal, name of degree)number
    shot_firstcategory (nominal)one-hot
    gendercategory (nominal)one-hot
    locationcategory (nominal)one-hot
    fan_star_warsYes/No0/1
    expanded_universeYes/No0/1
    fan_exapandedYes/No0/1
    fan_star_trekYes/No0/1
    seen_iYes/No (name of movie/NaN)0/1
    seen_iiYes/No (name of movie/NaN)0/1
    seen_iiiYes/No (name of movie/NaN)0/1
    seen_ivYes/No (name of movie/NaN)0/1
    seen_vYes/No (name of movie/NaN)0/1
    seen_viYes/No (name of movie/NaN)0/1
    movie rankingsnumber-
    character rankingscategory (ordinal)one-hot or factorize

    What functions can we use to convert the categorical columns to numeric?

    Question: When and why would we drop the first column when we convert a category using pd.get_dummies()?

    Answer: Whenever your algorithm needs to calculate a matrix inverse.

    The one-hot encoding creates one binary variable for each category.


    The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red”, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].


    This is called a dummy variable encoding, and always represents C categories with C-1 binary variables. In addition to being slightly less redundant, a dummy variable representation is required for some models.


    For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

    Source


    Predicting income.

    Grand Question 4 wants us to “build a machine learning model that predicts whether a person makes more than $50k”.

    Aka, what is our “outcome” or “response” that we want to predict?

    dat_ml.income > 50000
     

    Remember not to include the answer (income) in your features!

    x = dat_ml.drop(['income'], axis = 1)
     

    The response needs to be saved as a 0/1 variable (at least, for binary classification algorithms).

    y = (dat_ml.income > 50000) / 1
     

    Week 10-11: Project 5 - Star Wars

    A significant portion of a data scientist’s job is data cleaning. during these two weeks we will not hide the data munging from you. We will practice data cleaning using a Star Wars survey from FiveThirtEight. Survey data is notoriously difficult to handle. Even when the data is recorded cleanly the options for ‘write in questions’, ‘choose from multiple answers’, ‘pick all that are right’, and ‘multiple choice questions’ makes storing the data in a tidy format difficult.

    Week 10-11: Project 5 - Star Wars

    A significant portion of a data scientist’s job is data cleaning. during these two weeks we will not hide the data munging from you. We will practice data cleaning using a Star Wars survey from FiveThirtEight. Survey data is notoriously difficult to handle. Even when the data is recorded cleanly the options for ‘write in questions’, ‘choose from multiple answers’, ‘pick all that are right’, and ‘multiple choice questions’ makes storing the data in a tidy format difficult.

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p6/d2/index.html b/slides/p6/d2/index.html index 58767ea2..8f5ffacb 100644 --- a/slides/p6/d2/index.html +++ b/slides/p6/d2/index.html @@ -3,5 +3,5 @@

    Day 1: Git and Github

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Project 5 Comment

      • Feature Importance and Model discussion
    2. The last day of DSS is next Wednesday, Dec 6th at 6:00PM in STC 394

    3. Extra credit for creating and uploading cheat sheet (2 points for projects or checkpoints)

    4. Coding Challenge date?

    5. The technical aspects of Project 6 will be done mostly in class. Resume prep/MD outside


    Git and GitHub

    “Web developers’ social media platform”

    This is GitHub, the world’s largest code repository platform online. A platform used by some 50 million software developers to host their coding projects, most of them open-source — meaning others can access their codes and modify them to create better versions if they feel like.


    Most of the internet is produced or hosted on GitHub in the form of code. “What Gmail is to email, GitHub is to writing software,” says Kiran Jonnalagadda, cofounder of HasGeek, a platform to build and discover peer groups. Source

    • Don’t: post code for assignments that hundreds of other students have done.
    • Do: post unique code using skills from your classes.

    I would also recommend using private repos to manage your course work.


    Is it going to hurt?

    Answer: Yes.

    It feels weird at first but quickly becomes second nature. If you plan on taking more data science classes, you should know that DS 350 students are required to submit all coursework via GitHub. This is a major topic in class and office hours for the first two weeks. Then we practically never discuss it again.

    More bad news. Do you use GitHub to work with other people or to coordinate your own work from multiple computers? If so, after you recover from the initial setup, Git will crush you again with merge conflicts. And this is not one-time pain, this could be a dull ache for a long time.

    Managing a project via Git/GitHub is much like the Google Doc scenario and enjoys many of the same advantages. It is definitely more complicated than collaborating on a Google Doc, but this puts you in the right mindset. Source


    Step 1: Download and install

    Follow steps 1-4 of this tutorial.

    Then:

    1. Request access tothe BYU-I Resumes page at Request Access
    2. Respond to the auto-generated email
    3. Wait a few minutes for authorization
    4. Join our GitHub organization - byuids-resumes.

    If you are on a Mac, you may need:

    Step 2: Create a repository from the resume template and connect to the BYUI


    Step 3: Publish your resume to GitHub Pages

    • Go to settings for your repo.
    • Scroll down to the GitHub Pages section.
    • Under source select the box which says None and pick master.
    • Now select the /docs folder and click save.
    • Copy your site URL at the top of the /settings/pages location.
    • Add your link to the About section of your repository.
    • Edit the readme.md in the base repo to not show the resume directions.

    Step 4: Clone repo into VS Code

    Analytics Vidhya reading


    Step 5: Make your resume look good

    Examples:

    You may also find these articles helpful:

    Day 1: Git and Github

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Project 5 Comment

      • Feature Importance and Model discussion
    2. The last day of DSS is next Wednesday, Dec 6th at 6:00PM in STC 394

    3. Extra credit for creating and uploading cheat sheet (2 points for projects or checkpoints)

    4. Coding Challenge date?

    5. The technical aspects of Project 6 will be done mostly in class. Resume prep/MD outside


    Git and GitHub

    “Web developers’ social media platform”

    This is GitHub, the world’s largest code repository platform online. A platform used by some 50 million software developers to host their coding projects, most of them open-source — meaning others can access their codes and modify them to create better versions if they feel like.


    Most of the internet is produced or hosted on GitHub in the form of code. “What Gmail is to email, GitHub is to writing software,” says Kiran Jonnalagadda, cofounder of HasGeek, a platform to build and discover peer groups. Source

    • Don’t: post code for assignments that hundreds of other students have done.
    • Do: post unique code using skills from your classes.

    I would also recommend using private repos to manage your course work.


    Is it going to hurt?

    Answer: Yes.

    It feels weird at first but quickly becomes second nature. If you plan on taking more data science classes, you should know that DS 350 students are required to submit all coursework via GitHub. This is a major topic in class and office hours for the first two weeks. Then we practically never discuss it again.

    More bad news. Do you use GitHub to work with other people or to coordinate your own work from multiple computers? If so, after you recover from the initial setup, Git will crush you again with merge conflicts. And this is not one-time pain, this could be a dull ache for a long time.

    Managing a project via Git/GitHub is much like the Google Doc scenario and enjoys many of the same advantages. It is definitely more complicated than collaborating on a Google Doc, but this puts you in the right mindset. Source


    Step 1: Download and install

    Follow steps 1-4 of this tutorial.

    Then:

    1. Request access tothe BYU-I Resumes page at Request Access
    2. Respond to the auto-generated email
    3. Wait a few minutes for authorization
    4. Join our GitHub organization - byuids-resumes.

    If you are on a Mac, you may need:

    Step 2: Create a repository from the resume template and connect to the BYUI


    Step 3: Publish your resume to GitHub Pages

    • Go to settings for your repo.
    • Scroll down to the GitHub Pages section.
    • Under source select the box which says None and pick master.
    • Now select the /docs folder and click save.
    • Copy your site URL at the top of the /settings/pages location.
    • Add your link to the About section of your repository.
    • Edit the readme.md in the base repo to not show the resume directions.

    Step 4: Clone repo into VS Code

    Analytics Vidhya reading


    Step 5: Make your resume look good

    Examples:

    You may also find these articles helpful:

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p6/d3/index.html b/slides/p6/d3/index.html index 4bbeea7b..43999033 100644 --- a/slides/p6/d3/index.html +++ b/slides/p6/d3/index.html @@ -3,6 +3,6 @@

    Day 2: Commit, push, fork, and merge

    Welcome to class!

    Announcements


    Practice with Git

    GQ3: add, commit, push and a little pull

    Let’s save the changes we’ve made to our resume.


    GQ4: Fork and merge

    Get into groups of 2 or 3. Then follow the steps below:

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following:
      A. Create a new file called feedback.md +active">Day 2: Commit, push, fork, and merge
    4. Day 1: Git and Github
    5. Week 10-11: Project 5 - Star Wars
    6. Week 8-9: Project 4 - Homes
    7. Week 6-7: Project 3 - Baseball
    8. Week 1: Introduction

    Day 2: Commit, push, fork, and merge

    Welcome to class!

    Announcements


    Practice with Git

    GQ3: add, commit, push and a little pull

    Let’s save the changes we’ve made to our resume.


    GQ4: Fork and merge

    Get into groups of 2 or 3. Then follow the steps below:

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following:
      A. Create a new file called feedback.md B. Make a few recommendations or notes in the feedback.md file that will help the other student improve his or her resume
      C. add, commit, push your edits
      D. Go to the forked repo on GitHub and check if the feedback.md file shows up online
    4. Now, create a pull request to get your edits into the other student’s original repo.

    Once you’ve given another student feedback, accept any pull requests submitted to your own repo. Continue to edit and improve your resume based on the feedback you received.


    GQ5: Fork into byuids-resumes

    Fork your own resume repository into the BYU-I Data Science Resumes group.

    If you change your resume after you create this fork, you will have to submit a pull request to make sure the final version of your resume shows up in the group.

    These instructions will help you create a pull request.


    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p6/d4/index.html b/slides/p6/d4/index.html index d40c8e39..0a3e77a4 100644 --- a/slides/p6/d4/index.html +++ b/slides/p6/d4/index.html @@ -3,5 +3,5 @@

    Day 3: Resume Fork and Merge

    Remember from last class: pull, add, commit, push.


    Making edits in another user’s repo

    Breakout Room Activity

    Each student in the breakout room is going to provide feedback on another student’s resume. The breakout room should begin with a group discussion about the work you’ve each done on your resume and any questions the group has. Then follow the steps below.

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following;
      A. Create a new file called edits.md and save it in the main folder or the repository.
      B. Make a few recommendations or notes in the edits.md file that will help the other student improve his or her resume.
      C. add, commit, push your edits.
      D. Go to the forked repo on GitHub and check if the edits.md file shows up online.
    4. Now, create a pull request to get your edits into the other student’s original repo.

    Once you’ve given another student feedback, accept any pull requests submitted to your own repo. Continue to edit and improve your resume based on the feedback you received.


    Creating a fork in byuids-resumes

    Fork your own resume repository into the BYU-I Data Science Resumes group.

    If you change your resume after you create this fork, you will have to submit a pull request to make sure the final version of your resume shows up in the group.

    These instructions will help you create a pull request.


    Open time to finalize your resume

    Day 3: Resume Fork and Merge

    Remember from last class: pull, add, commit, push.


    Making edits in another user’s repo

    Breakout Room Activity

    Each student in the breakout room is going to provide feedback on another student’s resume. The breakout room should begin with a group discussion about the work you’ve each done on your resume and any questions the group has. Then follow the steps below.

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following;
      A. Create a new file called edits.md and save it in the main folder or the repository.
      B. Make a few recommendations or notes in the edits.md file that will help the other student improve his or her resume.
      C. add, commit, push your edits.
      D. Go to the forked repo on GitHub and check if the edits.md file shows up online.
    4. Now, create a pull request to get your edits into the other student’s original repo.

    Once you’ve given another student feedback, accept any pull requests submitted to your own repo. Continue to edit and improve your resume based on the feedback you received.


    Creating a fork in byuids-resumes

    Fork your own resume repository into the BYU-I Data Science Resumes group.

    If you change your resume after you create this fork, you will have to submit a pull request to make sure the final version of your resume shows up in the group.

    These instructions will help you create a pull request.


    Open time to finalize your resume

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p6/index.html b/slides/p6/index.html index 76741762..79f9b63b 100644 --- a/slides/p6/index.html +++ b/slides/p6/index.html @@ -3,5 +3,5 @@

    Week 12-13: Project 6 - Github

    GitHub is the communication tool for Data Scientists and developers. As students, you will want to curate your creative work on GitHub using Git. GitHub is the place to share your original work, not your homework assignments. Many people store their personal websites, blogs, and project websites on GitHub. Our textbook and course are hosted on GitHub, and you can see J. Hathaway’s or Ryan Hafen’s personal Data Science websites that are hosted on GitHub as well. You will be making your public resume that will be hosted on GitHub for this project.

    In the process of this project, we will be learning the process of Git and the tools of GitHub. We will use the Git process to have others in our class to edit our resumes. Take the process seriously (pick a suitable username and write a good resume), and you will have the beginning of your social presence in the DS/CS space.

    Grand Questions

    1. Join the BYUI Data Science Resumes GitHub organization and use the template repository to make a resume repository under your repositories. A good name might be LASTNAME-Resume.
    2. Clone your repository to your computer and build a first draft of your resume.
    3. Push your results to GitHub and have another student fork your repository to make edits.
    4. Accept the proposed changes from the student review and finish your final version.
    5. Make sure your resume is forked by BYU-I Data Science Resumes

    Week 12-13: Project 6 - Github

    GitHub is the communication tool for Data Scientists and developers. As students, you will want to curate your creative work on GitHub using Git. GitHub is the place to share your original work, not your homework assignments. Many people store their personal websites, blogs, and project websites on GitHub. Our textbook and course are hosted on GitHub, and you can see J. Hathaway’s or Ryan Hafen’s personal Data Science websites that are hosted on GitHub as well. You will be making your public resume that will be hosted on GitHub for this project.

    In the process of this project, we will be learning the process of Git and the tools of GitHub. We will use the Git process to have others in our class to edit our resumes. Take the process seriously (pick a suitable username and write a good resume), and you will have the beginning of your social presence in the DS/CS space.

    Grand Questions

    1. Join the BYUI Data Science Resumes GitHub organization and use the template repository to make a resume repository under your repositories. A good name might be LASTNAME-Resume.
    2. Clone your repository to your computer and build a first draft of your resume.
    3. Push your results to GitHub and have another student fork your repository to make edits.
    4. Accept the proposed changes from the student review and finish your final version.
    5. Make sure your resume is forked by BYU-I Data Science Resumes

    J. Hathaway and BYU-I ©

    \ No newline at end of file