diff --git a/index.html b/index.html index 5f32f50..436e70b 100644 --- a/index.html +++ b/index.html @@ -3,4 +3,4 @@

CSE 250: Data Science Programming

Using pandas, Altiar, scikit-learn, and NumPy to program with data

-
\ No newline at end of file +
\ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 49a56f1..8e8947b 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -1 +1 @@ -https://byuistats.github.io/DS250-Cannon/slides/introduction/day02/2020-09-17T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/syllabus/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/pull_merge/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/day01/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/pandas_altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-1/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/python-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/json_missing/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-2/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d1/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/machine-learning/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-3/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/relational_data/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/ml_sklearn/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-4/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-5/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/sql-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/munging/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/vs-code/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/git_github/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/markdown/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/quarto-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/faq/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slack/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/2020-10-06T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/final_coding_challenge/sp22/https://byuistats.github.io/DS250-Cannon/categories/https://byuistats.github.io/DS250-Cannon/final_coding_challenge/https://byuistats.github.io/DS250-Cannon/contact/https://byuistats.github.io/DS250-Cannon/tags/ \ No newline at end of file +https://byuistats.github.io/DS250-Cannon/slides/introduction/day02/2020-09-17T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/syllabus/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/pull_merge/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/day01/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/pandas_altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-1/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/python-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/json_missing/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-2/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/machine-learning/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-3/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/relational_data/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/ml_sklearn/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-4/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-5/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/sql-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/munging/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/vs-code/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/git_github/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/markdown/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/quarto-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/faq/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slack/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/2020-10-06T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/final_coding_challenge/sp22/https://byuistats.github.io/DS250-Cannon/categories/https://byuistats.github.io/DS250-Cannon/final_coding_challenge/https://byuistats.github.io/DS250-Cannon/contact/https://byuistats.github.io/DS250-Cannon/tags/ \ No newline at end of file diff --git a/slides/index.html b/slides/index.html index b84fcae..2d0868e 100644 --- a/slides/index.html +++ b/slides/index.html @@ -3,5 +3,5 @@
\ No newline at end of file diff --git a/slides/introduction/day01/index.html b/slides/introduction/day01/index.html index e816b5c..59caca7 100644 --- a/slides/introduction/day01/index.html +++ b/slides/introduction/day01/index.html @@ -2,7 +2,7 @@
\ No newline at end of file diff --git a/slides/introduction/day02/index.html b/slides/introduction/day02/index.html index c0ce4fd..468ec85 100644 --- a/slides/introduction/day02/index.html +++ b/slides/introduction/day02/index.html @@ -2,6 +2,6 @@
\ No newline at end of file diff --git a/slides/introduction/index.html b/slides/introduction/index.html index c4a6e90..d259dcb 100644 --- a/slides/introduction/index.html +++ b/slides/introduction/index.html @@ -2,6 +2,6 @@

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p5/d1/clean_workflow.png b/slides/p5/d1/clean_workflow.png deleted file mode 100644 index ed2364c..0000000 Binary files a/slides/p5/d1/clean_workflow.png and /dev/null differ diff --git a/slides/p5/d1/index.html b/slides/p5/d1/index.html deleted file mode 100644 index 2ae5f57..0000000 --- a/slides/p5/d1/index.html +++ /dev/null @@ -1,16 +0,0 @@ -Day 1: The war with Star Wars

Day 1: The war with Star Wars

Welcome to class!

Spiritual Thought

Announcements

  1. Project 4 thoughts
    • Feature Importances - Sorted Bar Graph, not unsorted tables
    • Suppress warnings
    • And the winner is…

The Star Wars data

Load the Star Wars data

# %%
-import pandas as pd 
-import altair as alt
-import numpy as np
-
-url = 'https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv'
-
-dat = pd.read_csv(url)
-
-

???

What do the data look like?

Take the time to understand how the current data is organized.

First things first…

Each group should answer these questions:

  1. Where are the column names?
  2. What does each row represent?
  3. What does each column represent?

What do we want the data to look like?

Each group should answer these questions:

  1. What is the goal of this project, and how does that affect what we want from the data?
  2. What do we want each row to represent?
  3. What do we want each column to look like? Pick a few columns from the dataset and try creating an example in excel.

Cleaning data takes time

Maybe not 80% of your time, but it does take time!

Data science is frequently about doing bespoke analysis which means creating and labelling unique datasets. No matter how cleanly formatted or standardized a dataset is, it likely needs some work.

I would argue that spending time working with data to transform, explore and understand it better is absolutely what data scientists should be doing. This is the medium they are working in. Understand the material better and you’ll get better insights. ref


Structure your project, structure your thinking

Tableau on tidying data

  1. Think about your data holistically
  2. Know the basic structure of your data
  3. Keep track of your steps
  4. Spot check throughout

Compartmentalize and organize your scripts and data


What are codecs and encodings?


The .str functions in pandas

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p5/d1/index.xml b/slides/p5/d1/index.xml deleted file mode 100644 index ea67e10..0000000 --- a/slides/p5/d1/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 1: The war with Star Wars on DS250https://byuistats.github.io/DS250-Cannon/slides/p5/d1/Recent content in Day 1: The war with Star Wars on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p5/d2/clean_workflow.png b/slides/p5/d2/clean_workflow.png deleted file mode 100644 index ed2364c..0000000 Binary files a/slides/p5/d2/clean_workflow.png and /dev/null differ diff --git a/slides/p5/d2/index.html b/slides/p5/d2/index.html deleted file mode 100644 index 6cb64fd..0000000 --- a/slides/p5/d2/index.html +++ /dev/null @@ -1,48 +0,0 @@ -Day 2: Star Wars and strings

Day 2: Star Wars and strings

Welcome to class!

Announcements

What’s something you’re grateful for today?


The .str functions in pandas


.str.strip()

s = pd.Series(['1. Ant.  ', '2. Bee!\n', '3. Cat?\t', '4. Beat?\t', np.nan])
-
-s.str.strip()
-
-s.str.strip('123.!? \n\t')
-
-s.str.strip('1234.!? \n\t')
-
-

.str.replace()

s.str.replace('Ant.', 'Man')
-s.str.replace('a', 8)
-s.str.replace('a', '8')
-s.str.replace('a', '8', case = False)
-s.str.replace('a|e', '8', case = False)
-
-s.str.replace('\d', '', case = False)
-
-

.str.split()

s2 = pd.Series(['1-20', '21-50', '51-80', '81-100', np.nan])
-s3 = pd.Series(
-    [
-        "this is a regular sentence",
-        "https://docs.python.org/3/tutorial/index.html",
-        np.nan
-    ]
-)
-
-s2.str.split()
-s3.str.split()
-s2.str.split(pat="-")
-

.str.join() or .str.cat()

two_columns = s2.str.split("-", expand = True).rename(
-   columns = {0: 'minimum', 1: 'maximum'})
-
-two_columns.fillna("").agg("__".join, axis = 1)
-
-two_columns.minimum.str.cat(two_columns.maximum, sep = "__")
-
-

Fixing the column names

Here is some code to get you started:

url = 'https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv'
-
-starwars_data = pd.read_csv(url, encoding = "ISO-8859-1", skiprows = 2, header = None)
-starwars_cols = pd.read_csv(url, encoding = "ISO-8859-1", nrows = 2, header = None)
-
-starwars_cols.iloc[0,:].str.upper().str.replace(" ", "!")
-

Validating statistical summaries

len(), .query(), and .value_counts() will be your friends.


Validating visuals

You’re going to make a lot of bar charts!

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p5/d2/index.xml b/slides/p5/d2/index.xml deleted file mode 100644 index cc82763..0000000 --- a/slides/p5/d2/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 2: Star Wars and strings on DS250https://byuistats.github.io/DS250-Cannon/slides/p5/d2/Recent content in Day 2: Star Wars and strings on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p5/d3/clean_workflow.png b/slides/p5/d3/clean_workflow.png deleted file mode 100644 index ed2364c..0000000 Binary files a/slides/p5/d3/clean_workflow.png and /dev/null differ diff --git a/slides/p5/d3/index.html b/slides/p5/d3/index.html deleted file mode 100644 index 41f9b78..0000000 --- a/slides/p5/d3/index.html +++ /dev/null @@ -1,36 +0,0 @@ -Day 3: Validating data, cleaning columns

Day 3: Validating data, cleaning columns

Welcome to class!

Announcements

Spiritual Thought

Let’s validate some data!

Pick something from the Star Wars article you want to validate (“double check”).


Moving from categories to values.

  1. Create an additional column(s) that converts the income ranges to a number.
  2. Create an additional column(s) that converts the age ranges to a number.
  3. Create an additional column(s) that converts the school groupings to a number.

Validating visuals

You’re going to make a lot of bar charts!


Getting started on Question 3

One-hot encoding

Project 5 asks you to “one-hot encode all columns that have categories” and “convert all yes/no responses to 1/0 numeric”.

The get_dummies method can be used to create one-hot encoded variables. The pd.get_dummies documentation is a great place to start.

After reading the documentation, study the code below and get started on Grand Question #3.

#%%
-# When we use machine learning to predict salary,
-# let's only look at people that have seen at least
-# one star wars film
-starwars = starwars.query('have_seen_any == "Yes"')
-
-# Discuss - what's a better way to filter out people 
-# who haven't seen star wars?
-
-# %%
-# Format columns for machine learning
-
-# Let's try this first: convert categories to "one-hot" encodings
-shot_first_onehot = pd.get_dummies(starwars.shot_first)
-shot_first_onehot
-
-# What the difference between code above,
-# and this? Which one is better?
-shot_first_onehot = pd.get_dummies(starwars.shot_first, drop_first=True)
-shot_first_onehot
-
-# %%
-# 'get_dummies()' can also be used to convert yes/no answers to 0/1
-
-episode_i = pd.get_dummies(starwars.seen_film_i__the_phantom_menace)
-episode_i
-
-# %%
-episode_i.value_counts()
-

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p5/d3/index.xml b/slides/p5/d3/index.xml deleted file mode 100644 index 1c42823..0000000 --- a/slides/p5/d3/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 3: Validating data, cleaning columns on DS250https://byuistats.github.io/DS250-Cannon/slides/p5/d3/Recent content in Day 3: Validating data, cleaning columns on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p5/d4/index.html b/slides/p5/d4/index.html deleted file mode 100644 index 816ad57..0000000 --- a/slides/p5/d4/index.html +++ /dev/null @@ -1,10 +0,0 @@ -Day 4: May the ML columns be with you

Day 4: May the ML columns be with you

Welcome to class!

Spiritual Thought

Announcements


Getting the data ready for machine learning.


What are machine learning algorithms expecting to see?

We need to handle missing values and categorical features before feeding the data into a machine learning algorithm, because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression. ref

We have some options when converting categorical features (columns) to numeric.

  • If the category contains numeric information (like a range of numbers) we can convert it to a numeric variable by taking the minimum, average, or maximum of the range.
  • Factorization: If the category is an “ordinal” variable (meaning, there is an order to the categories) we can assign each category to an integer. (For example, good = 1, better = 2, best = 3.)
  • One-hot Encoding or Dummy Variables: If the category is a “nominal” variable (without an order) then we need to use one-hot encoding (sometimes called “dummy variable encoding").
  • If the category is some version of True/False or Yes/No then we can simply convert the values to zeros and ones.

What’s our game plan for the Star Wars columns?

1. Break into Groups

Strategize + Code + Share

  • Group 1: How are you going to turn Age, Income and Education into numbers?
  • Group 2: How are you going to encode
    • Who Shot First
    • Gender
    • Location
    • All the Yes/No responses
  • Group 3: How are you going to deal with the character rankings?

2. Combine all the factors into one big X dataframe

3. Define Y as those making > $50k

First: Limit the data to only people who answered “Yes” to the question “Have you seen any of the 6 films in the Star Wars franchise?”.

Then: Use the table below as a guide to prepare your data for machine learning.

ColumnOriginal FormatConvert To
agecategory (ordinal, age ranges)number
incomecategory (ordinal, income ranges)number
educationcategory (ordinal, name of degree)number
shot_firstcategory (nominal)one-hot
gendercategory (nominal)one-hot
locationcategory (nominal)one-hot
fan_star_warsYes/No0/1
expanded_universeYes/No0/1
fan_exapandedYes/No0/1
fan_star_trekYes/No0/1
seen_iYes/No (name of movie/NaN)0/1
seen_iiYes/No (name of movie/NaN)0/1
seen_iiiYes/No (name of movie/NaN)0/1
seen_ivYes/No (name of movie/NaN)0/1
seen_vYes/No (name of movie/NaN)0/1
seen_viYes/No (name of movie/NaN)0/1
movie rankingsnumber-
character rankingscategory (ordinal)one-hot or factorize

What functions can we use to convert the categorical columns to numeric?

Question: When and why would we drop the first column when we convert a category using pd.get_dummies()?

Answer: Whenever your algorithm needs to calculate a matrix inverse.

The one-hot encoding creates one binary variable for each category.


The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red”, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].


This is called a dummy variable encoding, and always represents C categories with C-1 binary variables. In addition to being slightly less redundant, a dummy variable representation is required for some models.


For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

Source


Predicting income.

Grand Question 4 wants us to “build a machine learning model that predicts whether a person makes more than $50k”.

Aka, what is our “outcome” or “response” that we want to predict?

dat_ml.income > 50000
-

Remember not to include the answer (income) in your features!

x = dat_ml.drop(['income'], axis = 1)
-

The response needs to be saved as a 0/1 variable (at least, for binary classification algorithms).

y = (dat_ml.income > 50000) / 1
-

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p5/d4/index.xml b/slides/p5/d4/index.xml deleted file mode 100644 index 62b4a32..0000000 --- a/slides/p5/d4/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 4: May the ML columns be with you on DS250https://byuistats.github.io/DS250-Cannon/slides/p5/d4/Recent content in Day 4: May the ML columns be with you on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p5/index.html b/slides/p5/index.html deleted file mode 100644 index 0e97e11..0000000 --- a/slides/p5/index.html +++ /dev/null @@ -1,7 +0,0 @@ -Week 10-11: Project 5 - Star Wars

Week 10-11: Project 5 - Star Wars

A significant portion of a data scientist’s job is data cleaning. during these two weeks we will not hide the data munging from you. We will practice data cleaning using a Star Wars survey from FiveThirtEight. Survey data is notoriously difficult to handle. Even when the data is recorded cleanly the options for ‘write in questions’, ‘choose from multiple answers’, ‘pick all that are right’, and ‘multiple choice questions’ makes storing the data in a tidy format difficult.

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p5/index.xml b/slides/p5/index.xml deleted file mode 100644 index 354f1f4..0000000 --- a/slides/p5/index.xml +++ /dev/null @@ -1 +0,0 @@ -Week 10-11: Project 5 - Star Wars on DS250https://byuistats.github.io/DS250-Cannon/slides/p5/Recent content in Week 10-11: Project 5 - Star Wars on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p6/d2/index.html b/slides/p6/d2/index.html index 1896c0b..e4f8d92 100644 --- a/slides/p6/d2/index.html +++ b/slides/p6/d2/index.html @@ -3,5 +3,5 @@

Day 1: Git and Github

Welcome to class!

Spiritual Thought

Announcements

  1. Project 5 Comment

    • Feature Importance and Model discussion
  2. The last day of DSS is next Wednesday, Dec 6th at 6:00PM in STC 394

  3. Extra credit for creating and uploading cheat sheet (2 points for projects or checkpoints)

  4. Coding Challenge date?

  5. The technical aspects of Project 6 will be done mostly in class. Resume prep/MD outside


Git and GitHub

“Web developers’ social media platform”

This is GitHub, the world’s largest code repository platform online. A platform used by some 50 million software developers to host their coding projects, most of them open-source — meaning others can access their codes and modify them to create better versions if they feel like.


Most of the internet is produced or hosted on GitHub in the form of code. “What Gmail is to email, GitHub is to writing software,” says Kiran Jonnalagadda, cofounder of HasGeek, a platform to build and discover peer groups. Source

  • Don’t: post code for assignments that hundreds of other students have done.
  • Do: post unique code using skills from your classes.

I would also recommend using private repos to manage your course work.


Is it going to hurt?

Answer: Yes.

It feels weird at first but quickly becomes second nature. If you plan on taking more data science classes, you should know that DS 350 students are required to submit all coursework via GitHub. This is a major topic in class and office hours for the first two weeks. Then we practically never discuss it again.

More bad news. Do you use GitHub to work with other people or to coordinate your own work from multiple computers? If so, after you recover from the initial setup, Git will crush you again with merge conflicts. And this is not one-time pain, this could be a dull ache for a long time.

Managing a project via Git/GitHub is much like the Google Doc scenario and enjoys many of the same advantages. It is definitely more complicated than collaborating on a Google Doc, but this puts you in the right mindset. Source


Step 1: Download and install

Follow steps 1-4 of this tutorial.

Then:

  1. Request access tothe BYU-I Resumes page at Request Access
  2. Respond to the auto-generated email
  3. Wait a few minutes for authorization
  4. Join our GitHub organization - byuids-resumes.

If you are on a Mac, you may need:

Step 2: Create a repository from the resume template and connect to the BYUI


Step 3: Publish your resume to GitHub Pages

  • Go to settings for your repo.
  • Scroll down to the GitHub Pages section.
  • Under source select the box which says None and pick master.
  • Now select the /docs folder and click save.
  • Copy your site URL at the top of the /settings/pages location.
  • Add your link to the About section of your repository.
  • Edit the readme.md in the base repo to not show the resume directions.

Step 4: Clone repo into VS Code

Analytics Vidhya reading


Step 5: Make your resume look good

Examples:

You may also find these articles helpful:

\ No newline at end of file +active">Day 1: Git and Github
  • Week 1: Introduction
  • Day 1: Git and Github

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Project 5 Comment

      • Feature Importance and Model discussion
    2. The last day of DSS is next Wednesday, Dec 6th at 6:00PM in STC 394

    3. Extra credit for creating and uploading cheat sheet (2 points for projects or checkpoints)

    4. Coding Challenge date?

    5. The technical aspects of Project 6 will be done mostly in class. Resume prep/MD outside


    Git and GitHub

    “Web developers’ social media platform”

    This is GitHub, the world’s largest code repository platform online. A platform used by some 50 million software developers to host their coding projects, most of them open-source — meaning others can access their codes and modify them to create better versions if they feel like.


    Most of the internet is produced or hosted on GitHub in the form of code. “What Gmail is to email, GitHub is to writing software,” says Kiran Jonnalagadda, cofounder of HasGeek, a platform to build and discover peer groups. Source

    • Don’t: post code for assignments that hundreds of other students have done.
    • Do: post unique code using skills from your classes.

    I would also recommend using private repos to manage your course work.


    Is it going to hurt?

    Answer: Yes.

    It feels weird at first but quickly becomes second nature. If you plan on taking more data science classes, you should know that DS 350 students are required to submit all coursework via GitHub. This is a major topic in class and office hours for the first two weeks. Then we practically never discuss it again.

    More bad news. Do you use GitHub to work with other people or to coordinate your own work from multiple computers? If so, after you recover from the initial setup, Git will crush you again with merge conflicts. And this is not one-time pain, this could be a dull ache for a long time.

    Managing a project via Git/GitHub is much like the Google Doc scenario and enjoys many of the same advantages. It is definitely more complicated than collaborating on a Google Doc, but this puts you in the right mindset. Source


    Step 1: Download and install

    Follow steps 1-4 of this tutorial.

    Then:

    1. Request access tothe BYU-I Resumes page at Request Access
    2. Respond to the auto-generated email
    3. Wait a few minutes for authorization
    4. Join our GitHub organization - byuids-resumes.

    If you are on a Mac, you may need:

    Step 2: Create a repository from the resume template and connect to the BYUI


    Step 3: Publish your resume to GitHub Pages

    • Go to settings for your repo.
    • Scroll down to the GitHub Pages section.
    • Under source select the box which says None and pick master.
    • Now select the /docs folder and click save.
    • Copy your site URL at the top of the /settings/pages location.
    • Add your link to the About section of your repository.
    • Edit the readme.md in the base repo to not show the resume directions.

    Step 4: Clone repo into VS Code

    Analytics Vidhya reading


    Step 5: Make your resume look good

    Examples:

    You may also find these articles helpful:

    \ No newline at end of file diff --git a/slides/p6/d3/index.html b/slides/p6/d3/index.html index bdccfd8..07d3f5d 100644 --- a/slides/p6/d3/index.html +++ b/slides/p6/d3/index.html @@ -3,6 +3,6 @@

    Day 2: Commit, push, fork, and merge

    Welcome to class!

    Announcements


    Practice with Git

    GQ3: add, commit, push and a little pull

    Let’s save the changes we’ve made to our resume.


    GQ4: Fork and merge

    Get into groups of 2 or 3. Then follow the steps below:

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following:
      A. Create a new file called feedback.md +active">Day 2: Commit, push, fork, and merge
    4. Day 1: Git and Github
    5. Week 1: Introduction

    Day 2: Commit, push, fork, and merge

    Welcome to class!

    Announcements


    Practice with Git

    GQ3: add, commit, push and a little pull

    Let’s save the changes we’ve made to our resume.


    GQ4: Fork and merge

    Get into groups of 2 or 3. Then follow the steps below:

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following:
      A. Create a new file called feedback.md B. Make a few recommendations or notes in the feedback.md file that will help the other student improve his or her resume
      C. add, commit, push your edits
      D. Go to the forked repo on GitHub and check if the feedback.md file shows up online
    4. Now, create a pull request to get your edits into the other student’s original repo.

    Once you’ve given another student feedback, accept any pull requests submitted to your own repo. Continue to edit and improve your resume based on the feedback you received.


    GQ5: Fork into byuids-resumes

    Fork your own resume repository into the BYU-I Data Science Resumes group.

    If you change your resume after you create this fork, you will have to submit a pull request to make sure the final version of your resume shows up in the group.

    These instructions will help you create a pull request.


    \ No newline at end of file diff --git a/slides/p6/d4/index.html b/slides/p6/d4/index.html index e6124f6..a0e7f31 100644 --- a/slides/p6/d4/index.html +++ b/slides/p6/d4/index.html @@ -3,5 +3,5 @@

    Day 3: Resume Fork and Merge

    Remember from last class: pull, add, commit, push.


    Making edits in another user’s repo

    Breakout Room Activity

    Each student in the breakout room is going to provide feedback on another student’s resume. The breakout room should begin with a group discussion about the work you’ve each done on your resume and any questions the group has. Then follow the steps below.

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following;
      A. Create a new file called edits.md and save it in the main folder or the repository.
      B. Make a few recommendations or notes in the edits.md file that will help the other student improve his or her resume.
      C. add, commit, push your edits.
      D. Go to the forked repo on GitHub and check if the edits.md file shows up online.
    4. Now, create a pull request to get your edits into the other student’s original repo.

    Once you’ve given another student feedback, accept any pull requests submitted to your own repo. Continue to edit and improve your resume based on the feedback you received.


    Creating a fork in byuids-resumes

    Fork your own resume repository into the BYU-I Data Science Resumes group.

    If you change your resume after you create this fork, you will have to submit a pull request to make sure the final version of your resume shows up in the group.

    These instructions will help you create a pull request.


    Open time to finalize your resume

    Day 3: Resume Fork and Merge

    Remember from last class: pull, add, commit, push.


    Making edits in another user’s repo

    Breakout Room Activity

    Each student in the breakout room is going to provide feedback on another student’s resume. The breakout room should begin with a group discussion about the work you’ve each done on your resume and any questions the group has. Then follow the steps below.

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following;
      A. Create a new file called edits.md and save it in the main folder or the repository.
      B. Make a few recommendations or notes in the edits.md file that will help the other student improve his or her resume.
      C. add, commit, push your edits.
      D. Go to the forked repo on GitHub and check if the edits.md file shows up online.
    4. Now, create a pull request to get your edits into the other student’s original repo.

    Once you’ve given another student feedback, accept any pull requests submitted to your own repo. Continue to edit and improve your resume based on the feedback you received.


    Creating a fork in byuids-resumes

    Fork your own resume repository into the BYU-I Data Science Resumes group.

    If you change your resume after you create this fork, you will have to submit a pull request to make sure the final version of your resume shows up in the group.

    These instructions will help you create a pull request.


    Open time to finalize your resume

    \ No newline at end of file diff --git a/slides/p6/index.html b/slides/p6/index.html index d2bc580..ff69f15 100644 --- a/slides/p6/index.html +++ b/slides/p6/index.html @@ -3,5 +3,5 @@

    Week 12-13: Project 6 - Github

    GitHub is the communication tool for Data Scientists and developers. As students, you will want to curate your creative work on GitHub using Git. GitHub is the place to share your original work, not your homework assignments. Many people store their personal websites, blogs, and project websites on GitHub. Our textbook and course are hosted on GitHub, and you can see J. Hathaway’s or Ryan Hafen’s personal Data Science websites that are hosted on GitHub as well. You will be making your public resume that will be hosted on GitHub for this project.

    In the process of this project, we will be learning the process of Git and the tools of GitHub. We will use the Git process to have others in our class to edit our resumes. Take the process seriously (pick a suitable username and write a good resume), and you will have the beginning of your social presence in the DS/CS space.

    Grand Questions

    1. Join the BYUI Data Science Resumes GitHub organization and use the template repository to make a resume repository under your repositories. A good name might be LASTNAME-Resume.
    2. Clone your repository to your computer and build a first draft of your resume.
    3. Push your results to GitHub and have another student fork your repository to make edits.
    4. Accept the proposed changes from the student review and finish your final version.
    5. Make sure your resume is forked by BYU-I Data Science Resumes

    Week 12-13: Project 6 - Github

    GitHub is the communication tool for Data Scientists and developers. As students, you will want to curate your creative work on GitHub using Git. GitHub is the place to share your original work, not your homework assignments. Many people store their personal websites, blogs, and project websites on GitHub. Our textbook and course are hosted on GitHub, and you can see J. Hathaway’s or Ryan Hafen’s personal Data Science websites that are hosted on GitHub as well. You will be making your public resume that will be hosted on GitHub for this project.

    In the process of this project, we will be learning the process of Git and the tools of GitHub. We will use the Git process to have others in our class to edit our resumes. Take the process seriously (pick a suitable username and write a good resume), and you will have the beginning of your social presence in the DS/CS space.

    Grand Questions

    1. Join the BYUI Data Science Resumes GitHub organization and use the template repository to make a resume repository under your repositories. A good name might be LASTNAME-Resume.
    2. Clone your repository to your computer and build a first draft of your resume.
    3. Push your results to GitHub and have another student fork your repository to make edits.
    4. Accept the proposed changes from the student review and finish your final version.
    5. Make sure your resume is forked by BYU-I Data Science Resumes
    \ No newline at end of file