diff --git a/index.html b/index.html index ce076ae..95ad720 100644 --- a/index.html +++ b/index.html @@ -3,4 +3,4 @@

CSE 250: Data Science Programming

Using pandas, Altiar, scikit-learn, and NumPy to program with data

-
\ No newline at end of file +
\ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 37192a6..8d0d3d2 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -1 +1 @@ -https://byuistats.github.io/DS250-Cannon/slides/introduction/day02/2020-09-17T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d4/2020-09-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/syllabus/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/pull_merge/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/day01/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d3/2020-10-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/pandas_altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-1/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/python-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d2/2020-09-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/json_missing/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-2/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d1/2020-10-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/d1/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d1/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/machine-learning/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-3/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/relational_data/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/ml_sklearn/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-4/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-5/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/sql-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p3/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/munging/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/vs-code/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/git_github/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/markdown/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/quarto-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/faq/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slack/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/2020-10-06T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/final_coding_challenge/sp22/https://byuistats.github.io/DS250-Cannon/categories/https://byuistats.github.io/DS250-Cannon/final_coding_challenge/https://byuistats.github.io/DS250-Cannon/contact/https://byuistats.github.io/DS250-Cannon/tags/ \ No newline at end of file +https://byuistats.github.io/DS250-Cannon/slides/introduction/day02/2020-09-17T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d4/2020-09-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d4/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/syllabus/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/pull_merge/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/day01/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d3/2020-10-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d3/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/pandas_altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-1/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/python-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p6/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d2/2020-09-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d2/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/json_missing/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-2/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/d1/2020-10-01T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p5/d1/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/machine-learning/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-3/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/relational_data/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/p4/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/ml_sklearn/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-4/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-5/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/sql-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/munging/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/project-6/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/vs-code/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/altair/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/git_github/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/markdown/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/quarto-for-data-science/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/course-materials/git_github_ds/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/introduction/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/2020-10-12T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/faq/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/projects/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/skill_builders/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slack/2020-09-15T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/slides/2020-10-06T10:42:26+06:00https://byuistats.github.io/DS250-Cannon/final_coding_challenge/sp22/https://byuistats.github.io/DS250-Cannon/categories/https://byuistats.github.io/DS250-Cannon/final_coding_challenge/https://byuistats.github.io/DS250-Cannon/contact/https://byuistats.github.io/DS250-Cannon/tags/ \ No newline at end of file diff --git a/slides/index.html b/slides/index.html index d6bc9ff..edbde93 100644 --- a/slides/index.html +++ b/slides/index.html @@ -3,5 +3,5 @@
\ No newline at end of file diff --git a/slides/introduction/day01/index.html b/slides/introduction/day01/index.html index 031452f..fa05b1b 100644 --- a/slides/introduction/day01/index.html +++ b/slides/introduction/day01/index.html @@ -2,7 +2,7 @@
\ No newline at end of file diff --git a/slides/introduction/day02/index.html b/slides/introduction/day02/index.html index 523d860..cadd59b 100644 --- a/slides/introduction/day02/index.html +++ b/slides/introduction/day02/index.html @@ -2,6 +2,6 @@
\ No newline at end of file diff --git a/slides/introduction/index.html b/slides/introduction/index.html index 669f3a2..6d449f2 100644 --- a/slides/introduction/index.html +++ b/slides/introduction/index.html @@ -2,6 +2,6 @@

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p3/d1/index.html b/slides/p3/d1/index.html deleted file mode 100644 index cba67f0..0000000 --- a/slides/p3/d1/index.html +++ /dev/null @@ -1,37 +0,0 @@ -Day 1: Intro to Project 3

Day 1: Intro to Project 3

Welcome to class!

Spiritual Thought

Announcements

  1. Project 2 Highlights
  2. Project 2 comments
  • Turn them in
  • Clean up graphs (main titles, axis labels, legends)
  • Column headers on tables in your report (don’t include index number either)
  • Technically Proportion of all flights delayed by weather, not the proportion of delayed flights
  • JSON should look like a text example of a record, not a table
  1. Things for next project:
  • Be sure to give section headers meaningful titles (NOT “Question 1”)

What is Structured Query Language (SQL)?

-

Ok, but how does it work?

SQL uses keywords to pull (or “fetch”, “extract”) the data we want from a database. The computer reads those keywords in a specific order.

From EverSQL we can get some more background:

This is the logical order of operations, also known as the order of execution, for an SQL query:


  1. FROM, including JOINs
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. WINDOW functions
  6. SELECT
  7. DISTINCT
  8. UNION
  9. ORDER BY
  10. LIMIT and OFFSET

But the reality isn’t that easy nor straight forward. As we said, the SQL standard defines the order of execution for the different SQL query clauses. Said that, modern databases are already challenging that default order by applying some optimization tricks which might change the actual order of execution, though they must end up returning the same result as if they were running the query at the default execution order.

For CSE 250: Don’t think too hard about optimization at this point. Let the database figure out the optimized routine.

Most SQL queries are typed in the following pattern:

SELECT -- <columns> and <column calculations>
-FROM -- <table name>
-  JOIN -- <table name>
-  ON -- <columns to join>
-WHERE -- <filter condition>
-GROUP BY -- <subsets for column calculations>
-HAVING -- <grouped filter condition>
-ORDER BY -- <how the output is returned in sequence>
-LIMIT -- <number of rows to return>
-

Project 3 - what are our goals?

Do we understand the questions being asked in Project 3?


The baseball data

Let’s start exploring the baseball data!

import pandas as pd
-import sqlite3
-
-con = sqlite3.connect('lahmansbaseballdb.sqlite')
-
-df = pd.read_sql_query("SELECT * FROM fielding LIMIT 5", con)
-df
-

How can we see what tables are in the database?

import pandas as pd
-import sqlite3
-
-con = sqlite3.connect('lahmansbaseballdb.sqlite')
-
-pd.read_sql_query("""
-
-SELECT name 
-FROM sqlite_master 
-WHERE type='table'
-
-""", con)
-
-

Understanding SQL queries

Make sure you do the project readings!

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p3/d1/index.xml b/slides/p3/d1/index.xml deleted file mode 100644 index 13db683..0000000 --- a/slides/p3/d1/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 1: Intro to Project 3 on DS250https://byuistats.github.io/DS250-Cannon/slides/p3/d1/Recent content in Day 1: Intro to Project 3 on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p3/d2/index.html b/slides/p3/d2/index.html deleted file mode 100644 index 7911fb1..0000000 --- a/slides/p3/d2/index.html +++ /dev/null @@ -1,16 +0,0 @@ -Day 2: SQL Calculations

Day 2: SQL Calculations

Welcome to class!

Spiritual Thought

Announcements

  1. Project 3 - SQL practice

Class Activity in Slack

Part 1

Goal: Describe in words (NOT using code) how to get from your starting data to your ending data.

Post your answer in your group’s Slack thread. You have 7 minutes, and are allowed to ask me 1 question.

Part 2

Goal: Now try to write a SQL query to get your ending data.

Post your SQL query in your group’s Slack thread. You have 7 minutes, and are allowed to ask me 1 question.

Here is the SQL template for your use.

SELECT -- <columns> and <column calculations>
-FROM -- <table name>
-  JOIN -- <table name>
-  ON -- <columns to join>
-WHERE -- <filter condition>
-GROUP BY -- <subsets for column calculations>
-HAVING -- <grouped filter condition>
-ORDER BY -- <how the output is returned in sequence>
-LIMIT -- <number of rows to return>
-


Getting started

Question One: Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.

Think about:

  • What tables (data) do you need?
  • What SQL commands do you need?

Extra Practice

“I get SQL and want to be challenged.”

Do this Math 335 task with SQL commands in Python.

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p3/d2/index.xml b/slides/p3/d2/index.xml deleted file mode 100644 index 5da7091..0000000 --- a/slides/p3/d2/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 2: SQL Calculations on DS250https://byuistats.github.io/DS250-Cannon/slides/p3/d2/Recent content in Day 2: SQL Calculations on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p3/d3/index.html b/slides/p3/d3/index.html deleted file mode 100644 index 04e427f..0000000 --- a/slides/p3/d3/index.html +++ /dev/null @@ -1,8 +0,0 @@ -Day 3: The end of baseball

Day 3: The end of baseball

Welcome to class!

Spiritual Thought

Announcements

  1. Practice Coding Challenge
  2. Can I still get an “A”?
    • Profile of an “A” student
    • What if I fall behind?
  3. Reminders:
    • DS community assignment
    • Review and Request Letter

Coding Challenge:

How do I prepare? -What would your coding challenge look like?

Project 3 Questions

  1. Integer Division
  2. Career Batting Average
  3. What have come up with for Q3? Metrics? Visualizations?

Question 1

Ask yourself:

  1. What do I want and expect the end table to look like?
  2. What table(s) and calculations do I need?
  3. What makes a row in my end table unique?
  4. What problems can I anticipate?

Question 2

Ask yourself:

  1. What do I want and expect the end table to look like?
  2. What table(s) and calculations do I need?
  3. What makes a row in my end table unique?
  4. What problems can I anticipate?

Question 3

What are some ideas for Grand Question 3? Ask yourself:

  1. What information will you use to compare the two baseball teams?
  2. What table(s) and calculations do I need?
  3. What makes a row in my end table unique?
  4. What problems can I anticipate?

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p3/d3/index.xml b/slides/p3/d3/index.xml deleted file mode 100644 index 59a6fea..0000000 --- a/slides/p3/d3/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 3: The end of baseball on DS250https://byuistats.github.io/DS250-Cannon/slides/p3/d3/Recent content in Day 3: The end of baseball on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p3/d4/index.html b/slides/p3/d4/index.html deleted file mode 100644 index fae56e5..0000000 --- a/slides/p3/d4/index.html +++ /dev/null @@ -1,7 +0,0 @@ -Day 4: Practice Coding Challenge

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p3/d4/index.xml b/slides/p3/d4/index.xml deleted file mode 100644 index 8c6d4cf..0000000 --- a/slides/p3/d4/index.xml +++ /dev/null @@ -1 +0,0 @@ -Day 4: Practice Coding Challenge on DS250https://byuistats.github.io/DS250-Cannon/slides/p3/d4/Recent content in Day 4: Practice Coding Challenge on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p3/index.html b/slides/p3/index.html deleted file mode 100644 index 84ef725..0000000 --- a/slides/p3/index.html +++ /dev/null @@ -1,9 +0,0 @@ -Week 6-7: Project 3 - Baseball

Week 6-7: Project 3 - Baseball

We will use a baseball relational database to explore SQL in Python for data science applications. Finding relationships in baseball

Completed Readings: SQL for Data Science Readings (read all links) -and Why SQL is beating NoSQL, and what this means for the future of data

Use the data.world baseball url for the -Data Connection. You can read the
Connection Instructions for data.world here

Grand Questions

  1. Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.

  2. This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)

    1. Write an SQL query that provides playerID, yearID, and batting average for players with at least one at bat. Sort the table from highest batting average to lowest, and show the top 5 results in your report.
    2. Use the same query as above, but only include players with more than 10 “at bats” that year. Print the top 5 results.
    3. Now calculate the batting average for players over their entire careers (all years combined). Only include players with more than 100 at bats, and print the top 5 results.
  3. Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc.). Write an SQL query to get the data you need. Use Python if additional data wrangling is needed, then make a graph in Altair to visualize the comparison. Provide the visualization and the compiled Vega script that would build the visualization.

J. Hathaway and BYU-I ©

\ No newline at end of file diff --git a/slides/p3/index.xml b/slides/p3/index.xml deleted file mode 100644 index 6848870..0000000 --- a/slides/p3/index.xml +++ /dev/null @@ -1 +0,0 @@ -Week 6-7: Project 3 - Baseball on DS250https://byuistats.github.io/DS250-Cannon/slides/p3/Recent content in Week 6-7: Project 3 - Baseball on DS250Hugo -- gohugo.ioen-usJ. Hathaway and BYU-I ©Fri, 01 May 2020 11:02:05 +0600 \ No newline at end of file diff --git a/slides/p4/d1/index.html b/slides/p4/d1/index.html index 46fd23c..d967058 100644 --- a/slides/p4/d1/index.html +++ b/slides/p4/d1/index.html @@ -3,7 +3,7 @@

Day 1: Intro to ML

Welcome to class!

Announcements

  1. Project 3 - Getting pickier about good communication
    • Career batting average
    • Meaningful report name (Drop “Client Report”)
    • Meaningful section headers so the table of contents is useful (don’t call them “Question 1”)
    • Don’t include “My useless chart” from the template
  2. Coding Challenge - Table
  3. Ask for help!
    • Computing lab
    • Computing lab Slack channel (search)
    • Slack classmates or general channel

Spiritual Thought

Genesis 1:1 and Machine Learning
Are facts true?


Pictionary!



From Sebastian Thrun:

AI is able to learn ‘rules’ from highly repetitive data.


The single most important thing for AI to accomplish in the next ten years is to free us from the burden of repetitive work.


Your Turn: Student Classification Problem

Can we predict if a student is from Utah?


Your Turn: Features and Targets

Import dwellings.csv. With a neighbor:

  1. Try to describe the data. Explain what each observation (row) is and what measurements we have on that observation (columns).
  2. Now try describing the modeling (machine learning) we are going to do in terms of “features” and “targets”. Watch out - are there any columns that are the target in disguise? (You may need to review the project goal.)
  3. What features do you expect to have a strong relationship with the target?

Before Next Class

Machine Learning Introduction

  • Step-by-step guide (mostly) for training a GaussianNB classifier. (The steps will be the same for any algorithm you use.)

Visual Introduction to Machine Learning

  1. Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.
  2. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data.
  3. Overfitting happens when some boundaries are based on distinctions that don’t make a difference. You can see if a model overfits by having test data flow through the model.

The goal of Question 1 is to help us with “feature selection”.

  • Remember: Overfitting happens when some boundaries are based on on distinctions that don’t make a difference.
  • More data does not always lead to better models. (Occam’s Razor)

Common questions:

MaxRowsError: How can I plot Large Datasets?

You may also save data to a local filesystem and reference the data by file path. Altair has a JSON data transformer that will do this transparently when enabled:

alt.data_transformers.disable_max_rows()
+active">Day 1: Intro to ML
  • Week 1: Introduction
  • Day 1: Intro to ML

    Welcome to class!

    Announcements

    1. Project 3 - Getting pickier about good communication
      • Career batting average
      • Meaningful report name (Drop “Client Report”)
      • Meaningful section headers so the table of contents is useful (don’t call them “Question 1”)
      • Don’t include “My useless chart” from the template
    2. Coding Challenge - Table
    3. Ask for help!
      • Computing lab
      • Computing lab Slack channel (search)
      • Slack classmates or general channel

    Spiritual Thought

    Genesis 1:1 and Machine Learning
    Are facts true?


    Pictionary!



    From Sebastian Thrun:

    AI is able to learn ‘rules’ from highly repetitive data.


    The single most important thing for AI to accomplish in the next ten years is to free us from the burden of repetitive work.


    Your Turn: Student Classification Problem

    Can we predict if a student is from Utah?


    Your Turn: Features and Targets

    Import dwellings.csv. With a neighbor:

    1. Try to describe the data. Explain what each observation (row) is and what measurements we have on that observation (columns).
    2. Now try describing the modeling (machine learning) we are going to do in terms of “features” and “targets”. Watch out - are there any columns that are the target in disguise? (You may need to review the project goal.)
    3. What features do you expect to have a strong relationship with the target?

    Before Next Class

    Machine Learning Introduction

    • Step-by-step guide (mostly) for training a GaussianNB classifier. (The steps will be the same for any algorithm you use.)

    Visual Introduction to Machine Learning

    1. Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.
    2. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data.
    3. Overfitting happens when some boundaries are based on distinctions that don’t make a difference. You can see if a model overfits by having test data flow through the model.

    The goal of Question 1 is to help us with “feature selection”.

    • Remember: Overfitting happens when some boundaries are based on on distinctions that don’t make a difference.
    • More data does not always lead to better models. (Occam’s Razor)

    Common questions:

    MaxRowsError: How can I plot Large Datasets?

    You may also save data to a local filesystem and reference the data by file path. Altair has a JSON data transformer that will do this transparently when enabled:

    alt.data_transformers.disable_max_rows()
     subset_data = denver.sample(n = 4999)
     

    \ No newline at end of file +Week 1: Introduction
    \ No newline at end of file diff --git a/slides/p4/d2/index.html b/slides/p4/d2/index.html index a177f00..25ae203 100644 --- a/slides/p4/d2/index.html +++ b/slides/p4/d2/index.html @@ -3,7 +3,7 @@

    Day 2: Intro to Machine Learning

    Welcome to class!

    Announcements

    Spiritual thought

    Are facts true?

    • How do you distinguish between truth and error?
    • Joshua and Caleb

    Building a Decision Tree

    Splitting the Data

    1. Start with packages and data set

    We’ll be using some parts of SKLEARN package and the Seaborn package.

    # If you haven't already, install scikit-learn and seaborn
    +active">Day 2: Intro to Machine Learning
  • Day 1: Intro to ML
  • Week 1: Introduction
  • Day 2: Intro to Machine Learning

    Welcome to class!

    Announcements

    Spiritual thought

    Are facts true?

    • How do you distinguish between truth and error?
    • Joshua and Caleb

    Building a Decision Tree

    Splitting the Data

    1. Start with packages and data set

    We’ll be using some parts of SKLEARN package and the Seaborn package.

    # If you haven't already, install scikit-learn and seaborn
     pip install scikit-learn seaborn
     
    from types import GeneratorType
     import pandas as pd
    diff --git a/slides/p4/d3/index.html b/slides/p4/d3/index.html
    index 911f0f0..782af88 100644
    --- a/slides/p4/d3/index.html
    +++ b/slides/p4/d3/index.html
    @@ -3,7 +3,7 @@
     

    Day 3: Training a Classifier, Part 2

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Coding Challenge code posted

    Prepping data for the Machine

    alt text

    Building a Decision Tree

    Day 3: Training a Classifier, Part 2

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Coding Challenge code posted

    Prepping data for the Machine

    alt text

    Building a Decision Tree

    import pandas as pd
     import altair as alt
     
     from sklearn.model_selection import train_test_split
    diff --git a/slides/p4/d4/index.html b/slides/p4/d4/index.html
    index 7b8c832..941ec4b 100644
    --- a/slides/p4/d4/index.html
    +++ b/slides/p4/d4/index.html
    @@ -3,7 +3,7 @@
     

    Day 4: Evaluating Our Models, Part 2

    Announcements

    Today:

    1. Continue discussion about evaluating models
    2. Try to understand what models are doing

    Evaluating model performance cont

    Confusion Matrix

    Why isn’t accuracy enough?

    A confusion matrix is a quick way to see the strengths and weaknesses of your model. A confusion matrix is not a “metric”. A confusion matrix provides an easy way to calculate multiple metrics such as accuracy, precision, and recall.

    alt text


    Your Turn

    With your group, use the links above to find a definition for your assigned metric. Then try using the confusion matrix on the screen to calculate your metric for my model.

    • Group 1: Accuracy
    • Group 2: Sensitivity/Recall
    • Group 3: Precision
    • Group 4: Specificity
    • Group 5: F1 Score
    • Group 6: Balanced Accuracy

    Validation metrics

  • Week 1: Introduction
  • Day 4: Evaluating Our Models, Part 2

    Announcements

    Today:

    1. Continue discussion about evaluating models
    2. Try to understand what models are doing

    Evaluating model performance cont

    Confusion Matrix

    Why isn’t accuracy enough?

    A confusion matrix is a quick way to see the strengths and weaknesses of your model. A confusion matrix is not a “metric”. A confusion matrix provides an easy way to calculate multiple metrics such as accuracy, precision, and recall.

    alt text


    Your Turn

    With your group, use the links above to find a definition for your assigned metric. Then try using the confusion matrix on the screen to calculate your metric for my model.

    • Group 1: Accuracy
    • Group 2: Sensitivity/Recall
    • Group 3: Precision
    • Group 4: Specificity
    • Group 5: F1 Score
    • Group 6: Balanced Accuracy

    Validation metrics


    #%%
     # a confusion matrix
     print(metrics.confusion_matrix(y_test, y_predicted_DT))
    diff --git a/slides/p4/index.html b/slides/p4/index.html
    index 69da764..80147ad 100644
    --- a/slides/p4/index.html
    +++ b/slides/p4/index.html
    @@ -3,6 +3,6 @@
     

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p5/d1/index.html b/slides/p5/d1/index.html index b2bf734..13947ae 100644 --- a/slides/p5/d1/index.html +++ b/slides/p5/d1/index.html @@ -3,7 +3,7 @@

    Day 1: The war with Star Wars

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Project 4 thoughts
      • Feature Importances - Sorted Bar Graph, not unsorted tables
      • Suppress warnings
      • And the winner is…

    The Star Wars data

    Load the Star Wars data

    Day 1: The war with Star Wars

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Project 4 thoughts
      • Feature Importances - Sorted Bar Graph, not unsorted tables
      • Suppress warnings
      • And the winner is…

    The Star Wars data

    Load the Star Wars data

    # %%
     import pandas as pd 
     import altair as alt
     import numpy as np
    diff --git a/slides/p5/d2/index.html b/slides/p5/d2/index.html
    index 90727a1..a71ce10 100644
    --- a/slides/p5/d2/index.html
    +++ b/slides/p5/d2/index.html
    @@ -3,7 +3,7 @@
     

    Day 2: Star Wars and strings

    Day 2: Star Wars and strings

    Welcome to class!

    Announcements

    What’s something you’re grateful for today?


    The .str functions in pandas


    .str.strip()

    s = pd.Series(['1. Ant.  ', '2. Bee!\n', '3. Cat?\t', '4. Beat?\t', np.nan])
     
     s.str.strip()
     
    diff --git a/slides/p5/d3/index.html b/slides/p5/d3/index.html
    index d40cc18..7a6128c 100644
    --- a/slides/p5/d3/index.html
    +++ b/slides/p5/d3/index.html
    @@ -3,7 +3,7 @@
     

    Day 3: Validating data, cleaning columns

    Welcome to class!

    Announcements

    Spiritual Thought

    Let’s validate some data!

    Pick something from the Star Wars article you want to validate (“double check”).


    Moving from categories to values.

    1. Create an additional column(s) that converts the income ranges to a number.
    2. Create an additional column(s) that converts the age ranges to a number.
    3. Create an additional column(s) that converts the school groupings to a number.

    Validating visuals

    You’re going to make a lot of bar charts!


    Getting started on Question 3

    One-hot encoding

    Project 5 asks you to “one-hot encode all columns that have categories” and “convert all yes/no responses to 1/0 numeric”.

    The get_dummies method can be used to create one-hot encoded variables. The pd.get_dummies documentation is a great place to start.

    After reading the documentation, study the code below and get started on Grand Question #3.

    Day 3: Validating data, cleaning columns

    Welcome to class!

    Announcements

    Spiritual Thought

    Let’s validate some data!

    Pick something from the Star Wars article you want to validate (“double check”).


    Moving from categories to values.

    1. Create an additional column(s) that converts the income ranges to a number.
    2. Create an additional column(s) that converts the age ranges to a number.
    3. Create an additional column(s) that converts the school groupings to a number.

    Validating visuals

    You’re going to make a lot of bar charts!


    Getting started on Question 3

    One-hot encoding

    Project 5 asks you to “one-hot encode all columns that have categories” and “convert all yes/no responses to 1/0 numeric”.

    The get_dummies method can be used to create one-hot encoded variables. The pd.get_dummies documentation is a great place to start.

    After reading the documentation, study the code below and get started on Grand Question #3.

    #%%
     # When we use machine learning to predict salary,
     # let's only look at people that have seen at least
     # one star wars film
    diff --git a/slides/p5/d4/index.html b/slides/p5/d4/index.html
    index 1172339..5cbd6dc 100644
    --- a/slides/p5/d4/index.html
    +++ b/slides/p5/d4/index.html
    @@ -3,7 +3,7 @@
     

    Day 4: May the ML columns be with you

    Welcome to class!

    Spiritual Thought

    Announcements


    Getting the data ready for machine learning.


    What are machine learning algorithms expecting to see?

    We need to handle missing values and categorical features before feeding the data into a machine learning algorithm, because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression. ref

    We have some options when converting categorical features (columns) to numeric.

    • If the category contains numeric information (like a range of numbers) we can convert it to a numeric variable by taking the minimum, average, or maximum of the range.
    • Factorization: If the category is an “ordinal” variable (meaning, there is an order to the categories) we can assign each category to an integer. (For example, good = 1, better = 2, best = 3.)
    • One-hot Encoding or Dummy Variables: If the category is a “nominal” variable (without an order) then we need to use one-hot encoding (sometimes called “dummy variable encoding").
    • If the category is some version of True/False or Yes/No then we can simply convert the values to zeros and ones.

    What’s our game plan for the Star Wars columns?

    1. Break into Groups

    Strategize + Code + Share

    • Group 1: How are you going to turn Age, Income and Education into numbers?
    • Group 2: How are you going to encode
      • Who Shot First
      • Gender
      • Location
      • All the Yes/No responses
    • Group 3: How are you going to deal with the character rankings?

    2. Combine all the factors into one big X dataframe

    3. Define Y as those making > $50k

    First: Limit the data to only people who answered “Yes” to the question “Have you seen any of the 6 films in the Star Wars franchise?”.

    Then: Use the table below as a guide to prepare your data for machine learning.

    ColumnOriginal FormatConvert To
    agecategory (ordinal, age ranges)number
    incomecategory (ordinal, income ranges)number
    educationcategory (ordinal, name of degree)number
    shot_firstcategory (nominal)one-hot
    gendercategory (nominal)one-hot
    locationcategory (nominal)one-hot
    fan_star_warsYes/No0/1
    expanded_universeYes/No0/1
    fan_exapandedYes/No0/1
    fan_star_trekYes/No0/1
    seen_iYes/No (name of movie/NaN)0/1
    seen_iiYes/No (name of movie/NaN)0/1
    seen_iiiYes/No (name of movie/NaN)0/1
    seen_ivYes/No (name of movie/NaN)0/1
    seen_vYes/No (name of movie/NaN)0/1
    seen_viYes/No (name of movie/NaN)0/1
    movie rankingsnumber-
    character rankingscategory (ordinal)one-hot or factorize

    What functions can we use to convert the categorical columns to numeric?

    Question: When and why would we drop the first column when we convert a category using pd.get_dummies()?

    Answer: Whenever your algorithm needs to calculate a matrix inverse.

    The one-hot encoding creates one binary variable for each category.


    The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red”, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].


    This is called a dummy variable encoding, and always represents C categories with C-1 binary variables. In addition to being slightly less redundant, a dummy variable representation is required for some models.


    For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

    Source


    Predicting income.

    Grand Question 4 wants us to “build a machine learning model that predicts whether a person makes more than $50k”.

    Day 4: May the ML columns be with you

    Welcome to class!

    Spiritual Thought

    Announcements


    Getting the data ready for machine learning.


    What are machine learning algorithms expecting to see?

    We need to handle missing values and categorical features before feeding the data into a machine learning algorithm, because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression. ref

    We have some options when converting categorical features (columns) to numeric.

    • If the category contains numeric information (like a range of numbers) we can convert it to a numeric variable by taking the minimum, average, or maximum of the range.
    • Factorization: If the category is an “ordinal” variable (meaning, there is an order to the categories) we can assign each category to an integer. (For example, good = 1, better = 2, best = 3.)
    • One-hot Encoding or Dummy Variables: If the category is a “nominal” variable (without an order) then we need to use one-hot encoding (sometimes called “dummy variable encoding").
    • If the category is some version of True/False or Yes/No then we can simply convert the values to zeros and ones.

    What’s our game plan for the Star Wars columns?

    1. Break into Groups

    Strategize + Code + Share

    • Group 1: How are you going to turn Age, Income and Education into numbers?
    • Group 2: How are you going to encode
      • Who Shot First
      • Gender
      • Location
      • All the Yes/No responses
    • Group 3: How are you going to deal with the character rankings?

    2. Combine all the factors into one big X dataframe

    3. Define Y as those making > $50k

    First: Limit the data to only people who answered “Yes” to the question “Have you seen any of the 6 films in the Star Wars franchise?”.

    Then: Use the table below as a guide to prepare your data for machine learning.

    ColumnOriginal FormatConvert To
    agecategory (ordinal, age ranges)number
    incomecategory (ordinal, income ranges)number
    educationcategory (ordinal, name of degree)number
    shot_firstcategory (nominal)one-hot
    gendercategory (nominal)one-hot
    locationcategory (nominal)one-hot
    fan_star_warsYes/No0/1
    expanded_universeYes/No0/1
    fan_exapandedYes/No0/1
    fan_star_trekYes/No0/1
    seen_iYes/No (name of movie/NaN)0/1
    seen_iiYes/No (name of movie/NaN)0/1
    seen_iiiYes/No (name of movie/NaN)0/1
    seen_ivYes/No (name of movie/NaN)0/1
    seen_vYes/No (name of movie/NaN)0/1
    seen_viYes/No (name of movie/NaN)0/1
    movie rankingsnumber-
    character rankingscategory (ordinal)one-hot or factorize

    What functions can we use to convert the categorical columns to numeric?

    Question: When and why would we drop the first column when we convert a category using pd.get_dummies()?

    Answer: Whenever your algorithm needs to calculate a matrix inverse.

    The one-hot encoding creates one binary variable for each category.


    The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red”, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].


    This is called a dummy variable encoding, and always represents C categories with C-1 binary variables. In addition to being slightly less redundant, a dummy variable representation is required for some models.


    For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

    Source


    Predicting income.

    Grand Question 4 wants us to “build a machine learning model that predicts whether a person makes more than $50k”.

    Aka, what is our “outcome” or “response” that we want to predict?

    dat_ml.income > 50000
     

    Remember not to include the answer (income) in your features!

    x = dat_ml.drop(['income'], axis = 1)
     

    The response needs to be saved as a 0/1 variable (at least, for binary classification algorithms).

    y = (dat_ml.income > 50000) / 1
     

    Week 10-11: Project 5 - Star Wars

    A significant portion of a data scientist’s job is data cleaning. during these two weeks we will not hide the data munging from you. We will practice data cleaning using a Star Wars survey from FiveThirtEight. Survey data is notoriously difficult to handle. Even when the data is recorded cleanly the options for ‘write in questions’, ‘choose from multiple answers’, ‘pick all that are right’, and ‘multiple choice questions’ makes storing the data in a tidy format difficult.

    Week 10-11: Project 5 - Star Wars

    A significant portion of a data scientist’s job is data cleaning. during these two weeks we will not hide the data munging from you. We will practice data cleaning using a Star Wars survey from FiveThirtEight. Survey data is notoriously difficult to handle. Even when the data is recorded cleanly the options for ‘write in questions’, ‘choose from multiple answers’, ‘pick all that are right’, and ‘multiple choice questions’ makes storing the data in a tidy format difficult.

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p6/d2/index.html b/slides/p6/d2/index.html index 8f5ffac..7fc56f5 100644 --- a/slides/p6/d2/index.html +++ b/slides/p6/d2/index.html @@ -3,5 +3,5 @@

    Day 1: Git and Github

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Project 5 Comment

      • Feature Importance and Model discussion
    2. The last day of DSS is next Wednesday, Dec 6th at 6:00PM in STC 394

    3. Extra credit for creating and uploading cheat sheet (2 points for projects or checkpoints)

    4. Coding Challenge date?

    5. The technical aspects of Project 6 will be done mostly in class. Resume prep/MD outside


    Git and GitHub

    “Web developers’ social media platform”

    This is GitHub, the world’s largest code repository platform online. A platform used by some 50 million software developers to host their coding projects, most of them open-source — meaning others can access their codes and modify them to create better versions if they feel like.


    Most of the internet is produced or hosted on GitHub in the form of code. “What Gmail is to email, GitHub is to writing software,” says Kiran Jonnalagadda, cofounder of HasGeek, a platform to build and discover peer groups. Source

    • Don’t: post code for assignments that hundreds of other students have done.
    • Do: post unique code using skills from your classes.

    I would also recommend using private repos to manage your course work.


    Is it going to hurt?

    Answer: Yes.

    It feels weird at first but quickly becomes second nature. If you plan on taking more data science classes, you should know that DS 350 students are required to submit all coursework via GitHub. This is a major topic in class and office hours for the first two weeks. Then we practically never discuss it again.

    More bad news. Do you use GitHub to work with other people or to coordinate your own work from multiple computers? If so, after you recover from the initial setup, Git will crush you again with merge conflicts. And this is not one-time pain, this could be a dull ache for a long time.

    Managing a project via Git/GitHub is much like the Google Doc scenario and enjoys many of the same advantages. It is definitely more complicated than collaborating on a Google Doc, but this puts you in the right mindset. Source


    Step 1: Download and install

    Follow steps 1-4 of this tutorial.

    Then:

    1. Request access tothe BYU-I Resumes page at Request Access
    2. Respond to the auto-generated email
    3. Wait a few minutes for authorization
    4. Join our GitHub organization - byuids-resumes.

    If you are on a Mac, you may need:

    Step 2: Create a repository from the resume template and connect to the BYUI


    Step 3: Publish your resume to GitHub Pages

    • Go to settings for your repo.
    • Scroll down to the GitHub Pages section.
    • Under source select the box which says None and pick master.
    • Now select the /docs folder and click save.
    • Copy your site URL at the top of the /settings/pages location.
    • Add your link to the About section of your repository.
    • Edit the readme.md in the base repo to not show the resume directions.

    Step 4: Clone repo into VS Code

    Analytics Vidhya reading


    Step 5: Make your resume look good

    Examples:

    You may also find these articles helpful:

    Day 1: Git and Github

    Welcome to class!

    Spiritual Thought

    Announcements

    1. Project 5 Comment

      • Feature Importance and Model discussion
    2. The last day of DSS is next Wednesday, Dec 6th at 6:00PM in STC 394

    3. Extra credit for creating and uploading cheat sheet (2 points for projects or checkpoints)

    4. Coding Challenge date?

    5. The technical aspects of Project 6 will be done mostly in class. Resume prep/MD outside


    Git and GitHub

    “Web developers’ social media platform”

    This is GitHub, the world’s largest code repository platform online. A platform used by some 50 million software developers to host their coding projects, most of them open-source — meaning others can access their codes and modify them to create better versions if they feel like.


    Most of the internet is produced or hosted on GitHub in the form of code. “What Gmail is to email, GitHub is to writing software,” says Kiran Jonnalagadda, cofounder of HasGeek, a platform to build and discover peer groups. Source

    • Don’t: post code for assignments that hundreds of other students have done.
    • Do: post unique code using skills from your classes.

    I would also recommend using private repos to manage your course work.


    Is it going to hurt?

    Answer: Yes.

    It feels weird at first but quickly becomes second nature. If you plan on taking more data science classes, you should know that DS 350 students are required to submit all coursework via GitHub. This is a major topic in class and office hours for the first two weeks. Then we practically never discuss it again.

    More bad news. Do you use GitHub to work with other people or to coordinate your own work from multiple computers? If so, after you recover from the initial setup, Git will crush you again with merge conflicts. And this is not one-time pain, this could be a dull ache for a long time.

    Managing a project via Git/GitHub is much like the Google Doc scenario and enjoys many of the same advantages. It is definitely more complicated than collaborating on a Google Doc, but this puts you in the right mindset. Source


    Step 1: Download and install

    Follow steps 1-4 of this tutorial.

    Then:

    1. Request access tothe BYU-I Resumes page at Request Access
    2. Respond to the auto-generated email
    3. Wait a few minutes for authorization
    4. Join our GitHub organization - byuids-resumes.

    If you are on a Mac, you may need:

    Step 2: Create a repository from the resume template and connect to the BYUI


    Step 3: Publish your resume to GitHub Pages

    • Go to settings for your repo.
    • Scroll down to the GitHub Pages section.
    • Under source select the box which says None and pick master.
    • Now select the /docs folder and click save.
    • Copy your site URL at the top of the /settings/pages location.
    • Add your link to the About section of your repository.
    • Edit the readme.md in the base repo to not show the resume directions.

    Step 4: Clone repo into VS Code

    Analytics Vidhya reading


    Step 5: Make your resume look good

    Examples:

    You may also find these articles helpful:

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p6/d3/index.html b/slides/p6/d3/index.html index 4399903..886f5ed 100644 --- a/slides/p6/d3/index.html +++ b/slides/p6/d3/index.html @@ -3,6 +3,6 @@

    Day 2: Commit, push, fork, and merge

    Welcome to class!

    Announcements


    Practice with Git

    GQ3: add, commit, push and a little pull

    Let’s save the changes we’ve made to our resume.


    GQ4: Fork and merge

    Get into groups of 2 or 3. Then follow the steps below:

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following:
      A. Create a new file called feedback.md +active">Day 2: Commit, push, fork, and merge
    4. Day 1: Git and Github
    5. Week 10-11: Project 5 - Star Wars
    6. Week 8-9: Project 4 - Homes
    7. Week 1: Introduction

    Day 2: Commit, push, fork, and merge

    Welcome to class!

    Announcements


    Practice with Git

    GQ3: add, commit, push and a little pull

    Let’s save the changes we’ve made to our resume.


    GQ4: Fork and merge

    Get into groups of 2 or 3. Then follow the steps below:

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following:
      A. Create a new file called feedback.md B. Make a few recommendations or notes in the feedback.md file that will help the other student improve his or her resume
      C. add, commit, push your edits
      D. Go to the forked repo on GitHub and check if the feedback.md file shows up online
    4. Now, create a pull request to get your edits into the other student’s original repo.

    Once you’ve given another student feedback, accept any pull requests submitted to your own repo. Continue to edit and improve your resume based on the feedback you received.


    GQ5: Fork into byuids-resumes

    Fork your own resume repository into the BYU-I Data Science Resumes group.

    If you change your resume after you create this fork, you will have to submit a pull request to make sure the final version of your resume shows up in the group.

    These instructions will help you create a pull request.


    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p6/d4/index.html b/slides/p6/d4/index.html index 0a3e77a..ac0879a 100644 --- a/slides/p6/d4/index.html +++ b/slides/p6/d4/index.html @@ -3,5 +3,5 @@

    Day 3: Resume Fork and Merge

    Remember from last class: pull, add, commit, push.


    Making edits in another user’s repo

    Breakout Room Activity

    Each student in the breakout room is going to provide feedback on another student’s resume. The breakout room should begin with a group discussion about the work you’ve each done on your resume and any questions the group has. Then follow the steps below.

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following;
      A. Create a new file called edits.md and save it in the main folder or the repository.
      B. Make a few recommendations or notes in the edits.md file that will help the other student improve his or her resume.
      C. add, commit, push your edits.
      D. Go to the forked repo on GitHub and check if the edits.md file shows up online.
    4. Now, create a pull request to get your edits into the other student’s original repo.

    Once you’ve given another student feedback, accept any pull requests submitted to your own repo. Continue to edit and improve your resume based on the feedback you received.


    Creating a fork in byuids-resumes

    Fork your own resume repository into the BYU-I Data Science Resumes group.

    If you change your resume after you create this fork, you will have to submit a pull request to make sure the final version of your resume shows up in the group.

    These instructions will help you create a pull request.


    Open time to finalize your resume

    Day 3: Resume Fork and Merge

    Remember from last class: pull, add, commit, push.


    Making edits in another user’s repo

    Breakout Room Activity

    Each student in the breakout room is going to provide feedback on another student’s resume. The breakout room should begin with a group discussion about the work you’ve each done on your resume and any questions the group has. Then follow the steps below.

    1. fork the other student’s resume repository.
    2. Now clone that forked repository to your computer.
    3. On your local version of the forked repository, do the following;
      A. Create a new file called edits.md and save it in the main folder or the repository.
      B. Make a few recommendations or notes in the edits.md file that will help the other student improve his or her resume.
      C. add, commit, push your edits.
      D. Go to the forked repo on GitHub and check if the edits.md file shows up online.
    4. Now, create a pull request to get your edits into the other student’s original repo.

    Once you’ve given another student feedback, accept any pull requests submitted to your own repo. Continue to edit and improve your resume based on the feedback you received.


    Creating a fork in byuids-resumes

    Fork your own resume repository into the BYU-I Data Science Resumes group.

    If you change your resume after you create this fork, you will have to submit a pull request to make sure the final version of your resume shows up in the group.

    These instructions will help you create a pull request.


    Open time to finalize your resume

    J. Hathaway and BYU-I ©

    \ No newline at end of file diff --git a/slides/p6/index.html b/slides/p6/index.html index 79f9b63..a9eb13a 100644 --- a/slides/p6/index.html +++ b/slides/p6/index.html @@ -3,5 +3,5 @@

    Week 12-13: Project 6 - Github

    GitHub is the communication tool for Data Scientists and developers. As students, you will want to curate your creative work on GitHub using Git. GitHub is the place to share your original work, not your homework assignments. Many people store their personal websites, blogs, and project websites on GitHub. Our textbook and course are hosted on GitHub, and you can see J. Hathaway’s or Ryan Hafen’s personal Data Science websites that are hosted on GitHub as well. You will be making your public resume that will be hosted on GitHub for this project.

    In the process of this project, we will be learning the process of Git and the tools of GitHub. We will use the Git process to have others in our class to edit our resumes. Take the process seriously (pick a suitable username and write a good resume), and you will have the beginning of your social presence in the DS/CS space.

    Grand Questions

    1. Join the BYUI Data Science Resumes GitHub organization and use the template repository to make a resume repository under your repositories. A good name might be LASTNAME-Resume.
    2. Clone your repository to your computer and build a first draft of your resume.
    3. Push your results to GitHub and have another student fork your repository to make edits.
    4. Accept the proposed changes from the student review and finish your final version.
    5. Make sure your resume is forked by BYU-I Data Science Resumes

    Week 12-13: Project 6 - Github

    GitHub is the communication tool for Data Scientists and developers. As students, you will want to curate your creative work on GitHub using Git. GitHub is the place to share your original work, not your homework assignments. Many people store their personal websites, blogs, and project websites on GitHub. Our textbook and course are hosted on GitHub, and you can see J. Hathaway’s or Ryan Hafen’s personal Data Science websites that are hosted on GitHub as well. You will be making your public resume that will be hosted on GitHub for this project.

    In the process of this project, we will be learning the process of Git and the tools of GitHub. We will use the Git process to have others in our class to edit our resumes. Take the process seriously (pick a suitable username and write a good resume), and you will have the beginning of your social presence in the DS/CS space.

    Grand Questions

    1. Join the BYUI Data Science Resumes GitHub organization and use the template repository to make a resume repository under your repositories. A good name might be LASTNAME-Resume.
    2. Clone your repository to your computer and build a first draft of your resume.
    3. Push your results to GitHub and have another student fork your repository to make edits.
    4. Accept the proposed changes from the student review and finish your final version.
    5. Make sure your resume is forked by BYU-I Data Science Resumes

    J. Hathaway and BYU-I ©

    \ No newline at end of file