From 77b4089ffced22947d0ea5def7b945828ce3e39a Mon Sep 17 00:00:00 2001
From: hathawayj <hathawayj@gmail.com>
Date: Thu, 29 Feb 2024 16:25:57 +0000
Subject: [PATCH] deploy: c5e7875711029a098860069fc6ccd61684e54029

---
 index.html              |  2 +-
 slides/p4/d2/index.html | 21 ++++++++++++---------
 2 files changed, 13 insertions(+), 10 deletions(-)
diff --git a/index.html b/index.html
index 72206af..cca0a65 100644
--- a/index.html
+++ b/index.html
@@ -3,4 +3,4 @@
 <span class=navbar-toggler-icon></span></button><div class="collapse navbar-collapse text-center" id=navigation><ul class="navbar-nav ml-auto"><li class=nav-item><a class="nav-link text-dark" href=/DS250-Cannon>Home</a></li><li class=nav-item><a class="nav-link text-dark" href=/DS250-Cannon/projects>Projects</a></li><li class=nav-item><a class="nav-link text-dark" href=/DS250-Cannon/contact>Contact</a></li><li class=nav-item><a class="nav-link text-dark" href=/DS250-Cannon/course-materials>Materials</a></li><li class="nav-item dropdown"><a class="nav-link dropdown-toggle text-dark" href=# role=button data-toggle=dropdown aria-haspopup=true aria-expanded=false>Navigate</a><div class=dropdown-menu><a class=dropdown-item href=/DS250-Cannon/slides>Slides</a>
 <a class=dropdown-item href=/DS250-Cannon/course-materials/syllabus/>Syllabus</a>
 <a class=dropdown-item href=/DS250-Cannon/faq>FAQ</a></div></li></ul></div></div></nav><div class="container section"><div class=row><div class="col-lg-8 text-center mx-auto"><h1 class="text-white mb-3">CSE 250: Data Science Programming</h1><p class="text-white mb-4">Using pandas, Altiar, scikit-learn, and NumPy to program with data</p><div class=position-relative><input id=search class=form-control placeholder="Have a question? Just ask here or enter terms">
-<i class="ti-search search-icon"></i><script>$(function(){var projects=[{value:"Day 2: Project 0",label:"<p>Announcements  Devotional Computing Lab 4:30PM - 6:30PM all weekdays except Wednesday. Saturday from 10AM-12PM  Slack channel #tutoring_lab   Data Science Society - Wednesday\u0026rsquo;s at 6PM, STC 394 Math Department Opening Social - Thursday 11:30 RKS 229  Spiritual Thought Question  How is 1 Nephi like Genesis? \u0026ldquo;In the beginning, God created the heaven and the earth.\u0026rdquo;  Syllabus Questions?  A note about readings\u0026hellip; Tips for asking for help  Slack Google - acquired discernment   Quarto and tradeoffs Project Submissions: HTML  Are we all on the Slack channel? Follow the Slack invitation that is waiting in your student email. If you don\u0026rsquo;t see an invite, you can join through this link and then ask Brother Cannon to add you to the class channel.\nMethods Checkpoint All the answers will be in the assigned reading or in these slides.\nNotes on Project 0 Installing Packages and Extensions Learn how to install packages by reading the assigned material and by watching the video tutorial on this page.\nThe readings mention a lot of different packages. For Project 0, you need to install at least jupyter, pandas, plotly.express, numpy, and tabulate.\nThe readings will also mention two VS Code extensions you need to install.\nJupyter Notebooks vs. Interactive Python Window Should you decide to use Juypyter Notebooks this semester within VS Code, this is a great guide to get you started.\nOr you can choose to stick with the Python Interactive window like the textbook does.\nUse Your Resources!  Technical documentation Google searches ChatGPT Asking for help on Slack Don\u0026rsquo;t forget the data science lab! (Starts next week.) Question that cannot be answered by the textbook and documentation? Google it. A function you have never seen before? Google it. An error in your code? Google it.  Markdown What is Markdown?  A clean, human readable way to make slick html and pdf documents Used widely among programmers for clean documentation Used widely by Data Scientists to publish results and communicate with stakeholders  Here\u0026rsquo;s a good summary\nQuarto Do your tinkering in interactive Python or Jupyter notebooks. Generate report with finished code, graphs, etc. in Quatro\nQuarto\nNow for some data! Let\u0026rsquo;s get this party started Your turn:  Read in the cars data set Work with you your teams to talk through interesting possibilities for a graph Work on Project 0 Questions and Tasks   Any issues with getting Python installed?     Python VS Code Altair in VS Code     Does everyone have pandas, altiar, numpy, scikit-learn installed?     Video tutorial: how to install packages.  One way to install packages:\npip install pandas altair Maybe a better way to do it: run this in an interactive window.\nimport sys !{sys.executable} -m pip install pandas altair    Does everyone have altair-saver working?     altair_saver Video tutorial     ---------------------------------------------------- Why are we using Altair?    It is built on the VEGA and D3 which are fast and web based.  Grammar of Graphics: Vega-Lite   Technical Paper Website Endorsment      What are we not learning in this course?    Indexing, .loc[] and .iloc[] I may not be experienced enough to understand why I should teach you these. I think they all add complexity to what we are learning in the course and we have elected to avoid it. We will use reset_index() a lot. I think MultiIndex features create complication. I have also elected to use .filter() instead of .loc[] because I like it.\nVirtual Environments Virtual Environments appear to be an important tool as you continue to use Python. We will not be teaching these or supporting these in our course.\nmatplotlib (and any tool leveraging it) It feels old, has a bad api, and isn\u0026rsquo;t declarative.\n   ----------------------------- What can Python Interactive do?    Let\u0026rsquo;s review the power of Python Interactive  # %% in my .py script is much better than Jupyter notebooks (.ipynb).  If we hope to have our code work in a production environment then Jupyter is problematic. Caching and code chunks are problematic https:\/\/medium.com\/@_orcaman\/jupyter-notebook-is-the-cancer-of-ml-engineering-70b98685ee71       Set-up your py script    Setting up your script A good data science .py script will have packages and data loaded at the top. Usually you have a few short commented sentences that descibe the script purpose.\n# %% # import pandas, altair, numpy import pandas as pd import altair as alt import numpy as np # %% # load data # handgrenade data https:\/\/github.com\/byuidatascience\/data4soils\/blob\/master\/data-raw\/cfbp_handgrenade\/cfbp_handgrenade.csv url = \u0026#39;https:\/\/github.com\/byuidatascience\/data4soils\/raw\/master\/data-raw\/cfbp_handgrenade\/cfbp_handgrenade.csv\u0026#39; dat = pd.read_csv(url)    Make a scatter plot with hmx on the x and rdx on the y    To get you started:\nalt.Chart(dat).encode()    Make a spatial plot with hmx colored     Encode the row and column to the axes. Color the hmx points using the \u0026lsquo;goldorange\u0026rsquo; color scheme. Use mark_square() and make the square sizes 500.     -------------------- Create a histogram of hmx     Encode the x-axis as binned. Encode the y-axis as counts. Configure the title to a fontSize of 20. Use properties to place the title.     ----------------------------- How can I get help?     Make sure you read the reading assignments once or twice or five times. Read the guides on the Course Materials page. Post questions in our #cse250_s21_larson slack channel (and try to help others!) Attend the Data Science Lab. Google is your best friend.     -------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/introduction\/day02\/"},{value:"Day 4: Exporting JSON",label:"<p>Welcome to class! Spiritual Thought    Announcements  Hackathon Opening Social  Question 5 Let\u0026rsquo;s do an example of question 5 using the mtcars data.\nLoad packages and data #%% import pandas as pd import numpy as np import json url_cars = \u0026#34;https:\/\/github.com\/byuidatascience\/data4missing\/raw\/master\/data-raw\/mtcars_missing\/mtcars_missing.json\u0026#34; cars = pd.read_json(url_cars) \nFind all the missing values #%% # method 1: find \u0026#34;official\u0026#34; null values # hp, wt, and vs cars.isnull().sum() #%% # method 2: just look at the data # car, hp, wt, vs, gear cars.head(10) #%% # method 3: look at summaries # the values in \u0026#39;gear\u0026#39; look funny cars.describe() #%% # method 4: count up categories # looks like 4 rows are blank cars.car.value_counts() \nReformat the missing values Remember, you need to reformat your missing values to make them consistent!\nReading the examples in the replace documentation might give you some ideas.\n#%%  # There are a lot of functions # we could use to give the missing values # a consistent format. # `replace()` is one of the easiest # let\u0026#39;s change everything to np.nan cars_new = cars.replace(999, np.nan).replace(\u0026#34;\u0026#34;, np.nan) # or equivalently: cars_new = cars.replace([999, \u0026#34;\u0026#34;], np.nan) # did we get them all? cars_new.isnull().sum() \nSaving JSON files from a pandas dataframe You can save a DataFrame as a JSON file like this:\n#%% # save the new data as a json cars_new.to_json(\u0026#34;my_cars_data.json\u0026#34;) The df.to_json() documentation shows us how to change the way the JSON file is organized. (By row? By column? etc.)\nThis is the format we would like to see in the report:\n[ { \u0026#34;car\u0026#34;: \u0026#34;Mazda RX4\u0026#34;, \u0026#34;mpg\u0026#34;: 21, \u0026#34;cyl\u0026#34;: 6, \u0026#34;disp\u0026#34;: 160, \u0026#34;hp\u0026#34;: 110, \u0026#34;drat\u0026#34;: 3.9, \u0026#34;wt\u0026#34;: 2.62, \u0026#34;qsec\u0026#34;: 16.46, \u0026#34;vs\u0026#34;: 0, \u0026#34;am\u0026#34;: 1, \u0026#34;gear\u0026#34;: 4, \u0026#34;carb\u0026#34;: 4 } ] And here are the various options:\n# %% # Question 5 wants us to \u0026#34;include one record example\u0026#34; # in our md report that \u0026#34;has a missing value\u0026#34; # you can print out a json file like this: json_data = cars_new.to_json() print(json_data) # but that won\u0026#39;t look good in our report. # instead.... #%% # you can do this. # in this format, the json file is # organized\/printed by column json_data = cars_new.to_json() json_object = json.loads(json_data) json_formatted_str = json.dumps(json_object, indent = 4) print(json_formatted_str) # %% # we can change the format of the # json file using \u0026#39;orient\u0026#39; json_data = cars.to_json(orient=\u0026#34;split\u0026#34;) json_object = json.loads(json_data) json_formatted_str = json.dumps(json_object, indent = 4) print(json_formatted_str) # %% # by table json_data = cars.to_json(orient=\u0026#34;table\u0026#34;) json_object = json.loads(json_data) json_formatted_str = json.dumps(json_object, indent = 4) print(json_formatted_str) # %% # by \u0026#34;record\u0026#34; or \u0026#34;row\u0026#34; json_data = cars.to_json(orient=\u0026#34;records\u0026#34;) json_object = json.loads(json_data) json_formatted_str = json.dumps(json_object, indent = 4) print(json_formatted_str) </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p2\/d4\/"},{value:"Day 4: Practice Coding Challenge",label:"<p>What table do we want to use?    q = \u0026#39;\u0026#39;\u0026#39; SELECT * FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What columns do we want to select?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What calculation do we want to perform?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, r\/ab FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What name do we give our calculated column?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, r\/ab as runs_atbat FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    #### I want to join two tables to help in decision making The [data dictionary](https:\/\/data.world\/byuidss\/cse-250-baseball-database\/workspace\/file?filename=readme2014.txt) might help. - For seasons after 1999, which year had the most players selected as All Stars but didn\u0027t play in the All Star game? - Provide a summary of how many games, hits, and at bats all the players had in that year\u0027s post season. ```python import pandas as pd import altair as alt import numpy as np import datadotworld as dw baseball_url = \u0027byuidss\/cse-250-baseball-database\u0027 ``` What table do we want for All Star information?    # %% # allstar table dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM AllstarFull WHERE --? AND --? LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you use a groupby to get the counts of players per year?    dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT yearid, -- \u0026lt;stuff to calculate\u0026gt; FROM AllstarFull WHERE yearid \u0026gt; 1999 AND gp != 1 GROUP BY --? ORDER BY --? \u0026#39;\u0026#39;\u0026#39;).dataframe    What table do we want for the post season at bats?    dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM BattingPost as bp LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you join the post season batting table and AllStar information?     For each player, keep only the at bats, hits, the all star gp, and gameid columns. Let\u0026rsquo;s only keep players with at least one at bat in the post season.  dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;columns to keep\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON -- \u0026lt;two columns for the join\u0026gt; WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND -- \u0026lt;at bat condition\u0026gt; LIMIT 15 \u0026#39;\u0026#39;\u0026#39; ).dataframe    Let\u0026rsquo;s build the final table     For seasons after 1999, which year had the most players selected as All Stars but didn\u0026rsquo;t play in the All Star game? Provide a summary of how many games, hits, and at bats all the players had in that year\u0026rsquo;s post season.  dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;lots of calculations\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON bp.playerid = asf.playerid AND bp.yearid = asf.yearid WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND ab \u0026gt; 0 GROUP BY -- \u0026lt;column\u0026gt; ORDER BY -- \u0026lt;column\u0026gt; \u0026#39;\u0026#39;\u0026#39; ).dataframe    ------------------------------------------------------------------------------- Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc.). Write an SQL query to get the data you need. Use Python if additional data wrangling is needed, then make a graph in Altair to visualize the comparison. __In your group, answer the following questions and be prepared to share your answers with the class.__ 1. What will you use to compare the two baseball teams? 2. What table(s) does this information come from? 3. Do you need to do any calculations? 4. Can you think of any problems you might run into? ### Open Programming Time --------------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p3\/d4\/"},{value:"Introduction",label:"<p>A competent student should be able to finish the exercises within 60 minutes. You should work through it on your own. This serves as an assessment of your understanding of the assigned readings.\nBefore you start Make sure you have installed VS-code, pandas, and altair on your computer. You can install these package by typing this line in the terminal.\npip install pandas altair\nOR if you have more than one version of python\npip3.9 install pandas altair\npip3.9 indicates the version of python you are installing the packages to.\nPart 1 Get familiar with your tools Programming involves a lot of research. Unlike subjects like Mathematics or History, we are not required to remember every single function and its usage. It is natural for experienced programmers to look for answers on the internet, books, even from other people\u0026rsquo;s code. Programming will be extremely frustrating if we are not allowed to do web searches, so please get familiar with the tools you have and use them often.\nOffical Documentation This should be your first resort for understanding any code\/function. Scanning the documentation of a function will allow you to get an overview of its usage.\nHere is a link to the documentation of the assign() function:\n(https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.assign.html)\nExample of assign() (as shown in the documentation)\nimport pandas as pd df1 = pd.DataFrame({\u0026#39;temp_c\u0026#39;: [17.0, 25.0]}, index=[\u0026#39;Portland\u0026#39;, \u0026#39;Berkeley\u0026#39;]) df2 = df1.assign(temp_f=df1.temp_c * 9 \/ 5 \u002b 32) Exercise 1: After reading the documentation for assign(), write a short paragraph to explain assign() as if you were talking to someone with zero programming experience (use the example above to help you explain assign()).\n What is the difference between df1 and df2? How was df2 derived from df1?)  Online textbook It pains us to see students would rather be stuck at problems for hours yet they refuse to use the textbook. This is another very useful resource since this is designed for this class. link to the textbook: (https:\/\/byuidatascience.github.io\/python4ds\/)\nExercise 2: Locate the section where the textbook talks about query() and answer these questions.\n What function in R\u0026rsquo;s dplyr is equivalent or comparable to query() in pandas (You should include the section number in your answer)? What is the easiest mistake for python beginner to make that was shown in the text about query() (You should include the section number in your answer)?  The internet Google is a programmer\u0026rsquo;s friend. Get used to googling thing, in fact, you want to be an expert in googling\n Question that cannot be answered by the textbook and documentation? Google it. A function you have never seen before? Google it. An error in your code? Google it.  Exercise 3: Provide at least 2 extra resources you could find about the pandas function drop() on the internet.\nTutor, TA (Through slack, zoom, or in-person) We want to help you with your work; we want to answer your questions; but most importantly, we want to help you succeed in this class. That will require you to put in the necessary time in understanding the readings, coding and debugging. When you ask us a question, we expect that you have read the documentation, searched the textbook, and done your own research. Then we can be most helpful and can provide insights on top of your understanding.\nExamples of bad questions  How does drop() work? We will ask you to read the documentation for drop(). How do you make a table in a markdown file? We will refer you to the textbook. I don\u0026rsquo;t want these columns in my data, how can I drop them? We will ask you if you have found any things on the internet.  Examples of good questions  I am still confused about the syntax of drop(). After reading the documentation, this is my understanding of the function\u0026hellip; . What am I missing? I tried making a table in markdown (show code), it is still not giving me what I want, how can I fix this? I am trying to drop these columns in my dataframe, I think drop() is what I am looking for. Am I in the right direction? If not, what keywords should I be googling?  Exercise 4:\nUsing the code and tools mentioned above, finish question 4 and 5 under 3.2.4 in the textbook.(use the data in mpg for your plot):\n# library import import pandas as pd import altair as alt # data import url = \u0026quot;https:\/\/github.com\/byuidatascience\/data4python4ds\/raw\/master\/data-raw\/mpg\/mpg.csv\u0026quot; mpg = pd.read_csv(url)   Question 4: Make a scatterplot of hwy vs cyl.\n  Question 5: What happens if you make a scatterplot of class vs drv? Why is the plot not useful?\n  After you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/introduction\/"},{value:"Day 1: Welcome",label:"<p>Welcome to DS 250!  Teacher: Paul Cannon TA: Bracken Sant (san20050@byui.edu)  Announcements  Devotional Computing Lab 4:30PM - 6:30PM all weekdays except Wednesday. Saturday from 10AM-12PM  Slack channel #tutoring_lab   Data Science Society - Wednesday\u0026rsquo;s at 6PM, STC 394 Math Department Opening Social - Thursday 11:30 RKS 229  What is a Data Scientist? A Data Scientist has a C\u002b Talent Stack Class Structure  Problem Solving Improved coding skills Effective written\/visual communication Collaboration Timeliness and communication with \u0026ldquo;the boss\u0026rdquo;  Syllabus\nGot Slack? Are we all on the Slack channel? Follow the Slack invitation that is waiting in your student email. If you don\u0026rsquo;t see an invite, you can join through this link and then ask \u0026ldquo;@Paul Cannon\u0026rdquo; to add you to the class channel.\nWho are you?  Introduce yourself and learn the names\/majors\/origin story of your group members. Make a plan to get help this semester. How will you contact each other? Some ideas: Slack, I-Learn, emails, group texts, etc. If you were independently wealthy, what would you be doing right now? Would you change majors? Highlights of 2022  Problem Solving This is not a \u0026ldquo;see and repeat\u0026rdquo; programming class!\nHow would you go about fixing my motorcycle? Learn how to ask for help (1 hr rule)  Getting started on Project 0 Setting up your Programming Snvironment  Download Visual Studio Code Download Python v (3.10.8)  Be sure to select the \u0026ldquo;Add to Path\u0026rdquo; option during the install process  Mac Users be sure to click on \u0026ldquo;Install Certificates\u0026rdquo; at the end of the install   Install the Python packages and VS Code extensions you need (see this page)  pip install pandas pip install numpy pip install jupyter pip install tabulate pip install altair   Install Quarto CLI Quatro Instructions Start looking at Project 0 Complete the \u0026ldquo;Methods Checkpoint\u0026rdquo;  Installing Packages and Extensions Learn how to install packages by reading the assigned material and by watching the video tutorial on this page.\nThe readings mention a lot of different packages. For Project 0, you need to install at least pandas, altair, numpy, and jupyter.\nThe readings will also mention two VS Code extensions you need to install.\nA note on Jupyter Notebooks vs. Interactive Python Window The textbook will show you how to use VS Code\u0026rsquo;s interactive python windows and Quatro. Feel free to use Jupyter Notebooks.\nWe will do write-ups in Quarto, though, which can be rendered as a PDF or HTML\nIntroduction to Brother Cannon    What do you want to know?    What is a data scientist?    Brother Hathaway\u0026rsquo;s definition:\n A blend of programmer, statistician, and communicator that burns with curiosity.\n My definiton for DS 250:\n Someone who can extract insights from data and then communicate those insights with clarity.\n Learn more about the BYU-Idaho data science program here.\n   What is data science programming?    Data scientists write code as a means to an end, whereas software developers write code to build things. Data science is inherently different from software development in that data science is an analytic activity, whereas software development has much more in common with traditional engineering.\nData scientists tackle problems such as identifying fraudulent transactions, or predicting which employees are likely to leave a company. Software developers can take the data scientists models and turn them into fully functioning systems with production-quality code. Software developers tackle problems like getting an algorithm to run more efficiently, or building user interfaces.\n   Course Outcomes    Upon completing this course, you will be able to use data-driven programming in Python to handle, format, and visualize data. We will introduce you to data wrangling techniques (panadas), analytical methods (scikit-learn), and the grammar of graphics (Altair). Specifically, as a successful learner, you will be able to:\n Use functions, data structures, and other programming constructs efficiently to process and find meaning in data. Programmatically load data from various types of data sources, including files, databases, and remote services. Use data manipulation libraries to perform straightforward analysis, produce charts, and prepare data for machine learning algorithms. Use machine learning libraries to discover insights, make predictions, and interpret the success of these algorithms. Collaborate and share your work with industry-leading tools.     BYU-Idaho Mission Statement     Brigham Young University-Idaho was founded and is supported and guided by The Church of Jesus Christ of Latter-day Saints. Its mission is to develop disciples of Jesus Christ who are leaders in their homes, the Church, and their communities.\n  How would you describe a leader? What makes a leader powerful? What does a leader do with insights?  An example of a good leader.\nWhat (or who) is truth?\n   ## Course Format and Grading How hard is this class going to be?    The reality of CSE 250:\n We have done all we can to ensure that this is a 2-credit course for the average student. That means that we expect 4-6 hours outside of class for the average student to achieve an A. You have to put in the time if you want to build skills. The course is necessarily creative in nature. That fact usually makes it feel more challenging. We will be asking you to learn to write creative data science python code. If you have any concerns, please talk with me!     What is the structure of CSE 250?    The class uses 7 projects to teach data science programming in Python using pandas, Altair, scikit-learn, and numpy.\n Projects Syllabus     How do I get the grade I want?     Specification Grading Grading structure Competency Elements  Introduction Project \u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026gt;\nWhat is the goal?    Completing the introduction project will set you up for success the rest of the semester. The workflow followed in the introduction project (loading packages, writing code, saving images, compiling a final report) will be the same for every other project . If you have questions about this project, you need to seek help.   What exactly do I need to submit?    Make sure you carefully read the project instructions.\nYou will submit a single .pdf file to I-Learn. This pdf file should contain an project summary, your answers to the grand questions (including the plot you saved with altair_saver), and an appendix where you copy and paste your commented Python code.\n   --------------------------------------------------------   ----------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/introduction\/day01\/"},{value:"Day 2B: Missing Data",label:"<p>Welcome to class! Announcements Questions 1 and 2 What issues are we still running into?\nHow to work with missing data What counts as missing data? How to identify missing data  df.isnull().sum() df.describe() df.column.value_counts(dropna=False)   pd.crosstab()  Option 1: Remove missing values Be careful with .dropna(), and make sure you know what it is doing to your data!\nLet\u0026rsquo;s use the pandas example:\ndf = pd.DataFrame({\u0026#34;name\u0026#34;: [\u0026#39;Alfred\u0026#39;, \u0026#39;Batman\u0026#39;, \u0026#39;Catwoman\u0026#39;], \u0026#34;toy\u0026#34;: [np.nan, \u0026#39;Batmobile\u0026#39;, \u0026#39;Bullwhip\u0026#39;], \u0026#34;born\u0026#34;: [pd.NaT, pd.Timestamp(\u0026#34;1940-04-25\u0026#34;), pd.NaT]})  Q: When would we ever use dropna()?    A: Almost never! Why do you think it is a bad idea? df.dropna()   Q: What argument do we use to drop rows where all values are NA?    A: df.dropna(how=\u0027all\u0027) reference   Q: What if we want to drop NA rows based on one column?    A: df.dropna(subset=[\u0027toy\u0027]) reference   Option 2: Replacing missing values Again, let\u0026rsquo;s use the pandas example:\ndf = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1], [np.nan, np.nan, np.nan, 5], [np.nan, 3, np.nan, 4]], columns=list(\u0026#34;ABCD\u0026#34;))  Q: What if we want to replace all the NA in the wt column with the mean weight?    A: fillna() reference   Q: What if we want to replace all the 999 with a 4?    A: replace() reference   Q: What if we want to replace all the NAs with a linear interpolation?    A: interpolate() reference   Question 3 What columns do we need to use for question 3 (total number of flights delayed by weather)?  num_of_delays_weather num_of_delays_late_aircraft num_of_delays_nas  weather = flights.assign( severe = #????, mild_late = #????, mild_nas = np.where(#????), total_weather = # add up severe and mild, ).filter([\u0026#39;airport_code\u0026#39;,\u0026#39;month\u0026#39;,\u0026#39;severe\u0026#39;,\u0026#39;mild_late\u0026#39;,\u0026#39;mild_nas\u0026#39;, \u0026#39;total_weather\u0026#39;, \u0026#39;num_of_delays_total\u0026#39;]) Other resources for question 3  isin() method where() method Adding new variables with assign() assign() method  </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p2\/d3\/"},{value:"Day 3: Making your name stand out",label:"<p>Welcome to class! Reminder about resources  \u0026ldquo;Potluck\u0026rdquo; prep assignment Work with peers Make your own cheat sheet  Anouncements  Always submit a halfway checkpoint even if you\u0026rsquo;re behind!  It\u0026rsquo;s the only hard due date Think of it as a check in with \u0026ldquo;the boss\u0026rdquo;    Thoughts on P1 Halfway Checkpoint  Do your work in .py or .ipynb file, write-up in .qmd Making\/submitting a video: Loom alt.Save() Quarto Graphics for Communication Plotly.Express Resources  Let\u0026rsquo;s practice! Explore the data\nimport plotly.express as px import pandas as pd import numpy as np url = \u0026#34;https:\/\/github.com\/byuidatascience\/data4names\/raw\/master\/data-raw\/names_year\/names_year.csv\u0026#34; names = pd.read_csv(\u0026#39;names_year.csv\u0026#39;) names.head() names.describe() What do you want the chart to look like?\nWhat types of charts are there?\nWhat data do you need to make that chart?\n# names[[\u0026#39;name\u0026#39;],[\u0026#39;year\u0026#39;]] vs. names.query() kobe = names.query(\u0026#34;name == \u0026#39;Kobe\u0026#39;\u0026#34;)[[\u0026#34;name\u0026#34;, \u0026#34;year\u0026#34;, \u0026#34;Total\u0026#34;]] kobe2 = names.query(\u0026#34;name == \u0026#39;Kobe\u0026#39;\u0026#34;).filter(items=[\u0026#34;name\u0026#34;, \u0026#34;year\u0026#34;, \u0026#34;Total\u0026#34;]) # method chaining with () \nWork with your partner to create a line chart that includes both of your names?      Can you include total and data for the state in which you were born? Work together to make the code as eloquent as possible. compound charts      What can you add to your chart to help tell a story?\nCan you modify your previous chart to include your birth state?     Can you include Total and your birth state? Is there a better metric than raw counts that you could calculate? Are there good labels that you could include on the chart (mark_text())?     Remember this advice from Edward Tufte.\n To be truthful and revealing, data graphics must bear on the question at the heart of quantitative thinking: \u0026ldquo;Compared to what?\u0026rdquo; The emaciated, data-thin design should always provoke suspicion, for graphics often lie by omission, leaving out data sufficient for comparisons.\n What are some charts types we could use to answer this question?    There is a clear first choice, but I think there are a few other choices that could provide insight.\n  Visualization Catalog Altair Example Gallery       Use the query() method and filter() method to get your name and years in the rows with and include the name, year, and Total columns     filter the data down to your names (query) select the pertinent columns (filter()) Create a new data object for your name.     Create a line chart with your name.    base = (alt.Chart() .encode( x = alt.X(\u0026#39;\u0026#39;), y = alt.Y(\u0026#39;\u0026#39;) ) .mark_line() )    Create a new DataFrame with your birthday information in the row    Create a DataFrame with x, y, and label as columns. How to create a dataframe.   Add the vertical rule mark to show your birthday    These references can help:\n Using layered charts Altair Marks Add a horizontal line to an existent chart     Work with your partner to create a line chart that includes both of your names?      Can you include total and data for the state in which you were born? Work together to make the code as eloquent as possible.      Can you modify your previous chart to include your birth state?     Can you include Total and your birth state? Is there a better metric than raw counts that you could calculate? Are there good labels that you could include on the chart (mark_text())?      Now come up with a different chart than a line chart    Just use your state count or the Total count for your name.   \u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026gt;\n</p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p1\/d3\/"},{value:"Day 3: The end of baseball",label:"<p>Welcome to class! Spiritual Thought Announcements  Practice Coding Challenge Can I still get an \u0026ldquo;A\u0026rdquo;?  Profile of an \u0026ldquo;A\u0026rdquo; student What if I fall behind?   Reminders:  DS community assignment Review and Request Letter    Coding Challenge: How do I prepare? What would your coding challenge look like?\nProject 3 Questions  Integer Division Career Batting Average What have come up with for Q3? Metrics? Visualizations?  Question 1 Ask yourself:\n What do I want and expect the end table to look like? What table(s) and calculations do I need? What makes a row in my end table unique? What problems can I anticipate?  Question 2 Ask yourself:\n What do I want and expect the end table to look like? What table(s) and calculations do I need? What makes a row in my end table unique? What problems can I anticipate?  Question 3 What are some ideas for Grand Question 3? Ask yourself:\n What information will you use to compare the two baseball teams? What table(s) and calculations do I need? What makes a row in my end table unique? What problems can I anticipate?  and FROM -- JOIN -- ON -- WHERE -- GROUP BY -- ORDER BY -- LIMIT -- ``` -------------------------------------------  ## Connecting to SQLite: [Lahman SQLite](https:\/\/byuistats.github.io\/CSE250-Course\/data\/lahmansbaseballdb.sqlite) __Download the sqlite file:__ [Lahman sqlite](https:\/\/byuistats.github.io\/CSE250-Course\/data\/lahmansbaseballdb.sqlite) ### What is SQLite?  - [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/SQLite): SQLite is **a popular choice as embedded database software for local\/client storage in application software such as web browsers.** It is arguably the most widely deployed database engine, as it is used today by several widespread browsers, operating systems, and embedded systems (such as mobile phones), among others. SQLite has bindings to many programming languages.  - [SQLite.org](https:\/\/www.sqlite.org\/about.html): **SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine.** The code for SQLite is in the public domain and is thus free for use for any purpose, commercial or private. SQLite is the most widely deployed database in the world with more applications than we can count, including several high-profile projects.  - [Codecademy](https:\/\/www.codecademy.com\/articles\/what-is-sqlite): SQLite is a database engine. It is software that allows users to interact with a relational database. In SQLite, a database is stored in a single file — a trait that distinguishes it from other database engines. This fact allows for a great deal of accessibility: copying a database is no more complicated than copying the file that stores the data, sharing a database can mean sending an email attachment. ### Working with SQLite files in Python ```python # %% import pandas as pd import altair as alt import numpy as np import sqlite3 # %% sqlite_file = \u0027lahmansbaseballdb.sqlite\u0027 con = sqlite3.connect(sqlite_file) # %% # See the tables in the database table = pd.read_sql_query( \u0022SELECT name FROM sqlite_master WHERE type=\u0027table\u0027\u0022, con) print(table) ``` ------------------------------------------------------ What table do we want to use?    q = \u0026#39;\u0026#39;\u0026#39; SELECT * FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What columns do we want to select?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What calculation do we want to perform?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, r\/ab FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What name do we give our calculated column?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, r\/ab as runs_atbat FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    #### I want to join two tables to help in decision making __For seasons after 1999, which year had the most players selected as All Stars but didn\u0027t play in the All Star game?__ - Provide a summary of how many games, hits, and at bats all the players had in that year\u0027s post season. - The [data dictionary](https:\/\/data.world\/byuidss\/cse-250-baseball-database\/workspace\/file?filename=readme2014.txt) might help. ```python import pandas as pd import altair as alt import numpy as np import datadotworld as dw baseball_url = \u0027byuidss\/cse-250-baseball-database\u0027 ``` What table do we want for All Star information?    # %% # allstar table dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM AllstarFull WHERE --? AND --? LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you use a groupby to get the counts of players per year?    dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT yearid, -- \u0026lt;stuff to calculate\u0026gt; FROM AllstarFull WHERE yearid \u0026gt; 1999 AND gp != 1 GROUP BY --? ORDER BY --? \u0026#39;\u0026#39;\u0026#39;).dataframe    What table do we want for the post season at bats?    dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM BattingPost as bp LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you join the post season batting table and AllStar information?     For each player, keep only the at bats, hits, the all star gp, and gameid columns. Let\u0026rsquo;s only keep players with at least one at bat in the post season.  dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;columns to keep\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON -- \u0026lt;two columns for the join\u0026gt; WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND -- \u0026lt;at bat condition\u0026gt; LIMIT 15 \u0026#39;\u0026#39;\u0026#39; ).dataframe    Let\u0026rsquo;s build the final table    For seasons after 1999, which year had the most players selected as All Stars but didn\u0026rsquo;t play in the All Star game?\n Provide a summary of how many games, hits, and at bats all the players had in that year\u0026rsquo;s post season.  dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;lots of calculations\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON bp.playerid = asf.playerid AND bp.yearid = asf.yearid WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND ab \u0026gt; 0 GROUP BY -- \u0026lt;column\u0026gt; ORDER BY -- \u0026lt;column\u0026gt; \u0026#39;\u0026#39;\u0026#39; ).dataframe    --------------------------------------------------------------------- I want to see how much each college player from schools in the west and mountain west has made over their professional career. I want to know the full school name attended and the the Given name of each player. _Is this query correct?_ ```SQL SELECT cp.playerID, nameGiven, birthYear ,cp.schoolID, name_full ,SUM(salary) as salary FROM salaries as sal JOIN people as p ON p.playerID = sal.playerID JOIN CollegePlaying as cp ON p.playerID = cp.playerID JOIN schools as sc ON sc.schoolID = cp.schoolID WHERE sc.state = \u0027ID\u0027 GROUP BY cp.playerID, cp.schoolID ORDER BY name_full ``` ```python pd.read_sql_query( \u0027\u0027\u0027 SELECT cp.playerID, nameGiven, birthYear ,cp.schoolID, name_full ,SUM(salary) as salary FROM salaries as sal JOIN people as p ON p.playerID = sal.playerID JOIN CollegePlaying as cp ON p.playerID = cp.playerID JOIN schools as sc ON sc.schoolID = cp.schoolID WHERE sc.state = \u0027ID\u0027 GROUP BY cp.playerID, cp.schoolID ORDER BY name_full \u0027\u0027\u0027, con) ``` #### Let\u0027s start here ```python schools = pd.read_sql_query( \u0027\u0027\u0027 SELECT * FROM schools WHERE state = \u0027ID\u0027 \u0027\u0027\u0027, con) ``` ----------------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p3\/d3\/"},{value:"pandas and Altair",label:"<p>For this skill builder, we are exploring some important functions in the package of pandas and Altair. DS programming requires a lot of data wrangling. Using the proper functions, we can create concise and comprehensive codes. You should be exposed to a few functions through the readings this week.\nYou may want to at least scan the readings before beginning this task since this serves as an assessment of your understanding of the assigned readings. A prepared student should be able to finish the exercises within 60 minutes. You should work through it on your own.\nBefore you start Make sure you have installed VS-code, pandas, and Altair on your computer. You can install these packages by typing this line in the terminal:\npip install pandas altair\nOR if you have more than one version of python:\npip3.9 install pandas altair\npip3.9 indicates the version of python you are installing the packages to.\nData import Run the following code to import the data we need for this skill builder:\n# package import import numpy as np import pandas as pd import altair as al # data import dat = pd.read_csv(\u0026#34;https:\/\/vincentarelbundock.github.io\/Rdatasets\/csv\/AER\/Guns.csv\u0026#34;) Make sure the variable dat is correctly assigned in your environment and finish the following exercises. You can read the documentation of the data on this page - https:\/\/vincentarelbundock.github.io\/Rdatasets\/doc\/AER\/Guns.html\nExercise 1 One of the first things we can do to a freshly imported data is to check its columns. This will help us understand the basic structure of the dataframe(table).\n Using one line of code, select all the columns in dat, assign it to a variable called col_list.\n  Hint Every dataframe has an attribute \u0022columns\u0022. Accessing this attribute will give you a list of all column names  We often want to know the dimension of a dataframe. How many columns are in the dataset? How many rows are in the dataset?\n Using one line of code, show the number of columns and rows in dat.\n  Hint Every dataframe has an attribute \u0022shape\u0022. Accessing this attribute will give you the dimension of a datafarme  Now run dat.head(). It will print out the first 5 rows of data in dat.\n Just from looking at the output, what column(s) seems to be redundant with the row number?\n  Hint There is one column that serves as nothing but a row counter, that columns is redundant.  Exercise 2 After a brief investigation of the data, we will clean up the data. By cleaning up, we are trying to filter down dat so this only holds data we need. We will first get rid of the extra column we found in the previous excercise.\n Using one line of code, drop the redundant column using the variable col_list (created in excercise 1)\n  Hint Use `drop()`. Understand what \u0026ldquo;axis\u0026rdquo; is as a parameter of drop().\nYour function should looks like this:\ndat.drop([col_list[_]], axis = _)\nfill the \u0026ldquo;_\u0026quot;\u0026rsquo;s with the correct values and assign the output to dat.\n Don\u0026rsquo;t forget to save the changes in dat. Run dat.head() to make sure the column is dropped in dat.\nExercise 3 We have filtered dat vertically by dropping a column. Now we will try to filter dat horizontally, meaning we will get rid of some the rows.\nWe can do that by applying a condition to dat. A condition is an expression that can be evaluated as True\/False. For example, 8 \u0026gt; 5 is an expression that evaluates to be True. This is trivial because 8 will always be greater than 5.\nRun the code below:\n what is the difference between exp1 and exp2?\n exp1 = 8 \u0026gt; 5 exp2 = dat.violent \u0026lt; 300  Hint Try type() on else variable OR calling else variable.  Run ths code below:\n By putting dat.violent \u0026lt; 300, and the violent column from dat into a dataframe, what is the relationship between the two columns?\n exp = pd.DataFrame({\u0026quot;dat.violent \u0026lt; 300\u0026quot; : exp2, \u0026quot;violent value from dat\u0026quot; : dat.violent}) exp  Hint Try computing `dat.violent[n]  Using query(), filter down the dat so that it only contains the data for idaho\n  Hint query() takes in expressions and filters down data.  Don\u0026rsquo;t forget to save the changes in dat. Run dat.shape() to make sure the there are 23 rows and 13 columns.\nExercise 4 Besides filtering, we can manipulate the data by adding new data to it. By adding a new column to the data, we assign a new value to each row.\n Using assign(), create a new column that show the ratio between murder rate and violent rate.\n  Hint Use assign() You see get the ratio by computing this code:\ndat.murder\/dat.violent\n Exercise 5  Create a scatter plot that shows the relationship between murder rate and violent rate for the state of Idaho. Your chart should show murder rate as the x-axis, violent as the y-axis.\n  Hint Can you mimic this plot? (https:\/\/altair-viz.github.io\/gallery\/scatter_tooltips.html)\n  For an extra push Exercise 6  Using a line of code, filter down the data set so that it only shows the data in years between 1993 and 1997.\n Exercise 7  Create a line chart that show prisoners numbers for the state of Idaho, Utah, and Oregon.\n Your chart should show year as the x-axis, prisoner as the y-axis, states as different colours, along with an appropriate title.\nExercise 8  Without using query(), finshed the data wrangling in question 2,5 and 6.\n After you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/pandas_altair\/"},{value:"Day 2: Intro to Machine Learning",label:"<p>Welcome to class! Announcements Spiritual thought Are facts true?  How do you distinguish between truth and error? Joshua and Caleb  Building a Decision Tree  Import packages    Splitting the Data 1. Start with packages and data set We\u0026rsquo;ll be using some parts of SKLEARN package and the Seaborn package.\n# If you haven\u0026#39;t already, install scikit-learn and seaborn pip install scikit-learn seaborn from types import GeneratorType import pandas as pd import altair as alt import numpy as np import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn import metrics What is the difference between dwellings_denver.csv and dwellings_ml.csv?\n2. Choose which variables to use How do we know which variables to use out of dwellings_ml.csv?\nQuestion 1 will help you identify patterns (or lack of patterns) in the data.\n3. Separate into features and target Which Features? # %% h_subset = dwellings_ml.filter([\u0026#39;livearea\u0026#39;, \u0026#39;finbsmnt\u0026#39;, \u0026#39;basement\u0026#39;, \u0026#39;yearbuilt\u0026#39;, \u0026#39;nocars\u0026#39;, \u0026#39;numbdrm\u0026#39;, \u0026#39;numbaths\u0026#39;, \u0026#39;stories\u0026#39;, \u0026#39;yrbuilt\u0026#39;, \u0026#39;before1980\u0026#39;]).sample(500) sns.pairplot(h_subset, hue = \u0026#39;before1980\u0026#39;) corr = h_subset.drop(columns = \u0026#39;before1980\u0026#39;).corr() # %% sns.heatmap(corr) 4. Split into training and testing sets What does the \u0026ldquo;train_test_split()\u0026rdquo; function do? x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = #???, random_state = #???) Read the documentation and tell me what is returned?\nFunction documentation\n Why do we use \u0026ldquo;test_size\u0026rdquo; and \u0026ldquo;random_state\u0026rdquo;?\n  What is \u0026ldquo;x\u0026rdquo; and \u0026ldquo;y\u0026rdquo; in the above function example?\n We need to take our data and build the feature and target data objects.\n What columns should we remove from our features (X)?\n  What column should we use as our target (y)?\n x = dwellings_ml.filter([#what variables will you use as \u0026#34;features\u0026#34;?]) y = dwellings_ml[#what variable is the \u0026#34;target\u0026#34;?] \nTraining a Classifier Decision Tree Example # create the model classifier = DecisionTreeClassifier() # train the model classifier.fit(x_train, y_train) # make predictions y_predictions = classifier.predict(x_test) # test how accurate predictions are metrics.accuracy_score(y_test, y_predictions) How to Improve Accuracy To improve the accuracy of your model, you could:\n Change what variables are used in the features (x) data set Change what type of model you are using Tune (aka, \u0026ldquo;change\u0026rdquo; or \u0026ldquo;tweak\u0026rdquo;) the parameters of the model  Other Classification Models Here are some other models you could try.\nfrom sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier \nMake Progress on Project 4 Do the project readings    Machine Learning Introduction\n Step-by-step guide (mostly) for training a GaussianNB classifier. (The steps will be the same for any algorithm you use.)  Visual Introduction to Machine Learning\n Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data. Overfitting happens when some boundaries are based on distinctions that don\u0026rsquo;t make a difference. You can see if a model overfits by having test data flow through the model.     Start working on Question 1    The goal of Grand Question 1 is to help us with \u0026ldquo;feature selection\u0026rdquo;.\n \u0026ldquo;Overfitting\u0026rdquo; happens when some boundaries are based on on distinctions that don\u0026rsquo;t make a difference. More data does not always lead to better models. (Occam\u0026rsquo;s Razor)  Common questions:\n Why it may be better to have fewer predictors in Machine Learning models? What is Feature Selection and why do we need it in Machine Learning?     What is the 5000 rows error with Altair?    The best way around this is to look at a sub-sample of the data for exploratory purposes. For example, you can use \u0026ldquo;sample(500)\u0026rdquo;. But there are ways to expand VS Code\u0026rsquo;s limits.\nMaxRowsError: How can I plot Large Datasets?\nYou may also save data to a local filesystem and reference the data by file path. Altair allows you to disable the max rows:\nalt.data_transformers.disable_max_rows() subset_data = denver.sample(n = 4999)    scikit-learn resources     Home page Tutorials Getting Started: What do you notice about the header portion of each of the script chunks?  import vs from ... import       My favorite comic    xkcd\n   ## Searching for patterns What ideas do you have for charts? ## Understanding the data What differences do you notice between these two data sets? ```python dwellings = pd.read_csv() dwellings_ml = pd.read_csv() ``` ------------------------------------------------------------------- What is the 5000 rows error with Altair?    MaxRowsError: How can I plot Large Datasets?\nYou may also save data to a local filesystem and reference the data by file path. Altair has a JSON data transformer that will do this transparently when enabled:\nalt.data_transformers.disable_max_rows() subset_data = denver.sample(n = 4999)    What features of homes might have changed a bit over time?    Some ideas:\n square footage number of bathrooms basement size  Let\u0026rsquo;s create one chart using some of these variables.\n   ----------------------------------------- What is scikit-learn?     Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.\n About scikit-learn helps us see the history and funding. It should stay \u0026ldquo;king of the hill\u0026rdquo; for a long time.\n Simple and efficient tools for predictive data analysis Accessible to everybody, and reusable in various contexts Built on NumPy, SciPy, and matplotlib Open source, commercially usable - BSD license     Should I import scikit-learn?    scikit-learn is very large, with many submodules. To help the user of your .py script understand your code, the consensus is to use from .... import .....\nfrom sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn import metrics    After choosing a machine learning method, what do we do?     Fit (or \u0026ldquo;train\u0026rdquo;) the model using the features (also called \u0026ldquo;X\u0026rdquo;) Predict the target (also called \u0026ldquo;y\u0026rdquo;) Evaluate model performance (using many different metrics)     ## Train the model What does the train_test_split() function do?    Your turn: Read the documentation and tell me what is returned from the train_test_split() function.\nHow to save the output: Use a destructuring assignment\nx_train, x_test, y_train, y_test = train_test_split( x, y, test_size = .3, random_state = 76) Your turn:\n Why would we want to use the test_size and random_state arguments? What is x and y in the above example? Why do we care about splitting our data?     The next step    We need to take our data and build the feature and target data objects. Think about:\n What column(s) should we remove from our features (x)? What column(s) should we use as our target (y)?     ## Predicting targets and evaluating model performance What metrics should we use?    Do your reading! Read How to evaluate your ML model and try googling other ideas.\nAccuracy Question 2 is looking for a model that has \u0026ldquo;at least 90% accuracy\u0026rdquo;.\nConfusion Matrix A confusion matrix is a quick way to see the strengths and weaknesses of your model.\nYour turn: Look at the confusion matrix for our GaussianNB model. Where the model is doing well and where it might be falling short?\nYour turn: Now look at the confusion matrix for our Decision Tree model. What differences do you notice?\n# a confusion matrix print(metrics.confusion_matrix(y_test, y_predicted_GNB)) # this one might be easier to read print(pd.crosstab(y_test.flatten(), y_predicted_GNB, rownames=[\u0026#39;True\u0026#39;], colnames=[\u0026#39;Predicted\u0026#39;], margins=True)) # visualize a confusion matrix # requires \u0026#39;matplotlib\u0026#39; to be installed metrics.plot_confusion_matrix(classifier_GNB, x_test, y_test)    ------------------------------------------------------------------------- AI is able to learn \u0027rules\u0027 from highly repetitive data. [Sebastian Thrun](https:\/\/www.youtube.com\/watch?v=ZJixNvx9BAc)  The single most important thing for AI to accomplish in the next ten years is to free us from the burden of repetitive work. [Sebastian Thrun](https:\/\/www.youtube.com\/watch?v=ZJixNvx9BAc)   ### [Visual Introduction to Machine Learning](http:\/\/www.r2d3.us\/visual-intro-to-machine-learning-part-1\/)  1. Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.  2. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data.  3. Overfitting happens when some boundaries are based on distinctions that don\u0027t make a difference. You can see if a model overfits by having test data flow through the model. #### [Bias-Variance Tradeoff](http:\/\/www.r2d3.us\/visual-intro-to-machine-learning-part-2\/)  1. Models approximate real-life situations using limited data.  2. In doing so, errors can arise due to assumptions that are overly simple (bias) or overly complex (variance).  3. Building models is about making sure there\u0027s a balance between the two. #### But what is the \u0027Pavlovian bell\u0027 in the machine learning model? ![](..\/..\/images\/ml\/test.png) Some mathematical penalty\/reward equation.  - __[Regression](https:\/\/setosa.io\/ev\/ordinary-least-squares-regression\/)__  - __[Variance, RMSE, SD](..\/..\/interactive\/threshold_histogram.html)__  - __proportions__ ## Using our project data to understand features, targets, and samples.  1. Import `dwellings_ml.csv` and write a short sentence describing your data. Remember to explain an observation and what measurements we have on that observation.  2. Now try describing the modeling (machine learning) we are going to do in terms of features and targets.  A. Are there any columns that are the target in disguise?  B. _Are the observational units unique in every row?_ ![](..\/..\/images\/ml\/iris_description.png) ### If your model is near perfect in its predictability, you might be cheating. ### Watch out for [transactional data](http:\/\/localhost:1313\/CSE250-Course\/images\/ml\/iris_description.png)!  - Financial: orders, invoices, payments  - Work: plans, activity records  - School: Grades ### [scikit learn](https:\/\/scikit-learn.org\/stable\/)  - [Tutorials](https:\/\/scikit-learn.org\/stable\/tutorial\/index.html)  - [Getting Started](https:\/\/scikit-learn.org\/stable\/getting_started.html): _What do you notice about the header portion of each of the script chunks?_  - [`import` vs `from ... import`](https:\/\/scikit-learn.org\/stable\/getting_started.html) ## Setting up Live Share -----------------------------------    </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p4\/d2\/"},{value:"Day 2: Seeing names with Altair",label:"<p>Welcome to class! Announcements Project Submissions  Don\u0026rsquo;t leave example text\/documentation from the Template in your writeup Change the Project Title (don\u0026rsquo;t have to call it Client Report) Code can be adjacent to the relevant output as long as it\u0026rsquo;s not distracting, but please include your complete code in an Appendix Be sure to save the QMD file before rendering   Autosave  Project 0 Wrap-up  If you still cannot render a document in Quarto, let me know Is python, at least up and running? able to plot graphs and make tables? Finishing up a report   Markdown  Tables - want to have the printed table in Markdown area, not a code area   HTML submissions  Other hints:\n Tutoring Lab Slack Channel: #tutoring_lab  Back to Day 1 Slides Methods Checkpoint Loading the names data    Visit the Project 1 Instructions to download the data. #%% # load packages import pandas as pd import altair as alt #%% # load data from url url = \u0026quot;this_is_the_url_to_the_csv_file\u0026quot; names = pd.read_csv(url) #%% # or, you can load data from file names2 = pd.read_csv(\u0026quot;names_year.csv\u0026quot;)    Pandas and DataFrames    What is a Pandas DataFrame? DataFrames come with attributes and built-in functions that can help us get a feel for our data.\nRun the code below one line at a time (or use other functions of your choice) to explore the names data. What do you learn?\nnames.columns names.shape names.size names.head() names.describe()    Understanding your data    You should be able to introduce your data sets to people, the same way you introduce a friend.\n If you can\u0026rsquo;t describe what a row is in your data, then you don\u0026rsquo;t understand what groups you can analyze. If you can\u0026rsquo;r describe what a column is in your data, then you don\u0026rsquo;t understand what information you can evaluate for each group.  Being able to explain your data out loud to someone else follows the same principles as rubber duck debugging.\n   Let\u0026rsquo;s practice!    Understanding column values How many unique names does the names dataset contain? Work with a partner to find the answer. I recommend searching the Pandas cheat sheet.\n pull the name column out as a series Use the pandas unique function pd.unique() find the size of the series  What is the range of years in the names dataset? Again, work with a partner and use the Pandas cheat sheet.\n pull the year column as a series Find the max Find the min     ----------------------------------------------------- How many unique years do we have for our name?    pd.unique(dat.query(\u0027name == \u0026quot;John\u0026quot;\u0027).year).min() pd.unique(dat.query(\u0027name == \u0026quot;John\u0026quot;\u0027).year).max() pd.unique(dat.query(\u0027name == \u0026quot;John\u0026quot;\u0027).year).size     Filtering rows of a DataFrame    Make sure to do the project readings!  P4DS: 5.2 Filter rows with .query() The query method     ## Getting started with Altair ### Why are we using Altair? #### It is built on the VEGA and D3 which are fast and web based.  #### Grammar of Graphics: Vega-Lite ![](altair_grammar_graphics.png)  - [Technical Paper](https:\/\/www.domoritz.de\/papers\/2017-VegaLite-InfoVis.pdf)  - [Website](https:\/\/vega.github.io\/vega-lite\/)  - [Endorsment](https:\/\/medium.com\/@robin.linacre\/why-im-backing-vega-lite-as-our-default-tool-for-data-visualisation-51c20970df39) ------------------------------------ Grand Grand Question 1 What does a chart need to look like to answer Question 1?\nWhat data do we need to build that chart?\nMaking our chart look good.  Size of chart Title and subtitle Size and color of line Axis formatting Reference marks  Extra Practice Altair (and Vega and Vega-Lite and D3!)    What is the difference between a \u0026ldquo;high-level\u0026rdquo; and \u0026ldquo;low-level\u0026rdquo; programming language or tool? Here\u0026rsquo;s what Google has to say.\n Altair is a Python library built on Vega and Vega-Lite Vega is a \u0026ldquo;higher-level visualization specification language on top of D3\u0026rdquo; that creates charts with json files D3 is a JavaScript library     Altair: Removing commas from years    Remember, Altair builds on Vega, which builds on D3. Sometimes to answer a question about Altair, you will have to read Vega or D3 documentation. For example:\n Altair\u0026rsquo;s guide for customizing axis labels. (Scroll down to the second code example.) D3 options for different axis formats.  (alt.Chart(my_data) .mark_line() .encode( x = alt.X(\u0027year\u0027, axis = alt.Axis(format = \u0027d\u0027, title = \u0026quot;Year\u0026quot;)), y = alt.Y(\u0027Total\u0027, axis = alt.Axis(title = \u0026quot;Children with Name\u0026quot;)) ) )    Altair: Adding a reference line    You may want to include a point or line of reference to help your chart answer the question \u0026ldquo;compared to what?\u0026rdquo;. Let\u0026rsquo;s say you have your chart for Grand Question 1 saved as question_1. The easiest way I have found to add a reference line is to create a new DataFrame with a single number:\nline_df = pd.DataFrame({\u0027year\u0027: [1990]}) line_df And use the new DataFrame to create a chart with a single line that has a specific value of x (for example, your birth year) but spans the entire y-axis.\nIn Altair, this is done with the the mark_rule() geometry. You can then \u0026ldquo;layer\u0026rdquo; the two charts together.\nline = alt.Chart(line_df).mark_rule(color=\u0026quot;red\u0026quot;).encode(x = \u0026quot;year\u0026quot;) final_chart = question_1 \u002b line final_chart Additional references:\n Using layered charts Altair Marks Add a horizontal line to an existent chart     ------------------------------------ Look at the names data and write a short paragraph in your notes describing the data set    We have a row for each name-year. Excluding the name and year columns we have a column for each state and DC. Finally there is a Total column that sums over the other columns.\n  If you can\u0026rsquo;t describe what a row is in your table then you don\u0026rsquo;t understand what groups you can talk about with your data. The columns tell you what information you will be able to evaluate on each \u0026lsquo;group\u0026rsquo; or \u0026lsquo;observation\u0026rsquo; in your data.   We want tidy data.\n   ----------------------- Which name has been given the most and the least?      Sum all the years for each name (groupby()). Create a new DataFrame for the totals. Write a query that filters the total data to the max and min. Create a markdown table with the information. A. to_markdown() requires the tabulate package. B. to_markdown() with arguments showindex and floatformat C. Guidance on floatformat   dat_total = dat.groupby(\u0027name\u0027).agg(n = (\u0027Total\u0027, \u0027sum\u0027)).reset_index() print(dat_total .query(\u0027n in [@dat_total.n.max(), @dat_total.n.min()]\u0027) .to_markdown(showindex = False, floatfmt=\u0026quot;.0f\u0026quot;))    -------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p1\/d2\/"},{value:"Day 2: SQL Calculations",label:"<p>Welcome to class! Spiritual Thought Announcements  Project 3 - SQL practice  Class Activity in Slack Part 1 Goal: Describe in words (NOT using code) how to get from your starting data to your ending data.\nPost your answer in your group\u0026rsquo;s Slack thread. You have 7 minutes, and are allowed to ask me 1 question.\nPart 2 Goal: Now try to write a SQL query to get your ending data.\nPost your SQL query in your group\u0026rsquo;s Slack thread. You have 7 minutes, and are allowed to ask me 1 question.\nHere is the SQL template for your use.\nSELECT -- \u0026lt;columns\u0026gt; and \u0026lt;column calculations\u0026gt; FROM -- \u0026lt;table name\u0026gt;  JOIN -- \u0026lt;table name\u0026gt;  ON -- \u0026lt;columns to join\u0026gt; WHERE -- \u0026lt;filter condition\u0026gt; GROUP BY -- \u0026lt;subsets for column calculations\u0026gt; HAVING -- \u0026lt;grouped filter condition\u0026gt; ORDER BY -- \u0026lt;how the output is returned in sequence\u0026gt; LIMIT -- \u0026lt;number of rows to return\u0026gt; \n- Group 1: [SELECT and FROM](https:\/\/docs.data.world\/documentation\/sql\/concepts\/basic\/SELECT_and_FROM.html) with the `people` table (called \u0022master\u0022 in the data dictionary). Include examples of `SELECT AS` and `SELECT DISTINCT`.  - Group 2: [WHERE](https:\/\/docs.data.world\/documentation\/sql\/concepts\/basic\/WHERE.html) with the `schools` table. Try using different types of comparison operators, or making multiple comparisons with `AND`.  - Group 3: [ORDER BY](https:\/\/docs.data.world\/documentation\/sql\/concepts\/basic\/ORDER_BY.html) with the `salaries` table. Try sorting in different orders (ascending or descending) and with multiple columns.  - Group 4: [JOIN](https:\/\/docs.data.world\/documentation\/sql\/concepts\/intermediate\/Joins.html) with the `schools` and `collegeplaying` tables (focus on \u0022inner\u0022 joins).  - Group 5: [Aggregations](https:\/\/docs.data.world\/documentation\/sql\/concepts\/intermediate\/aggregations.html) with the `batting` table.  - Group 6: [GROUP BY](https:\/\/docs.data.world\/documentation\/sql\/concepts\/intermediate\/GROUP_BY.html) with the `batting` table. -------------------------- Getting started Question One: Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID\/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.\nThink about:\n What tables (data) do you need? What SQL commands do you need?  What table do we want to use?    q = \u0026#39;\u0026#39;\u0026#39; SELECT * FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What columns do we want to select?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What calculation do we want to perform?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, ab\/r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; batting_calc = dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What name do we give our calculated column?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, ab\/r as runs_atbat FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; batting_calc = dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    #### I want to join two tables to help in decision making __Which year had the most players players selected as All Stars but didn\u0027t play in the All Star game after 1999?__ - __provide a summary of how many games, hits, and at bats occured by those players had in that years post season.__ ```python import pandas as pd import altair as alt import numpy as np import datadotworld as dw con_url = \u0027byuidss\/cse-250-baseball-database\u0027 ``` What table do we want for All Star information?    # %% # allstar table dw.query(con_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM AllstarFull WHERE AND LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you use a groupby to get the counts of players per year?    dw.query(con_url, \u0026#39;\u0026#39;\u0026#39; SELECT yearid, -- \u0026lt;stuff to calculate\u0026gt; FROM AllstarFull WHERE yearid \u0026gt; 1999 AND gp != 1 GROUP BY --? ORDER BY --? \u0026#39;\u0026#39;\u0026#39;).dataframe    What table do we want for the post season at bats?    dw.query(con_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM BattingPost as bp LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you join the batting table and AllStar information and keep only the at bats, hits with the all star gp and gameid columns?    Let\u0026rsquo;s only keep players with at least one at bat in the post season\ndw.query(con_url, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;columns to keep\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON -- \u0026lt;two columns for the join\u0026gt; WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND -- \u0026lt;at bat condition\u0026gt; LIMIT 15 \u0026#39;\u0026#39;\u0026#39; ).dataframe    Let\u0026rsquo;s build the final table    Which year had the most players players selected as All Stars but didn\u0026rsquo;t play in the All Star game after 1999?\n provide a summary of how many games, hits, and at bats occured by those players had in that years post season.  dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;lots of calculations\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON bp.playerid = asf.playerid AND bp.yearid = asf.yearid WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND ab \u0026gt; 0 GROUP BY -- \u0026lt;column\u0026gt; ORDER BY -- \u0026lt;column\u0026gt; \u0026#39;\u0026#39;\u0026#39; ).dataframe    --------------------------------------------------------- Extra Practice \u0026ldquo;I get SQL and want to be challenged.\u0026rdquo; Do this Math 335 task with SQL commands in Python.\n</p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p3\/d2\/"},{value:"Day 2: Transforming Data",label:"<p>Welcome to class! Spiritual Thought Announcements  Code chunk options:  Locally using #| warning: false Globally in the YAML using execute: warning: false    Flights Data Issues: What are some of the data issues you discovered while getting to know your data?\nLoading JSON files into pandas Let\u0026rsquo;s load in some practice data! Data link.\nHere\u0026rsquo;s a description of the data: Data Description.\nimport pandas as pd # to load and transform data import numpy as np # for math\/stat calculations # from url to pandas dataframe url = \u0026#34;https:\/\/github.com\/byuidatascience\/data4missing\/raw\/master\/data-raw\/mtcars_missing\/mtcars_missing.json\u0026#34; cars = pd.read_json(url) # or from file to pandas dataframe cars = pd.read_json(\u0026#34;mtcars_missing.json\u0026#34;) Look at the data for the first two cars. What is different about the format?\n[ { \u0026#34;car\u0026#34;: \u0026#34;Mazda RX4\u0026#34;, \u0026#34;mpg\u0026#34;: 21, \u0026#34;cyl\u0026#34;: 6, \u0026#34;disp\u0026#34;: 160, \u0026#34;hp\u0026#34;: 110, \u0026#34;drat\u0026#34;: 3.9, \u0026#34;wt\u0026#34;: 2.62, \u0026#34;qsec\u0026#34;: 16.46, \u0026#34;vs\u0026#34;: 0, \u0026#34;am\u0026#34;: 1, \u0026#34;gear\u0026#34;: 4, \u0026#34;carb\u0026#34;: 4 }, { \u0026#34;car\u0026#34;: \u0026#34;Mazda RX4 Wag\u0026#34;, \u0026#34;mpg\u0026#34;: 21, \u0026#34;cyl\u0026#34;: 6, \u0026#34;disp\u0026#34;: 160, \u0026#34;hp\u0026#34;: 110, \u0026#34;drat\u0026#34;: 3.9, \u0026#34;wt\u0026#34;: 2.875, \u0026#34;qsec\u0026#34;: 17.02, \u0026#34;am\u0026#34;: 1, \u0026#34;gear\u0026#34;: 4, \u0026#34;carb\u0026#34;: 4 } ] \nYour Turn: Transforming Data With your group, research these functions and create an example using the cars data. Post your example in Slack. Be prepared to teach the class about your functions.\nYou can use the Data Transformation textbook chapter and the pandas documentation to help you.\nRecreate the following output to the best of your abilities: Group 1: Working with rows  .query() allows you to subset observations (rows) .sort_values() arranges rows in a particular order  Group 2: Working with columns  .filter() (as well as [] and .loc[]) allow you to select columns .assign() is one way to add new columns to a dataframe  Group 3: Counting items  .value_counts() summarizes a column by counting the values inside .crosstab() creates a \u0026ldquo;cross tabulation\u0026rdquo; of two or more variables  Group 4: Summarizing data  Using .groupby() and .agg() together allows you to calculate group summaries  Your Turn: Summarizing the cars data Write code to calculate the mean weight wt for each cylinder type cyl.\nAnswer 1    cars.groupby(\u0027cyl\u0027).agg(mean_weight = (\u0027wt\u0027, np.mean)).reset_index()    Can you print the answer as a markdown table?\nAnswer 2    print(cars.groupby(\u0027cyl\u0027).agg(mean_weight = (\u0027wt\u0027, np.mean)).reset_index().to_markdown(index = False))    Project 2 FAQs Why are we using assign()    One main reason:\n You can create multiple columns within the same assign() where one of the columns depends on another one defined within the same assign. source: Documentation\n Other resources:\n Why use pandas.assign rather than simply initialize new column? 3 Ways to Add New Columns to Pandas Dataframe  Not related, but also fun: Should you use \u0026ldquo;dot notation\u0026rdquo; or \u0026ldquo;bracket notation\u0026rdquo; with pandas?\n   Lambda functions    Two ways to define the same function:\ndef square(x): return x**2 square = lambda x:x**2 There are some difference between them as listed below.\n  lambda is a keyword that returns a function object and does not create a \u0026lsquo;name\u0026rsquo;. Whereas def creates name in the local namespace lambda functions are good for situations where you want to minimize lines of code as you can create function in one line of python code. It is not possible using def lambda functions are somewhat less readable for most Python users. lambda functions can only be used once, unless assigned to a variable name.  source\n    Conditional operations    What if you want to create a new column, whose values depend on another column? There are a lot of ways to accomplish this (see this stackoverflow answer). Some functions I use:\n isin() method where() method You can also use an if else statement inside a lambda function     Missing data    We will learn how to identify and deal with missing data next week. For now, we can drop rows we don\u0026rsquo;t want using square brackets [] or .query().   API\u0026rsquo;s and JSON: A Primer Application Programming Interfaces (APIs) Representational State Transfer (REST APIs)  Over the course of the ’00s, another Web services technology, called Representational State Transfer, or REST, began to overtake [all other tools] for the purpose of transferring data. One of the big advantages of programming using REST APIs is that you can use multiple data formats — not just XML, but JSON and HTML as well. As web developers came to prefer JSON over XML, so too did they come to favor REST over SOAP. As Kostyantyn Kharchenko put it on the Svitla blog, “In many ways, the success of REST is due to the JSON format because of its easy use on various platforms.”\nToday, JSON is the de-facto standard for exchanging data between web and mobile clients and back-end services. ref\n JavaScript Object Notation  Well, when you’re writing frontend code in Javascript, getting JSON data back makes it easier to load that data into an object tree and work with it. And JSON formats data in a more succinct way, which saves bandwidth and improves response times when sending messages back and forth to a server. In a world of APIs, cloud computing, and ever-growing data, JSON has a big role to play in greasing the wheels of a modern, open web. ref\n Other Resources  RESTful APIs in 100 Seconds (video) Python API Tutorial: Getting Started with APIs Big List of Free and Open Public APIs (No Auth Needed)  How could we leverage numpy\u0026rsquo;s where() to address the different month proportions in question 3?    reference   How many rows have missing months?    flights.month.value_counts()    Can we figure out any patterns in the missingness?     pd.crosstab() groupby     -------------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p2\/d2\/"},{value:"JSONs \u0026 missing",label:"<p>UFO Sightings Data Link to json file\nExercise 1 Read in the json file as a pandas dataframe. After reading in the data, you\u0026rsquo;ll want to explore it and gain some intuition. Exploring data is a very important step — the more you know about your data the better! Answer the following questions to gain some insight into this dataset.\n How many rows are there? How many columns? What does a row represent in this dataset? What are the different ways missing values are encoded? How many np.nan in each column?  Some useful code for exploring data\n# Object\/Categorical Columns data.column_name.value_counts(dropna=False) data.column_name.unique() # Numeric Columns data.column_name.describe() # Counting missing values data.isna().sum() # Creates boolean dataframe and sums each column  Exercise 2 After learning different ways our data encodes missing values, now we will neatly manage them. There are many techniques we can use to handle missing values; for example, we can drop all rows that contain a missing value, impute with mean or median, or replace missing values with a new missing category. We will use some of these techniques in this exercise.\n shape_reported - replace missing values with missing string. distance_reported - change -999 values to np.nan. (-999 is a typical way of encoding missing values.) distance_reported - fill in missing values with the mean (imputation) were_you_abducted - replace - string with missing string.  The first 10 rows of your data should look like this after completion of the above steps.\n    city shape_reported distance_reported were_you_abducted estimated_size     0 Ithaca TRIANGLE 8521.9 yes 5033.9   1 Willingboro OTHER 7438.64 no 5781.03   2 Holyoke OVAL 7438.64 no 697203   3 Abilene DISK 7438.64 no 5384.61   4 New York Worlds Fair LIGHT 6615.78 missing 3417.58   5 Valley City DISK 7438.64 no 4280.1   6 Crater Lake CIRCLE 7377.89 no 528289   7 Alma DISK 7438.64 missing 4772.75   8 Eklutna CIGAR 5214.95 no 4534.03   9 Hubbard CYLINDER 8220.34 missing 4653.72    Some useful code for filling in missing data\ndata.column_name.replace(..., ..., inplace=True) data.column_name.fillna(..., inplace=True)  Exercise 3 Create a table that contains the following summary statistics.\n median estimated size by shape mean distance reported by shape count of reports belonging to each shape  Your table should look like this:\n   shape_reported median_est_size mean_distance_reported group_count     CIGAR 5899.68 6520.21 3   CIRCLE 266002 7408.26 2   CYLINDER 4550.58 8039.49 2   DISK 4581.8 7516.39 16   FIREBALL 5407.22 7097.78 3   FLASH 6108.34 7438.64 1   FORMATION 5104.4 8708.32 2   LIGHT 3850.25 7636.09 2   OTHER 4699.4 7473.98 4   OVAL 4943.63 7787.24 4   RECTANGLE 3668.1 6054.62 2   SPHERE 5076.78 7206.55 6   TRIANGLE 5033.9 8521.9 1   missing 250153 7438.64 2    Some useful code for grouping and getting summary statistics\n(data.groupby(...) .agg(..., ..., ...))  Exercise 4 The cities listed below reported their estimated size in square inches, not square feet. Create a new column named estimated_size_sqft in the dataframe, that has all the estimated sizes reported as sqft. (Hint: divide by 144 to go from sqin -\u0026gt; sqft)\n Holyoke Crater Lake Los Angeles San Diego Dallas  The head of your data should look like this.\n    city shape_reported distance_reported were_you_abducted estimated_size estimated_size_sqft     0 Ithaca TRIANGLE 8521.9 yes 5033.9 5033.9   1 Willingboro OTHER 7438.64 no 5781.03 5781.03   2 Holyoke OVAL 7438.64 no 697203 4841.69   3 Abilene DISK 7438.64 no 5384.61 5384.61   4 New York Worlds Fair LIGHT 6615.78 missing 3417.58 3417.58   5 Valley City DISK 7438.64 no 4280.1 4280.1   6 Crater Lake CIRCLE 7377.89 no 528289 3668.68   7 Alma DISK 7438.64 missing 4772.75 4772.75   8 Eklutna CIGAR 5214.95 no 4534.03 4534.03   9 Hubbard CYLINDER 8220.34 missing 4653.72 4653.72    Some useful code to fix the rows reported in sqin\nnp.where(..., # Condition ..., # If condition is true ...) # If condition is false  After you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/json_missing\/"},{value:"Day 1: Exploring names with pandas",label:"<p>Welcome to Class! Announcements  Data Science Society Kickoff! Wednesday at 6 in the STC 394 The data science lab  Completing Last Week  Quarto - \u0026ldquo;out of the frying pan and into the fire\u0026rdquo; Finishing the Introduction Project  Use the QMD Template project template Render as HTML and upload in Canvas    What was that data science community portion of our grade?    The Syllabus has this section which says:\n Data science community\n To earn credit for the DS Community element you must complete two different tasks from the list below. At the end of the semester, you will be asked to report on which tasks you completed and what you learned from them.\nAttend Data Science Society at least once.\n Sign up for an email newsletter that will teach you more about data science. Data Science Weekly or Data Elixir are good options. Listen to a podcast episode about data science. Build a Career in Data Science has some excellent episodes. Watch a professional presentation on YouTube about data science. Be prepared to share the link and a summary of the video. Reach out to someone who works in a data-related field and ask them for 15 minutes of their time. Use this time to conduct an “informational interview” and learn more about their responsibilities and career path. Research and apply to at least 5 data-related jobs or internships.  Interview Question: How do you keep up with the current methods in data science?\nDon\u0026rsquo;t Say: Nothing\n   Let\u0026rsquo;s Code! DS 250 workflow     You are going to hit SHIFT \u002b ENTER thousands of times. We don\u0026rsquo;t usually source our scripts. Think of Python Interactive like a graphing calculator or Excel on steroids. You code in pieces. Rewrite for clarity!     Can you figure out the functions of pandas?    Pandas Cheat Sheet and Basics Blog Post\n # Pause: can you explain what this code is doing? df = pd.DataFrame( {\u0026#34;a\u0026#34; : [5, 4, 6, 2, 3], \u0026#34;b\u0026#34; : [7, 8, 9, 10, 11], \u0026#34;c\u0026#34; : [10, 11, 12, 101, 0]}) Use the cheat sheet to find the functions you would need to implement the following steps.\nGroup 1\n sort my table by column a then only use the first 2 rows then calculate the mean of column b.  Group 2\n rename column a to duck then subset to only have duck and b columns then keep all rows where b is less than 9 then find the min of duck     What is method chaining?    Pandas is built to allow for method chaining. Here is a great resource on how to use method chaining: How to write neat pandas code.\n plotly.express creates a chart object pandas creates a DataFrame object We usually include () around our entire method so we can show it in steps.     Project 1 - Intro Understanding your data You should be able to introduce your data sets to people, the same way you introduce a friend!\n What does each row represent? If you don\u0026rsquo;t know, then you don\u0026rsquo;t understand what groups you can analyze. What does each column represent? If you don\u0026rsquo;t know, then you don\u0026rsquo;t understand what information you can evaluate for each group.  Being able to explain your data out loud to someone else follows the same principles as rubber duck debugging.\nIntroduction to pandas \u0026ldquo;DataFrame\u0026rdquo; What is a pandas DataFrame? We can read the official documentation. I also like the video in this tutorial.\nDataFrames come with attributes and built-in functions that can help us get a feel for our data.\nRun the code below one line at a time (or use other functions of your choice) to explore the names data. What do you learn?\nmy_data.columns my_data.shape my_data.size my_data.head() my_data.describe() Setup for Project 1 Create the folder and files to get prepared.  DS250 \u0026gt; project_1 \u0026gt;  names.py names.qmd data.csv (just in case the internet is down)    \u0026ldquo;How should we start each file?\u0026rdquo; I would do this process for every project.\n names.py: Every file starts with the same cells 1) import packages, 2) load data. names.qmd: Let\u0026rsquo;s start with the course template notes.qmd: Keep project noteson the readings and things you learn. my_cheat_sheet.qmd: Update your own cheat sheet  Read in the data.\n#%% # load packages import pandas as pd import plotly.express as px #%% # load data url = \u0026#34;https:\/\/github.com\/byuidatascience\/data4names\/raw\/master\/data-raw\/names_year\/names_year.csv\u0026#34; names = pd.read_csv(url) 1. How many unique names does the names dataframe contain? Work with a partner to find the answer. You might want to look at this pandas cheat sheet.\nHint     Pull the name column out as a series Use the pandas unique function pd.unique() Find the size of the series     2. What is the range of years in the names dataframe? Again, work with a partner and use the pandas cheat sheet.\nHint2     Pull the year column out as a series Find the max Find the min     </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p1\/d1\/"},{value:"Day 1: Intro to Flights Data",label:"<p>Welcome to class! Spiritual Thought Short Link\nProject 1 Comments  Don\u0026rsquo;t include data as a table. Only include tables that add useful information. If I have to scroll up and down it isn\u0026rsquo;t useful. Reports should be readable by an intelligent, but non-technical audience (Meaningful titles and section names) Make it like something you\u0026rsquo;d like to read Clean out any code output, logs, that distract from the message (\u0026ldquo;My Useless Chart\u0026rdquo;) Eliminate \u0026ldquo;warnings\u0026rdquo;  Project 2: Late Flights and Missing Data JSON files (JavaScript Object Notation)  Today, JSON is the de-facto standard for exchanging data between web and mobile clients and back-end services. source\n What is JSON? [ { \u0026#34;car\u0026#34;: \u0026#34;Mazda RX4\u0026#34;, \u0026#34;mpg\u0026#34;: 21, \u0026#34;cyl\u0026#34;: 6, \u0026#34;disp\u0026#34;: 160, \u0026#34;hp\u0026#34;: 110, \u0026#34;drat\u0026#34;: 3.9, \u0026#34;wt\u0026#34;: 2.62, \u0026#34;qsec\u0026#34;: 16.46, \u0026#34;vs\u0026#34;: 0, \u0026#34;am\u0026#34;: 1, \u0026#34;gear\u0026#34;: 4, \u0026#34;carb\u0026#34;: 4 }, { \u0026#34;car\u0026#34;: \u0026#34;Mazda RX4 Wag\u0026#34;, \u0026#34;mpg\u0026#34;: 21, \u0026#34;cyl\u0026#34;: 6, \u0026#34;disp\u0026#34;: 160, \u0026#34;hp\u0026#34;: 110, \u0026#34;drat\u0026#34;: 3.9, \u0026#34;wt\u0026#34;: 2.875, \u0026#34;qsec\u0026#34;: 17.02, \u0026#34;am\u0026#34;: 1, \u0026#34;gear\u0026#34;: 4, \u0026#34;carb\u0026#34;: 4 } ] Introduce the data Load the JSON file and spend a few minutes studying it. Can you learn enough about it to describe the columns and rows?\nHints:\n You can use .describe() to learn about the distribution of a numeric variable. You can use .value_counts() to learn about the distribution of a categorical variable. .crosstab() creates a \u0026ldquo;cross tabulation\u0026rdquo; of two or more categorical variables.  Can you trust the data? Do you notice anything interesting about the flights data?\nQuestion Brainstorming In your group, try to answer the following questions about your assigned question:\n What is our goal? How can we get there? What will the answer look like when we\u0026rsquo;re done?  Project 2 FAQs Missing data    Not all missing data is represented as np.nan. For an example, look at the column that counts delays due to late aircraft.\nWe will learn how to identify and deal with missing data next week. For now, we can drop rows we don\u0026rsquo;t want using square brackets [] or .query().\n   What columns do we need to use for question 3 (total number of flights delayed by weather)?      num_of_delays_weather num_of_delays_late_aircraft num_of_delays_nas      Groups 1 and 5 - Working with rows  .query() allows you to subset observations (rows) .sort_values() arranges rows in a particular order  Groups 2 and 6 - Working with columns  .filter() (as well as [] and .loc[]) allow you to select columns .assign() is one way to add new columns to a dataframe  Groups 3 and 7 - Counting items  .value_counts() summarizes a column by counting the values inside .crosstab() creates a \u0026ldquo;cross tabulation\u0026rdquo; of two or more variables  Groups 4 and 8 - Summarizing data  Using .groupby() and .agg() together allows you to calculate group summaries   Your Turn: Summarizing the cars data Write the code to calculate the mean weight wt for each cylinder type cyl.\nAnswer 1    cars.groupby(\u0027cyl\u0027).agg(mean_weight = (\u0027wt\u0027, np.mean)).reset_index()    Can you print the answer as a markdown table?\nAnswer 2    cars.groupby(\u0027cyl\u0027).agg(mean_weight = (\u0027wt\u0027, np.mean)).reset_index().to_markdown(index = False)    -------------------------------------------------------------------------- The flights data How are we going to answer Question 1 and Question 2?\nWatch out for different forms of missing data!    Not all missing data is represented as np.nan. For an example, look at the column that counts delays due to late aircraft.   What columns do we need to use for question 3 (total number of flights delayed by weather)?      num_of_delays_weather num_of_delays_late_aircraft num_of_delays_nas      How could we leverage numpy\u0026rsquo;s where() to address the different month proportions in question 3?    reference   How many rows have missing months?    flights.month.value_counts()    Can we figure out any patterns in the missingness?     pd.crosstab() groupby     Project 1: Names In your groups, discuss:\n What did you learn about data and Altair? What questions do you still have?  Connecting to Application Programming Interfaces (APIs) Representational State Transfer (REST APIs)  Over the course of the ’00s, another Web services technology, called Representational State Transfer, or REST, began to overtake [all other tools] for the purpose of transferring data. One of the big advantages of programming using REST APIs is that you can use multiple data formats — not just XML, but JSON and HTML as well. As web developers came to prefer JSON over XML, so too did they come to favor REST over SOAP. As Kostyantyn Kharchenko put it on the Svitla blog, “In many ways, the success of REST is due to the JSON format because of its easy use on various platforms.”\nToday, JSON is the de-facto standard for exchanging data between web and mobile clients and back-end services. ref\n JavaScript Object Notation  Well, when you’re writing frontend code in Javascript, getting JSON data back makes it easier to load that data into an object tree and work with it. And JSON formats data in a more succinct way, which saves bandwidth and improves response times when sending messages back and forth to a server. In a world of APIs, cloud computing, and ever-growing data, JSON has a big role to play in greasing the wheels of a modern, open web. ref \u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026gt;\n What does missing data look like?    How many missing values do you see in the first ten rows? (The mtcars documentation might help.)\ncars.head(10)    How many missing values are there?    #%% cars.isna().sum() #%% cars.isin([\u0026#39;\u0026#39;]).sum() #%% cars.describe() reference 1 and reference 2\n   ### How Pandas handles missingness Read [\u0027Handling missing in pandas\u0027](https:\/\/pandas.pydata.org\/pandas-docs\/stable\/user_guide\/missing_data.html#calculations-with-missing-data) ```python import numpy as np df = (pd.DataFrame( np.random.randn(5, 3), index=[\u0027a\u0027, \u0027c\u0027, \u0027e\u0027, \u0027f\u0027, \u0027h\u0027], columns=[\u0027one\u0027, \u0027two\u0027, \u0027three\u0027]) .assign( four = \u0027bar\u0027, five = lambda x: x.one  0, six = [np.nan, np.nan, 2, 2, 1], seven = [4, 5, 5, np.nan, np.nan]) ) ``` What happens when you add two pandas objects with missing values?    df.seven \u002b df.six reference\n   What happens when you sum within a column?    df.seven.sum() reference\n   How could I add two columns treating NaN like zeros?    df.seven.fillna(0) \u002b df.six.fillna(0) reference\n   ----------------------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p2\/d1\/"},{value:"Day 1: Intro to ML",label:"<p>Welcome to class! Announcements  Project 3 - Getting pickier about good communication  Tables in reports should be as concise as possible (no duplicate information) Career batting average Meaningful report name (Drop \u0026ldquo;Client Report\u0026rdquo;) Meaningful section headers so the table of contents is useful (don\u0026rsquo;t call them \u0026ldquo;Question 1\u0026rdquo;) Don\u0026rsquo;t include \u0026ldquo;My useless chart\u0026rdquo; from the template   Ask for help!  Computing lab Computing lab Slack channel (search) Slack classmates or general channel    Spiritual Thought Genesis 1:1 and Machine Learning Are facts true? Pictionary!   ----------------------  From Sebastian Thrun:\n AI is able to learn \u0026lsquo;rules\u0026rsquo; from highly repetitive data.\nThe single most important thing for AI to accomplish in the next ten years is to free us from the burden of repetitive work.\n Your Turn: Student Classification Problem Can we predict if a student is from Utah?\nYour Turn: Features and Targets Import dwellings.csv. With a neighbor:\n Try to describe the data. Explain what each observation (row) is and what measurements we have on that observation (columns). Now try describing the modeling (machine learning) we are going to do in terms of \u0026ldquo;features\u0026rdquo; and \u0026ldquo;targets\u0026rdquo;. Watch out - are there any columns that are the target in disguise? (You may need to review the project goal.) What features do you expect to have a strong relationship with the target?  Before Next Class Start working on Question 1    The goal of Question 1 is to help us with \u0026ldquo;feature selection\u0026rdquo;.\n Remember: Overfitting happens when some boundaries are based on on distinctions that don\u0026rsquo;t make a difference. More data does not always lead to better models. (Occam\u0026rsquo;s Razor)  Common questions:\n Why it may be better to have fewer predictors in Machine Learning models? What is Feature Selection and why do we need it in Machine Learning?     Do the project readings    Machine Learning Introduction\n Step-by-step guide (mostly) for training a GaussianNB classifier. (The steps will be the same for any algorithm you use.)  Visual Introduction to Machine Learning\n Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data. Overfitting happens when some boundaries are based on distinctions that don\u0026rsquo;t make a difference. You can see if a model overfits by having test data flow through the model.     What is the 5000 rows error with Altair?    MaxRowsError: How can I plot Large Datasets?\nYou may also save data to a local filesystem and reference the data by file path. Altair has a JSON data transformer that will do this transparently when enabled:\n# Try doing data exploration with: subset_data = denver.sample(n = 4999)    ---------------------- scikit-learn resources     Home page Tutorials Getting Started: What do you notice about the header portion of each of the script chunks?  import vs from ... import       1. Models approximate real-life situations using limited data.  2. In doing so, errors can arise due to assumptions that are overly simple (bias) or overly complex (variance).  3. Building models is about making sure there\u0027s a balance between the two. #### But what is the \u0027Pavlovian bell\u0027 in the machine learning model? ![](..\/..\/images\/ml\/test.png) Some mathematical penalty\/reward equation.  - __[Regression](https:\/\/setosa.io\/ev\/ordinary-least-squares-regression\/)__  - __[Variance, RMSE, SD](..\/..\/interactive\/threshold_histogram.html)__  - __proportions__ ## Using our project data to understand features, targets, and samples.  1. Import `dwellings_ml.csv` and write a short sentence describing your data. Remember to explain an observation and what measurements we have on that observation.  2. Now try describing the modeling (machine learning) we are going to do in terms of features and targets.  A. Are there any columns that are the target in disguise?  B. _Are the observational units unique in every row?_ ![](..\/..\/images\/ml\/iris_description.png) --------------- - Financial: orders, invoices, payments  - Work: plans, activity records  - School: Grades ------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p4\/d1\/"},{value:"Day 1: Intro to Project 3",label:"<p>Welcome to class! Spiritual Thought Announcements  Project 2 Highlights Project 2 comments   Turn them in Clean up graphs (main titles, axis labels, legends) Column headers on tables in your report (don\u0026rsquo;t include index number either) Technically Proportion of all flights delayed by weather, not the proportion of delayed flights JSON should look like a text example of a record, not a table  Things for next project:   Be sure to give section headers meaningful titles (NOT \u0026ldquo;Question 1\u0026rdquo;) Drop \u0026ldquo;my useless chart\u0026rdquo; from your graphs  What is Structured Query Language (SQL)?   Ray and I were impressed by how compactly Codd’s languages could represent complex queries. However, at the same time, we believed that it should be possible to design a relational language that would be more accessible to users without formal training in mathematics or computer programming. We believed that barriers to widespread acceptance of Codd’s languages existed on two levels.  .   1. The first barrier came from the mathematical notation, which was hard to enter at a keyboard. This barrier was superficial and could be easily dealt with by replacing symbols with keywords.   2. The more difficult barrier was at the semantic level. The basic concepts of Codd’s languages were adapted from set theory and symbolic logic. This was natural given Codd’s background as a mathematician, _but Ray and I hoped to design a relational language based on concepts that would be familiar to a wider population of users._ We also hoped to extend the language to encompass database updates and administrative tasks such as the creation of new tables and views, which had traditionally been outside the scope of a query language. SQL is \u0022a relational language based on concepts that would be familiar to a wider population of users.\u0022  When we moved to the San Jose Research Laboratory in 1973 to join the System R project, we began work on another new language that we called Sequel. Sequel allowed the well-paid-employee query to be represented in a readable form free from mathematical concepts and symbols. ... In 1977, because of a trademark issue, the name Sequel was shortened to SQL.  ------------------------------------- Ok, but how does it work? SQL uses keywords to pull (or \u0026ldquo;fetch\u0026rdquo;, \u0026ldquo;extract\u0026rdquo;) the data we want from a database. The computer reads those keywords in a specific order.\nFrom EverSQL we can get some more background:\n This is the logical order of operations, also known as the order of execution, for an SQL query:\n  FROM, including JOINs WHERE GROUP BY HAVING WINDOW functions SELECT DISTINCT UNION ORDER BY LIMIT and OFFSET   But the reality isn\u0026rsquo;t that easy nor straight forward. As we said, the SQL standard defines the order of execution for the different SQL query clauses. Said that, modern databases are already challenging that default order by applying some optimization tricks which might change the actual order of execution, though they must end up returning the same result as if they were running the query at the default execution order.\n For CSE 250: Don\u0026rsquo;t think too hard about optimization at this point. Let the database figure out the optimized routine.\nMost SQL queries are typed in the following pattern:\nSELECT -- \u0026lt;columns\u0026gt; and \u0026lt;column calculations\u0026gt; FROM -- \u0026lt;table name\u0026gt;  JOIN -- \u0026lt;table name\u0026gt;  ON -- \u0026lt;columns to join\u0026gt; WHERE -- \u0026lt;filter condition\u0026gt; GROUP BY -- \u0026lt;subsets for column calculations\u0026gt; HAVING -- \u0026lt;grouped filter condition\u0026gt; ORDER BY -- \u0026lt;how the output is returned in sequence\u0026gt; LIMIT -- \u0026lt;number of rows to return\u0026gt; \nProject 3 - what are our goals? Do we understand the questions being asked in Project 3?\nThe baseball data Let\u0026rsquo;s start exploring the baseball data!\n You\u0026rsquo;ll need to download the SQLite Databse And review the data dictionary  import pandas as pd import sqlite3 con = sqlite3.connect(\u0026#39;lahmansbaseballdb.sqlite\u0026#39;) df = pd.read_sql_query(\u0026#34;SELECT * FROM fielding LIMIT 5\u0026#34;, con) df How can we see what tables are in the database?\nimport pandas as pd import sqlite3 con = sqlite3.connect(\u0026#39;lahmansbaseballdb.sqlite\u0026#39;) pd.read_sql_query(\u0026#34;\u0026#34;\u0026#34; SELECT name FROM sqlite_master WHERE type=\u0026#39;table\u0026#39; \u0026#34;\u0026#34;\u0026#34;, con) Understanding SQL queries Make sure you do the project readings!\nWhat table do we want to use?    q = \u0026#39;\u0026#39;\u0026#39; SELECT * FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What columns do we want to select?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What calculation do we want to perform?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, ab\/r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; batting_calc = dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What name do we give our calculated column?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, ab\/r as runs_atbat FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; batting_calc = dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    -------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p3\/d1\/"},{value:"SQL \u0026 databases",label:"<p>Skill builder (relational database) For this skill builder, we are exploring some important topics in relational databases. This exercise will require you to create SQL queries through python. You may want to at least scan the readings before beginning this task since this serves as an assessment of your understanding of the assigned readings.\nA competent student should be able to finish the exercises within 75 minutes.\nBefore you start Make sure you have installed VS-code, pandas, and Altair on your computer.\nAlso make sure you have gone through the tutorial on under course materials called SQL for Data Science: we assume that you have a connection to your data.\nExercise 1 Readme file A database can consist of more than one table\/data set. A relational database consists of tables\/data sets that share columns. These shared columns then establish the relationship between the tables, thus the name relational database. The relations are sometimes not easily found and they require careful investigations.\nTo understand what is in a relational database, we can start with understanding the tables and the columns within.\nHere is a link to the readme file of the baseball database.\n What is the name of the table that records data about pitchers in the regular seasons?\n  What do the HR and HBP columns mean in that table respectively?\n Excercise 2 SELECT and FROM The simplest SQL query is a query with SELECT and FROM. These are the keywords you will see again and again in SQL. Usually, when constructing a more complex query, it is easier to identify what goes into these two clauses first.\n Create a query that shows all columns from the table you found in Exercise 1, save the dataframe in a variable \u0026ldquo;pitch\u0026rdquo;\n You script should look something like:\nresult = pd.read_sql_query( \u0027SELECT _______ FROM _______\u0027, con) results Excercise 2 WHERE The WHERE keyword allows us to filter down the table horizontally (fewer rows).\nIt goes after SELECT and FROM.\n Using a SQL query, select all rows in the same table where HR is lesser than 10 and gs is greater than 25.\n  Find out what the columns mean and explain your query in words\n Excercise 3 ORDER BY ORDER BY sort the table you select by one or more columns and goes after WHERE\n Using the same query in exercise 2, edit it so that the table is ordered by the year of the season(nearest to furthermost) and the player ID(alphabetically).\n Excercise 4 Joins Joins are used when you wish to create a new table through two different tables. Keep in mind that you have to identify the relationship between two tables before you can correctly join them.\nJOIN goes between FROM and WHERE.\n Identify the shared columns (keys) and join the table in exercise 2 with the salaries table, then filter the data so that it shows only pitchers in the year 1986.\n You should get a dataframe with 306 rows.\nExercise 5 Group by Group by is a keyword we use to lower the level of granularity of a table. Meaning we are combining rows into one by the given column(s).\nCreate a query that captures the number of pitchers the Washington Nationals used in each year, then sort the table by year\nYou should get a dataframe with 23 rows.\nFor the overachievers Excercise 6 Research the order of operations for SQL and put the following keywords in that order.\n SELECT FROM JOIN WHERE HAVING ORDER BY GROUP BY LIMIT  After you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/relational_data\/"},{value:"Week 8-9: Project 4 - Homes",label:"<p></p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p4\/"},{value:"Machine Learning",label:"<p>Intro to Titanic Machine Learning Skill Builder Link to data\nFor this skill builder, we\u0026rsquo;ll be putting our machine learning hats on. We\u0026rsquo;ll be creating a model that predicts whether a passenger survived. With machine learning, there is a lot of jargon! It can be quite overwhelming at times. This skill builder attempts to keep things basic and simple. With that being said, there are some terms that are important to understand. Let\u0026rsquo;s look at the first few rows of our dataset before proceeding with the definitions.\nThe titanic dataset will be used for examples of each definition.\n   survived pclass sex age siblings_spouses_aboard parents_children_aboard fare     0 3 1 22 1 0 7.25   1 1 0 38 1 0 71.2833   1 3 0 26 0 0 7.925   1 1 0 35 1 0 53.1   0 3 1 35 0 0 8.05    Important Terms:  features: measurable property of the object you\u0026rsquo;re trying to predict. We use this information to predict our target of interest.  Example: pclass, sex, age, siblings_spouses_aboard , parents_children_aboard, fare columns are all examples of different features. Synonyms: attributes, explanatory variables, independent variables, variables, X\u0026rsquo;s, covariates   target: the feature that you are wanting to gain more insight into. The thing you are trying to predict.  Example: in the titanic dataset our target is survived Synonyms: label, dependent variable, y   train set: Usually 70% of the rows from the original dataset are randomly sampled to create this training data. It\u0026rsquo;s used by the algorithm, to determine, or learn, the optimal combinations of variables that will generate a good predictive model  Example: Random sample of 70% of the original titanic dataset rows Synonyms: training data, train data, X_train, y_train   test set: Usually the remaining 30% of the rows in the original dataset are used to create this dataset. The testing data is a set of rows used only to assess the performance (i.e. generalization) of a model. To do this, the final model is used to predict classifications of examples in the test set. Those predictions are compared to the examples\u0026rsquo; true classifications to assess the model\u0026rsquo;s accuracy.  Example: Random sample of 30% of the original titanic dataset rows Synonyms: testing data, test data, X_test, y_test   evaluation metrics: A statistic that tells you how well your predictions align with the actual values. Other words, tells you how good your model is.  Example: Accuracy, Precision, Recall, MSE, MAE, Rsquared Synonyms: performance metric    Again, this is a very light and oversimplified treatment of machine learning. The purpose of this project is to help you understand the main concepts of ml and walk you through the process of building a machine learning model. A simplified work flow of a machine learning project is shown below. Spend some time getting familiar with this flow \u0026amp;mdash as you are about to code it\u0026hellip; Exciting!\nNote in order to do this skill builder you will need to have scikit-learn installed on your machine. Run the following command in your terminal if you haven\u0026rsquo;t already.\npip install scikit-learn\nData Link to csv file\nExercise 0 (Imports and Loading in data) # Loading in packages import pandas as pd import numpy as np import altair as alt from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Loading in data data = pd.read_csv(___)  Exercise 1 Create a chart exploring the relationship between age and survived in the titanic dataset. A strip plot, density plot, or boxplot might be useful here. Below is an example of a density plot. Feel free to replicate this chart or create your own.\nThe purpose of making this chart is to explore the relationships between a feature and the target. We want to see if the feature contains predictive information about the target. This is a large part of machine learning called Exploratory Data Analysis that should never be skipped! Spend time getting to know your features and how they interact with other features and the target.\n Exercise 2 Build a random forest model that is able to predict whether a passenger survived. This exercise is the bulk of the skill builder and contains several steps.\nStep 0: Split the data into X and y variables The X variable will contain all your features\n# Removes the target and keeps all features X = data.drop(___, axis=1) The y variable will hold the target\n# Selects the target column y = data[\u0026#39;___\u0026#39;] Step 1: Split data into train and test sets The train_test_split function is useful for this task. Review the train_test_split function documentation\n# Splitting X and y variables into train and test sets using stratified sampling X_train, X_test, y_train, y_test = train_test_split(___, ___, test_size=0.3, random_state=24, stratify=y) Step 2: Train the model Explore the RandomForestClassifier documentation for the RandomForestClassifier. It\u0026rsquo;s not necessary to understand the inner workings of the Random Forest algorithm for this class - just learn the syntax of fitting the model.\n# Creating random forest object rf = RandomForestClassifier(random_state=24) # Fit with the training data rf.fit(___, ___) Step 3: Use test set to make predictions # Using the features in the test set to make predictions y_pred = rf.predict(___) Step 4: Compare test set predictions to actual values. Calculate the accuracy. # Comparing predictions to actual values accuracy_score(___, ___)  Exercise 3 What is the most important feature in making predictions? Why do you think this is?\nCreate a table that shows the feature importances in descending order. The random forest classifier has a feature importances attribute. It can be accessed by rf.feature_importances_. The table should look something like this.\n   feature names importances     fare 0.288051   sex 0.281853   age 0.266491   pclass 0.0814224   siblings_spouses_aboard 0.0475633   parents_children_aboard 0.034619    After you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/ml_sklearn\/"},{value:"Week 6-7: Project 3 - Baseball",label:"<p> We will use a baseball relational database to explore SQL in Python for data science applications. Finding relationships in baseball\n Completed Readings: SQL for Data Science Readings (read all links) and Why SQL is beating NoSQL, and what this means for the future of data\nUse the data.world baseball url for the Data Connection. You can read the Connection Instructions for data.world here\nGrand Questions   Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID\/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.\n  This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)\n Write an SQL query that provides playerID, yearID, and batting average for players with at least one at bat. Sort the table from highest batting average to lowest, and show the top 5 results in your report. Use the same query as above, but only include players with more than 10 “at bats” that year. Print the top 5 results. Now calculate the batting average for players over their entire careers (all years combined). Only include players with more than 100 at bats, and print the top 5 results.    Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc.). Write an SQL query to get the data you need. Use Python if additional data wrangling is needed, then make a graph in Altair to visualize the comparison. Provide the visualization and the compiled Vega script that would build the visualization.\n   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p3\/"},{value:"Munging data",label:"<p>Intro to cleaning movies data Link to the data\nThis skill builder focuses on munging (formatting) data into a machine learning ready dataset. We will be using an IMDB Ratings dataset. It contains columns that are categorical. Sklearn cannot handle columns that are strings, so we need to convert these into a numerical representation. We accomplish this by either one hot encoding, label encoding, or taking just one value of the range provided. There are many other ways to represent these columns as numbers, but they are beyond the scope of this course.\nOnce you\u0026rsquo;ve converted all columns to numeric, in an intelligent way, you will be asked to recreate a graph using altair. Here is the head of the data you will be working with. Enjoy!\n   star_rating content_rating genre duration box_office_rev major_hit     9.3 R Crime 142 €1924521976 - €1925521976 no   9.2 R Crime 175 €177034987 - €178034987 no   9.1 R Crime 200 €2617541398 - €2618541398 no   9 PG-13 Action 152 €996115723 - €997115723 no   8.9 R Crime 154 €1172054364 - €1173054364 no    Data Link to csv file: ...\n Exercise 0  Grab the high range value for each movie and put it into a new column called high_range_rev.  Make sure the data type of this new column is numeric!!   Remove the box_office_rev column from the dataset.  The .str.split() and .astype() methods might be of use! Also, to get the euro sign just copy it from here, €, and put it in your code.\nThe first 5 rows of the resulting dataframe should look like this\n   star_rating content_rating genre duration major_hit high_range_rev     9.3 R Crime 142 no 2345444803   9.2 R Crime 175 no 2182412593   9.1 R Crime 200 no 1604872807   9 PG-13 Action 152 no 284317976   8.9 R Crime 154 yes 1791932201     Exercise 1 Convert the major_hit column to 1\/0\u0026rsquo;s. yes -\u0026gt; 1 and no -\u0026gt; 0. Again, there are several ways to accomplish this. Using our old friend np.where is probably the easiest though.\nThe first 5 rows of the resulting dataframe should like this\n   star_rating content_rating genre duration major_hit high_range_rev     9.3 R Crime 142 0 1925521976   9.2 R Crime 175 0 178034987   9.1 R Crime 200 0 2618541398   9 PG-13 Action 152 0 997115723   8.9 R Crime 154 0 1173054364     Exercise 2 Convert the content_rating column using label encoding. We\u0026rsquo;re using label encoding in this case because the movie ratings already have a natural ordering to them. We will replace each rating with a number in it\u0026rsquo;s natural ascending order.\nTo be more specific, here is how we will do it.\n G: 0 PG: 1 PG-13: 2 R: 3  A dictionary and the .map() method could be useful for this exercise. There are other ways of tackling this problem though. Be creative!\nThe first 5 rows of the resulting dataframe should look like\n   star_rating content_rating genre duration major_hit high_range_rev     9.3 3 Crime 142 0 1925521976   9.2 3 Crime 175 0 178034987   9.1 3 Crime 200 0 2618541398   9 2 Action 152 0 997115723   8.9 3 Crime 154 0 1173054364     Exercise 3 The last column that we need to take care of is genre. We will use one hot encoding for this. Make sure to ONLY one hot encode the genre column!\nA useful function for one hot encoding is pd.get_dummies(). I recommend checking out the documentation.\nThe resulting dataframe should look like the following example; don\u0026rsquo;t worry if your high_range_rev column turned into scientific notation—Pandas does this sometimes.\n    star_rating content_rating duration major_hit high_range_rev genre_Action genre_Adventure genre_Animation genre_Biography genre_Comedy genre_Crime genre_Drama genre_Family genre_Fantasy genre_Horror genre_Mystery genre_Sci-Fi genre_Thriller genre_Western     0 9.3 3 142 0 1.92552e\u002b09 0 0 0 0 0 1 0 0 0 0 0 0 0 0   1 9.2 3 175 0 1.78035e\u002b08 0 0 0 0 0 1 0 0 0 0 0 0 0 0   2 9.1 3 200 0 2.61854e\u002b09 0 0 0 0 0 1 0 0 0 0 0 0 0 0   3 9 2 152 0 9.97116e\u002b08 1 0 0 0 0 0 0 0 0 0 0 0 0 0   4 8.9 3 154 0 1.17305e\u002b09 0 0 0 0 0 1 0 0 0 0 0 0 0 0     Exercise 4 Recreate this graph as best you can. You\u0026rsquo;ll need to use the original data that specifies the actual rating.\nAfter you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/munging\/"},{value:"Week 4-5: Project 2 - Flights",label:"<p>JSON files are the format of choice for sharing information and data between apps on the internet. When you hear someone explain that you can use an API to get the data, there is usually a JSON file involved. The history of JSON is worth reading. We will have another project analyzing data from JSON files that are missing values. Are we missing JSON on our flight?\n Completed Readings: P4DS: Chapter 5 Data tranformation, P4DS: Section 7.4 Missing Values, Python Data Science Handbook: Missing Data, How to Handle Missing Data, and Wikipedia Missing Data\n The flights JSON File\nand the Data Description\n ### Grand Questions  1. __Which airport has the worst delays? How did you choose to define \u0022worst\u0022? As part of your answer include a table that lists the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours, for each airport.__   2. __What is the worst month to fly if you want to avoid delays? Include one chart to help support your answer, with the x-axis ordered by month. You also need to explain and justify how you chose to handle the missing `Month` data.__   3. __According to the BTS website the Weather category only accounts for severe weather delays. Other “mild” weather delays are included as part of the NAS category and the Late-Arriving Aircraft category. Calculate the total number of flights delayed by weather (either severe or mild) using these two rules:__   1. __30% of all delayed flights in the Late-Arriving category are due to weather.__  2. __From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%.__   4. __Create a barplot showing the proportion of all flights that are delayed by weather at each airport. What do you learn from this graph (Careful to handle the missing `Late Aircraft` data correctly)?__   5. __Fix all of the varied `NA` types in the data and save the file back out in the same format that was provided. Provide one example from the file with the new `NA` values shown.__ --------------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p2\/"},{value:"GitHub and git",label:"<p>Complete the Hello World GitHub Guide\n</p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/git_github\/"},{value:"Week 2-3: Project 1 - Names",label:"<p>We are going to start learning the pandas package while we explore the names data for our project. What is in a name?\n Completed Readings: Python for Data Science (P4DS): Data Visualization, P4DS: Graphics for Communication, P4DS: Markdown, P4DS: 5.2 Filter rows with .query()\n https:\/\/github.com\/byuidatascience\/data4names\/raw\/master\/data-raw\/names_year\/names_year.csv\n ### Grand Questions  1. __How does your name at your birth year compare to its use historically?__  1. __If you talked to someone named Brittany on the phone, what is your guess of their age?__  1. __Mary, Martha, Peter, and Paul are all Christian names. From 1920 - 2000, compare the name usage of each of the four names.__  1. __Think of a unique name from a famous movie. Plot that name and see increases line up with the movie release.__ ------------------------------ </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p1\/"},{value:"Week 1: Introduction",label:"<p>  Introduction Project Syllabus   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/introduction\/"},{value:"DS250",label:"<p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/"},{value:"Frequently Asked Questions",label:"<p> What do you mean by data science programming?    Most likely, you have had 1-2 courses of programming before you have taken CSE 250. Unlike traditional computer science courses, CSE 250 uses Python in an interactive mode instead of building programs. The data provider usually has some big questions that need answering; However, there are hundreds of little issues and responses along the way. We use programming to facilitate this investigation.\nThere are similarities with User Experience Designers. In our case, we don\u0026rsquo;t get to ask users about their experience. We use programming to ask data about its background, and each data set has its own history. We want our analysis to mold to that experience. You can think of data science programming like a first date with your data. You can\u0026rsquo;t write one long program nieve of the issues and nuances each living data set provides.\n   How does CSE 250 compare to CSE 350 or Math 335?    The two courses have similarities. You could think of CSE 250 as an introduction to data wrangling and visualization. Both classes use real-world data and are built around data science projects. There are some critical differences between the two courses.\n In this course, we use Python, and CSE 350 uses R. We are introducing the principles of data science programming in CSE 250. The course is only 2-credits. CSE 250 is intended to introduce visualization, wrangling, and modeling.     How does CSE 250 prepare me for CSE 350, Math 335 and CSE 450?    You will be comfortable with interactive programming and have an introduction to the principles of data formats for data science applications. You will be introduced to principles related to machine learning, data wrangling, and data visualization.   What programming languages do we use in this course?    The course is done using Python. We focus on the pandas and Altair packages.   What are the prerequisites for this course?    Using the new courses at BYU-I, the prerequisite is CSE 110. However, if you have experience programming from other classes, you most likely are prepared for this course.   Why Python instead of R?    The computer science and software engineering programs at BYU-I use Python as their foundational courses. The standard student will have some experience with Python before CSE 250. Python is an essential programming language for data scientists, and we already have CSE 350\/Math 335, which is taught in R.   What is pandas?    pandas is the foundational data science package in Python. If you are using tabular data you will be in pandas.   Why are we using Altair instead of Seaborn or Matplotlib?    Matplotlib was the first visualization package to gain a following in Python. Seaborn is built on top of Matplotlib. Many data scientists use both in their work—neither leverage the grammar of graphics as developed by Leland Wilkinson. Altair is built on Vega-Lite, which uses the Vega visualization grammar. It is declarative and actively developed. We expect that it will become the predominant visualization package in Python (https:\/\/youtu.be\/FytuB8nFHPQ and https:\/\/youtu.be\/vTingdk_pVM).   </p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/faq\/"},{value:"Skill Builders",label:"<p>These short activites are provided for you to gain some additional skills to help with the class projects.\n</p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/"},{value:"Slack",label:"<p>If you haven\u0026rsquo;t already, please join Slack. This will be a lifesaver.\nhttps:\/\/join.slack.com\/t\/byuidss\/signup\n</p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slack\/"},{value:"Slides",label:"<p>Use the navigation pane on the left to review the class slides.\n</p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/"},{value:"",label:"<p>Details Your coding challenge will help you demonstrate the skills you have developed this semester. Here are a few essential items.\n Your goal is to demonstrate your data science coding abilities. Get through as many items with a rough implementation as possible. Get your code to match our outputs as close as possible, but don\u0026rsquo;t stress over minute details. Keep most of the code you type. If you end up not using specific parts, comment them out and include them at the bottom. Use the entire hour and may not finish. Submit a .md and a .pdf report with your output and code for each challenge.  Please use the challenge template to submit your work.\nimport pandas as pd import altair as alt import numpy as np from sklearn.model_selection import train_test_split from sklearn import tree from sklearn.ensemble import GradientBoostingClassifier from sklearn import metrics Challenge 1 Split Entry houses are a failed building experiment in the United States. Use the data from our Denver homes project, as shown below, to recreate the following graphic.\nurl = \u0026#39;https:\/\/github.com\/byuidatascience\/data4dwellings\/raw\/master\/data-raw\/dwellings_denver\/dwellings_denver.csv\u0026#39; dat_home = pd.read_csv(url).sample(n=4500, random_state=15) Challenge 2 Our computations can\u0026rsquo;t be done with missing values. Programmatically replace all the lost values with 125 and make a box-plot.\nmister = pd.Series([\u0026#34;lost\u0026#34;, 15, 22, 45, 31, \u0026#34;lost\u0026#34;, 85, 38, 129, 80, 21, 2]) Challenge 3 Our computations can\u0026rsquo;t be done with missing values. Programmatically replace all the lost values with 125 and report the mean rounded to two decimals.\nmister = pd.Series([\u0026#34;lost\u0026#34;, 15, 22, 45, 31, \u0026#34;lost\u0026#34;, 85, 38, 129, 80, 21, 2]) Challenge 4 Programmatically read in the following JSON file, keep only the cases column and return a markdown table that has country in the rows and cases for 1999 and 2000 in the columns. Your table will have six cells with values.\nurl = \u0026#39;https:\/\/github.com\/byuidatascience\/data4python4ds\/raw\/master\/data-raw\/table1\/table1.json\u0026#39; Challenge 5 Use our cleaned example of the star wars data from project 6 to predict the gender of the respondent to the survey. Report your precision and a feature importance plot.\n Use test_size = .20 and random_state = 2020 in train_test_split() Use the GradientBoostingClassifier() method.  url = \u0026#34;http:\/\/byuistats.github.io\/CSE250-Course\/data\/clean_starwars.csv\u0026#34; dat = pd.read_csv(url) </p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/final_coding_challenge\/sp22\/"},{value:"Categories",label:"<p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/categories\/"},{value:"Final_coding_challenges",label:"<p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/final_coding_challenge\/"},{value:"Office Hours",label:"<p>Schedule a visit with Brother Cannon at an available time. https:\/\/calendly.com\/cannonp\n</p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/contact\/"},{value:"Tags",label:"<p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/tags\/"},];$("#search").autocomplete({source:projects}).data("ui-autocomplete")._renderItem=function(ul,item){return $("<li>").append("<a href="+item.url+" + \" &quot;\" +  >"+item.value+"</a>"+item.label).appendTo(ul);};});</script></div></div></div></div></header><section class=section><div class=container><div class="row justify-content-center"><div class="col-12 text-center"><h2 class=section-title></h2></div><div class="col-lg-4 col-sm-6 mb-4"><a href=https://byuistats.github.io/DS250-Cannon/skill_builders/ class="px-4 py-5 bg-white shadow text-center d-block match-height"><i class="ti-ruler-pencil icon text-primary d-block mb-4"></i><h3 class="mb-3 mt-0">Skill Builders</h3><p class=mb-0>Build skills for the projects.</p></a></div><div class="col-lg-4 col-sm-6 mb-4"><a href=https://byuistats.github.io/DS250-Cannon/slack/ class="px-4 py-5 bg-white shadow text-center d-block match-height"><i class="https://img.shields.io/badge/slack-@oresoftware/npp-yellow.svg?logo=slack icon text-primary d-block mb-4"></i><h3 class="mb-3 mt-0">Slack</h3><p class=mb-0>Link to Slack signup</p></a></div><div class="col-lg-4 col-sm-6 mb-4"><a href=https://byuistats.github.io/DS250-Cannon/slides/ class="px-4 py-5 bg-white shadow text-center d-block match-height"><i class="ti-layout-slider-alt icon text-primary d-block mb-4"></i><h3 class="mb-3 mt-0">Slides</h3><p class=mb-0>Class material for every day.</p></a></div></div></div></section><footer class="section pb-4"><div class=container><div class="row align-items-center"><div class="col-md-8 text-md-left text-center"><p class="mb-md-0 mb-4">J. Hathaway and BYU-I ©</p></div><div class="col-md-4 text-md-right text-center"><ul class=list-inline><li class=list-inline-item><a class="text-color d-inline-block p-2" href=https://github.com/byuidatascience><i class=ti-github></i></a></li><li class=list-inline-item><a class="text-color d-inline-block p-2" href=https://www.linkedin.com/groups/13537407/><i class=ti-linkedin></i></a></li></ul></div></div></div></footer><script src=https://byuistats.github.io/DS250-Cannon/js/script.min.js></script></body></html>
\ No newline at end of file
+<i class="ti-search search-icon"></i><script>$(function(){var projects=[{value:"Day 2: Project 0",label:"<p>Announcements  Devotional Computing Lab 4:30PM - 6:30PM all weekdays except Wednesday. Saturday from 10AM-12PM  Slack channel #tutoring_lab   Data Science Society - Wednesday\u0026rsquo;s at 6PM, STC 394 Math Department Opening Social - Thursday 11:30 RKS 229  Spiritual Thought Question  How is 1 Nephi like Genesis? \u0026ldquo;In the beginning, God created the heaven and the earth.\u0026rdquo;  Syllabus Questions?  A note about readings\u0026hellip; Tips for asking for help  Slack Google - acquired discernment   Quarto and tradeoffs Project Submissions: HTML  Are we all on the Slack channel? Follow the Slack invitation that is waiting in your student email. If you don\u0026rsquo;t see an invite, you can join through this link and then ask Brother Cannon to add you to the class channel.\nMethods Checkpoint All the answers will be in the assigned reading or in these slides.\nNotes on Project 0 Installing Packages and Extensions Learn how to install packages by reading the assigned material and by watching the video tutorial on this page.\nThe readings mention a lot of different packages. For Project 0, you need to install at least jupyter, pandas, plotly.express, numpy, and tabulate.\nThe readings will also mention two VS Code extensions you need to install.\nJupyter Notebooks vs. Interactive Python Window Should you decide to use Juypyter Notebooks this semester within VS Code, this is a great guide to get you started.\nOr you can choose to stick with the Python Interactive window like the textbook does.\nUse Your Resources!  Technical documentation Google searches ChatGPT Asking for help on Slack Don\u0026rsquo;t forget the data science lab! (Starts next week.) Question that cannot be answered by the textbook and documentation? Google it. A function you have never seen before? Google it. An error in your code? Google it.  Markdown What is Markdown?  A clean, human readable way to make slick html and pdf documents Used widely among programmers for clean documentation Used widely by Data Scientists to publish results and communicate with stakeholders  Here\u0026rsquo;s a good summary\nQuarto Do your tinkering in interactive Python or Jupyter notebooks. Generate report with finished code, graphs, etc. in Quatro\nQuarto\nNow for some data! Let\u0026rsquo;s get this party started Your turn:  Read in the cars data set Work with you your teams to talk through interesting possibilities for a graph Work on Project 0 Questions and Tasks   Any issues with getting Python installed?     Python VS Code Altair in VS Code     Does everyone have pandas, altiar, numpy, scikit-learn installed?     Video tutorial: how to install packages.  One way to install packages:\npip install pandas altair Maybe a better way to do it: run this in an interactive window.\nimport sys !{sys.executable} -m pip install pandas altair    Does everyone have altair-saver working?     altair_saver Video tutorial     ---------------------------------------------------- Why are we using Altair?    It is built on the VEGA and D3 which are fast and web based.  Grammar of Graphics: Vega-Lite   Technical Paper Website Endorsment      What are we not learning in this course?    Indexing, .loc[] and .iloc[] I may not be experienced enough to understand why I should teach you these. I think they all add complexity to what we are learning in the course and we have elected to avoid it. We will use reset_index() a lot. I think MultiIndex features create complication. I have also elected to use .filter() instead of .loc[] because I like it.\nVirtual Environments Virtual Environments appear to be an important tool as you continue to use Python. We will not be teaching these or supporting these in our course.\nmatplotlib (and any tool leveraging it) It feels old, has a bad api, and isn\u0026rsquo;t declarative.\n   ----------------------------- What can Python Interactive do?    Let\u0026rsquo;s review the power of Python Interactive  # %% in my .py script is much better than Jupyter notebooks (.ipynb).  If we hope to have our code work in a production environment then Jupyter is problematic. Caching and code chunks are problematic https:\/\/medium.com\/@_orcaman\/jupyter-notebook-is-the-cancer-of-ml-engineering-70b98685ee71       Set-up your py script    Setting up your script A good data science .py script will have packages and data loaded at the top. Usually you have a few short commented sentences that descibe the script purpose.\n# %% # import pandas, altair, numpy import pandas as pd import altair as alt import numpy as np # %% # load data # handgrenade data https:\/\/github.com\/byuidatascience\/data4soils\/blob\/master\/data-raw\/cfbp_handgrenade\/cfbp_handgrenade.csv url = \u0026#39;https:\/\/github.com\/byuidatascience\/data4soils\/raw\/master\/data-raw\/cfbp_handgrenade\/cfbp_handgrenade.csv\u0026#39; dat = pd.read_csv(url)    Make a scatter plot with hmx on the x and rdx on the y    To get you started:\nalt.Chart(dat).encode()    Make a spatial plot with hmx colored     Encode the row and column to the axes. Color the hmx points using the \u0026lsquo;goldorange\u0026rsquo; color scheme. Use mark_square() and make the square sizes 500.     -------------------- Create a histogram of hmx     Encode the x-axis as binned. Encode the y-axis as counts. Configure the title to a fontSize of 20. Use properties to place the title.     ----------------------------- How can I get help?     Make sure you read the reading assignments once or twice or five times. Read the guides on the Course Materials page. Post questions in our #cse250_s21_larson slack channel (and try to help others!) Attend the Data Science Lab. Google is your best friend.     -------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/introduction\/day02\/"},{value:"Day 4: Exporting JSON",label:"<p>Welcome to class! Spiritual Thought    Announcements  Hackathon Opening Social  Question 5 Let\u0026rsquo;s do an example of question 5 using the mtcars data.\nLoad packages and data #%% import pandas as pd import numpy as np import json url_cars = \u0026#34;https:\/\/github.com\/byuidatascience\/data4missing\/raw\/master\/data-raw\/mtcars_missing\/mtcars_missing.json\u0026#34; cars = pd.read_json(url_cars) \nFind all the missing values #%% # method 1: find \u0026#34;official\u0026#34; null values # hp, wt, and vs cars.isnull().sum() #%% # method 2: just look at the data # car, hp, wt, vs, gear cars.head(10) #%% # method 3: look at summaries # the values in \u0026#39;gear\u0026#39; look funny cars.describe() #%% # method 4: count up categories # looks like 4 rows are blank cars.car.value_counts() \nReformat the missing values Remember, you need to reformat your missing values to make them consistent!\nReading the examples in the replace documentation might give you some ideas.\n#%%  # There are a lot of functions # we could use to give the missing values # a consistent format. # `replace()` is one of the easiest # let\u0026#39;s change everything to np.nan cars_new = cars.replace(999, np.nan).replace(\u0026#34;\u0026#34;, np.nan) # or equivalently: cars_new = cars.replace([999, \u0026#34;\u0026#34;], np.nan) # did we get them all? cars_new.isnull().sum() \nSaving JSON files from a pandas dataframe You can save a DataFrame as a JSON file like this:\n#%% # save the new data as a json cars_new.to_json(\u0026#34;my_cars_data.json\u0026#34;) The df.to_json() documentation shows us how to change the way the JSON file is organized. (By row? By column? etc.)\nThis is the format we would like to see in the report:\n[ { \u0026#34;car\u0026#34;: \u0026#34;Mazda RX4\u0026#34;, \u0026#34;mpg\u0026#34;: 21, \u0026#34;cyl\u0026#34;: 6, \u0026#34;disp\u0026#34;: 160, \u0026#34;hp\u0026#34;: 110, \u0026#34;drat\u0026#34;: 3.9, \u0026#34;wt\u0026#34;: 2.62, \u0026#34;qsec\u0026#34;: 16.46, \u0026#34;vs\u0026#34;: 0, \u0026#34;am\u0026#34;: 1, \u0026#34;gear\u0026#34;: 4, \u0026#34;carb\u0026#34;: 4 } ] And here are the various options:\n# %% # Question 5 wants us to \u0026#34;include one record example\u0026#34; # in our md report that \u0026#34;has a missing value\u0026#34; # you can print out a json file like this: json_data = cars_new.to_json() print(json_data) # but that won\u0026#39;t look good in our report. # instead.... #%% # you can do this. # in this format, the json file is # organized\/printed by column json_data = cars_new.to_json() json_object = json.loads(json_data) json_formatted_str = json.dumps(json_object, indent = 4) print(json_formatted_str) # %% # we can change the format of the # json file using \u0026#39;orient\u0026#39; json_data = cars.to_json(orient=\u0026#34;split\u0026#34;) json_object = json.loads(json_data) json_formatted_str = json.dumps(json_object, indent = 4) print(json_formatted_str) # %% # by table json_data = cars.to_json(orient=\u0026#34;table\u0026#34;) json_object = json.loads(json_data) json_formatted_str = json.dumps(json_object, indent = 4) print(json_formatted_str) # %% # by \u0026#34;record\u0026#34; or \u0026#34;row\u0026#34; json_data = cars.to_json(orient=\u0026#34;records\u0026#34;) json_object = json.loads(json_data) json_formatted_str = json.dumps(json_object, indent = 4) print(json_formatted_str) </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p2\/d4\/"},{value:"Day 4: Practice Coding Challenge",label:"<p>What table do we want to use?    q = \u0026#39;\u0026#39;\u0026#39; SELECT * FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What columns do we want to select?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What calculation do we want to perform?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, r\/ab FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What name do we give our calculated column?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, r\/ab as runs_atbat FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    #### I want to join two tables to help in decision making The [data dictionary](https:\/\/data.world\/byuidss\/cse-250-baseball-database\/workspace\/file?filename=readme2014.txt) might help. - For seasons after 1999, which year had the most players selected as All Stars but didn\u0027t play in the All Star game? - Provide a summary of how many games, hits, and at bats all the players had in that year\u0027s post season. ```python import pandas as pd import altair as alt import numpy as np import datadotworld as dw baseball_url = \u0027byuidss\/cse-250-baseball-database\u0027 ``` What table do we want for All Star information?    # %% # allstar table dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM AllstarFull WHERE --? AND --? LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you use a groupby to get the counts of players per year?    dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT yearid, -- \u0026lt;stuff to calculate\u0026gt; FROM AllstarFull WHERE yearid \u0026gt; 1999 AND gp != 1 GROUP BY --? ORDER BY --? \u0026#39;\u0026#39;\u0026#39;).dataframe    What table do we want for the post season at bats?    dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM BattingPost as bp LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you join the post season batting table and AllStar information?     For each player, keep only the at bats, hits, the all star gp, and gameid columns. Let\u0026rsquo;s only keep players with at least one at bat in the post season.  dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;columns to keep\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON -- \u0026lt;two columns for the join\u0026gt; WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND -- \u0026lt;at bat condition\u0026gt; LIMIT 15 \u0026#39;\u0026#39;\u0026#39; ).dataframe    Let\u0026rsquo;s build the final table     For seasons after 1999, which year had the most players selected as All Stars but didn\u0026rsquo;t play in the All Star game? Provide a summary of how many games, hits, and at bats all the players had in that year\u0026rsquo;s post season.  dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;lots of calculations\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON bp.playerid = asf.playerid AND bp.yearid = asf.yearid WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND ab \u0026gt; 0 GROUP BY -- \u0026lt;column\u0026gt; ORDER BY -- \u0026lt;column\u0026gt; \u0026#39;\u0026#39;\u0026#39; ).dataframe    ------------------------------------------------------------------------------- Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc.). Write an SQL query to get the data you need. Use Python if additional data wrangling is needed, then make a graph in Altair to visualize the comparison. __In your group, answer the following questions and be prepared to share your answers with the class.__ 1. What will you use to compare the two baseball teams? 2. What table(s) does this information come from? 3. Do you need to do any calculations? 4. Can you think of any problems you might run into? ### Open Programming Time --------------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p3\/d4\/"},{value:"Introduction",label:"<p>A competent student should be able to finish the exercises within 60 minutes. You should work through it on your own. This serves as an assessment of your understanding of the assigned readings.\nBefore you start Make sure you have installed VS-code, pandas, and altair on your computer. You can install these package by typing this line in the terminal.\npip install pandas altair\nOR if you have more than one version of python\npip3.9 install pandas altair\npip3.9 indicates the version of python you are installing the packages to.\nPart 1 Get familiar with your tools Programming involves a lot of research. Unlike subjects like Mathematics or History, we are not required to remember every single function and its usage. It is natural for experienced programmers to look for answers on the internet, books, even from other people\u0026rsquo;s code. Programming will be extremely frustrating if we are not allowed to do web searches, so please get familiar with the tools you have and use them often.\nOffical Documentation This should be your first resort for understanding any code\/function. Scanning the documentation of a function will allow you to get an overview of its usage.\nHere is a link to the documentation of the assign() function:\n(https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.assign.html)\nExample of assign() (as shown in the documentation)\nimport pandas as pd df1 = pd.DataFrame({\u0026#39;temp_c\u0026#39;: [17.0, 25.0]}, index=[\u0026#39;Portland\u0026#39;, \u0026#39;Berkeley\u0026#39;]) df2 = df1.assign(temp_f=df1.temp_c * 9 \/ 5 \u002b 32) Exercise 1: After reading the documentation for assign(), write a short paragraph to explain assign() as if you were talking to someone with zero programming experience (use the example above to help you explain assign()).\n What is the difference between df1 and df2? How was df2 derived from df1?)  Online textbook It pains us to see students would rather be stuck at problems for hours yet they refuse to use the textbook. This is another very useful resource since this is designed for this class. link to the textbook: (https:\/\/byuidatascience.github.io\/python4ds\/)\nExercise 2: Locate the section where the textbook talks about query() and answer these questions.\n What function in R\u0026rsquo;s dplyr is equivalent or comparable to query() in pandas (You should include the section number in your answer)? What is the easiest mistake for python beginner to make that was shown in the text about query() (You should include the section number in your answer)?  The internet Google is a programmer\u0026rsquo;s friend. Get used to googling thing, in fact, you want to be an expert in googling\n Question that cannot be answered by the textbook and documentation? Google it. A function you have never seen before? Google it. An error in your code? Google it.  Exercise 3: Provide at least 2 extra resources you could find about the pandas function drop() on the internet.\nTutor, TA (Through slack, zoom, or in-person) We want to help you with your work; we want to answer your questions; but most importantly, we want to help you succeed in this class. That will require you to put in the necessary time in understanding the readings, coding and debugging. When you ask us a question, we expect that you have read the documentation, searched the textbook, and done your own research. Then we can be most helpful and can provide insights on top of your understanding.\nExamples of bad questions  How does drop() work? We will ask you to read the documentation for drop(). How do you make a table in a markdown file? We will refer you to the textbook. I don\u0026rsquo;t want these columns in my data, how can I drop them? We will ask you if you have found any things on the internet.  Examples of good questions  I am still confused about the syntax of drop(). After reading the documentation, this is my understanding of the function\u0026hellip; . What am I missing? I tried making a table in markdown (show code), it is still not giving me what I want, how can I fix this? I am trying to drop these columns in my dataframe, I think drop() is what I am looking for. Am I in the right direction? If not, what keywords should I be googling?  Exercise 4:\nUsing the code and tools mentioned above, finish question 4 and 5 under 3.2.4 in the textbook.(use the data in mpg for your plot):\n# library import import pandas as pd import altair as alt # data import url = \u0026quot;https:\/\/github.com\/byuidatascience\/data4python4ds\/raw\/master\/data-raw\/mpg\/mpg.csv\u0026quot; mpg = pd.read_csv(url)   Question 4: Make a scatterplot of hwy vs cyl.\n  Question 5: What happens if you make a scatterplot of class vs drv? Why is the plot not useful?\n  After you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/introduction\/"},{value:"Day 1: Welcome",label:"<p>Welcome to DS 250!  Teacher: Paul Cannon TA: Bracken Sant (san20050@byui.edu)  Announcements  Devotional Computing Lab 4:30PM - 6:30PM all weekdays except Wednesday. Saturday from 10AM-12PM  Slack channel #tutoring_lab   Data Science Society - Wednesday\u0026rsquo;s at 6PM, STC 394 Math Department Opening Social - Thursday 11:30 RKS 229  What is a Data Scientist? A Data Scientist has a C\u002b Talent Stack Class Structure  Problem Solving Improved coding skills Effective written\/visual communication Collaboration Timeliness and communication with \u0026ldquo;the boss\u0026rdquo;  Syllabus\nGot Slack? Are we all on the Slack channel? Follow the Slack invitation that is waiting in your student email. If you don\u0026rsquo;t see an invite, you can join through this link and then ask \u0026ldquo;@Paul Cannon\u0026rdquo; to add you to the class channel.\nWho are you?  Introduce yourself and learn the names\/majors\/origin story of your group members. Make a plan to get help this semester. How will you contact each other? Some ideas: Slack, I-Learn, emails, group texts, etc. If you were independently wealthy, what would you be doing right now? Would you change majors? Highlights of 2022  Problem Solving This is not a \u0026ldquo;see and repeat\u0026rdquo; programming class!\nHow would you go about fixing my motorcycle? Learn how to ask for help (1 hr rule)  Getting started on Project 0 Setting up your Programming Snvironment  Download Visual Studio Code Download Python v (3.10.8)  Be sure to select the \u0026ldquo;Add to Path\u0026rdquo; option during the install process  Mac Users be sure to click on \u0026ldquo;Install Certificates\u0026rdquo; at the end of the install   Install the Python packages and VS Code extensions you need (see this page)  pip install pandas pip install numpy pip install jupyter pip install tabulate pip install altair   Install Quarto CLI Quatro Instructions Start looking at Project 0 Complete the \u0026ldquo;Methods Checkpoint\u0026rdquo;  Installing Packages and Extensions Learn how to install packages by reading the assigned material and by watching the video tutorial on this page.\nThe readings mention a lot of different packages. For Project 0, you need to install at least pandas, altair, numpy, and jupyter.\nThe readings will also mention two VS Code extensions you need to install.\nA note on Jupyter Notebooks vs. Interactive Python Window The textbook will show you how to use VS Code\u0026rsquo;s interactive python windows and Quatro. Feel free to use Jupyter Notebooks.\nWe will do write-ups in Quarto, though, which can be rendered as a PDF or HTML\nIntroduction to Brother Cannon    What do you want to know?    What is a data scientist?    Brother Hathaway\u0026rsquo;s definition:\n A blend of programmer, statistician, and communicator that burns with curiosity.\n My definiton for DS 250:\n Someone who can extract insights from data and then communicate those insights with clarity.\n Learn more about the BYU-Idaho data science program here.\n   What is data science programming?    Data scientists write code as a means to an end, whereas software developers write code to build things. Data science is inherently different from software development in that data science is an analytic activity, whereas software development has much more in common with traditional engineering.\nData scientists tackle problems such as identifying fraudulent transactions, or predicting which employees are likely to leave a company. Software developers can take the data scientists models and turn them into fully functioning systems with production-quality code. Software developers tackle problems like getting an algorithm to run more efficiently, or building user interfaces.\n   Course Outcomes    Upon completing this course, you will be able to use data-driven programming in Python to handle, format, and visualize data. We will introduce you to data wrangling techniques (panadas), analytical methods (scikit-learn), and the grammar of graphics (Altair). Specifically, as a successful learner, you will be able to:\n Use functions, data structures, and other programming constructs efficiently to process and find meaning in data. Programmatically load data from various types of data sources, including files, databases, and remote services. Use data manipulation libraries to perform straightforward analysis, produce charts, and prepare data for machine learning algorithms. Use machine learning libraries to discover insights, make predictions, and interpret the success of these algorithms. Collaborate and share your work with industry-leading tools.     BYU-Idaho Mission Statement     Brigham Young University-Idaho was founded and is supported and guided by The Church of Jesus Christ of Latter-day Saints. Its mission is to develop disciples of Jesus Christ who are leaders in their homes, the Church, and their communities.\n  How would you describe a leader? What makes a leader powerful? What does a leader do with insights?  An example of a good leader.\nWhat (or who) is truth?\n   ## Course Format and Grading How hard is this class going to be?    The reality of CSE 250:\n We have done all we can to ensure that this is a 2-credit course for the average student. That means that we expect 4-6 hours outside of class for the average student to achieve an A. You have to put in the time if you want to build skills. The course is necessarily creative in nature. That fact usually makes it feel more challenging. We will be asking you to learn to write creative data science python code. If you have any concerns, please talk with me!     What is the structure of CSE 250?    The class uses 7 projects to teach data science programming in Python using pandas, Altair, scikit-learn, and numpy.\n Projects Syllabus     How do I get the grade I want?     Specification Grading Grading structure Competency Elements  Introduction Project \u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026gt;\nWhat is the goal?    Completing the introduction project will set you up for success the rest of the semester. The workflow followed in the introduction project (loading packages, writing code, saving images, compiling a final report) will be the same for every other project . If you have questions about this project, you need to seek help.   What exactly do I need to submit?    Make sure you carefully read the project instructions.\nYou will submit a single .pdf file to I-Learn. This pdf file should contain an project summary, your answers to the grand questions (including the plot you saved with altair_saver), and an appendix where you copy and paste your commented Python code.\n   --------------------------------------------------------   ----------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/introduction\/day01\/"},{value:"Day 2B: Missing Data",label:"<p>Welcome to class! Announcements Questions 1 and 2 What issues are we still running into?\nHow to work with missing data What counts as missing data? How to identify missing data  df.isnull().sum() df.describe() df.column.value_counts(dropna=False)   pd.crosstab()  Option 1: Remove missing values Be careful with .dropna(), and make sure you know what it is doing to your data!\nLet\u0026rsquo;s use the pandas example:\ndf = pd.DataFrame({\u0026#34;name\u0026#34;: [\u0026#39;Alfred\u0026#39;, \u0026#39;Batman\u0026#39;, \u0026#39;Catwoman\u0026#39;], \u0026#34;toy\u0026#34;: [np.nan, \u0026#39;Batmobile\u0026#39;, \u0026#39;Bullwhip\u0026#39;], \u0026#34;born\u0026#34;: [pd.NaT, pd.Timestamp(\u0026#34;1940-04-25\u0026#34;), pd.NaT]})  Q: When would we ever use dropna()?    A: Almost never! Why do you think it is a bad idea? df.dropna()   Q: What argument do we use to drop rows where all values are NA?    A: df.dropna(how=\u0027all\u0027) reference   Q: What if we want to drop NA rows based on one column?    A: df.dropna(subset=[\u0027toy\u0027]) reference   Option 2: Replacing missing values Again, let\u0026rsquo;s use the pandas example:\ndf = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1], [np.nan, np.nan, np.nan, 5], [np.nan, 3, np.nan, 4]], columns=list(\u0026#34;ABCD\u0026#34;))  Q: What if we want to replace all the NA in the wt column with the mean weight?    A: fillna() reference   Q: What if we want to replace all the 999 with a 4?    A: replace() reference   Q: What if we want to replace all the NAs with a linear interpolation?    A: interpolate() reference   Question 3 What columns do we need to use for question 3 (total number of flights delayed by weather)?  num_of_delays_weather num_of_delays_late_aircraft num_of_delays_nas  weather = flights.assign( severe = #????, mild_late = #????, mild_nas = np.where(#????), total_weather = # add up severe and mild, ).filter([\u0026#39;airport_code\u0026#39;,\u0026#39;month\u0026#39;,\u0026#39;severe\u0026#39;,\u0026#39;mild_late\u0026#39;,\u0026#39;mild_nas\u0026#39;, \u0026#39;total_weather\u0026#39;, \u0026#39;num_of_delays_total\u0026#39;]) Other resources for question 3  isin() method where() method Adding new variables with assign() assign() method  </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p2\/d3\/"},{value:"Day 3: Making your name stand out",label:"<p>Welcome to class! Reminder about resources  \u0026ldquo;Potluck\u0026rdquo; prep assignment Work with peers Make your own cheat sheet  Anouncements  Always submit a halfway checkpoint even if you\u0026rsquo;re behind!  It\u0026rsquo;s the only hard due date Think of it as a check in with \u0026ldquo;the boss\u0026rdquo;    Thoughts on P1 Halfway Checkpoint  Do your work in .py or .ipynb file, write-up in .qmd Making\/submitting a video: Loom alt.Save() Quarto Graphics for Communication Plotly.Express Resources  Let\u0026rsquo;s practice! Explore the data\nimport plotly.express as px import pandas as pd import numpy as np url = \u0026#34;https:\/\/github.com\/byuidatascience\/data4names\/raw\/master\/data-raw\/names_year\/names_year.csv\u0026#34; names = pd.read_csv(\u0026#39;names_year.csv\u0026#39;) names.head() names.describe() What do you want the chart to look like?\nWhat types of charts are there?\nWhat data do you need to make that chart?\n# names[[\u0026#39;name\u0026#39;],[\u0026#39;year\u0026#39;]] vs. names.query() kobe = names.query(\u0026#34;name == \u0026#39;Kobe\u0026#39;\u0026#34;)[[\u0026#34;name\u0026#34;, \u0026#34;year\u0026#34;, \u0026#34;Total\u0026#34;]] kobe2 = names.query(\u0026#34;name == \u0026#39;Kobe\u0026#39;\u0026#34;).filter(items=[\u0026#34;name\u0026#34;, \u0026#34;year\u0026#34;, \u0026#34;Total\u0026#34;]) # method chaining with () \nWork with your partner to create a line chart that includes both of your names?      Can you include total and data for the state in which you were born? Work together to make the code as eloquent as possible. compound charts      What can you add to your chart to help tell a story?\nCan you modify your previous chart to include your birth state?     Can you include Total and your birth state? Is there a better metric than raw counts that you could calculate? Are there good labels that you could include on the chart (mark_text())?     Remember this advice from Edward Tufte.\n To be truthful and revealing, data graphics must bear on the question at the heart of quantitative thinking: \u0026ldquo;Compared to what?\u0026rdquo; The emaciated, data-thin design should always provoke suspicion, for graphics often lie by omission, leaving out data sufficient for comparisons.\n What are some charts types we could use to answer this question?    There is a clear first choice, but I think there are a few other choices that could provide insight.\n  Visualization Catalog Altair Example Gallery       Use the query() method and filter() method to get your name and years in the rows with and include the name, year, and Total columns     filter the data down to your names (query) select the pertinent columns (filter()) Create a new data object for your name.     Create a line chart with your name.    base = (alt.Chart() .encode( x = alt.X(\u0026#39;\u0026#39;), y = alt.Y(\u0026#39;\u0026#39;) ) .mark_line() )    Create a new DataFrame with your birthday information in the row    Create a DataFrame with x, y, and label as columns. How to create a dataframe.   Add the vertical rule mark to show your birthday    These references can help:\n Using layered charts Altair Marks Add a horizontal line to an existent chart     Work with your partner to create a line chart that includes both of your names?      Can you include total and data for the state in which you were born? Work together to make the code as eloquent as possible.      Can you modify your previous chart to include your birth state?     Can you include Total and your birth state? Is there a better metric than raw counts that you could calculate? Are there good labels that you could include on the chart (mark_text())?      Now come up with a different chart than a line chart    Just use your state count or the Total count for your name.   \u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026gt;\n</p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p1\/d3\/"},{value:"Day 3: The end of baseball",label:"<p>Welcome to class! Spiritual Thought Announcements  Practice Coding Challenge Can I still get an \u0026ldquo;A\u0026rdquo;?  Profile of an \u0026ldquo;A\u0026rdquo; student What if I fall behind?   Reminders:  DS community assignment Review and Request Letter    Coding Challenge: How do I prepare? What would your coding challenge look like?\nProject 3 Questions  Integer Division Career Batting Average What have come up with for Q3? Metrics? Visualizations?  Question 1 Ask yourself:\n What do I want and expect the end table to look like? What table(s) and calculations do I need? What makes a row in my end table unique? What problems can I anticipate?  Question 2 Ask yourself:\n What do I want and expect the end table to look like? What table(s) and calculations do I need? What makes a row in my end table unique? What problems can I anticipate?  Question 3 What are some ideas for Grand Question 3? Ask yourself:\n What information will you use to compare the two baseball teams? What table(s) and calculations do I need? What makes a row in my end table unique? What problems can I anticipate?  and FROM -- JOIN -- ON -- WHERE -- GROUP BY -- ORDER BY -- LIMIT -- ``` -------------------------------------------  ## Connecting to SQLite: [Lahman SQLite](https:\/\/byuistats.github.io\/CSE250-Course\/data\/lahmansbaseballdb.sqlite) __Download the sqlite file:__ [Lahman sqlite](https:\/\/byuistats.github.io\/CSE250-Course\/data\/lahmansbaseballdb.sqlite) ### What is SQLite?  - [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/SQLite): SQLite is **a popular choice as embedded database software for local\/client storage in application software such as web browsers.** It is arguably the most widely deployed database engine, as it is used today by several widespread browsers, operating systems, and embedded systems (such as mobile phones), among others. SQLite has bindings to many programming languages.  - [SQLite.org](https:\/\/www.sqlite.org\/about.html): **SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine.** The code for SQLite is in the public domain and is thus free for use for any purpose, commercial or private. SQLite is the most widely deployed database in the world with more applications than we can count, including several high-profile projects.  - [Codecademy](https:\/\/www.codecademy.com\/articles\/what-is-sqlite): SQLite is a database engine. It is software that allows users to interact with a relational database. In SQLite, a database is stored in a single file — a trait that distinguishes it from other database engines. This fact allows for a great deal of accessibility: copying a database is no more complicated than copying the file that stores the data, sharing a database can mean sending an email attachment. ### Working with SQLite files in Python ```python # %% import pandas as pd import altair as alt import numpy as np import sqlite3 # %% sqlite_file = \u0027lahmansbaseballdb.sqlite\u0027 con = sqlite3.connect(sqlite_file) # %% # See the tables in the database table = pd.read_sql_query( \u0022SELECT name FROM sqlite_master WHERE type=\u0027table\u0027\u0022, con) print(table) ``` ------------------------------------------------------ What table do we want to use?    q = \u0026#39;\u0026#39;\u0026#39; SELECT * FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What columns do we want to select?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What calculation do we want to perform?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, r\/ab FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What name do we give our calculated column?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, r\/ab as runs_atbat FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    #### I want to join two tables to help in decision making __For seasons after 1999, which year had the most players selected as All Stars but didn\u0027t play in the All Star game?__ - Provide a summary of how many games, hits, and at bats all the players had in that year\u0027s post season. - The [data dictionary](https:\/\/data.world\/byuidss\/cse-250-baseball-database\/workspace\/file?filename=readme2014.txt) might help. ```python import pandas as pd import altair as alt import numpy as np import datadotworld as dw baseball_url = \u0027byuidss\/cse-250-baseball-database\u0027 ``` What table do we want for All Star information?    # %% # allstar table dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM AllstarFull WHERE --? AND --? LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you use a groupby to get the counts of players per year?    dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT yearid, -- \u0026lt;stuff to calculate\u0026gt; FROM AllstarFull WHERE yearid \u0026gt; 1999 AND gp != 1 GROUP BY --? ORDER BY --? \u0026#39;\u0026#39;\u0026#39;).dataframe    What table do we want for the post season at bats?    dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM BattingPost as bp LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you join the post season batting table and AllStar information?     For each player, keep only the at bats, hits, the all star gp, and gameid columns. Let\u0026rsquo;s only keep players with at least one at bat in the post season.  dw.query(baseball_url, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;columns to keep\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON -- \u0026lt;two columns for the join\u0026gt; WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND -- \u0026lt;at bat condition\u0026gt; LIMIT 15 \u0026#39;\u0026#39;\u0026#39; ).dataframe    Let\u0026rsquo;s build the final table    For seasons after 1999, which year had the most players selected as All Stars but didn\u0026rsquo;t play in the All Star game?\n Provide a summary of how many games, hits, and at bats all the players had in that year\u0026rsquo;s post season.  dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;lots of calculations\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON bp.playerid = asf.playerid AND bp.yearid = asf.yearid WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND ab \u0026gt; 0 GROUP BY -- \u0026lt;column\u0026gt; ORDER BY -- \u0026lt;column\u0026gt; \u0026#39;\u0026#39;\u0026#39; ).dataframe    --------------------------------------------------------------------- I want to see how much each college player from schools in the west and mountain west has made over their professional career. I want to know the full school name attended and the the Given name of each player. _Is this query correct?_ ```SQL SELECT cp.playerID, nameGiven, birthYear ,cp.schoolID, name_full ,SUM(salary) as salary FROM salaries as sal JOIN people as p ON p.playerID = sal.playerID JOIN CollegePlaying as cp ON p.playerID = cp.playerID JOIN schools as sc ON sc.schoolID = cp.schoolID WHERE sc.state = \u0027ID\u0027 GROUP BY cp.playerID, cp.schoolID ORDER BY name_full ``` ```python pd.read_sql_query( \u0027\u0027\u0027 SELECT cp.playerID, nameGiven, birthYear ,cp.schoolID, name_full ,SUM(salary) as salary FROM salaries as sal JOIN people as p ON p.playerID = sal.playerID JOIN CollegePlaying as cp ON p.playerID = cp.playerID JOIN schools as sc ON sc.schoolID = cp.schoolID WHERE sc.state = \u0027ID\u0027 GROUP BY cp.playerID, cp.schoolID ORDER BY name_full \u0027\u0027\u0027, con) ``` #### Let\u0027s start here ```python schools = pd.read_sql_query( \u0027\u0027\u0027 SELECT * FROM schools WHERE state = \u0027ID\u0027 \u0027\u0027\u0027, con) ``` ----------------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p3\/d3\/"},{value:"pandas and Altair",label:"<p>For this skill builder, we are exploring some important functions in the package of pandas and Altair. DS programming requires a lot of data wrangling. Using the proper functions, we can create concise and comprehensive codes. You should be exposed to a few functions through the readings this week.\nYou may want to at least scan the readings before beginning this task since this serves as an assessment of your understanding of the assigned readings. A prepared student should be able to finish the exercises within 60 minutes. You should work through it on your own.\nBefore you start Make sure you have installed VS-code, pandas, and Altair on your computer. You can install these packages by typing this line in the terminal:\npip install pandas altair\nOR if you have more than one version of python:\npip3.9 install pandas altair\npip3.9 indicates the version of python you are installing the packages to.\nData import Run the following code to import the data we need for this skill builder:\n# package import import numpy as np import pandas as pd import altair as al # data import dat = pd.read_csv(\u0026#34;https:\/\/vincentarelbundock.github.io\/Rdatasets\/csv\/AER\/Guns.csv\u0026#34;) Make sure the variable dat is correctly assigned in your environment and finish the following exercises. You can read the documentation of the data on this page - https:\/\/vincentarelbundock.github.io\/Rdatasets\/doc\/AER\/Guns.html\nExercise 1 One of the first things we can do to a freshly imported data is to check its columns. This will help us understand the basic structure of the dataframe(table).\n Using one line of code, select all the columns in dat, assign it to a variable called col_list.\n  Hint Every dataframe has an attribute \u0022columns\u0022. Accessing this attribute will give you a list of all column names  We often want to know the dimension of a dataframe. How many columns are in the dataset? How many rows are in the dataset?\n Using one line of code, show the number of columns and rows in dat.\n  Hint Every dataframe has an attribute \u0022shape\u0022. Accessing this attribute will give you the dimension of a datafarme  Now run dat.head(). It will print out the first 5 rows of data in dat.\n Just from looking at the output, what column(s) seems to be redundant with the row number?\n  Hint There is one column that serves as nothing but a row counter, that columns is redundant.  Exercise 2 After a brief investigation of the data, we will clean up the data. By cleaning up, we are trying to filter down dat so this only holds data we need. We will first get rid of the extra column we found in the previous excercise.\n Using one line of code, drop the redundant column using the variable col_list (created in excercise 1)\n  Hint Use `drop()`. Understand what \u0026ldquo;axis\u0026rdquo; is as a parameter of drop().\nYour function should looks like this:\ndat.drop([col_list[_]], axis = _)\nfill the \u0026ldquo;_\u0026quot;\u0026rsquo;s with the correct values and assign the output to dat.\n Don\u0026rsquo;t forget to save the changes in dat. Run dat.head() to make sure the column is dropped in dat.\nExercise 3 We have filtered dat vertically by dropping a column. Now we will try to filter dat horizontally, meaning we will get rid of some the rows.\nWe can do that by applying a condition to dat. A condition is an expression that can be evaluated as True\/False. For example, 8 \u0026gt; 5 is an expression that evaluates to be True. This is trivial because 8 will always be greater than 5.\nRun the code below:\n what is the difference between exp1 and exp2?\n exp1 = 8 \u0026gt; 5 exp2 = dat.violent \u0026lt; 300  Hint Try type() on else variable OR calling else variable.  Run ths code below:\n By putting dat.violent \u0026lt; 300, and the violent column from dat into a dataframe, what is the relationship between the two columns?\n exp = pd.DataFrame({\u0026quot;dat.violent \u0026lt; 300\u0026quot; : exp2, \u0026quot;violent value from dat\u0026quot; : dat.violent}) exp  Hint Try computing `dat.violent[n]  Using query(), filter down the dat so that it only contains the data for idaho\n  Hint query() takes in expressions and filters down data.  Don\u0026rsquo;t forget to save the changes in dat. Run dat.shape() to make sure the there are 23 rows and 13 columns.\nExercise 4 Besides filtering, we can manipulate the data by adding new data to it. By adding a new column to the data, we assign a new value to each row.\n Using assign(), create a new column that show the ratio between murder rate and violent rate.\n  Hint Use assign() You see get the ratio by computing this code:\ndat.murder\/dat.violent\n Exercise 5  Create a scatter plot that shows the relationship between murder rate and violent rate for the state of Idaho. Your chart should show murder rate as the x-axis, violent as the y-axis.\n  Hint Can you mimic this plot? (https:\/\/altair-viz.github.io\/gallery\/scatter_tooltips.html)\n  For an extra push Exercise 6  Using a line of code, filter down the data set so that it only shows the data in years between 1993 and 1997.\n Exercise 7  Create a line chart that show prisoners numbers for the state of Idaho, Utah, and Oregon.\n Your chart should show year as the x-axis, prisoner as the y-axis, states as different colours, along with an appropriate title.\nExercise 8  Without using query(), finshed the data wrangling in question 2,5 and 6.\n After you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/pandas_altair\/"},{value:"Day 2: Intro to Machine Learning",label:"<p>Welcome to class! Shire Reckoning\nAnnouncements  Coding Challenge Practice - Thursday, March 7  Spiritual thought Are facts true?  How do you distinguish between truth and error? Joshua and Caleb  Building a Decision Tree  Import packages    Splitting the Data 1. Start with packages and data set We\u0026rsquo;ll be using some parts of SKLEARN package and the Seaborn package.\n# If you haven\u0026#39;t already, install scikit-learn and seaborn pip install scikit-learn seaborn from types import GeneratorType import pandas as pd import altair as alt import numpy as np import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn import metrics What is the difference between dwellings_denver.csv and dwellings_ml.csv?\n2. Choose which variables to use How do we know which variables to use out of dwellings_ml.csv?\nQuestion 1 will help you identify patterns (or lack of patterns) in the data.\n3. Separate into features and target Which Features? # %% h_subset = dwellings_ml.filter([\u0026#39;livearea\u0026#39;, \u0026#39;finbsmnt\u0026#39;, \u0026#39;basement\u0026#39;, \u0026#39;yearbuilt\u0026#39;, \u0026#39;nocars\u0026#39;, \u0026#39;numbdrm\u0026#39;, \u0026#39;numbaths\u0026#39;, \u0026#39;stories\u0026#39;, \u0026#39;yrbuilt\u0026#39;, \u0026#39;before1980\u0026#39;]).sample(500) sns.pairplot(h_subset, hue = \u0026#39;before1980\u0026#39;) corr = h_subset.drop(columns = \u0026#39;before1980\u0026#39;).corr() # %% sns.heatmap(corr) 4. Split into training and testing sets What does the \u0026ldquo;train_test_split()\u0026rdquo; function do? x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = #???, random_state = #???) Read the documentation and tell me what is returned?\nFunction documentation\n Why do we use \u0026ldquo;test_size\u0026rdquo; and \u0026ldquo;random_state\u0026rdquo;?\n  What is \u0026ldquo;x\u0026rdquo; and \u0026ldquo;y\u0026rdquo; in the above function example?\n We need to take our data and build the feature and target data objects.\n What columns should we remove from our features (X)?\n  What column should we use as our target (y)?\n x = dwellings_ml.filter([#what variables will you use as \u0026#34;features\u0026#34;?]) y = dwellings_ml[#what variable is the \u0026#34;target\u0026#34;?] \nTraining a Classifier Decision Tree Example #%% # Create a decision tree classifier_DT = DecisionTreeClassifier(max_depth = 4) # Fit the decision tree classifier_DT.fit(x_train, y_train) # Test the decision tree (make predictions) y_predicted_DT = classifier_DT.predict(x_test) # Evaluate the decision tree print(\u0026#34;Accuracy:\u0026#34;, metrics.accuracy_score(y_test, y_predicted_DT)) How to Improve Accuracy To improve the accuracy of your model, you could:\n Change what variables are used in the features (x) data set Change what type of model you are using Tune (aka, \u0026ldquo;change\u0026rdquo; or \u0026ldquo;tweak\u0026rdquo;) the parameters of the model  Other Classification Models Here are some other models you could try.\nfrom sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier \nMake Progress on Project 4 Do the project readings    Machine Learning Introduction\n Step-by-step guide (mostly) for training a GaussianNB classifier. (The steps will be the same for any algorithm you use.)  Visual Introduction to Machine Learning\n Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data. Overfitting happens when some boundaries are based on distinctions that don\u0026rsquo;t make a difference. You can see if a model overfits by having test data flow through the model.     Start working on Question 1    The goal of Grand Question 1 is to help us with \u0026ldquo;feature selection\u0026rdquo;.\n \u0026ldquo;Overfitting\u0026rdquo; happens when some boundaries are based on on distinctions that don\u0026rsquo;t make a difference. More data does not always lead to better models. (Occam\u0026rsquo;s Razor)  Common questions:\n Why it may be better to have fewer predictors in Machine Learning models? What is Feature Selection and why do we need it in Machine Learning?     What is the 5000 rows error with Altair?    The best way around this is to look at a sub-sample of the data for exploratory purposes. For example, you can use \u0026ldquo;sample(500)\u0026rdquo;. But there are ways to expand VS Code\u0026rsquo;s limits.\nMaxRowsError: How can I plot Large Datasets?\nYou may also save data to a local filesystem and reference the data by file path. Altair allows you to disable the max rows:\nalt.data_transformers.disable_max_rows() subset_data = denver.sample(n = 4999)    scikit-learn resources     Home page Tutorials Getting Started: What do you notice about the header portion of each of the script chunks?  import vs from ... import       My favorite comic    xkcd\n   ## Searching for patterns What ideas do you have for charts? ## Understanding the data What differences do you notice between these two data sets? ```python dwellings = pd.read_csv() dwellings_ml = pd.read_csv() ``` ------------------------------------------------------------------- What is the 5000 rows error with Altair?    MaxRowsError: How can I plot Large Datasets?\nYou may also save data to a local filesystem and reference the data by file path. Altair has a JSON data transformer that will do this transparently when enabled:\nalt.data_transformers.disable_max_rows() subset_data = denver.sample(n = 4999)    What features of homes might have changed a bit over time?    Some ideas:\n square footage number of bathrooms basement size  Let\u0026rsquo;s create one chart using some of these variables.\n   ----------------------------------------- What is scikit-learn?     Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.\n About scikit-learn helps us see the history and funding. It should stay \u0026ldquo;king of the hill\u0026rdquo; for a long time.\n Simple and efficient tools for predictive data analysis Accessible to everybody, and reusable in various contexts Built on NumPy, SciPy, and matplotlib Open source, commercially usable - BSD license     Should I import scikit-learn?    scikit-learn is very large, with many submodules. To help the user of your .py script understand your code, the consensus is to use from .... import .....\nfrom sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn import metrics    After choosing a machine learning method, what do we do?     Fit (or \u0026ldquo;train\u0026rdquo;) the model using the features (also called \u0026ldquo;X\u0026rdquo;) Predict the target (also called \u0026ldquo;y\u0026rdquo;) Evaluate model performance (using many different metrics)     ## Train the model What does the train_test_split() function do?    Your turn: Read the documentation and tell me what is returned from the train_test_split() function.\nHow to save the output: Use a destructuring assignment\nx_train, x_test, y_train, y_test = train_test_split( x, y, test_size = .3, random_state = 76) Your turn:\n Why would we want to use the test_size and random_state arguments? What is x and y in the above example? Why do we care about splitting our data?     The next step    We need to take our data and build the feature and target data objects. Think about:\n What column(s) should we remove from our features (x)? What column(s) should we use as our target (y)?     ## Predicting targets and evaluating model performance What metrics should we use?    Do your reading! Read How to evaluate your ML model and try googling other ideas.\nAccuracy Question 2 is looking for a model that has \u0026ldquo;at least 90% accuracy\u0026rdquo;.\nConfusion Matrix A confusion matrix is a quick way to see the strengths and weaknesses of your model.\nYour turn: Look at the confusion matrix for our GaussianNB model. Where the model is doing well and where it might be falling short?\nYour turn: Now look at the confusion matrix for our Decision Tree model. What differences do you notice?\n# a confusion matrix print(metrics.confusion_matrix(y_test, y_predicted_GNB)) # this one might be easier to read print(pd.crosstab(y_test.flatten(), y_predicted_GNB, rownames=[\u0026#39;True\u0026#39;], colnames=[\u0026#39;Predicted\u0026#39;], margins=True)) # visualize a confusion matrix # requires \u0026#39;matplotlib\u0026#39; to be installed metrics.plot_confusion_matrix(classifier_GNB, x_test, y_test)    ------------------------------------------------------------------------- AI is able to learn \u0027rules\u0027 from highly repetitive data. [Sebastian Thrun](https:\/\/www.youtube.com\/watch?v=ZJixNvx9BAc)  The single most important thing for AI to accomplish in the next ten years is to free us from the burden of repetitive work. [Sebastian Thrun](https:\/\/www.youtube.com\/watch?v=ZJixNvx9BAc)   ### [Visual Introduction to Machine Learning](http:\/\/www.r2d3.us\/visual-intro-to-machine-learning-part-1\/)  1. Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.  2. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data.  3. Overfitting happens when some boundaries are based on distinctions that don\u0027t make a difference. You can see if a model overfits by having test data flow through the model. #### [Bias-Variance Tradeoff](http:\/\/www.r2d3.us\/visual-intro-to-machine-learning-part-2\/)  1. Models approximate real-life situations using limited data.  2. In doing so, errors can arise due to assumptions that are overly simple (bias) or overly complex (variance).  3. Building models is about making sure there\u0027s a balance between the two. #### But what is the \u0027Pavlovian bell\u0027 in the machine learning model? ![](..\/..\/images\/ml\/test.png) Some mathematical penalty\/reward equation.  - __[Regression](https:\/\/setosa.io\/ev\/ordinary-least-squares-regression\/)__  - __[Variance, RMSE, SD](..\/..\/interactive\/threshold_histogram.html)__  - __proportions__ ## Using our project data to understand features, targets, and samples.  1. Import `dwellings_ml.csv` and write a short sentence describing your data. Remember to explain an observation and what measurements we have on that observation.  2. Now try describing the modeling (machine learning) we are going to do in terms of features and targets.  A. Are there any columns that are the target in disguise?  B. _Are the observational units unique in every row?_ ![](..\/..\/images\/ml\/iris_description.png) ### If your model is near perfect in its predictability, you might be cheating. ### Watch out for [transactional data](http:\/\/localhost:1313\/CSE250-Course\/images\/ml\/iris_description.png)!  - Financial: orders, invoices, payments  - Work: plans, activity records  - School: Grades ### [scikit learn](https:\/\/scikit-learn.org\/stable\/)  - [Tutorials](https:\/\/scikit-learn.org\/stable\/tutorial\/index.html)  - [Getting Started](https:\/\/scikit-learn.org\/stable\/getting_started.html): _What do you notice about the header portion of each of the script chunks?_  - [`import` vs `from ... import`](https:\/\/scikit-learn.org\/stable\/getting_started.html) ## Setting up Live Share -----------------------------------    </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p4\/d2\/"},{value:"Day 2: Seeing names with Altair",label:"<p>Welcome to class! Announcements Project Submissions  Don\u0026rsquo;t leave example text\/documentation from the Template in your writeup Change the Project Title (don\u0026rsquo;t have to call it Client Report) Code can be adjacent to the relevant output as long as it\u0026rsquo;s not distracting, but please include your complete code in an Appendix Be sure to save the QMD file before rendering   Autosave  Project 0 Wrap-up  If you still cannot render a document in Quarto, let me know Is python, at least up and running? able to plot graphs and make tables? Finishing up a report   Markdown  Tables - want to have the printed table in Markdown area, not a code area   HTML submissions  Other hints:\n Tutoring Lab Slack Channel: #tutoring_lab  Back to Day 1 Slides Methods Checkpoint Loading the names data    Visit the Project 1 Instructions to download the data. #%% # load packages import pandas as pd import altair as alt #%% # load data from url url = \u0026quot;this_is_the_url_to_the_csv_file\u0026quot; names = pd.read_csv(url) #%% # or, you can load data from file names2 = pd.read_csv(\u0026quot;names_year.csv\u0026quot;)    Pandas and DataFrames    What is a Pandas DataFrame? DataFrames come with attributes and built-in functions that can help us get a feel for our data.\nRun the code below one line at a time (or use other functions of your choice) to explore the names data. What do you learn?\nnames.columns names.shape names.size names.head() names.describe()    Understanding your data    You should be able to introduce your data sets to people, the same way you introduce a friend.\n If you can\u0026rsquo;t describe what a row is in your data, then you don\u0026rsquo;t understand what groups you can analyze. If you can\u0026rsquo;r describe what a column is in your data, then you don\u0026rsquo;t understand what information you can evaluate for each group.  Being able to explain your data out loud to someone else follows the same principles as rubber duck debugging.\n   Let\u0026rsquo;s practice!    Understanding column values How many unique names does the names dataset contain? Work with a partner to find the answer. I recommend searching the Pandas cheat sheet.\n pull the name column out as a series Use the pandas unique function pd.unique() find the size of the series  What is the range of years in the names dataset? Again, work with a partner and use the Pandas cheat sheet.\n pull the year column as a series Find the max Find the min     ----------------------------------------------------- How many unique years do we have for our name?    pd.unique(dat.query(\u0027name == \u0026quot;John\u0026quot;\u0027).year).min() pd.unique(dat.query(\u0027name == \u0026quot;John\u0026quot;\u0027).year).max() pd.unique(dat.query(\u0027name == \u0026quot;John\u0026quot;\u0027).year).size     Filtering rows of a DataFrame    Make sure to do the project readings!  P4DS: 5.2 Filter rows with .query() The query method     ## Getting started with Altair ### Why are we using Altair? #### It is built on the VEGA and D3 which are fast and web based.  #### Grammar of Graphics: Vega-Lite ![](altair_grammar_graphics.png)  - [Technical Paper](https:\/\/www.domoritz.de\/papers\/2017-VegaLite-InfoVis.pdf)  - [Website](https:\/\/vega.github.io\/vega-lite\/)  - [Endorsment](https:\/\/medium.com\/@robin.linacre\/why-im-backing-vega-lite-as-our-default-tool-for-data-visualisation-51c20970df39) ------------------------------------ Grand Grand Question 1 What does a chart need to look like to answer Question 1?\nWhat data do we need to build that chart?\nMaking our chart look good.  Size of chart Title and subtitle Size and color of line Axis formatting Reference marks  Extra Practice Altair (and Vega and Vega-Lite and D3!)    What is the difference between a \u0026ldquo;high-level\u0026rdquo; and \u0026ldquo;low-level\u0026rdquo; programming language or tool? Here\u0026rsquo;s what Google has to say.\n Altair is a Python library built on Vega and Vega-Lite Vega is a \u0026ldquo;higher-level visualization specification language on top of D3\u0026rdquo; that creates charts with json files D3 is a JavaScript library     Altair: Removing commas from years    Remember, Altair builds on Vega, which builds on D3. Sometimes to answer a question about Altair, you will have to read Vega or D3 documentation. For example:\n Altair\u0026rsquo;s guide for customizing axis labels. (Scroll down to the second code example.) D3 options for different axis formats.  (alt.Chart(my_data) .mark_line() .encode( x = alt.X(\u0027year\u0027, axis = alt.Axis(format = \u0027d\u0027, title = \u0026quot;Year\u0026quot;)), y = alt.Y(\u0027Total\u0027, axis = alt.Axis(title = \u0026quot;Children with Name\u0026quot;)) ) )    Altair: Adding a reference line    You may want to include a point or line of reference to help your chart answer the question \u0026ldquo;compared to what?\u0026rdquo;. Let\u0026rsquo;s say you have your chart for Grand Question 1 saved as question_1. The easiest way I have found to add a reference line is to create a new DataFrame with a single number:\nline_df = pd.DataFrame({\u0027year\u0027: [1990]}) line_df And use the new DataFrame to create a chart with a single line that has a specific value of x (for example, your birth year) but spans the entire y-axis.\nIn Altair, this is done with the the mark_rule() geometry. You can then \u0026ldquo;layer\u0026rdquo; the two charts together.\nline = alt.Chart(line_df).mark_rule(color=\u0026quot;red\u0026quot;).encode(x = \u0026quot;year\u0026quot;) final_chart = question_1 \u002b line final_chart Additional references:\n Using layered charts Altair Marks Add a horizontal line to an existent chart     ------------------------------------ Look at the names data and write a short paragraph in your notes describing the data set    We have a row for each name-year. Excluding the name and year columns we have a column for each state and DC. Finally there is a Total column that sums over the other columns.\n  If you can\u0026rsquo;t describe what a row is in your table then you don\u0026rsquo;t understand what groups you can talk about with your data. The columns tell you what information you will be able to evaluate on each \u0026lsquo;group\u0026rsquo; or \u0026lsquo;observation\u0026rsquo; in your data.   We want tidy data.\n   ----------------------- Which name has been given the most and the least?      Sum all the years for each name (groupby()). Create a new DataFrame for the totals. Write a query that filters the total data to the max and min. Create a markdown table with the information. A. to_markdown() requires the tabulate package. B. to_markdown() with arguments showindex and floatformat C. Guidance on floatformat   dat_total = dat.groupby(\u0027name\u0027).agg(n = (\u0027Total\u0027, \u0027sum\u0027)).reset_index() print(dat_total .query(\u0027n in [@dat_total.n.max(), @dat_total.n.min()]\u0027) .to_markdown(showindex = False, floatfmt=\u0026quot;.0f\u0026quot;))    -------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p1\/d2\/"},{value:"Day 2: SQL Calculations",label:"<p>Welcome to class! Spiritual Thought Announcements  Project 3 - SQL practice  Class Activity in Slack Part 1 Goal: Describe in words (NOT using code) how to get from your starting data to your ending data.\nPost your answer in your group\u0026rsquo;s Slack thread. You have 7 minutes, and are allowed to ask me 1 question.\nPart 2 Goal: Now try to write a SQL query to get your ending data.\nPost your SQL query in your group\u0026rsquo;s Slack thread. You have 7 minutes, and are allowed to ask me 1 question.\nHere is the SQL template for your use.\nSELECT -- \u0026lt;columns\u0026gt; and \u0026lt;column calculations\u0026gt; FROM -- \u0026lt;table name\u0026gt;  JOIN -- \u0026lt;table name\u0026gt;  ON -- \u0026lt;columns to join\u0026gt; WHERE -- \u0026lt;filter condition\u0026gt; GROUP BY -- \u0026lt;subsets for column calculations\u0026gt; HAVING -- \u0026lt;grouped filter condition\u0026gt; ORDER BY -- \u0026lt;how the output is returned in sequence\u0026gt; LIMIT -- \u0026lt;number of rows to return\u0026gt; \n- Group 1: [SELECT and FROM](https:\/\/docs.data.world\/documentation\/sql\/concepts\/basic\/SELECT_and_FROM.html) with the `people` table (called \u0022master\u0022 in the data dictionary). Include examples of `SELECT AS` and `SELECT DISTINCT`.  - Group 2: [WHERE](https:\/\/docs.data.world\/documentation\/sql\/concepts\/basic\/WHERE.html) with the `schools` table. Try using different types of comparison operators, or making multiple comparisons with `AND`.  - Group 3: [ORDER BY](https:\/\/docs.data.world\/documentation\/sql\/concepts\/basic\/ORDER_BY.html) with the `salaries` table. Try sorting in different orders (ascending or descending) and with multiple columns.  - Group 4: [JOIN](https:\/\/docs.data.world\/documentation\/sql\/concepts\/intermediate\/Joins.html) with the `schools` and `collegeplaying` tables (focus on \u0022inner\u0022 joins).  - Group 5: [Aggregations](https:\/\/docs.data.world\/documentation\/sql\/concepts\/intermediate\/aggregations.html) with the `batting` table.  - Group 6: [GROUP BY](https:\/\/docs.data.world\/documentation\/sql\/concepts\/intermediate\/GROUP_BY.html) with the `batting` table. -------------------------- Getting started Question One: Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID\/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.\nThink about:\n What tables (data) do you need? What SQL commands do you need?  What table do we want to use?    q = \u0026#39;\u0026#39;\u0026#39; SELECT * FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What columns do we want to select?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What calculation do we want to perform?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, ab\/r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; batting_calc = dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What name do we give our calculated column?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, ab\/r as runs_atbat FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; batting_calc = dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    #### I want to join two tables to help in decision making __Which year had the most players players selected as All Stars but didn\u0027t play in the All Star game after 1999?__ - __provide a summary of how many games, hits, and at bats occured by those players had in that years post season.__ ```python import pandas as pd import altair as alt import numpy as np import datadotworld as dw con_url = \u0027byuidss\/cse-250-baseball-database\u0027 ``` What table do we want for All Star information?    # %% # allstar table dw.query(con_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM AllstarFull WHERE AND LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you use a groupby to get the counts of players per year?    dw.query(con_url, \u0026#39;\u0026#39;\u0026#39; SELECT yearid, -- \u0026lt;stuff to calculate\u0026gt; FROM AllstarFull WHERE yearid \u0026gt; 1999 AND gp != 1 GROUP BY --? ORDER BY --? \u0026#39;\u0026#39;\u0026#39;).dataframe    What table do we want for the post season at bats?    dw.query(con_url, \u0026#39;\u0026#39;\u0026#39; SELECT * FROM BattingPost as bp LIMIT 5 \u0026#39;\u0026#39;\u0026#39;).dataframe    Can you join the batting table and AllStar information and keep only the at bats, hits with the all star gp and gameid columns?    Let\u0026rsquo;s only keep players with at least one at bat in the post season\ndw.query(con_url, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;columns to keep\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON -- \u0026lt;two columns for the join\u0026gt; WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND -- \u0026lt;at bat condition\u0026gt; LIMIT 15 \u0026#39;\u0026#39;\u0026#39; ).dataframe    Let\u0026rsquo;s build the final table    Which year had the most players players selected as All Stars but didn\u0026rsquo;t play in the All Star game after 1999?\n provide a summary of how many games, hits, and at bats occured by those players had in that years post season.  dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, \u0026#39;\u0026#39;\u0026#39; SELECT -- \u0026lt;lots of calculations\u0026gt; FROM BattingPost as bp JOIN AllstarFull as asf ON bp.playerid = asf.playerid AND bp.yearid = asf.yearid WHERE bp.yearid \u0026gt; 1999 AND gp != 1 AND ab \u0026gt; 0 GROUP BY -- \u0026lt;column\u0026gt; ORDER BY -- \u0026lt;column\u0026gt; \u0026#39;\u0026#39;\u0026#39; ).dataframe    --------------------------------------------------------- Extra Practice \u0026ldquo;I get SQL and want to be challenged.\u0026rdquo; Do this Math 335 task with SQL commands in Python.\n</p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p3\/d2\/"},{value:"Day 2: Transforming Data",label:"<p>Welcome to class! Spiritual Thought Announcements  Code chunk options:  Locally using #| warning: false Globally in the YAML using execute: warning: false    Flights Data Issues: What are some of the data issues you discovered while getting to know your data?\nLoading JSON files into pandas Let\u0026rsquo;s load in some practice data! Data link.\nHere\u0026rsquo;s a description of the data: Data Description.\nimport pandas as pd # to load and transform data import numpy as np # for math\/stat calculations # from url to pandas dataframe url = \u0026#34;https:\/\/github.com\/byuidatascience\/data4missing\/raw\/master\/data-raw\/mtcars_missing\/mtcars_missing.json\u0026#34; cars = pd.read_json(url) # or from file to pandas dataframe cars = pd.read_json(\u0026#34;mtcars_missing.json\u0026#34;) Look at the data for the first two cars. What is different about the format?\n[ { \u0026#34;car\u0026#34;: \u0026#34;Mazda RX4\u0026#34;, \u0026#34;mpg\u0026#34;: 21, \u0026#34;cyl\u0026#34;: 6, \u0026#34;disp\u0026#34;: 160, \u0026#34;hp\u0026#34;: 110, \u0026#34;drat\u0026#34;: 3.9, \u0026#34;wt\u0026#34;: 2.62, \u0026#34;qsec\u0026#34;: 16.46, \u0026#34;vs\u0026#34;: 0, \u0026#34;am\u0026#34;: 1, \u0026#34;gear\u0026#34;: 4, \u0026#34;carb\u0026#34;: 4 }, { \u0026#34;car\u0026#34;: \u0026#34;Mazda RX4 Wag\u0026#34;, \u0026#34;mpg\u0026#34;: 21, \u0026#34;cyl\u0026#34;: 6, \u0026#34;disp\u0026#34;: 160, \u0026#34;hp\u0026#34;: 110, \u0026#34;drat\u0026#34;: 3.9, \u0026#34;wt\u0026#34;: 2.875, \u0026#34;qsec\u0026#34;: 17.02, \u0026#34;am\u0026#34;: 1, \u0026#34;gear\u0026#34;: 4, \u0026#34;carb\u0026#34;: 4 } ] \nYour Turn: Transforming Data With your group, research these functions and create an example using the cars data. Post your example in Slack. Be prepared to teach the class about your functions.\nYou can use the Data Transformation textbook chapter and the pandas documentation to help you.\nRecreate the following output to the best of your abilities: Group 1: Working with rows  .query() allows you to subset observations (rows) .sort_values() arranges rows in a particular order  Group 2: Working with columns  .filter() (as well as [] and .loc[]) allow you to select columns .assign() is one way to add new columns to a dataframe  Group 3: Counting items  .value_counts() summarizes a column by counting the values inside .crosstab() creates a \u0026ldquo;cross tabulation\u0026rdquo; of two or more variables  Group 4: Summarizing data  Using .groupby() and .agg() together allows you to calculate group summaries  Your Turn: Summarizing the cars data Write code to calculate the mean weight wt for each cylinder type cyl.\nAnswer 1    cars.groupby(\u0027cyl\u0027).agg(mean_weight = (\u0027wt\u0027, np.mean)).reset_index()    Can you print the answer as a markdown table?\nAnswer 2    print(cars.groupby(\u0027cyl\u0027).agg(mean_weight = (\u0027wt\u0027, np.mean)).reset_index().to_markdown(index = False))    Project 2 FAQs Why are we using assign()    One main reason:\n You can create multiple columns within the same assign() where one of the columns depends on another one defined within the same assign. source: Documentation\n Other resources:\n Why use pandas.assign rather than simply initialize new column? 3 Ways to Add New Columns to Pandas Dataframe  Not related, but also fun: Should you use \u0026ldquo;dot notation\u0026rdquo; or \u0026ldquo;bracket notation\u0026rdquo; with pandas?\n   Lambda functions    Two ways to define the same function:\ndef square(x): return x**2 square = lambda x:x**2 There are some difference between them as listed below.\n  lambda is a keyword that returns a function object and does not create a \u0026lsquo;name\u0026rsquo;. Whereas def creates name in the local namespace lambda functions are good for situations where you want to minimize lines of code as you can create function in one line of python code. It is not possible using def lambda functions are somewhat less readable for most Python users. lambda functions can only be used once, unless assigned to a variable name.  source\n    Conditional operations    What if you want to create a new column, whose values depend on another column? There are a lot of ways to accomplish this (see this stackoverflow answer). Some functions I use:\n isin() method where() method You can also use an if else statement inside a lambda function     Missing data    We will learn how to identify and deal with missing data next week. For now, we can drop rows we don\u0026rsquo;t want using square brackets [] or .query().   API\u0026rsquo;s and JSON: A Primer Application Programming Interfaces (APIs) Representational State Transfer (REST APIs)  Over the course of the ’00s, another Web services technology, called Representational State Transfer, or REST, began to overtake [all other tools] for the purpose of transferring data. One of the big advantages of programming using REST APIs is that you can use multiple data formats — not just XML, but JSON and HTML as well. As web developers came to prefer JSON over XML, so too did they come to favor REST over SOAP. As Kostyantyn Kharchenko put it on the Svitla blog, “In many ways, the success of REST is due to the JSON format because of its easy use on various platforms.”\nToday, JSON is the de-facto standard for exchanging data between web and mobile clients and back-end services. ref\n JavaScript Object Notation  Well, when you’re writing frontend code in Javascript, getting JSON data back makes it easier to load that data into an object tree and work with it. And JSON formats data in a more succinct way, which saves bandwidth and improves response times when sending messages back and forth to a server. In a world of APIs, cloud computing, and ever-growing data, JSON has a big role to play in greasing the wheels of a modern, open web. ref\n Other Resources  RESTful APIs in 100 Seconds (video) Python API Tutorial: Getting Started with APIs Big List of Free and Open Public APIs (No Auth Needed)  How could we leverage numpy\u0026rsquo;s where() to address the different month proportions in question 3?    reference   How many rows have missing months?    flights.month.value_counts()    Can we figure out any patterns in the missingness?     pd.crosstab() groupby     -------------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p2\/d2\/"},{value:"JSONs \u0026 missing",label:"<p>UFO Sightings Data Link to json file\nExercise 1 Read in the json file as a pandas dataframe. After reading in the data, you\u0026rsquo;ll want to explore it and gain some intuition. Exploring data is a very important step — the more you know about your data the better! Answer the following questions to gain some insight into this dataset.\n How many rows are there? How many columns? What does a row represent in this dataset? What are the different ways missing values are encoded? How many np.nan in each column?  Some useful code for exploring data\n# Object\/Categorical Columns data.column_name.value_counts(dropna=False) data.column_name.unique() # Numeric Columns data.column_name.describe() # Counting missing values data.isna().sum() # Creates boolean dataframe and sums each column  Exercise 2 After learning different ways our data encodes missing values, now we will neatly manage them. There are many techniques we can use to handle missing values; for example, we can drop all rows that contain a missing value, impute with mean or median, or replace missing values with a new missing category. We will use some of these techniques in this exercise.\n shape_reported - replace missing values with missing string. distance_reported - change -999 values to np.nan. (-999 is a typical way of encoding missing values.) distance_reported - fill in missing values with the mean (imputation) were_you_abducted - replace - string with missing string.  The first 10 rows of your data should look like this after completion of the above steps.\n    city shape_reported distance_reported were_you_abducted estimated_size     0 Ithaca TRIANGLE 8521.9 yes 5033.9   1 Willingboro OTHER 7438.64 no 5781.03   2 Holyoke OVAL 7438.64 no 697203   3 Abilene DISK 7438.64 no 5384.61   4 New York Worlds Fair LIGHT 6615.78 missing 3417.58   5 Valley City DISK 7438.64 no 4280.1   6 Crater Lake CIRCLE 7377.89 no 528289   7 Alma DISK 7438.64 missing 4772.75   8 Eklutna CIGAR 5214.95 no 4534.03   9 Hubbard CYLINDER 8220.34 missing 4653.72    Some useful code for filling in missing data\ndata.column_name.replace(..., ..., inplace=True) data.column_name.fillna(..., inplace=True)  Exercise 3 Create a table that contains the following summary statistics.\n median estimated size by shape mean distance reported by shape count of reports belonging to each shape  Your table should look like this:\n   shape_reported median_est_size mean_distance_reported group_count     CIGAR 5899.68 6520.21 3   CIRCLE 266002 7408.26 2   CYLINDER 4550.58 8039.49 2   DISK 4581.8 7516.39 16   FIREBALL 5407.22 7097.78 3   FLASH 6108.34 7438.64 1   FORMATION 5104.4 8708.32 2   LIGHT 3850.25 7636.09 2   OTHER 4699.4 7473.98 4   OVAL 4943.63 7787.24 4   RECTANGLE 3668.1 6054.62 2   SPHERE 5076.78 7206.55 6   TRIANGLE 5033.9 8521.9 1   missing 250153 7438.64 2    Some useful code for grouping and getting summary statistics\n(data.groupby(...) .agg(..., ..., ...))  Exercise 4 The cities listed below reported their estimated size in square inches, not square feet. Create a new column named estimated_size_sqft in the dataframe, that has all the estimated sizes reported as sqft. (Hint: divide by 144 to go from sqin -\u0026gt; sqft)\n Holyoke Crater Lake Los Angeles San Diego Dallas  The head of your data should look like this.\n    city shape_reported distance_reported were_you_abducted estimated_size estimated_size_sqft     0 Ithaca TRIANGLE 8521.9 yes 5033.9 5033.9   1 Willingboro OTHER 7438.64 no 5781.03 5781.03   2 Holyoke OVAL 7438.64 no 697203 4841.69   3 Abilene DISK 7438.64 no 5384.61 5384.61   4 New York Worlds Fair LIGHT 6615.78 missing 3417.58 3417.58   5 Valley City DISK 7438.64 no 4280.1 4280.1   6 Crater Lake CIRCLE 7377.89 no 528289 3668.68   7 Alma DISK 7438.64 missing 4772.75 4772.75   8 Eklutna CIGAR 5214.95 no 4534.03 4534.03   9 Hubbard CYLINDER 8220.34 missing 4653.72 4653.72    Some useful code to fix the rows reported in sqin\nnp.where(..., # Condition ..., # If condition is true ...) # If condition is false  After you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/json_missing\/"},{value:"Day 1: Exploring names with pandas",label:"<p>Welcome to Class! Announcements  Data Science Society Kickoff! Wednesday at 6 in the STC 394 The data science lab  Completing Last Week  Quarto - \u0026ldquo;out of the frying pan and into the fire\u0026rdquo; Finishing the Introduction Project  Use the QMD Template project template Render as HTML and upload in Canvas    What was that data science community portion of our grade?    The Syllabus has this section which says:\n Data science community\n To earn credit for the DS Community element you must complete two different tasks from the list below. At the end of the semester, you will be asked to report on which tasks you completed and what you learned from them.\nAttend Data Science Society at least once.\n Sign up for an email newsletter that will teach you more about data science. Data Science Weekly or Data Elixir are good options. Listen to a podcast episode about data science. Build a Career in Data Science has some excellent episodes. Watch a professional presentation on YouTube about data science. Be prepared to share the link and a summary of the video. Reach out to someone who works in a data-related field and ask them for 15 minutes of their time. Use this time to conduct an “informational interview” and learn more about their responsibilities and career path. Research and apply to at least 5 data-related jobs or internships.  Interview Question: How do you keep up with the current methods in data science?\nDon\u0026rsquo;t Say: Nothing\n   Let\u0026rsquo;s Code! DS 250 workflow     You are going to hit SHIFT \u002b ENTER thousands of times. We don\u0026rsquo;t usually source our scripts. Think of Python Interactive like a graphing calculator or Excel on steroids. You code in pieces. Rewrite for clarity!     Can you figure out the functions of pandas?    Pandas Cheat Sheet and Basics Blog Post\n # Pause: can you explain what this code is doing? df = pd.DataFrame( {\u0026#34;a\u0026#34; : [5, 4, 6, 2, 3], \u0026#34;b\u0026#34; : [7, 8, 9, 10, 11], \u0026#34;c\u0026#34; : [10, 11, 12, 101, 0]}) Use the cheat sheet to find the functions you would need to implement the following steps.\nGroup 1\n sort my table by column a then only use the first 2 rows then calculate the mean of column b.  Group 2\n rename column a to duck then subset to only have duck and b columns then keep all rows where b is less than 9 then find the min of duck     What is method chaining?    Pandas is built to allow for method chaining. Here is a great resource on how to use method chaining: How to write neat pandas code.\n plotly.express creates a chart object pandas creates a DataFrame object We usually include () around our entire method so we can show it in steps.     Project 1 - Intro Understanding your data You should be able to introduce your data sets to people, the same way you introduce a friend!\n What does each row represent? If you don\u0026rsquo;t know, then you don\u0026rsquo;t understand what groups you can analyze. What does each column represent? If you don\u0026rsquo;t know, then you don\u0026rsquo;t understand what information you can evaluate for each group.  Being able to explain your data out loud to someone else follows the same principles as rubber duck debugging.\nIntroduction to pandas \u0026ldquo;DataFrame\u0026rdquo; What is a pandas DataFrame? We can read the official documentation. I also like the video in this tutorial.\nDataFrames come with attributes and built-in functions that can help us get a feel for our data.\nRun the code below one line at a time (or use other functions of your choice) to explore the names data. What do you learn?\nmy_data.columns my_data.shape my_data.size my_data.head() my_data.describe() Setup for Project 1 Create the folder and files to get prepared.  DS250 \u0026gt; project_1 \u0026gt;  names.py names.qmd data.csv (just in case the internet is down)    \u0026ldquo;How should we start each file?\u0026rdquo; I would do this process for every project.\n names.py: Every file starts with the same cells 1) import packages, 2) load data. names.qmd: Let\u0026rsquo;s start with the course template notes.qmd: Keep project noteson the readings and things you learn. my_cheat_sheet.qmd: Update your own cheat sheet  Read in the data.\n#%% # load packages import pandas as pd import plotly.express as px #%% # load data url = \u0026#34;https:\/\/github.com\/byuidatascience\/data4names\/raw\/master\/data-raw\/names_year\/names_year.csv\u0026#34; names = pd.read_csv(url) 1. How many unique names does the names dataframe contain? Work with a partner to find the answer. You might want to look at this pandas cheat sheet.\nHint     Pull the name column out as a series Use the pandas unique function pd.unique() Find the size of the series     2. What is the range of years in the names dataframe? Again, work with a partner and use the pandas cheat sheet.\nHint2     Pull the year column out as a series Find the max Find the min     </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p1\/d1\/"},{value:"Day 1: Intro to Flights Data",label:"<p>Welcome to class! Spiritual Thought Short Link\nProject 1 Comments  Don\u0026rsquo;t include data as a table. Only include tables that add useful information. If I have to scroll up and down it isn\u0026rsquo;t useful. Reports should be readable by an intelligent, but non-technical audience (Meaningful titles and section names) Make it like something you\u0026rsquo;d like to read Clean out any code output, logs, that distract from the message (\u0026ldquo;My Useless Chart\u0026rdquo;) Eliminate \u0026ldquo;warnings\u0026rdquo;  Project 2: Late Flights and Missing Data JSON files (JavaScript Object Notation)  Today, JSON is the de-facto standard for exchanging data between web and mobile clients and back-end services. source\n What is JSON? [ { \u0026#34;car\u0026#34;: \u0026#34;Mazda RX4\u0026#34;, \u0026#34;mpg\u0026#34;: 21, \u0026#34;cyl\u0026#34;: 6, \u0026#34;disp\u0026#34;: 160, \u0026#34;hp\u0026#34;: 110, \u0026#34;drat\u0026#34;: 3.9, \u0026#34;wt\u0026#34;: 2.62, \u0026#34;qsec\u0026#34;: 16.46, \u0026#34;vs\u0026#34;: 0, \u0026#34;am\u0026#34;: 1, \u0026#34;gear\u0026#34;: 4, \u0026#34;carb\u0026#34;: 4 }, { \u0026#34;car\u0026#34;: \u0026#34;Mazda RX4 Wag\u0026#34;, \u0026#34;mpg\u0026#34;: 21, \u0026#34;cyl\u0026#34;: 6, \u0026#34;disp\u0026#34;: 160, \u0026#34;hp\u0026#34;: 110, \u0026#34;drat\u0026#34;: 3.9, \u0026#34;wt\u0026#34;: 2.875, \u0026#34;qsec\u0026#34;: 17.02, \u0026#34;am\u0026#34;: 1, \u0026#34;gear\u0026#34;: 4, \u0026#34;carb\u0026#34;: 4 } ] Introduce the data Load the JSON file and spend a few minutes studying it. Can you learn enough about it to describe the columns and rows?\nHints:\n You can use .describe() to learn about the distribution of a numeric variable. You can use .value_counts() to learn about the distribution of a categorical variable. .crosstab() creates a \u0026ldquo;cross tabulation\u0026rdquo; of two or more categorical variables.  Can you trust the data? Do you notice anything interesting about the flights data?\nQuestion Brainstorming In your group, try to answer the following questions about your assigned question:\n What is our goal? How can we get there? What will the answer look like when we\u0026rsquo;re done?  Project 2 FAQs Missing data    Not all missing data is represented as np.nan. For an example, look at the column that counts delays due to late aircraft.\nWe will learn how to identify and deal with missing data next week. For now, we can drop rows we don\u0026rsquo;t want using square brackets [] or .query().\n   What columns do we need to use for question 3 (total number of flights delayed by weather)?      num_of_delays_weather num_of_delays_late_aircraft num_of_delays_nas      Groups 1 and 5 - Working with rows  .query() allows you to subset observations (rows) .sort_values() arranges rows in a particular order  Groups 2 and 6 - Working with columns  .filter() (as well as [] and .loc[]) allow you to select columns .assign() is one way to add new columns to a dataframe  Groups 3 and 7 - Counting items  .value_counts() summarizes a column by counting the values inside .crosstab() creates a \u0026ldquo;cross tabulation\u0026rdquo; of two or more variables  Groups 4 and 8 - Summarizing data  Using .groupby() and .agg() together allows you to calculate group summaries   Your Turn: Summarizing the cars data Write the code to calculate the mean weight wt for each cylinder type cyl.\nAnswer 1    cars.groupby(\u0027cyl\u0027).agg(mean_weight = (\u0027wt\u0027, np.mean)).reset_index()    Can you print the answer as a markdown table?\nAnswer 2    cars.groupby(\u0027cyl\u0027).agg(mean_weight = (\u0027wt\u0027, np.mean)).reset_index().to_markdown(index = False)    -------------------------------------------------------------------------- The flights data How are we going to answer Question 1 and Question 2?\nWatch out for different forms of missing data!    Not all missing data is represented as np.nan. For an example, look at the column that counts delays due to late aircraft.   What columns do we need to use for question 3 (total number of flights delayed by weather)?      num_of_delays_weather num_of_delays_late_aircraft num_of_delays_nas      How could we leverage numpy\u0026rsquo;s where() to address the different month proportions in question 3?    reference   How many rows have missing months?    flights.month.value_counts()    Can we figure out any patterns in the missingness?     pd.crosstab() groupby     Project 1: Names In your groups, discuss:\n What did you learn about data and Altair? What questions do you still have?  Connecting to Application Programming Interfaces (APIs) Representational State Transfer (REST APIs)  Over the course of the ’00s, another Web services technology, called Representational State Transfer, or REST, began to overtake [all other tools] for the purpose of transferring data. One of the big advantages of programming using REST APIs is that you can use multiple data formats — not just XML, but JSON and HTML as well. As web developers came to prefer JSON over XML, so too did they come to favor REST over SOAP. As Kostyantyn Kharchenko put it on the Svitla blog, “In many ways, the success of REST is due to the JSON format because of its easy use on various platforms.”\nToday, JSON is the de-facto standard for exchanging data between web and mobile clients and back-end services. ref\n JavaScript Object Notation  Well, when you’re writing frontend code in Javascript, getting JSON data back makes it easier to load that data into an object tree and work with it. And JSON formats data in a more succinct way, which saves bandwidth and improves response times when sending messages back and forth to a server. In a world of APIs, cloud computing, and ever-growing data, JSON has a big role to play in greasing the wheels of a modern, open web. ref \u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026gt;\n What does missing data look like?    How many missing values do you see in the first ten rows? (The mtcars documentation might help.)\ncars.head(10)    How many missing values are there?    #%% cars.isna().sum() #%% cars.isin([\u0026#39;\u0026#39;]).sum() #%% cars.describe() reference 1 and reference 2\n   ### How Pandas handles missingness Read [\u0027Handling missing in pandas\u0027](https:\/\/pandas.pydata.org\/pandas-docs\/stable\/user_guide\/missing_data.html#calculations-with-missing-data) ```python import numpy as np df = (pd.DataFrame( np.random.randn(5, 3), index=[\u0027a\u0027, \u0027c\u0027, \u0027e\u0027, \u0027f\u0027, \u0027h\u0027], columns=[\u0027one\u0027, \u0027two\u0027, \u0027three\u0027]) .assign( four = \u0027bar\u0027, five = lambda x: x.one  0, six = [np.nan, np.nan, 2, 2, 1], seven = [4, 5, 5, np.nan, np.nan]) ) ``` What happens when you add two pandas objects with missing values?    df.seven \u002b df.six reference\n   What happens when you sum within a column?    df.seven.sum() reference\n   How could I add two columns treating NaN like zeros?    df.seven.fillna(0) \u002b df.six.fillna(0) reference\n   ----------------------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p2\/d1\/"},{value:"Day 1: Intro to ML",label:"<p>Welcome to class! Announcements  Project 3 - Getting pickier about good communication  Tables in reports should be as concise as possible (no duplicate information) Career batting average Meaningful report name (Drop \u0026ldquo;Client Report\u0026rdquo;) Meaningful section headers so the table of contents is useful (don\u0026rsquo;t call them \u0026ldquo;Question 1\u0026rdquo;) Don\u0026rsquo;t include \u0026ldquo;My useless chart\u0026rdquo; from the template   Ask for help!  Computing lab Computing lab Slack channel (search) Slack classmates or general channel    Spiritual Thought Genesis 1:1 and Machine Learning Are facts true? Pictionary!   ----------------------  From Sebastian Thrun:\n AI is able to learn \u0026lsquo;rules\u0026rsquo; from highly repetitive data.\nThe single most important thing for AI to accomplish in the next ten years is to free us from the burden of repetitive work.\n Your Turn: Student Classification Problem Can we predict if a student is from Utah?\nYour Turn: Features and Targets Import dwellings.csv. With a neighbor:\n Try to describe the data. Explain what each observation (row) is and what measurements we have on that observation (columns). Now try describing the modeling (machine learning) we are going to do in terms of \u0026ldquo;features\u0026rdquo; and \u0026ldquo;targets\u0026rdquo;. Watch out - are there any columns that are the target in disguise? (You may need to review the project goal.) What features do you expect to have a strong relationship with the target?  Before Next Class Start working on Question 1    The goal of Question 1 is to help us with \u0026ldquo;feature selection\u0026rdquo;.\n Remember: Overfitting happens when some boundaries are based on on distinctions that don\u0026rsquo;t make a difference. More data does not always lead to better models. (Occam\u0026rsquo;s Razor)  Common questions:\n Why it may be better to have fewer predictors in Machine Learning models? What is Feature Selection and why do we need it in Machine Learning?     Do the project readings    Machine Learning Introduction\n Step-by-step guide (mostly) for training a GaussianNB classifier. (The steps will be the same for any algorithm you use.)  Visual Introduction to Machine Learning\n Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data. Overfitting happens when some boundaries are based on distinctions that don\u0026rsquo;t make a difference. You can see if a model overfits by having test data flow through the model.     What is the 5000 rows error with Altair?    MaxRowsError: How can I plot Large Datasets?\nYou may also save data to a local filesystem and reference the data by file path. Altair has a JSON data transformer that will do this transparently when enabled:\n# Try doing data exploration with: subset_data = denver.sample(n = 4999)    ---------------------- scikit-learn resources     Home page Tutorials Getting Started: What do you notice about the header portion of each of the script chunks?  import vs from ... import       1. Models approximate real-life situations using limited data.  2. In doing so, errors can arise due to assumptions that are overly simple (bias) or overly complex (variance).  3. Building models is about making sure there\u0027s a balance between the two. #### But what is the \u0027Pavlovian bell\u0027 in the machine learning model? ![](..\/..\/images\/ml\/test.png) Some mathematical penalty\/reward equation.  - __[Regression](https:\/\/setosa.io\/ev\/ordinary-least-squares-regression\/)__  - __[Variance, RMSE, SD](..\/..\/interactive\/threshold_histogram.html)__  - __proportions__ ## Using our project data to understand features, targets, and samples.  1. Import `dwellings_ml.csv` and write a short sentence describing your data. Remember to explain an observation and what measurements we have on that observation.  2. Now try describing the modeling (machine learning) we are going to do in terms of features and targets.  A. Are there any columns that are the target in disguise?  B. _Are the observational units unique in every row?_ ![](..\/..\/images\/ml\/iris_description.png) --------------- - Financial: orders, invoices, payments  - Work: plans, activity records  - School: Grades ------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p4\/d1\/"},{value:"Day 1: Intro to Project 3",label:"<p>Welcome to class! Spiritual Thought Announcements  Project 2 Highlights Project 2 comments   Turn them in Clean up graphs (main titles, axis labels, legends) Column headers on tables in your report (don\u0026rsquo;t include index number either) Technically Proportion of all flights delayed by weather, not the proportion of delayed flights JSON should look like a text example of a record, not a table  Things for next project:   Be sure to give section headers meaningful titles (NOT \u0026ldquo;Question 1\u0026rdquo;) Drop \u0026ldquo;my useless chart\u0026rdquo; from your graphs  What is Structured Query Language (SQL)?   Ray and I were impressed by how compactly Codd’s languages could represent complex queries. However, at the same time, we believed that it should be possible to design a relational language that would be more accessible to users without formal training in mathematics or computer programming. We believed that barriers to widespread acceptance of Codd’s languages existed on two levels.  .   1. The first barrier came from the mathematical notation, which was hard to enter at a keyboard. This barrier was superficial and could be easily dealt with by replacing symbols with keywords.   2. The more difficult barrier was at the semantic level. The basic concepts of Codd’s languages were adapted from set theory and symbolic logic. This was natural given Codd’s background as a mathematician, _but Ray and I hoped to design a relational language based on concepts that would be familiar to a wider population of users._ We also hoped to extend the language to encompass database updates and administrative tasks such as the creation of new tables and views, which had traditionally been outside the scope of a query language. SQL is \u0022a relational language based on concepts that would be familiar to a wider population of users.\u0022  When we moved to the San Jose Research Laboratory in 1973 to join the System R project, we began work on another new language that we called Sequel. Sequel allowed the well-paid-employee query to be represented in a readable form free from mathematical concepts and symbols. ... In 1977, because of a trademark issue, the name Sequel was shortened to SQL.  ------------------------------------- Ok, but how does it work? SQL uses keywords to pull (or \u0026ldquo;fetch\u0026rdquo;, \u0026ldquo;extract\u0026rdquo;) the data we want from a database. The computer reads those keywords in a specific order.\nFrom EverSQL we can get some more background:\n This is the logical order of operations, also known as the order of execution, for an SQL query:\n  FROM, including JOINs WHERE GROUP BY HAVING WINDOW functions SELECT DISTINCT UNION ORDER BY LIMIT and OFFSET   But the reality isn\u0026rsquo;t that easy nor straight forward. As we said, the SQL standard defines the order of execution for the different SQL query clauses. Said that, modern databases are already challenging that default order by applying some optimization tricks which might change the actual order of execution, though they must end up returning the same result as if they were running the query at the default execution order.\n For CSE 250: Don\u0026rsquo;t think too hard about optimization at this point. Let the database figure out the optimized routine.\nMost SQL queries are typed in the following pattern:\nSELECT -- \u0026lt;columns\u0026gt; and \u0026lt;column calculations\u0026gt; FROM -- \u0026lt;table name\u0026gt;  JOIN -- \u0026lt;table name\u0026gt;  ON -- \u0026lt;columns to join\u0026gt; WHERE -- \u0026lt;filter condition\u0026gt; GROUP BY -- \u0026lt;subsets for column calculations\u0026gt; HAVING -- \u0026lt;grouped filter condition\u0026gt; ORDER BY -- \u0026lt;how the output is returned in sequence\u0026gt; LIMIT -- \u0026lt;number of rows to return\u0026gt; \nProject 3 - what are our goals? Do we understand the questions being asked in Project 3?\nThe baseball data Let\u0026rsquo;s start exploring the baseball data!\n You\u0026rsquo;ll need to download the SQLite Databse And review the data dictionary  import pandas as pd import sqlite3 con = sqlite3.connect(\u0026#39;lahmansbaseballdb.sqlite\u0026#39;) df = pd.read_sql_query(\u0026#34;SELECT * FROM fielding LIMIT 5\u0026#34;, con) df How can we see what tables are in the database?\nimport pandas as pd import sqlite3 con = sqlite3.connect(\u0026#39;lahmansbaseballdb.sqlite\u0026#39;) pd.read_sql_query(\u0026#34;\u0026#34;\u0026#34; SELECT name FROM sqlite_master WHERE type=\u0026#39;table\u0026#39; \u0026#34;\u0026#34;\u0026#34;, con) Understanding SQL queries Make sure you do the project readings!\nWhat table do we want to use?    q = \u0026#39;\u0026#39;\u0026#39; SELECT * FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What columns do we want to select?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What calculation do we want to perform?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, ab\/r FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; batting_calc = dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    What name do we give our calculated column?    q = \u0026#39;\u0026#39;\u0026#39; SELECT playerid, teamid, ab, r, ab\/r as runs_atbat FROM batting LIMIT 5 \u0026#39;\u0026#39;\u0026#39; batting_calc = dw.query(\u0026#39;byuidss\/cse-250-baseball-database\u0026#39;, q).dataframe    -------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p3\/d1\/"},{value:"SQL \u0026 databases",label:"<p>Skill builder (relational database) For this skill builder, we are exploring some important topics in relational databases. This exercise will require you to create SQL queries through python. You may want to at least scan the readings before beginning this task since this serves as an assessment of your understanding of the assigned readings.\nA competent student should be able to finish the exercises within 75 minutes.\nBefore you start Make sure you have installed VS-code, pandas, and Altair on your computer.\nAlso make sure you have gone through the tutorial on under course materials called SQL for Data Science: we assume that you have a connection to your data.\nExercise 1 Readme file A database can consist of more than one table\/data set. A relational database consists of tables\/data sets that share columns. These shared columns then establish the relationship between the tables, thus the name relational database. The relations are sometimes not easily found and they require careful investigations.\nTo understand what is in a relational database, we can start with understanding the tables and the columns within.\nHere is a link to the readme file of the baseball database.\n What is the name of the table that records data about pitchers in the regular seasons?\n  What do the HR and HBP columns mean in that table respectively?\n Excercise 2 SELECT and FROM The simplest SQL query is a query with SELECT and FROM. These are the keywords you will see again and again in SQL. Usually, when constructing a more complex query, it is easier to identify what goes into these two clauses first.\n Create a query that shows all columns from the table you found in Exercise 1, save the dataframe in a variable \u0026ldquo;pitch\u0026rdquo;\n You script should look something like:\nresult = pd.read_sql_query( \u0027SELECT _______ FROM _______\u0027, con) results Excercise 2 WHERE The WHERE keyword allows us to filter down the table horizontally (fewer rows).\nIt goes after SELECT and FROM.\n Using a SQL query, select all rows in the same table where HR is lesser than 10 and gs is greater than 25.\n  Find out what the columns mean and explain your query in words\n Excercise 3 ORDER BY ORDER BY sort the table you select by one or more columns and goes after WHERE\n Using the same query in exercise 2, edit it so that the table is ordered by the year of the season(nearest to furthermost) and the player ID(alphabetically).\n Excercise 4 Joins Joins are used when you wish to create a new table through two different tables. Keep in mind that you have to identify the relationship between two tables before you can correctly join them.\nJOIN goes between FROM and WHERE.\n Identify the shared columns (keys) and join the table in exercise 2 with the salaries table, then filter the data so that it shows only pitchers in the year 1986.\n You should get a dataframe with 306 rows.\nExercise 5 Group by Group by is a keyword we use to lower the level of granularity of a table. Meaning we are combining rows into one by the given column(s).\nCreate a query that captures the number of pitchers the Washington Nationals used in each year, then sort the table by year\nYou should get a dataframe with 23 rows.\nFor the overachievers Excercise 6 Research the order of operations for SQL and put the following keywords in that order.\n SELECT FROM JOIN WHERE HAVING ORDER BY GROUP BY LIMIT  After you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/relational_data\/"},{value:"Week 8-9: Project 4 - Homes",label:"<p></p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p4\/"},{value:"Machine Learning",label:"<p>Intro to Titanic Machine Learning Skill Builder Link to data\nFor this skill builder, we\u0026rsquo;ll be putting our machine learning hats on. We\u0026rsquo;ll be creating a model that predicts whether a passenger survived. With machine learning, there is a lot of jargon! It can be quite overwhelming at times. This skill builder attempts to keep things basic and simple. With that being said, there are some terms that are important to understand. Let\u0026rsquo;s look at the first few rows of our dataset before proceeding with the definitions.\nThe titanic dataset will be used for examples of each definition.\n   survived pclass sex age siblings_spouses_aboard parents_children_aboard fare     0 3 1 22 1 0 7.25   1 1 0 38 1 0 71.2833   1 3 0 26 0 0 7.925   1 1 0 35 1 0 53.1   0 3 1 35 0 0 8.05    Important Terms:  features: measurable property of the object you\u0026rsquo;re trying to predict. We use this information to predict our target of interest.  Example: pclass, sex, age, siblings_spouses_aboard , parents_children_aboard, fare columns are all examples of different features. Synonyms: attributes, explanatory variables, independent variables, variables, X\u0026rsquo;s, covariates   target: the feature that you are wanting to gain more insight into. The thing you are trying to predict.  Example: in the titanic dataset our target is survived Synonyms: label, dependent variable, y   train set: Usually 70% of the rows from the original dataset are randomly sampled to create this training data. It\u0026rsquo;s used by the algorithm, to determine, or learn, the optimal combinations of variables that will generate a good predictive model  Example: Random sample of 70% of the original titanic dataset rows Synonyms: training data, train data, X_train, y_train   test set: Usually the remaining 30% of the rows in the original dataset are used to create this dataset. The testing data is a set of rows used only to assess the performance (i.e. generalization) of a model. To do this, the final model is used to predict classifications of examples in the test set. Those predictions are compared to the examples\u0026rsquo; true classifications to assess the model\u0026rsquo;s accuracy.  Example: Random sample of 30% of the original titanic dataset rows Synonyms: testing data, test data, X_test, y_test   evaluation metrics: A statistic that tells you how well your predictions align with the actual values. Other words, tells you how good your model is.  Example: Accuracy, Precision, Recall, MSE, MAE, Rsquared Synonyms: performance metric    Again, this is a very light and oversimplified treatment of machine learning. The purpose of this project is to help you understand the main concepts of ml and walk you through the process of building a machine learning model. A simplified work flow of a machine learning project is shown below. Spend some time getting familiar with this flow \u0026amp;mdash as you are about to code it\u0026hellip; Exciting!\nNote in order to do this skill builder you will need to have scikit-learn installed on your machine. Run the following command in your terminal if you haven\u0026rsquo;t already.\npip install scikit-learn\nData Link to csv file\nExercise 0 (Imports and Loading in data) # Loading in packages import pandas as pd import numpy as np import altair as alt from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Loading in data data = pd.read_csv(___)  Exercise 1 Create a chart exploring the relationship between age and survived in the titanic dataset. A strip plot, density plot, or boxplot might be useful here. Below is an example of a density plot. Feel free to replicate this chart or create your own.\nThe purpose of making this chart is to explore the relationships between a feature and the target. We want to see if the feature contains predictive information about the target. This is a large part of machine learning called Exploratory Data Analysis that should never be skipped! Spend time getting to know your features and how they interact with other features and the target.\n Exercise 2 Build a random forest model that is able to predict whether a passenger survived. This exercise is the bulk of the skill builder and contains several steps.\nStep 0: Split the data into X and y variables The X variable will contain all your features\n# Removes the target and keeps all features X = data.drop(___, axis=1) The y variable will hold the target\n# Selects the target column y = data[\u0026#39;___\u0026#39;] Step 1: Split data into train and test sets The train_test_split function is useful for this task. Review the train_test_split function documentation\n# Splitting X and y variables into train and test sets using stratified sampling X_train, X_test, y_train, y_test = train_test_split(___, ___, test_size=0.3, random_state=24, stratify=y) Step 2: Train the model Explore the RandomForestClassifier documentation for the RandomForestClassifier. It\u0026rsquo;s not necessary to understand the inner workings of the Random Forest algorithm for this class - just learn the syntax of fitting the model.\n# Creating random forest object rf = RandomForestClassifier(random_state=24) # Fit with the training data rf.fit(___, ___) Step 3: Use test set to make predictions # Using the features in the test set to make predictions y_pred = rf.predict(___) Step 4: Compare test set predictions to actual values. Calculate the accuracy. # Comparing predictions to actual values accuracy_score(___, ___)  Exercise 3 What is the most important feature in making predictions? Why do you think this is?\nCreate a table that shows the feature importances in descending order. The random forest classifier has a feature importances attribute. It can be accessed by rf.feature_importances_. The table should look something like this.\n   feature names importances     fare 0.288051   sex 0.281853   age 0.266491   pclass 0.0814224   siblings_spouses_aboard 0.0475633   parents_children_aboard 0.034619    After you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/ml_sklearn\/"},{value:"Week 6-7: Project 3 - Baseball",label:"<p> We will use a baseball relational database to explore SQL in Python for data science applications. Finding relationships in baseball\n Completed Readings: SQL for Data Science Readings (read all links) and Why SQL is beating NoSQL, and what this means for the future of data\nUse the data.world baseball url for the Data Connection. You can read the Connection Instructions for data.world here\nGrand Questions   Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID\/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.\n  This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)\n Write an SQL query that provides playerID, yearID, and batting average for players with at least one at bat. Sort the table from highest batting average to lowest, and show the top 5 results in your report. Use the same query as above, but only include players with more than 10 “at bats” that year. Print the top 5 results. Now calculate the batting average for players over their entire careers (all years combined). Only include players with more than 100 at bats, and print the top 5 results.    Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc.). Write an SQL query to get the data you need. Use Python if additional data wrangling is needed, then make a graph in Altair to visualize the comparison. Provide the visualization and the compiled Vega script that would build the visualization.\n   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p3\/"},{value:"Munging data",label:"<p>Intro to cleaning movies data Link to the data\nThis skill builder focuses on munging (formatting) data into a machine learning ready dataset. We will be using an IMDB Ratings dataset. It contains columns that are categorical. Sklearn cannot handle columns that are strings, so we need to convert these into a numerical representation. We accomplish this by either one hot encoding, label encoding, or taking just one value of the range provided. There are many other ways to represent these columns as numbers, but they are beyond the scope of this course.\nOnce you\u0026rsquo;ve converted all columns to numeric, in an intelligent way, you will be asked to recreate a graph using altair. Here is the head of the data you will be working with. Enjoy!\n   star_rating content_rating genre duration box_office_rev major_hit     9.3 R Crime 142 €1924521976 - €1925521976 no   9.2 R Crime 175 €177034987 - €178034987 no   9.1 R Crime 200 €2617541398 - €2618541398 no   9 PG-13 Action 152 €996115723 - €997115723 no   8.9 R Crime 154 €1172054364 - €1173054364 no    Data Link to csv file: ...\n Exercise 0  Grab the high range value for each movie and put it into a new column called high_range_rev.  Make sure the data type of this new column is numeric!!   Remove the box_office_rev column from the dataset.  The .str.split() and .astype() methods might be of use! Also, to get the euro sign just copy it from here, €, and put it in your code.\nThe first 5 rows of the resulting dataframe should look like this\n   star_rating content_rating genre duration major_hit high_range_rev     9.3 R Crime 142 no 2345444803   9.2 R Crime 175 no 2182412593   9.1 R Crime 200 no 1604872807   9 PG-13 Action 152 no 284317976   8.9 R Crime 154 yes 1791932201     Exercise 1 Convert the major_hit column to 1\/0\u0026rsquo;s. yes -\u0026gt; 1 and no -\u0026gt; 0. Again, there are several ways to accomplish this. Using our old friend np.where is probably the easiest though.\nThe first 5 rows of the resulting dataframe should like this\n   star_rating content_rating genre duration major_hit high_range_rev     9.3 R Crime 142 0 1925521976   9.2 R Crime 175 0 178034987   9.1 R Crime 200 0 2618541398   9 PG-13 Action 152 0 997115723   8.9 R Crime 154 0 1173054364     Exercise 2 Convert the content_rating column using label encoding. We\u0026rsquo;re using label encoding in this case because the movie ratings already have a natural ordering to them. We will replace each rating with a number in it\u0026rsquo;s natural ascending order.\nTo be more specific, here is how we will do it.\n G: 0 PG: 1 PG-13: 2 R: 3  A dictionary and the .map() method could be useful for this exercise. There are other ways of tackling this problem though. Be creative!\nThe first 5 rows of the resulting dataframe should look like\n   star_rating content_rating genre duration major_hit high_range_rev     9.3 3 Crime 142 0 1925521976   9.2 3 Crime 175 0 178034987   9.1 3 Crime 200 0 2618541398   9 2 Action 152 0 997115723   8.9 3 Crime 154 0 1173054364     Exercise 3 The last column that we need to take care of is genre. We will use one hot encoding for this. Make sure to ONLY one hot encode the genre column!\nA useful function for one hot encoding is pd.get_dummies(). I recommend checking out the documentation.\nThe resulting dataframe should look like the following example; don\u0026rsquo;t worry if your high_range_rev column turned into scientific notation—Pandas does this sometimes.\n    star_rating content_rating duration major_hit high_range_rev genre_Action genre_Adventure genre_Animation genre_Biography genre_Comedy genre_Crime genre_Drama genre_Family genre_Fantasy genre_Horror genre_Mystery genre_Sci-Fi genre_Thriller genre_Western     0 9.3 3 142 0 1.92552e\u002b09 0 0 0 0 0 1 0 0 0 0 0 0 0 0   1 9.2 3 175 0 1.78035e\u002b08 0 0 0 0 0 1 0 0 0 0 0 0 0 0   2 9.1 3 200 0 2.61854e\u002b09 0 0 0 0 0 1 0 0 0 0 0 0 0 0   3 9 2 152 0 9.97116e\u002b08 1 0 0 0 0 0 0 0 0 0 0 0 0 0   4 8.9 3 154 0 1.17305e\u002b09 0 0 0 0 0 1 0 0 0 0 0 0 0 0     Exercise 4 Recreate this graph as best you can. You\u0026rsquo;ll need to use the original data that specifies the actual rating.\nAfter you have completed this skill builder with your team (or on your own) then compare your work to our script    See the script.   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/munging\/"},{value:"Week 4-5: Project 2 - Flights",label:"<p>JSON files are the format of choice for sharing information and data between apps on the internet. When you hear someone explain that you can use an API to get the data, there is usually a JSON file involved. The history of JSON is worth reading. We will have another project analyzing data from JSON files that are missing values. Are we missing JSON on our flight?\n Completed Readings: P4DS: Chapter 5 Data tranformation, P4DS: Section 7.4 Missing Values, Python Data Science Handbook: Missing Data, How to Handle Missing Data, and Wikipedia Missing Data\n The flights JSON File\nand the Data Description\n ### Grand Questions  1. __Which airport has the worst delays? How did you choose to define \u0022worst\u0022? As part of your answer include a table that lists the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours, for each airport.__   2. __What is the worst month to fly if you want to avoid delays? Include one chart to help support your answer, with the x-axis ordered by month. You also need to explain and justify how you chose to handle the missing `Month` data.__   3. __According to the BTS website the Weather category only accounts for severe weather delays. Other “mild” weather delays are included as part of the NAS category and the Late-Arriving Aircraft category. Calculate the total number of flights delayed by weather (either severe or mild) using these two rules:__   1. __30% of all delayed flights in the Late-Arriving category are due to weather.__  2. __From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%.__   4. __Create a barplot showing the proportion of all flights that are delayed by weather at each airport. What do you learn from this graph (Careful to handle the missing `Late Aircraft` data correctly)?__   5. __Fix all of the varied `NA` types in the data and save the file back out in the same format that was provided. Provide one example from the file with the new `NA` values shown.__ --------------------------------------------------- </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p2\/"},{value:"GitHub and git",label:"<p>Complete the Hello World GitHub Guide\n</p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/git_github\/"},{value:"Week 2-3: Project 1 - Names",label:"<p>We are going to start learning the pandas package while we explore the names data for our project. What is in a name?\n Completed Readings: Python for Data Science (P4DS): Data Visualization, P4DS: Graphics for Communication, P4DS: Markdown, P4DS: 5.2 Filter rows with .query()\n https:\/\/github.com\/byuidatascience\/data4names\/raw\/master\/data-raw\/names_year\/names_year.csv\n ### Grand Questions  1. __How does your name at your birth year compare to its use historically?__  1. __If you talked to someone named Brittany on the phone, what is your guess of their age?__  1. __Mary, Martha, Peter, and Paul are all Christian names. From 1920 - 2000, compare the name usage of each of the four names.__  1. __Think of a unique name from a famous movie. Plot that name and see increases line up with the movie release.__ ------------------------------ </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/p1\/"},{value:"Week 1: Introduction",label:"<p>  Introduction Project Syllabus   </p><p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/introduction\/"},{value:"DS250",label:"<p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/"},{value:"Frequently Asked Questions",label:"<p> What do you mean by data science programming?    Most likely, you have had 1-2 courses of programming before you have taken CSE 250. Unlike traditional computer science courses, CSE 250 uses Python in an interactive mode instead of building programs. The data provider usually has some big questions that need answering; However, there are hundreds of little issues and responses along the way. We use programming to facilitate this investigation.\nThere are similarities with User Experience Designers. In our case, we don\u0026rsquo;t get to ask users about their experience. We use programming to ask data about its background, and each data set has its own history. We want our analysis to mold to that experience. You can think of data science programming like a first date with your data. You can\u0026rsquo;t write one long program nieve of the issues and nuances each living data set provides.\n   How does CSE 250 compare to CSE 350 or Math 335?    The two courses have similarities. You could think of CSE 250 as an introduction to data wrangling and visualization. Both classes use real-world data and are built around data science projects. There are some critical differences between the two courses.\n In this course, we use Python, and CSE 350 uses R. We are introducing the principles of data science programming in CSE 250. The course is only 2-credits. CSE 250 is intended to introduce visualization, wrangling, and modeling.     How does CSE 250 prepare me for CSE 350, Math 335 and CSE 450?    You will be comfortable with interactive programming and have an introduction to the principles of data formats for data science applications. You will be introduced to principles related to machine learning, data wrangling, and data visualization.   What programming languages do we use in this course?    The course is done using Python. We focus on the pandas and Altair packages.   What are the prerequisites for this course?    Using the new courses at BYU-I, the prerequisite is CSE 110. However, if you have experience programming from other classes, you most likely are prepared for this course.   Why Python instead of R?    The computer science and software engineering programs at BYU-I use Python as their foundational courses. The standard student will have some experience with Python before CSE 250. Python is an essential programming language for data scientists, and we already have CSE 350\/Math 335, which is taught in R.   What is pandas?    pandas is the foundational data science package in Python. If you are using tabular data you will be in pandas.   Why are we using Altair instead of Seaborn or Matplotlib?    Matplotlib was the first visualization package to gain a following in Python. Seaborn is built on top of Matplotlib. Many data scientists use both in their work—neither leverage the grammar of graphics as developed by Leland Wilkinson. Altair is built on Vega-Lite, which uses the Vega visualization grammar. It is declarative and actively developed. We expect that it will become the predominant visualization package in Python (https:\/\/youtu.be\/FytuB8nFHPQ and https:\/\/youtu.be\/vTingdk_pVM).   </p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/faq\/"},{value:"Skill Builders",label:"<p>These short activites are provided for you to gain some additional skills to help with the class projects.\n</p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/skill_builders\/"},{value:"Slack",label:"<p>If you haven\u0026rsquo;t already, please join Slack. This will be a lifesaver.\nhttps:\/\/join.slack.com\/t\/byuidss\/signup\n</p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slack\/"},{value:"Slides",label:"<p>Use the navigation pane on the left to review the class slides.\n</p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/slides\/"},{value:"",label:"<p>Details Your coding challenge will help you demonstrate the skills you have developed this semester. Here are a few essential items.\n Your goal is to demonstrate your data science coding abilities. Get through as many items with a rough implementation as possible. Get your code to match our outputs as close as possible, but don\u0026rsquo;t stress over minute details. Keep most of the code you type. If you end up not using specific parts, comment them out and include them at the bottom. Use the entire hour and may not finish. Submit a .md and a .pdf report with your output and code for each challenge.  Please use the challenge template to submit your work.\nimport pandas as pd import altair as alt import numpy as np from sklearn.model_selection import train_test_split from sklearn import tree from sklearn.ensemble import GradientBoostingClassifier from sklearn import metrics Challenge 1 Split Entry houses are a failed building experiment in the United States. Use the data from our Denver homes project, as shown below, to recreate the following graphic.\nurl = \u0026#39;https:\/\/github.com\/byuidatascience\/data4dwellings\/raw\/master\/data-raw\/dwellings_denver\/dwellings_denver.csv\u0026#39; dat_home = pd.read_csv(url).sample(n=4500, random_state=15) Challenge 2 Our computations can\u0026rsquo;t be done with missing values. Programmatically replace all the lost values with 125 and make a box-plot.\nmister = pd.Series([\u0026#34;lost\u0026#34;, 15, 22, 45, 31, \u0026#34;lost\u0026#34;, 85, 38, 129, 80, 21, 2]) Challenge 3 Our computations can\u0026rsquo;t be done with missing values. Programmatically replace all the lost values with 125 and report the mean rounded to two decimals.\nmister = pd.Series([\u0026#34;lost\u0026#34;, 15, 22, 45, 31, \u0026#34;lost\u0026#34;, 85, 38, 129, 80, 21, 2]) Challenge 4 Programmatically read in the following JSON file, keep only the cases column and return a markdown table that has country in the rows and cases for 1999 and 2000 in the columns. Your table will have six cells with values.\nurl = \u0026#39;https:\/\/github.com\/byuidatascience\/data4python4ds\/raw\/master\/data-raw\/table1\/table1.json\u0026#39; Challenge 5 Use our cleaned example of the star wars data from project 6 to predict the gender of the respondent to the survey. Report your precision and a feature importance plot.\n Use test_size = .20 and random_state = 2020 in train_test_split() Use the GradientBoostingClassifier() method.  url = \u0026#34;http:\/\/byuistats.github.io\/CSE250-Course\/data\/clean_starwars.csv\u0026#34; dat = pd.read_csv(url) </p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/final_coding_challenge\/sp22\/"},{value:"Categories",label:"<p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/categories\/"},{value:"Final_coding_challenges",label:"<p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/final_coding_challenge\/"},{value:"Office Hours",label:"<p>Schedule a visit with Brother Cannon at an available time. https:\/\/calendly.com\/cannonp\n</p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/contact\/"},{value:"Tags",label:"<p></p>",url:"https:\/\/byuistats.github.io\/DS250-Cannon\/tags\/"},];$("#search").autocomplete({source:projects}).data("ui-autocomplete")._renderItem=function(ul,item){return $("<li>").append("<a href="+item.url+" + \" &quot;\" +  >"+item.value+"</a>"+item.label).appendTo(ul);};});</script></div></div></div></div></header><section class=section><div class=container><div class="row justify-content-center"><div class="col-12 text-center"><h2 class=section-title></h2></div><div class="col-lg-4 col-sm-6 mb-4"><a href=https://byuistats.github.io/DS250-Cannon/skill_builders/ class="px-4 py-5 bg-white shadow text-center d-block match-height"><i class="ti-ruler-pencil icon text-primary d-block mb-4"></i><h3 class="mb-3 mt-0">Skill Builders</h3><p class=mb-0>Build skills for the projects.</p></a></div><div class="col-lg-4 col-sm-6 mb-4"><a href=https://byuistats.github.io/DS250-Cannon/slack/ class="px-4 py-5 bg-white shadow text-center d-block match-height"><i class="https://img.shields.io/badge/slack-@oresoftware/npp-yellow.svg?logo=slack icon text-primary d-block mb-4"></i><h3 class="mb-3 mt-0">Slack</h3><p class=mb-0>Link to Slack signup</p></a></div><div class="col-lg-4 col-sm-6 mb-4"><a href=https://byuistats.github.io/DS250-Cannon/slides/ class="px-4 py-5 bg-white shadow text-center d-block match-height"><i class="ti-layout-slider-alt icon text-primary d-block mb-4"></i><h3 class="mb-3 mt-0">Slides</h3><p class=mb-0>Class material for every day.</p></a></div></div></div></section><footer class="section pb-4"><div class=container><div class="row align-items-center"><div class="col-md-8 text-md-left text-center"><p class="mb-md-0 mb-4">J. Hathaway and BYU-I ©</p></div><div class="col-md-4 text-md-right text-center"><ul class=list-inline><li class=list-inline-item><a class="text-color d-inline-block p-2" href=https://github.com/byuidatascience><i class=ti-github></i></a></li><li class=list-inline-item><a class="text-color d-inline-block p-2" href=https://www.linkedin.com/groups/13537407/><i class=ti-linkedin></i></a></li></ul></div></div></div></footer><script src=https://byuistats.github.io/DS250-Cannon/js/script.min.js></script></body></html>
\ No newline at end of file
diff --git a/slides/p4/d2/index.html b/slides/p4/d2/index.html
index cd5c9a4..a483aa3 100644
--- a/slides/p4/d2/index.html
+++ b/slides/p4/d2/index.html
@@ -3,7 +3,7 @@
 <span class=navbar-toggler-icon></span></button><div class="collapse navbar-collapse text-center" id=navigation><ul class="navbar-nav ml-auto"><li class=nav-item><a class="nav-link text-dark" href=/DS250-Cannon>Home</a></li><li class=nav-item><a class="nav-link text-dark" href=/DS250-Cannon/projects>Projects</a></li><li class=nav-item><a class="nav-link text-dark" href=/DS250-Cannon/contact>Contact</a></li><li class=nav-item><a class="nav-link text-dark" href=/DS250-Cannon/course-materials>Materials</a></li><li class="nav-item dropdown"><a class="nav-link dropdown-toggle text-dark" href=# role=button data-toggle=dropdown aria-haspopup=true aria-expanded=false>Navigate</a><div class=dropdown-menu><a class=dropdown-item href=/DS250-Cannon/slides>Slides</a>
 <a class=dropdown-item href=/DS250-Cannon/course-materials/syllabus/>Syllabus</a>
 <a class=dropdown-item href=/DS250-Cannon/faq>FAQ</a></div></li></ul></div></div></nav></header><section class="single section-sm pb-0"><div class=container><div class=row><div class=col-lg-3><div class=sidebar><ul class=list-styled><a class=back-btn href=/DS250-Cannon></a><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/ title=Slides class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/>Slides</a><ul><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p4/ title="Week 8-9: Project 4 - Homes" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p4/>Week 8-9: Project 4 - Homes</a><ul><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p4/d2/ title="Day 2: Intro to Machine Learning" class="sidelist
-active"><a href=https://byuistats.github.io/DS250-Cannon/slides/p4/d2/>Day 2: Intro to Machine Learning</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p4/d1/ title="Day 1: Intro to ML" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p4/d1/>Day 1: Intro to ML</a></li></ul></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p3/ title="Week 6-7: Project 3 - Baseball" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p3/>Week 6-7: Project 3 - Baseball</a><ul><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p3/d4/ title="Day 4: Practice Coding Challenge" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p3/d4/>Day 4: Practice Coding Challenge</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p3/d3/ title="Day 3: The end of baseball" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p3/d3/>Day 3: The end of baseball</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p3/d2/ title="Day 2: SQL Calculations" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p3/d2/>Day 2: SQL Calculations</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p3/d1/ title="Day 1: Intro to Project 3" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p3/d1/>Day 1: Intro to Project 3</a></li></ul></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p2/ title="Week 4-5: Project 2 - Flights" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p2/>Week 4-5: Project 2 - Flights</a><ul><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p2/d4/ title="Day 4: Exporting JSON" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p2/d4/>Day 4: Exporting JSON</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p2/d3/ title="Day 2B: Missing Data" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p2/d3/>Day 2B: Missing Data</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p2/d2/ title="Day 2: Transforming Data" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p2/d2/>Day 2: Transforming Data</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p2/d1/ title="Day 1: Intro to Flights Data" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p2/d1/>Day 1: Intro to Flights Data</a></li></ul></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p1/ title="Week 2-3: Project 1 - Names" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p1/>Week 2-3: Project 1 - Names</a><ul><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p1/d3/ title="Day 3: Making your name stand out" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p1/d3/>Day 3: Making your name stand out</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p1/d2/ title="Day 2: Seeing names with Altair" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p1/d2/>Day 2: Seeing names with Altair</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p1/d1/ title="Day 1: Exploring names with pandas" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p1/d1/>Day 1: Exploring names with pandas</a></li></ul></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/introduction/ title="Week 1: Introduction" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/introduction/>Week 1: Introduction</a><ul><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/introduction/day02/ title="Day 2: Project 0" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/introduction/day02/>Day 2: Project 0</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/introduction/day01/ title="Day 1: Welcome" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/introduction/day01/>Day 1: Welcome</a></li></ul></li></ul></li></ul></div></div><div class=col-lg-9><div class="p-lg-5 p-4 bg-white"><h2 class=mb-5>Day 2: Intro to Machine Learning</h2><div class=content><h2 id=welcome-to-class>Welcome to class!</h2><h4 id=announcements>Announcements</h4><h4 id=spiritual-thought>Spiritual thought</h4><h5 id=are-facts-true>Are facts true?</h5><br><ul><li>How do you distinguish between truth and error?</li><li>Joshua and Caleb</li></ul><br><h2 id=building-a-decision-tree>Building a Decision Tree</h2><iframe width=560 height=315 src=https://www.youtube.com/embed/ZVR2Way4nwQ title="YouTube video player" frameborder=0 allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="card mb-4 rounded-0 shadow border-0"><div class="card-header rounded-0 bg-white border p-0 border-0"><a class="card-link h4 d-flex tex-dark mb-0 py-3 px-4 justify-content-between" data-toggle=collapse href=#import-packages><span>Import packages</span> <i class="ti-plus text-primary text-right"></i></a></div><div id=import-packages class=collapse data-parent=#accordion><div class="card-body font-secondary text-color"><h2 id=splitting-the-data>Splitting the Data</h2><h4 id=1-start-with-packages-and-data-set>1. Start with packages and data set</h4><p>We&rsquo;ll be using some parts of SKLEARN package and the Seaborn package.</p><div class=highlight><pre style=color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=color:#75715e># If you haven&#39;t already, install scikit-learn and seaborn</span>
+active"><a href=https://byuistats.github.io/DS250-Cannon/slides/p4/d2/>Day 2: Intro to Machine Learning</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p4/d1/ title="Day 1: Intro to ML" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p4/d1/>Day 1: Intro to ML</a></li></ul></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p3/ title="Week 6-7: Project 3 - Baseball" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p3/>Week 6-7: Project 3 - Baseball</a><ul><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p3/d4/ title="Day 4: Practice Coding Challenge" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p3/d4/>Day 4: Practice Coding Challenge</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p3/d3/ title="Day 3: The end of baseball" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p3/d3/>Day 3: The end of baseball</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p3/d2/ title="Day 2: SQL Calculations" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p3/d2/>Day 2: SQL Calculations</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p3/d1/ title="Day 1: Intro to Project 3" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p3/d1/>Day 1: Intro to Project 3</a></li></ul></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p2/ title="Week 4-5: Project 2 - Flights" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p2/>Week 4-5: Project 2 - Flights</a><ul><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p2/d4/ title="Day 4: Exporting JSON" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p2/d4/>Day 4: Exporting JSON</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p2/d3/ title="Day 2B: Missing Data" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p2/d3/>Day 2B: Missing Data</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p2/d2/ title="Day 2: Transforming Data" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p2/d2/>Day 2: Transforming Data</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p2/d1/ title="Day 1: Intro to Flights Data" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p2/d1/>Day 1: Intro to Flights Data</a></li></ul></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p1/ title="Week 2-3: Project 1 - Names" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p1/>Week 2-3: Project 1 - Names</a><ul><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p1/d3/ title="Day 3: Making your name stand out" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p1/d3/>Day 3: Making your name stand out</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p1/d2/ title="Day 2: Seeing names with Altair" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p1/d2/>Day 2: Seeing names with Altair</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/p1/d1/ title="Day 1: Exploring names with pandas" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/p1/d1/>Day 1: Exploring names with pandas</a></li></ul></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/introduction/ title="Week 1: Introduction" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/introduction/>Week 1: Introduction</a><ul><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/introduction/day02/ title="Day 2: Project 0" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/introduction/day02/>Day 2: Project 0</a></li><li data-nav-id=https://byuistats.github.io/DS250-Cannon/slides/introduction/day01/ title="Day 1: Welcome" class=sidelist><a href=https://byuistats.github.io/DS250-Cannon/slides/introduction/day01/>Day 1: Welcome</a></li></ul></li></ul></li></ul></div></div><div class=col-lg-9><div class="p-lg-5 p-4 bg-white"><h2 class=mb-5>Day 2: Intro to Machine Learning</h2><div class=content><h2 id=welcome-to-class>Welcome to class!</h2><p><img src=tropical-year-illustration.png alt="alt text"></p><p><a href=https://shire-reckoning.com/calendar.html>Shire Reckoning</a></p><h4 id=announcements>Announcements</h4><ol><li>Coding Challenge Practice - Thursday, March 7</li></ol><h4 id=spiritual-thought>Spiritual thought</h4><h5 id=are-facts-true>Are facts true?</h5><br><ul><li>How do you distinguish between truth and error?</li><li>Joshua and Caleb</li></ul><br><h2 id=building-a-decision-tree>Building a Decision Tree</h2><iframe width=560 height=315 src=https://www.youtube.com/embed/ZVR2Way4nwQ title="YouTube video player" frameborder=0 allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="card mb-4 rounded-0 shadow border-0"><div class="card-header rounded-0 bg-white border p-0 border-0"><a class="card-link h4 d-flex tex-dark mb-0 py-3 px-4 justify-content-between" data-toggle=collapse href=#import-packages><span>Import packages</span> <i class="ti-plus text-primary text-right"></i></a></div><div id=import-packages class=collapse data-parent=#accordion><div class="card-body font-secondary text-color"><h2 id=splitting-the-data>Splitting the Data</h2><h4 id=1-start-with-packages-and-data-set>1. Start with packages and data set</h4><p>We&rsquo;ll be using some parts of SKLEARN package and the Seaborn package.</p><div class=highlight><pre style=color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=color:#75715e># If you haven&#39;t already, install scikit-learn and seaborn</span>
 pip install scikit<span style=color:#f92672>-</span>learn seaborn
 </code></pre></div><div class=highlight><pre style=color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=color:#f92672>from</span> types <span style=color:#f92672>import</span> GeneratorType
 <span style=color:#f92672>import</span> pandas <span style=color:#f92672>as</span> pd
@@ -30,17 +30,20 @@
 </code></pre></div><h4 id=4-split-into-training-and-testing-sets>4. Split into training and testing sets</h4><h3 id=what-does-the-train_test_split-function-do>What does the &ldquo;train_test_split()&rdquo; function do?</h3><div class=highlight><pre style=color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python>x_train, x_test, y_train, y_test <span style=color:#f92672>=</span> train_test_split(x, y, test_size <span style=color:#f92672>=</span> <span style=color:#75715e>#???, random_state = #???)</span>
 </code></pre></div><p><strong>Read the documentation and tell me what is returned?</strong></p><p><strong><a href=https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>Function documentation</a></strong></p><blockquote><p>Why do we use &ldquo;test_size&rdquo; and &ldquo;random_state&rdquo;?</p></blockquote><blockquote><p>What is &ldquo;x&rdquo; and &ldquo;y&rdquo; in the above function example?</p></blockquote><p>We need to take our data and build the feature and target data objects.</p><blockquote><p>What columns should we remove from our features (X)?</p></blockquote><blockquote><p>What column should we use as our target (y)?</p></blockquote><div class=highlight><pre style=color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python>x <span style=color:#f92672>=</span> dwellings_ml<span style=color:#f92672>.</span>filter([<span style=color:#75715e>#what variables will you use as &#34;features&#34;?])</span>
 y <span style=color:#f92672>=</span> dwellings_ml[<span style=color:#75715e>#what variable is the &#34;target&#34;?]</span>
-</code></pre></div><br><br><h2 id=training-a-classifier>Training a Classifier</h2><h4 id=decision-tree-example>Decision Tree Example</h4><div class=highlight><pre style=color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=color:#75715e># create the model</span>
-classifier <span style=color:#f92672>=</span> DecisionTreeClassifier()
+</code></pre></div><br><br><h2 id=training-a-classifier>Training a Classifier</h2><h4 id=decision-tree-example>Decision Tree Example</h4><div class=highlight><pre style=color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python>
+<span style=color:#75715e>#%%</span>
+<span style=color:#75715e># Create a decision tree</span>
+classifier_DT <span style=color:#f92672>=</span> DecisionTreeClassifier(max_depth <span style=color:#f92672>=</span> <span style=color:#ae81ff>4</span>)
 
-<span style=color:#75715e># train the model</span>
-classifier<span style=color:#f92672>.</span>fit(x_train, y_train)
+<span style=color:#75715e># Fit the decision tree</span>
+classifier_DT<span style=color:#f92672>.</span>fit(x_train, y_train)
 
-<span style=color:#75715e># make predictions</span>
-y_predictions <span style=color:#f92672>=</span> classifier<span style=color:#f92672>.</span>predict(x_test)
+<span style=color:#75715e># Test the decision tree (make predictions)</span>
+y_predicted_DT <span style=color:#f92672>=</span> classifier_DT<span style=color:#f92672>.</span>predict(x_test)
+
+<span style=color:#75715e># Evaluate the decision tree</span>
+<span style=color:#66d9ef>print</span>(<span style=color:#e6db74>&#34;Accuracy:&#34;</span>, metrics<span style=color:#f92672>.</span>accuracy_score(y_test, y_predicted_DT))
 
-<span style=color:#75715e># test how accurate predictions are</span>
-metrics<span style=color:#f92672>.</span>accuracy_score(y_test, y_predictions)
 </code></pre></div><h4 id=how-to-improve-accuracy>How to Improve Accuracy</h4><p>To improve the accuracy of your model, you could:</p><ul><li>Change what variables are used in the features (x) data set</li><li>Change what type of model you are using</li><li>Tune (aka, &ldquo;change&rdquo; or &ldquo;tweak&rdquo;) the parameters of the model</li></ul><h4 id=other-classification-models>Other Classification Models</h4><p>Here are some other models you could try.</p><div class=highlight><pre style=color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=color:#f92672>from</span> sklearn.naive_bayes <span style=color:#f92672>import</span> GaussianNB
 <span style=color:#f92672>from</span> sklearn.ensemble <span style=color:#f92672>import</span> RandomForestClassifier
 <span style=color:#f92672>from</span> sklearn.ensemble <span style=color:#f92672>import</span> GradientBoostingClassifier