deploy: 073e102

rhodyprog4ds · Oct 18, 2024 · 0c188f6 · 0c188f6
1 parent 7a4ec98
commit 0c188f6
Show file tree

Hide file tree

Showing 78 changed files with 3,792 additions and 204 deletions.
diff --git a/_images/06ac9d5a79ccb8fddbceee8abac82997c1eaa5ba2f1f02111769fb1db878af0d.png b/_images/06ac9d5a79ccb8fddbceee8abac82997c1eaa5ba2f1f02111769fb1db878af0d.png
diff --git a/_images/0fa4db0830ff26984c1fe8d1c44a6367452a0e82d4950c0267d70b36b3ffadea.png b/_images/0fa4db0830ff26984c1fe8d1c44a6367452a0e82d4950c0267d70b36b3ffadea.png
diff --git a/_images/2be9b0df877a3152c37a0f62e2a8217ffae6dac94848bfb4c85700af3deaad08.png b/_images/2be9b0df877a3152c37a0f62e2a8217ffae6dac94848bfb4c85700af3deaad08.png
diff --git a/_images/30da1c3bf87862db9810d755b93eb0e249caf08d9f39042e013a0fcdec3afb3e.png b/_images/30da1c3bf87862db9810d755b93eb0e249caf08d9f39042e013a0fcdec3afb3e.png
diff --git a/_images/577bba368e0b273b6e8a0c62e6586ec72b4a8c84afff863b70415202769f4582.png b/_images/577bba368e0b273b6e8a0c62e6586ec72b4a8c84afff863b70415202769f4582.png
diff --git a/_images/5f1a84ec55676e8362b3c9cfe6d203099c8ca065ba193dc99fc22f4c3fe15375.png b/_images/5f1a84ec55676e8362b3c9cfe6d203099c8ca065ba193dc99fc22f4c3fe15375.png
diff --git a/_images/78ccae0b5b4cf8743bb7d1e0fae7f8d86ccefc8b97066bf2f4abf3d1ca7401cb.png b/_images/78ccae0b5b4cf8743bb7d1e0fae7f8d86ccefc8b97066bf2f4abf3d1ca7401cb.png
diff --git a/_images/8919a66c65870f264d5390311b60d0736e28af55c6f717e0675cc22c0a74bddd.png b/_images/8919a66c65870f264d5390311b60d0736e28af55c6f717e0675cc22c0a74bddd.png
diff --git a/_images/a318d2119336d94c638413cedc2cb1f37d2e690f22748b0f6672e555b1fd25bf.png b/_images/a318d2119336d94c638413cedc2cb1f37d2e690f22748b0f6672e555b1fd25bf.png
diff --git a/_images/c1a996b89b35fe1a49ae51461e73f52489eaaae0d3e3c339e4e84fb5cf6e4480.png b/_images/c1a996b89b35fe1a49ae51461e73f52489eaaae0d3e3c339e4e84fb5cf6e4480.png
diff --git a/_images/cb4ba6dd979009882fa9449a570a15db32ac73a50d2d04cc56875d2c768d4afc.png b/_images/cb4ba6dd979009882fa9449a570a15db32ac73a50d2d04cc56875d2c768d4afc.png
diff --git a/_images/d19e1393f89f5806de6d1ac50487cb41ad1279f4e86fcbd9b791cd7d77fd5e9b.png b/_images/d19e1393f89f5806de6d1ac50487cb41ad1279f4e86fcbd9b791cd7d77fd5e9b.png
diff --git a/_images/e46da72d78ef33239fd09a8e58186baec3c7286dfc0be10dd51966609d3d0251.png b/_images/e46da72d78ef33239fd09a8e58186baec3c7286dfc0be10dd51966609d3d0251.png
diff --git a/_images/f873423b5a1eae89748911b893ba6b7b9f19f75c052aa2fc00a4b598783728c3.png b/_images/f873423b5a1eae89748911b893ba6b7b9f19f75c052aa2fc00a4b598783728c3.png
diff --git a/_sources/assignments/04-prepare.md b/_sources/assignments/04-prepare.md
@@ -5,7 +5,7 @@ __Due: 2023-10-03__
 
 Eligible skills: 
 - prepare 1
-- access 1
+- access 2
 - python 1,2
 
 

diff --git a/_sources/assignments/06-audit.md → _sources/assignments/06-evaluate.md b/_sources/assignments/06-audit.md → _sources/assignments/06-evaluate.md
@@ -1,7 +1,7 @@
 # Assignment 6: Auditing Algorithms
 
 
-__Due: 2023-10-18_
+__Due: 2023-10-21_
 
 Eligible skills: 
 - evaluate level 1
@@ -12,7 +12,7 @@ Eligible skills:
 
 ## Related notes
 
-- [](../notes/2023-10-12)
+- [](../notes/2024-10-10)
 <!-- - [](../notes/2023-03-02) -->
 
 

diff --git a/_sources/assignments/07-classification.md b/_sources/assignments/07-classification.md
@@ -0,0 +1,108 @@
+# Assignment 7
+
+[accept the assigment](https://classroom.github.com/a/q-cpZN-M)
+
+__Due: 2023-10-28__
+
+
+Eligible skills: 
+- evaluate level 2
+- classification level 1,2
+- summarize, 1,2
+- visuailze 1,2
+
+## Related notes
+
+- [](../notes/2024-10-17)
+<!-- - [](../notes/2023-10-19) -->
+
+::::{important}
+There is a large extra section in the notes, that should be of use for this assignment. 
+
+You can use Gassian Naive Bayes **or** a Decision tree for the assignment. 
+::::
+
+## Dataset and EDA
+
+
+Choose a dataset that is well suited for classification and that has *all numerical features*.
+If you want to use a dataset with nonnumerical features you will have to convert
+the categorical features to numerical with one hot encoding.  
+
+```{hint}
+Use the [UCI ML repository](https://archive.ics.uci.edu/datasets), it  will let you filter data by the attributes of it you need. 
+```
+
+1. Include a basic description of the data(what the features are)
+1. Describe the classification task in your own words
+1. Use EDA to determine if you expect the classification to get a high accuracy or not. What types of mistakes do you think will happen most (think about the confusion matrix)? 
+1. Hypothesize which classifier from the notes will do better and why you think that. Does the data meet the assumptions of Naive Bayes? What is important about this classifier for this application? 
+
+```{important}
+ You will get to reuse the above, and this dataset, for the clustering assignment *and* optionally one or both of A10 and A11. 
+```
+
+## Basic Classification
+
+1. Fit your chosen classifier with the default parameters on 80% of the data
+1. Inspect the model to answer the questions appropriate to your model.
+
+    - Does this model make sense?
+    - (if DT) Are there any leaves that are very small?
+    - (if DT) Is this an interpretable number of levels?
+    - (if GNB) do the parameters fit the data well? or do the paramters generate similar synthetic data (you can answer statistically only or with synthetic data & a plot)
+1. Test it on 20% held out data and generate a classification report
+2. Interpret the model and its performance in terms of the application in order to give a recommendation, "would you deploy this model" . Example questions to consider in your response include
+
+  - do you think this model is good enough to use for real?
+  - is this a model you would trust?
+  - do you think that a more complex model should be used?
+  - do you think that maybe this task cannot be done with machine learning?
+
+:::{note}
+You need to give a thorough answer to the deployment question and these bulleted questions will help you create a thorough response. 
+:::
+
+## Exploring Problem Setups
+
+```{important}
+Understanding the impact of test/train size is a part of classifcation and helps with evaluation.  This exercise is *also* a chance at python level 2.
+```
+
+````{margin}
+```{tip}
+The summary statistics and visualization we used before are useful for helping to
+investigate the performance of our model.  We can try fitting a model  with different settings
+to create a new "dataset" for our experiments.
+The same skills apply.
+```
+
+
+```{hint}
+The most important thing about the max depth here is that it's the same across all of the models. If you get an error, try making it smaller.
+```
+
+````
+Do an experiment to compare test set size vs performance:
+1. Use a loop to train a model  on 10%, 30%, ... , 90% of the data. Compute the {term}`training accuracy` and test accuracy for each size training data. Create a DataFrame with columns ['train_pct','n_train_samples','n_test_samples','train_acc','test_acc']
+2. Use EDA on this data frame to interpret the results of your experiment.  How does training vs test size impact the model's performance? Does it impact training and test accuracy the same way? 
+
+
+
+
+```{admonition} Thinking Ahead
+_ideas for level 3 evaluate, not required for A7_
+
+Repeat the problem setup experiment with multiple test/train splits at each size and plot with error bars.
+- What is the tradeoff to be made in choosing a test/train size?
+- What is the best test/train size for this dataset?
+
+or with variations:
+- allowing it to figure out the model depth for each training size, and recording the depth in the loop as well.  
+- repeating each size 10 items, then using summary statistics on that data
+
+Use the extensions above to experiment further with other model parameters.
+
+**some of this we'll learn how to automate in a few weeks, but getting the
+ideas by doing it yourself can help build understanding and intution**
+```
diff --git a/_sources/notes/2024-09-26.md b/_sources/notes/2024-09-26.md
@@ -435,7 +435,12 @@ pd.cut(coffee_df_bags['Number.of.Bags'],bins=3).sample(10)
 
 by default, it makes bins of equal size, meaning the range of values. This is not good based on what we noted above. Most will be in one label
 
+```{note}
+I would like to show a histogram here, but for somereason it broke. The output is hidden for now. 
+```
+
 ```{code-cell} ipython3
+:tags: ["hide-input"]
 pd.cut(coffee_df_bags['Number.of.Bags'],bins=3).hist()
 ```
 

diff --git a/_sources/notes/2024-10-01.md b/_sources/notes/2024-10-01.md
@@ -118,12 +118,20 @@ here I suppressed the output in class by looking only at the first few character
 cs_people_html[:100]
 ```
 
+<!-- 
+```{code-cell} ipython3
+:tags: ["hide-cell"]
+# save html to a file to read it in as parts via notebook features
+with open('cs_people.html','w') as f:
+    f.write(cs_people_html)
+```
 
-```{literalinclude} https://web.uri.edu/cs/people/
+```{literalinclude} cs_people.html
 :start-at: Department Chair
 :end-before: Directors
 :lineno-match:
 ```
+ -->
 
 
 But we do not need to manually write search tools, that's what [`BeautifulSoup`](https://beautiful-soup-4.readthedocs.io/en/latest/) is for.
@@ -396,3 +404,11 @@ Technically you could manually edit a copy of it.
 Web scraping is *for* when the website is not in tabular form.  It should be strucutred, but the structure does not need to come from a single page.  It could be that there are many pages strucutred similarly and you build most of the columns from the other pages, not the starting page. 
 
 For example from the [teams page of the nba](https://www.nba.com/teams) you can get to a page with info about each team that includes all time records and the current rosters. On these individual pages, most info is an actual table, so you can use `pd.read_html` for those, but the crawing part from the first page would count. 
+
+
+```{code-cell} ipython3
+:tags: ["hide-cell"]
+# delete temp file
+import os
+# os.remove('cs_people.html')
+```