add quantize and a5 tip

rhodyprog4ds · Oct 4, 2024 · dcfe0ea · dcfe0ea
1 parent 7497890
commit dcfe0ea
Show file tree

Hide file tree

Showing 2 changed files with 93 additions and 2 deletions.
diff --git a/assignments/05-construct.md b/assignments/05-construct.md
@@ -22,7 +22,7 @@ Skills:
 ## Constructing Datasets
 
 
-Your goal is to programmatically construct a ready to analyze dataset that combines information from multiple pages. This can be in a crawing fashion like we did for the CS people or by combining two tables with a merge. If you use a merge to meet the multiple sources criterion, only one source must be scraped, the second can be provided as tabular data.  
+Your goal is to programmatically construct a ready to analyze dataset that combines information from multiple sources. This can be in a crawing fashion like we did for the CS people or by combining two tables with a merge. If you use a merge to meet the multiple sources criterion, only one source must be scraped, the second can be provided as tabular data.  
 
 
 ````{margin}
@@ -35,6 +35,7 @@ The notebook you submit should include:
 - a motivating question for why your are building the dataset you are building
 - code and description of how you built and prepared your dataset. For each step,  describe what you're about to do, the code with output, interpretation that leads into the next step.
 - exploratory data analysis that shows why you built the data and confirms that is prepared enough to analyze. 
+- also save your dataset to csv
 
 
 For construct only, this can be very minimal EDA.
@@ -53,6 +54,8 @@ To earn level 2 for prepare, you must manipulate either a component table or the
 - add a new column by computing from others
 - handle NaN values by dropping or filling
 - drop a column, row, or duplicates in another way
+- change a continuous value to categorical (there is an added section in the notes on [quantizing](quantize) that we did not do in class, but should be easy to follow)
+
 
 ### Summarize and Visualize level 2
 To earn level 2 for summarize and/or visualize, include additional analyses after building the datasets.

diff --git a/notes/2024-09-26.md b/notes/2024-09-26.md
@@ -395,12 +395,100 @@ type(int(a))
 bag_df.head()
 ```
 
-pandas ahd
+we can pass `pd.concat` and iterable of pandas objects (here a `list` of `DataFrames`) and it will, by default stack them vertically, or with `axis=1` stack the horizontally
 
 ```{code-cell} ipython3
 pd.concat([coffee_df,bag_df],axis=1)
 ```
 
+(quantize)=
+## Quantizing a variable
+
+Sometimes a variable is recorded continous, or close (like age in years, technically integers are discrete, but for wide enough range it is not very categorical) but we want to analyze it as if it is categorical.  
+
+We can add a new variable that is *calculated* from the original one.  
+
+Let's say we want to categorize coffes as small, medium or large batch size based on
+the quantiles for the `'Number.of.Bags'` column. 
+
+First, we get an idea of the distribution with EDA to make our plan: 
+```{code-cell} ipython3
+coffee_df_bags['Number.of.Bags'].describe()
+```
+
+
+```{code-cell} ipython3
+coffee_df_bags['Number.of.Bags'].hist()
+```
+We see that most are small, but there is at least one major outlier, 75% are below 275, but the max is 1062. 
+
+
+
+We can use `pd.cut` to make discrete values 
+
+```{code-cell} ipython3
+pd.cut(coffee_df_bags['Number.of.Bags'],bins=3).sample(10)
+```
+
+
+
+by default, it makes bins of equal size, meaning the range of values. This is not good based on what we noted above. Most will be in one label
+
+```{code-cell} ipython3
+pd.cut(coffee_df_bags['Number.of.Bags'],bins=3).hist()
+```
+
+TO make it better, we can specify the bin edges instead of only the number
+
+```{code-cell} ipython3
+min_bags = coffee_df_bags['Number.of.Bags'].min()
+sm_cutoff = coffee_df_bags['Number.of.Bags'].quantile(.33)
+md_cutoff = coffee_df_bags['Number.of.Bags'].quantile(.66)
+max_bags = coffee_df_bags['Number.of.Bags'].max()
+pd.cut(coffee_df_bags['Number.of.Bags'],
+        bins=[min_bags,sm_cutoff,md_cutoff,max_bags]).head()
+```
+
+here, we made cutoffs individually and pass them as a list to `pd.cut`
+
+This is okay for 3 bins, but if we change our mind, it's a lot of work to make more. 
+Better is to make the bins more programmatically: 
+
+```{code-cell} ipython3
+[coffee_df_bags['Number.of.Bags'].quantile(pct) for pct in np.linspace(0,1,4)]
+```
+
+`np.linspace` returns a numpyarray of evenly (linearly; there is also logspace) spaced
+numbers. From the start to the end value for the number you specify. Here we said 4 evenly spaced from 0 to 1. 
+
+this is the same as we had before (up to rounding error)
+
+```{code-cell} ipython3
+[min_bags,sm_cutoff,md_cutoff,max_bags]
+```
+
+Now we can use these and optionally, change to text labels (which then means we have to update that too if we change the number 4 to another number, but still less work than above)
+
+```{code-cell} ipython3
+bag_num_bins = [coffee_df_bags['Number.of.Bags'].quantile(pct) for pct in np.linspace(0,1,4)]
+pd.cut(coffee_df_bags['Number.of.Bags'],
+        bins=bag_num_bins,labels = ['small','medium','large']).head()
+```
+
+we could then add this to the dataframe to work with it
+```{code-cell} ipython3
+
+```
+
+```{code-cell} ipython3
+
+```
+
+
+```{code-cell} ipython3
+
+```
+
 ```{code-cell} ipython3
 
 ```