Skip to content

Commit

Permalink
add quantize and a5 tip
Browse files Browse the repository at this point in the history
  • Loading branch information
brownsarahm committed Oct 4, 2024
1 parent 7497890 commit dcfe0ea
Show file tree
Hide file tree
Showing 2 changed files with 93 additions and 2 deletions.
5 changes: 4 additions & 1 deletion assignments/05-construct.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Skills:
## Constructing Datasets


Your goal is to programmatically construct a ready to analyze dataset that combines information from multiple pages. This can be in a crawing fashion like we did for the CS people or by combining two tables with a merge. If you use a merge to meet the multiple sources criterion, only one source must be scraped, the second can be provided as tabular data.
Your goal is to programmatically construct a ready to analyze dataset that combines information from multiple sources. This can be in a crawing fashion like we did for the CS people or by combining two tables with a merge. If you use a merge to meet the multiple sources criterion, only one source must be scraped, the second can be provided as tabular data.


````{margin}
Expand All @@ -35,6 +35,7 @@ The notebook you submit should include:
- a motivating question for why your are building the dataset you are building
- code and description of how you built and prepared your dataset. For each step, describe what you're about to do, the code with output, interpretation that leads into the next step.
- exploratory data analysis that shows why you built the data and confirms that is prepared enough to analyze.
- also save your dataset to csv


For construct only, this can be very minimal EDA.
Expand All @@ -53,6 +54,8 @@ To earn level 2 for prepare, you must manipulate either a component table or the
- add a new column by computing from others
- handle NaN values by dropping or filling
- drop a column, row, or duplicates in another way
- change a continuous value to categorical (there is an added section in the notes on [quantizing](quantize) that we did not do in class, but should be easy to follow)


### Summarize and Visualize level 2
To earn level 2 for summarize and/or visualize, include additional analyses after building the datasets.
Expand Down
90 changes: 89 additions & 1 deletion notes/2024-09-26.md
Original file line number Diff line number Diff line change
Expand Up @@ -395,12 +395,100 @@ type(int(a))
bag_df.head()
```

pandas ahd
we can pass `pd.concat` and iterable of pandas objects (here a `list` of `DataFrames`) and it will, by default stack them vertically, or with `axis=1` stack the horizontally

```{code-cell} ipython3
pd.concat([coffee_df,bag_df],axis=1)
```

(quantize)=
## Quantizing a variable

Sometimes a variable is recorded continous, or close (like age in years, technically integers are discrete, but for wide enough range it is not very categorical) but we want to analyze it as if it is categorical.

We can add a new variable that is *calculated* from the original one.

Let's say we want to categorize coffes as small, medium or large batch size based on
the quantiles for the `'Number.of.Bags'` column.

First, we get an idea of the distribution with EDA to make our plan:
```{code-cell} ipython3
coffee_df_bags['Number.of.Bags'].describe()
```


```{code-cell} ipython3
coffee_df_bags['Number.of.Bags'].hist()
```
We see that most are small, but there is at least one major outlier, 75% are below 275, but the max is 1062.



We can use `pd.cut` to make discrete values

```{code-cell} ipython3
pd.cut(coffee_df_bags['Number.of.Bags'],bins=3).sample(10)
```



by default, it makes bins of equal size, meaning the range of values. This is not good based on what we noted above. Most will be in one label

```{code-cell} ipython3
pd.cut(coffee_df_bags['Number.of.Bags'],bins=3).hist()
```

TO make it better, we can specify the bin edges instead of only the number

```{code-cell} ipython3
min_bags = coffee_df_bags['Number.of.Bags'].min()
sm_cutoff = coffee_df_bags['Number.of.Bags'].quantile(.33)
md_cutoff = coffee_df_bags['Number.of.Bags'].quantile(.66)
max_bags = coffee_df_bags['Number.of.Bags'].max()
pd.cut(coffee_df_bags['Number.of.Bags'],
bins=[min_bags,sm_cutoff,md_cutoff,max_bags]).head()
```

here, we made cutoffs individually and pass them as a list to `pd.cut`

This is okay for 3 bins, but if we change our mind, it's a lot of work to make more.
Better is to make the bins more programmatically:

```{code-cell} ipython3
[coffee_df_bags['Number.of.Bags'].quantile(pct) for pct in np.linspace(0,1,4)]
```

`np.linspace` returns a numpyarray of evenly (linearly; there is also logspace) spaced
numbers. From the start to the end value for the number you specify. Here we said 4 evenly spaced from 0 to 1.

this is the same as we had before (up to rounding error)

```{code-cell} ipython3
[min_bags,sm_cutoff,md_cutoff,max_bags]
```

Now we can use these and optionally, change to text labels (which then means we have to update that too if we change the number 4 to another number, but still less work than above)

```{code-cell} ipython3
bag_num_bins = [coffee_df_bags['Number.of.Bags'].quantile(pct) for pct in np.linspace(0,1,4)]
pd.cut(coffee_df_bags['Number.of.Bags'],
bins=bag_num_bins,labels = ['small','medium','large']).head()
```

we could then add this to the dataframe to work with it
```{code-cell} ipython3
```

```{code-cell} ipython3
```


```{code-cell} ipython3
```

```{code-cell} ipython3
```
Expand Down

0 comments on commit dcfe0ea

Please sign in to comment.