diff --git a/_toc.yml b/_toc.yml index 45b471c..a5fa518 100644 --- a/_toc.yml +++ b/_toc.yml @@ -68,4 +68,4 @@ parts: title: Advice from FA2020 Students - url: https://rhodyprog4ds.github.io/BrownFall21/letters/ title: Advice from FA2021 Students - - file: letters/index + # - file: letters/index diff --git a/assignments/04-prepare.md b/assignments/04-prepare.md index 4d2973e..a5c1d21 100644 --- a/assignments/04-prepare.md +++ b/assignments/04-prepare.md @@ -5,7 +5,7 @@ __Due: 2023-10-03__ Eligible skills: - prepare 1 -- access 1 +- access 2 - python 1,2 diff --git a/assignments/06-evaluate.md b/assignments/06-evaluate.md index 908f0fe..9ac6416 100644 --- a/assignments/06-evaluate.md +++ b/assignments/06-evaluate.md @@ -12,7 +12,7 @@ Eligible skills: ## Related notes -- [](../notes/2023-10-12) +- [](../notes/2024-10-10) diff --git a/notes/2024-09-26.md b/notes/2024-09-26.md index 47c87c8..bb9aee2 100644 --- a/notes/2024-09-26.md +++ b/notes/2024-09-26.md @@ -440,7 +440,7 @@ I would like to show a histogram here, but for somereason it broke. The output i ``` ```{code-cell} ipython3 -:tags:["hide-output"] +:tags: ["hide-input"] pd.cut(coffee_df_bags['Number.of.Bags'],bins=3).hist() ``` diff --git a/notes/2024-10-01.md b/notes/2024-10-01.md index 7c89eb8..07f459f 100644 --- a/notes/2024-10-01.md +++ b/notes/2024-10-01.md @@ -118,12 +118,20 @@ here I suppressed the output in class by looking only at the first few character cs_people_html[:100] ``` + But we do not need to manually write search tools, that's what [`BeautifulSoup`](https://beautiful-soup-4.readthedocs.io/en/latest/) is for. @@ -396,3 +404,11 @@ Technically you could manually edit a copy of it. Web scraping is *for* when the website is not in tabular form. It should be strucutred, but the structure does not need to come from a single page. It could be that there are many pages strucutred similarly and you build most of the columns from the other pages, not the starting page. For example from the [teams page of the nba](https://www.nba.com/teams) you can get to a page with info about each team that includes all time records and the current rosters. On these individual pages, most info is an actual table, so you can use `pd.read_html` for those, but the crawing part from the first page would count. + + +```{code-cell} ipython3 +:tags: ["hide-cell"] +# delete temp file +import os +# os.remove('cs_people.html') +``` \ No newline at end of file diff --git a/notes/cs_people.html b/notes/cs_people.html new file mode 100644 index 0000000..e69de29 diff --git a/resources/glossary.md b/resources/glossary.md index 64868bc..7d5843d 100644 --- a/resources/glossary.md +++ b/resources/glossary.md @@ -13,9 +13,11 @@ [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) a python library used to assist in web scraping, it pulls data from html and xml files that can be parsed in a variety of different ways using different methods. + conditional a logical control to do something, conditioned on something else, for example the `if`, `elif` `else` - + + corpus (NLP) a set of documents for analysis @@ -60,7 +62,7 @@ kernel in the jupyter environment, [the kernel](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html#kernel) is a language specific computational engine [lambda](https://docs.python.org/3.9/reference/expressions.html#lambda) - they keyword used to define an anonymous function; lambda functions are defined with a compact syntax ` = lambda : ` + they keyword used to define an anonymous function; lambda functions are defined with a compact syntax ` = lambda : ` numpy array a type provided by [numpy]() to represent matrices, used by `pd.DataFrame.values` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html) and accessed by `pd.DataFrame.to_numpy` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy)