Skip to content

Commit

Permalink
notes
Browse files Browse the repository at this point in the history
  • Loading branch information
brownsarahm committed Apr 21, 2023
1 parent 14247b9 commit 0723a9a
Show file tree
Hide file tree
Showing 2 changed files with 218 additions and 0 deletions.
1 change: 1 addition & 0 deletions _toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ parts:
- file: notes/2023-04-06
- file: notes/2023-04-11
- file: notes/2023-04-18
- file: notes/2023-04-20
- caption: Assignments
numbered: True
chapters:
Expand Down
217 changes: 217 additions & 0 deletions notes/2023-04-20.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.1
kernelspec:
display_name: Python 3
language: python
name: python3
---

# ML with Text

```{code-cell} ipython3
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import datasets
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
ng_X,ng_y = datasets.fetch_20newsgroups(categories =['comp.graphics','sci.crypt'],
return_X_y = True)
```

## News Data


```{code-cell} ipython3
datasets.fetch_20newsgroups()
```

Lables are the topic of the article 0 = computer graphics, 1 = cyrptography.

```{code-cell} ipython3
ng_y[:5]
```

the X is he actual text

```{code-cell} ipython3
ng_X[0][:100]
```


We're going to instantiate the object and fit it two the whole dataset.

```{code-cell} ipython3
count_vec = text.CountVectorizer()
ng_vec = count_vec.fit_transform(ng_X)
```

```{code-cell} ipython3
ng_vec_train, ng_vec_test, ng_y_train, ng_y_test = train_test_split(ng_vec,ng_y)
```

```{code-cell} ipython3
ng_vec.shape
```

```{code-cell} ipython3
clf = MultinomialNB()
```


```{code-cell} ipython3
clf.fit(ng_vec_train,ng_y_train).score(ng_vec_test,ng_y_test)
```


THis tells us that from the word counts we are able to distinguish between computer graphics articles and cryptography articles very well.

## TF-IDF


We wanted the TfidfVectorizer, not the transformer, so that it accepts documents not features. We will again, instantiate the object and then fit on the whole dataset.

The vectorizer can be applied directly to the raw text

```{code-cell} ipython3
help(text.TfidfVectorizer)
```

The transformer can be applied to the counts.

```{code-cell} ipython3
tfidf = text.TfidfTransformer()
```

```{code-cell} ipython3
ng_tfidf = tfidf.fit_transform(ng_vec)
```

```{code-cell} ipython3
ng_vec_train_tfidf, ng_vec_test_tfidf, ng_y_train_tfidf, ng_y_test_tfidf = train_test_split(ng_tfidf,ng_y)
```

```{code-cell} ipython3
from sklearn import naive_bayes
```

```{code-cell} ipython3
gnb = naive_bayes.GaussianNB()
```

```{code-cell} ipython3
gnb.fit(ng_vec_train_tfidf.toarray(),ng_y_train_tfidf).score(ng_vec_test_tfidf.toarray(),ng_y_test_tfidf)
```

```{code-cell} ipython3
```


## Comparing representations

+++

To start, we will look at one element from each in order to compare them.

```{code-cell} ipython3
ng_tfidf_train[0]
```

```{code-cell} ipython3
ng_vec_train[0]
```

To start they both have 84 elements, since it is two different representations of the same document, that makes sense. We can check a few others as well

```{code-cell} ipython3
ng_tfidf_train[1]
```

```{code-cell} ipython3
ng_vec_train[1]
```

```{code-cell} ipython3
(ng_vec_train[4]>0).sum() == (ng_tfidf_train[4]>0).sum()
```

Let's pick out a common word so that the calculation is meaningful and do the tfidf calucation. To find a common word in the dictionary, we'll first filter the vocabulary to keep only the words that occur at least 300 times in the training set. We sum along the columns of the matrix, transform it to an array, then iterate over the sum, enumerated (assigning the number to each element of the sum) and use that to get the word out, if its total is over 300. I saw that this is actually a sort of long list, so I chose to only print out the first 25. We print them out with the index so we can use it for the one we choose.

```{code-cell} ipython3
[(count_vec.get_feature_names()[i],i) for i, n in
enumerate(np.asarray(ng_vec_train.sum(axis=0))[0])
if n>300][:25]
```

Let's use computer.

```{code-cell} ipython3
computer_idx = 6786
count_vec.get_feature_names()[computer_idx]
```

```{code-cell} ipython3
ng_vec_train[:,computer_idx].toarray()[:10].T
```

```{code-cell} ipython3
ng_tfidf_train[:,computer_idx].toarray()[:10].T
```

So, we can see they have non zero elements in the same places, meaning that in both representations the column refers to the same thing.

We can compare the untransformed to the count vectorizer:

```{code-cell} ipython3
len(ng_X_train[0].split())
```

```{code-cell} ipython3
ng_vec_train[0].sum()
```

We see that it is just a count of the number of total words, not unique words, but total

```{code-cell} ipython3
ng_tfidf_train[0].sum()
```

The tf-idf matrix, however is normalized to make the sums smaller. Each row is not the same, but it is more similar.

```{code-cell} ipython3
sns.displot(ng_vec_train.sum(axis=1),bins=20)
```

```{code-cell} ipython3
sns.displot(ng_tfidf_train.sum(axis=1),bins=20)
```

We can see that the `tf-idf` makes the totals across documents more spread out.

+++

When we sum across words, we then get to see how

+++

From the documentation we see that the idf is not exactly the inverse of the number of documents, it's also rescaled some.

$$ idf = \log \frac{1 +n }{1 +df} + 1 $$

+++

In this implementation, each row is the normalized as well, to keep them small, so that documents of different sizes are more comparable. For example, if we return to what we did last week.


## Questions After Class

### What are unique tokens vs total vocabulary?

the total vocabulary is the unique tokens across all documents. THe number of values stored in the whole matrix is the sume of the number of unique tokens across each word. In that number, any word that appears in more than one document is counted more than once.

0 comments on commit 0723a9a

Please sign in to comment.