diff --git a/_toc.yml b/_toc.yml index 730f710..68f107f 100644 --- a/_toc.yml +++ b/_toc.yml @@ -36,6 +36,7 @@ parts: - file: notes/2023-04-06 - file: notes/2023-04-11 - file: notes/2023-04-18 + - file: notes/2023-04-20 - caption: Assignments numbered: True chapters: diff --git a/notes/2023-04-20.md b/notes/2023-04-20.md new file mode 100644 index 0000000..a50dadf --- /dev/null +++ b/notes/2023-04-20.md @@ -0,0 +1,217 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.14.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +# ML with Text + +```{code-cell} ipython3 +from sklearn.feature_extraction import text +from sklearn.metrics.pairwise import euclidean_distances +from sklearn import datasets +import pandas as pd +from sklearn.naive_bayes import MultinomialNB +from sklearn.model_selection import train_test_split + +ng_X,ng_y = datasets.fetch_20newsgroups(categories =['comp.graphics','sci.crypt'], + return_X_y = True) +``` + +## News Data + + +```{code-cell} ipython3 +datasets.fetch_20newsgroups() +``` + +Lables are the topic of the article 0 = computer graphics, 1 = cyrptography. + +```{code-cell} ipython3 +ng_y[:5] +``` + +the X is he actual text + +```{code-cell} ipython3 +ng_X[0][:100] +``` + + +We're going to instantiate the object and fit it two the whole dataset. + +```{code-cell} ipython3 +count_vec = text.CountVectorizer() +ng_vec = count_vec.fit_transform(ng_X) +``` + +```{code-cell} ipython3 +ng_vec_train, ng_vec_test, ng_y_train, ng_y_test = train_test_split(ng_vec,ng_y) +``` + +```{code-cell} ipython3 +ng_vec.shape +``` + +```{code-cell} ipython3 +clf = MultinomialNB() +``` + + +```{code-cell} ipython3 +clf.fit(ng_vec_train,ng_y_train).score(ng_vec_test,ng_y_test) +``` + + +THis tells us that from the word counts we are able to distinguish between computer graphics articles and cryptography articles very well. + +## TF-IDF + + +We wanted the TfidfVectorizer, not the transformer, so that it accepts documents not features. We will again, instantiate the object and then fit on the whole dataset. + +The vectorizer can be applied directly to the raw text + +```{code-cell} ipython3 +help(text.TfidfVectorizer) +``` + +The transformer can be applied to the counts. + +```{code-cell} ipython3 +tfidf = text.TfidfTransformer() +``` + +```{code-cell} ipython3 +ng_tfidf = tfidf.fit_transform(ng_vec) +``` + +```{code-cell} ipython3 +ng_vec_train_tfidf, ng_vec_test_tfidf, ng_y_train_tfidf, ng_y_test_tfidf = train_test_split(ng_tfidf,ng_y) +``` + +```{code-cell} ipython3 +from sklearn import naive_bayes +``` + +```{code-cell} ipython3 +gnb = naive_bayes.GaussianNB() +``` + +```{code-cell} ipython3 +gnb.fit(ng_vec_train_tfidf.toarray(),ng_y_train_tfidf).score(ng_vec_test_tfidf.toarray(),ng_y_test_tfidf) +``` + +```{code-cell} ipython3 + +``` + + +## Comparing representations + ++++ + +To start, we will look at one element from each in order to compare them. + +```{code-cell} ipython3 +ng_tfidf_train[0] +``` + +```{code-cell} ipython3 +ng_vec_train[0] +``` + +To start they both have 84 elements, since it is two different representations of the same document, that makes sense. We can check a few others as well + +```{code-cell} ipython3 +ng_tfidf_train[1] +``` + +```{code-cell} ipython3 +ng_vec_train[1] +``` + +```{code-cell} ipython3 +(ng_vec_train[4]>0).sum() == (ng_tfidf_train[4]>0).sum() +``` + +Let's pick out a common word so that the calculation is meaningful and do the tfidf calucation. To find a common word in the dictionary, we'll first filter the vocabulary to keep only the words that occur at least 300 times in the training set. We sum along the columns of the matrix, transform it to an array, then iterate over the sum, enumerated (assigning the number to each element of the sum) and use that to get the word out, if its total is over 300. I saw that this is actually a sort of long list, so I chose to only print out the first 25. We print them out with the index so we can use it for the one we choose. + +```{code-cell} ipython3 +[(count_vec.get_feature_names()[i],i) for i, n in + enumerate(np.asarray(ng_vec_train.sum(axis=0))[0]) + if n>300][:25] +``` + +Let's use computer. + +```{code-cell} ipython3 +computer_idx = 6786 +count_vec.get_feature_names()[computer_idx] +``` + +```{code-cell} ipython3 +ng_vec_train[:,computer_idx].toarray()[:10].T +``` + +```{code-cell} ipython3 +ng_tfidf_train[:,computer_idx].toarray()[:10].T +``` + +So, we can see they have non zero elements in the same places, meaning that in both representations the column refers to the same thing. + +We can compare the untransformed to the count vectorizer: + +```{code-cell} ipython3 +len(ng_X_train[0].split()) +``` + +```{code-cell} ipython3 +ng_vec_train[0].sum() +``` + +We see that it is just a count of the number of total words, not unique words, but total + +```{code-cell} ipython3 +ng_tfidf_train[0].sum() +``` + +The tf-idf matrix, however is normalized to make the sums smaller. Each row is not the same, but it is more similar. + +```{code-cell} ipython3 +sns.displot(ng_vec_train.sum(axis=1),bins=20) +``` + +```{code-cell} ipython3 +sns.displot(ng_tfidf_train.sum(axis=1),bins=20) +``` + +We can see that the `tf-idf` makes the totals across documents more spread out. + ++++ + +When we sum across words, we then get to see how + ++++ + +From the documentation we see that the idf is not exactly the inverse of the number of documents, it's also rescaled some. + +$$ idf = \log \frac{1 +n }{1 +df} + 1 $$ + ++++ + +In this implementation, each row is the normalized as well, to keep them small, so that documents of different sizes are more comparable. For example, if we return to what we did last week. + + +## Questions After Class + +### What are unique tokens vs total vocabulary? + + the total vocabulary is the unique tokens across all documents. THe number of values stored in the whole matrix is the sume of the number of unique tokens across each word. In that number, any word that appears in more than one document is counted more than once. \ No newline at end of file