From 790efb29c3e9e5d4c2cd9dcf2321868136b5e52e Mon Sep 17 00:00:00 2001 From: Sarah M Brown Date: Thu, 17 Oct 2024 20:27:33 -0400 Subject: [PATCH] extension on a6, a7, notes on classification --- _toc.yml | 2 +- assignments/06-evaluate.md | 2 +- assignments/07-classification.md | 108 ++++++ notes/2024-10-17.md | 567 +++++++++++++++++++++++++++++++ resources/glossary.md | 9 + 5 files changed, 686 insertions(+), 2 deletions(-) create mode 100644 assignments/07-classification.md create mode 100644 notes/2024-10-17.md diff --git a/_toc.yml b/_toc.yml index ef3d090..45b471c 100644 --- a/_toc.yml +++ b/_toc.yml @@ -34,7 +34,7 @@ parts: - file: assignments/04-prepare - file: assignments/05-construct - file: assignments/06-evaluate - # - file: assignments/07-classification + - file: assignments/07-classification # - file: assignments/08-clustering # - file: assignments/09-regression # - file: assignments/10-optimize diff --git a/assignments/06-evaluate.md b/assignments/06-evaluate.md index c0b7a7a..908f0fe 100644 --- a/assignments/06-evaluate.md +++ b/assignments/06-evaluate.md @@ -1,7 +1,7 @@ # Assignment 6: Auditing Algorithms -__Due: 2023-10-18_ +__Due: 2023-10-21_ Eligible skills: - evaluate level 1 diff --git a/assignments/07-classification.md b/assignments/07-classification.md new file mode 100644 index 0000000..d398524 --- /dev/null +++ b/assignments/07-classification.md @@ -0,0 +1,108 @@ +# Assignment 7 + +[accept the assigment](https://classroom.github.com/a/q-cpZN-M) + +__Due: 2023-10-28__ + + +Eligible skills: +- evaluate level 2 +- classification level 1,2 +- summarize, 1,2 +- visuailze 1,2 + +## Related notes + +- [](../notes/2024-10-17) + + +::::{important} +There is a large extra section in the notes, that should be of use for this assignment. + +You can use Gassian Naive Bayes **or** a Decision tree for the assignment. +:::: + +## Dataset and EDA + + +Choose a dataset that is well suited for classification and that has *all numerical features*. +If you want to use a dataset with nonnumerical features you will have to convert +the categorical features to numerical with one hot encoding. + +```{hint} +Use the [UCI ML repository](https://archive.ics.uci.edu/datasets), it will let you filter data by the attributes of it you need. +``` + +1. Include a basic description of the data(what the features are) +1. Describe the classification task in your own words +1. Use EDA to determine if you expect the classification to get a high accuracy or not. What types of mistakes do you think will happen most (think about the confusion matrix)? +1. Hypothesize which classifier from the notes will do better and why you think that. Does the data meet the assumptions of Naive Bayes? What is important about this classifier for this application? + +```{important} + You will get to reuse the above, and this dataset, for the clustering assignment *and* optionally one or both of A10 and A11. +``` + +## Basic Classification + +1. Fit your chosen classifier with the default parameters on 80% of the data +1. Inspect the model to answer the questions appropriate to your model. + + - Does this model make sense? + - (if DT) Are there any leaves that are very small? + - (if DT) Is this an interpretable number of levels? + - (if GNB) do the parameters fit the data well? or do the paramters generate similar synthetic data (you can answer statistically only or with synthetic data & a plot) +1. Test it on 20% held out data and generate a classification report +2. Interpret the model and its performance in terms of the application in order to give a recommendation, "would you deploy this model" . Example questions to consider in your response include + + - do you think this model is good enough to use for real? + - is this a model you would trust? + - do you think that a more complex model should be used? + - do you think that maybe this task cannot be done with machine learning? + +:::{note} +You need to give a thorough answer to the deployment question and these bulleted questions will help you create a thorough response. +::: + +## Exploring Problem Setups + +```{important} +Understanding the impact of test/train size is a part of classifcation and helps with evaluation. This exercise is *also* a chance at python level 2. +``` + +````{margin} +```{tip} +The summary statistics and visualization we used before are useful for helping to +investigate the performance of our model. We can try fitting a model with different settings +to create a new "dataset" for our experiments. +The same skills apply. +``` + + +```{hint} +The most important thing about the max depth here is that it's the same across all of the models. If you get an error, try making it smaller. +``` + +```` +Do an experiment to compare test set size vs performance: +1. Use a loop to train a model on 10%, 30%, ... , 90% of the data. Compute the {term}`training accuracy` and test accuracy for each size training data. Create a DataFrame with columns ['train_pct','n_train_samples','n_test_samples','train_acc','test_acc'] +2. Use EDA on this data frame to interpret the results of your experiment. How does training vs test size impact the model's performance? Does it impact training and test accuracy the same way? + + + + +```{admonition} Thinking Ahead +_ideas for level 3 evaluate, not required for A7_ + +Repeat the problem setup experiment with multiple test/train splits at each size and plot with error bars. +- What is the tradeoff to be made in choosing a test/train size? +- What is the best test/train size for this dataset? + +or with variations: +- allowing it to figure out the model depth for each training size, and recording the depth in the loop as well. +- repeating each size 10 items, then using summary statistics on that data + +Use the extensions above to experiment further with other model parameters. + +**some of this we'll learn how to automate in a few weeks, but getting the +ideas by doing it yourself can help build understanding and intution** +``` diff --git a/notes/2024-10-17.md b/notes/2024-10-17.md new file mode 100644 index 0000000..5a68918 --- /dev/null +++ b/notes/2024-10-17.md @@ -0,0 +1,567 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.4 +kernelspec: + display_name: Python 3 (ipykernel) + language: python + name: python3 +--- + +# Classification Models + + First machine learning model: Naive bayes + + +```{code-cell} ipython3 +import pandas as pd +import seaborn as sns +import numpy as np +from sklearn.model_selection import train_test_split +from sklearn.naive_bayes import GaussianNB +from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score +iris_df = sns.load_dataset('iris') +``` + + To start we will look at the data +```{code-cell} ipython3 +iris_df.head() +``` + + + +![image of an iris with the petal width, petal length, sepal length and sepal width annotated](https://www.integratedots.com/wp-content/uploads/2019/06/iris_petal-sepal-e1560211020463.png) + +We're trying to build an automatic flower classifier that, for measurements of a new flower returns the predicted species. To do this, we have a DataFrame with columns for species, petal width, petal length, sepal length, and sepal width. The species is what type of flower it is the petal and sepal are parts of the flower. + +:::::{margin} +:::{tip} +In class, I printed the column names to copy them into the variables below. This cell is hidden because it it not necessary for the narrative strucutre of our analysis, but it was useful for creating the next cell +::: +::::: + +```{code-cell} ipython3 +:tags: ['hide-cell'] +iris_df.columns +``` + +The species will be the target and the measurements will be the features. We want to predict the target from the features, the species from the measurements. + + +```{code-cell} ipython3 +feature_vars = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'] +target_var = 'species' +``` + + +## What does Naive Bayes do? + +[docs](https://scikit-learn.org/stable/modules/naive_bayes.html) + + +Naive = indepdent features +Bayes = most probable + +[Bayes Estimator](https://en.wikipedia.org/wiki/Bayes_estimator) + + + +We can look at this data using a pair plot. It plots each pair of numerical variables in a grid of scatterplots and on the diagonal (where it would be a variable with itself) shows the distribution of that variable. +```{code-cell} ipython3 +sns.pairplot(data =iris_df,hue=target_var) +``` + +This data is reasonably **separable** beacuse the different species (indicated with colors in the plot) do not overlap much. We see that the features are distributed sort of like a normal, or Gaussian, distribution. In 2D a Gaussian distribution is like a hill, so we expect to see more points near the center and fewer on the edge of circle-ish blobs. These blobs are slightly live ovals, but not too skew. + +This means that the assumptions of the Gaussian Naive Bayes model are met well enough we can expect the classifier to do well. + + + +## Separating Training and Test Data + +To do machine learning, we split the data both sample wise (rows if tidy) and variable-wise (columns if tidy). First, we'll designate the columns to use as features and as the target. + +The features are the input that we wish to use to predict the target. + +Next, we'll use a sklearn function to split the data randomly into test and train portions. +````{margin} +```{note} +Here i set the random state. This means that the site will always have the same result even when this notebook is run over and over again. + +Try downloading it (or adding `random_state` to your own code) and running it on your own computer. +``` +```` + + +```{code-cell} ipython3 +X_train, X_test, y_train, y_test = train_test_split(iris_df[feature_vars], + iris_df[target_var], + random_state=5) +``` + + +This function returns multiple values, the docs say that it returns [twice as many](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#:~:text=splittinglist%2C%20length%3D2%20*%20len(arrays)) as it is passed. We passed two separate things, the features and the labels separated, so we get train and test each for both. + +```{note} +If you get different numbers fort the index than I do here or run the train test split multipe times and see things change, you have a different ranomd seed above. +``` + +```{code-cell} ipython3 +X_train.head() +``` + +```{code-cell} ipython3 +X_train.shape, X_test.shape +``` + + +We can see by default how many samples it puts the training set: + +```{code-cell} ipython3 +len(X_train)/len(iris_df) +``` + +So by default we get a 75-25 split. + + +## Instantiating our Model Object + +Next we will instantiate the object for our *model*. In `sklearn` they call these objects [estimator](https://scikit-learn.org/stable/tutorial/statistical_inference/settings.html#estimators-objects). All estimators have a similar usage. First we instantiate the object and set any *hyperparameters*. + +Instantiating the object says we are assuming a particular type of model. In this case [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html). This sets several assumptions in one form: +- we assume data are Gaussian (normally) distributed +- the features are uncorrelated/independent (Naive) +- the best way to predict is to find the highest probability (Bayes) + +this is one example of a [Bayes Estimator](https://en.wikipedia.org/wiki/Bayes_estimator) + + +````{margin} +```{admonition} Further Reading +_All of this is beyond the scope of this course, but may be of interest_ +The Scikit Learn [User Guide](https://scikit-learn.org/stable/user_guide.html) is + a really good place to learn the details of machine learning. It is +high quality documentation from both a statistical and computer science +perspective of every element of the library. + +The sklearn [API](https://scikit-learn.org/stable/modules/classes.html) describes +how the library is structured and organized. Because the library is so popular +(and it's pretty well architected from a software perspective as well) if you +are developing new machine learning techniques it's good to make them sklearn +compatible. + +For example, IBM's [AIF360](https://aif360.readthedocs.io/en/latest/index.html) is a package for doing fair machine learning which has a +[sklearn compatible interface](https://aif360.readthedocs.io/en/latest/modules/sklearn.html). Scikit Learn documentation also includes a +[related projects](https://scikit-learn.org/stable/related_projects.html) page. +``` +```` + + +```{code-cell} ipython3 +gnb = GaussianNB() +``` + +At this point the object is not very interesting + +```{code-cell} ipython3 +gnb.__dict__ +``` + + + +The fit method uses the data to learn the model's parameters. In this case, a Gaussian distribution is characterized by a mean and variance; so the GNB classifier is characterized by one mean and one variance for each class (in 4d, like our data) + + +```{code-cell} ipython3 +gnb.fit(X_train,y_train) +``` + + +The attributes of the [estimator object](https://scikit-learn.org/stable/glossary.html#term-estimators) (`gbn`) describe the data (eg the class list) and the model's parameters. The `theta_` (often in math as $\theta$ or $\mu$) +represents the mean and the `var_` ($\sigma$) represents the variance of the +distributions. +```{code-cell} ipython3 +gnb.__dict__ +``` + +### Scoring a model + +Estimator objects also have a score method. If the estimator is a classifier, that score is accuracy. We will see that for other types of estimators it is different types. + +```{code-cell} ipython3 +gnb.score(X_test,y_test) +``` + +## Making model predictions +we can predict for each sample as well: +```{code-cell} ipython3 +y_pred = gnb.predict(X_test) +``` + +:::::{important} +in the end of class I tried to demo this and got an error +::::: + +We can also do one single sample, the `iloc` attrbiute lets us pick out rows by +integer index even if that is not the actual index of the DataFrame +```{code-cell} ipython3 +X_test.iloc[0] +``` +but if we pick one row, it returns a series, which is incompatible with the predict method. + + +```{code-cell} ipython3 +:tags: ["raises-exception"] +gnb.predict(X_test.iloc[0]) +``` + +If we select with a range, that only includes 1, it still returns a DataFrame + +```{code-cell} ipython3 +X_test.iloc[0:1] +``` + +which we can get a prediction for: + +```{code-cell} ipython3 +gnb.predict(X_test.iloc[0:1]) +``` + +We could also transform with `to_frame` and then {term}`transpose` with [`T`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.T.html#pandas.DataFrame.T) or ([`transpose`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html)) +```{code-cell} ipython3 +gnb.predict(X_test.iloc[0].to_frame().T) +``` + +We can also pass a 2D array (list of lists) with values in it (here I typed in values similar to the mean for setosa above) +```{code-cell} ipython3 +gnb.predict([[5.1, 3.6, 1.5, 0.25]]) +``` + +This way it warns us that the feature names are missing, but it still gives a prediction. + +### Evaluating Performance in more detail + +```{code-cell} ipython3 +confusion_matrix(y_test,y_pred) +``` + +This is a little harder to read than the 2D version but we can make it a dataframe to read it better. + + +```{code-cell} ipython3 +n_classes = len(gnb.classes_) +prediction_labels = [['predicted class']*n_classes, gnb.classes_] +actual_labels = [['true class']*n_classes, gnb.classes_] +conf_mat = confusion_matrix(y_test,y_pred) +conf_df = pd.DataFrame(data = conf_mat, index=actual_labels, columns=prediction_labels) +``` + +```{code-cell} ipython3 +:tags: ['hide-cell'] + +from myst_nb import glue +c1 = gnb.classes_[1] +c2 = gnb.classes_[2] +conf12 = conf_mat[1][2] +glue('c1',c1) +glue('c2',c2) +glue('f1t2',conf12) +``` + +{glue}`f1t2` flowers were mistakenly classified as {glue}`c1` when they were really {glue}`c2` + +This report is also available: +```{code-cell} ipython3 +print(classification_report(y_test,y_pred)) +``` +We can also get a report with a few metrics. + +- Recall is the percent of each species that were predicted correctly. +- Precision is the percent of the ones predicted to be in a species that are truly that species. +- the F1 score is combination of the two + +We see we have perfect recall and precision for setosa, as above, but we have lower for the other two because there were mistakes where versicolor and virginica were mixed up. + + +## What does a generative model mean? + +Gaussian Naive Bayes is a very simple model, but it is a {term}`generative` model (in constrast to a {term}`discriminative` model) so we can use it to generate synthetic data that looks like the real data, based on what the model learned. + + + + +```{code-cell} ipython3 +N = 20 +gnb_df = pd.DataFrame(np.concatenate([np.random.multivariate_normal(th, sig*np.eye(4),N) + for th, sig in zip(gnb.theta_,gnb.var_)]), + columns = gnb.feature_names_in_) +gnb_df['species'] = [ci for cl in [[c]*N for c in gnb.classes_] for ci in cl] +sns.pairplot(data =gnb_df, hue='species') +``` + +To break this code down: + +To do this, we extract the mean and variance parameters from the model +(`gnb.theta_,gnb.sigma_`) and `zip` them together to create an iterable object +that in each iteration returns one value from each list (`for th, sig in zip(gnb.theta_,gnb.sigma_)`). +We do this inside of a list comprehension and for each `th,sig` where `th` is +from `gnb.theta_` and `sig` is from `gnb.sigma_` we use `np.random.multivariate_normal` +to get 20 samples. In a general [multivariate normal distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution) the second parameter is actually a covariance +matrix. This describes both the variance of each individual feature and the +correlation of the features. Since Naive Bayes is Naive it assumes the features +are independent or have 0 correlation. So, to create the matrix from the vector +of variances we multiply by `np.eye(4)` which is the identity matrix or a matrix +with 1 on the diagonal and 0 elsewhere. Finally we stack the groups for each +species together with `np.concatenate` (like `pd.concat` but works on numpy objects + and `np.random.multivariate_normal` returns numpy arrays not data frames) and put all of that in a +DataFrame using the feature names as the columns. + +Then we add a species column, by repeating each species 20 times +`[c]*N for c in gnb.classes_` and then unpack that into a single list instead of +as list of lists. + + +## How does it make the predictions? + +It computes the probability for each class and then predicts the highest one: + +```{code-cell} ipython3 +gnb.predict_proba(X_test) +``` +we can also plot these +```{code-cell} ipython3 +# make the probabilities into a dataframe labeled with classes & make the index a separate column +prob_df = pd.DataFrame(data = gnb.predict_proba(X_test), columns = gnb.classes_ ).reset_index() +# add the predictions +prob_df['predicted_species'] = y_pred +prob_df['true_species'] = y_test.values +# for plotting, make a column that combines the index & prediction +pred_text = lambda r: str( r['index']) + ',' + r['predicted_species'] +prob_df['i,pred'] = prob_df.apply(pred_text,axis=1) +# same for ground truth +true_text = lambda r: str( r['index']) + ',' + r['true_species'] +prob_df['correct'] = prob_df['predicted_species'] == prob_df['true_species'] +# a dd a column for which are correct +prob_df['i,true'] = prob_df.apply(true_text,axis=1) +prob_df_melted = prob_df.melt(id_vars =[ 'index', 'predicted_species','true_species','i,pred','i,true','correct'],value_vars = gnb.classes_, + var_name = target_var, value_name = 'probability') +prob_df_melted.head() +``` + +and then we can plot this: + +::::{margin} +:::{note} +I added `hue='species'` here to make this more readable +::: +:::: + +```{code-cell} ipython3 +# plot a bar graph for each point labeled with the prediction +sns.catplot(data =prob_df_melted, x = 'species', y='probability' ,col ='i,true', + col_wrap=5,kind='bar', hue='species') +``` + +## What if the assumptions are not met? + + + +Using a toy dataset here shows an easy to see challenge for the classifier that we have seen so far. Real datasets will be hard in different ways, and since they're higher dimensional, it's harder to visualize the cause. + + +```{code-cell} ipython3 +corner_data = 'https://raw.githubusercontent.com/rhodyprog4ds/06-naive-bayes/f425ba121cc0c4dd8bcaa7ebb2ff0b40b0b03bff/data/dataset6.csv' +df6= pd.read_csv(corner_data,usecols=[1,2,3]) +sns.pairplot(data=df6, hue='char',hue_order=['A','B']) +``` + +As we can see in this dataset, these classes are quite separated. + +```{code-cell} ipython3 +X_train, X_test, y_train, y_test = train_test_split(df6[['x0','x1']], + df6['char'], + random_state = 4) +``` + + +```{code-cell} ipython3 +gnb_corners = GaussianNB() +gnb_corners.fit(X_train,y_train) +gnb_corners.score(X_test, y_test) +``` + +But we do not get a very good classification score. + +To see why, we can look at what it learned. + + +```{code-cell} ipython3 +N = 100 +gnb_df = pd.DataFrame(np.concatenate([np.random.multivariate_normal(th, sig*np.eye(2),N) + for th, sig in zip(gnb_corners.theta_,gnb_corners.var_)]), + columns = ['x0','x1']) +gnb_df['char'] = [ci for cl in [[c]*N for c in gnb_corners.classes_] for ci in cl] + +sns.pairplot(data =gnb_df, hue='char',hue_order=['A','B']) +``` + +```{code-cell} ipython3 +df6_pred = X_test.copy() +df6_pred['pred'] = gnb_corners.predict(X_test) +``` + +```{code-cell} ipython3 +sns.pairplot(data =df6_pred, hue ='pred', hue_order =['A','B']) +``` + +This does not look much like the data and it's hard to tell which is higher at any given point in the 2D space. We know though, that it has missed the mark. We can also look at the actual predictions. + +If you try this again, split, fit, plot, it will learn different decisions, but always at least about 25% of the data will have to be classified incorrectly. + + +## Decision Trees + +This data does not fit the assumptions of the Niave Bayes model, but a decision tree has a different rule. It can be more complex, but for the scikit learn one relies on splitting the data at a series of points along one feature at a time, sequentially. It basically learns a flowchart for deciding what class the sample belongs to at test time + +It is a {term}`discriminative` model, because it describes how to discriminate (in the sense of differentiate) between the classes. This is in contrast to the {term}`generative` model that describes how the data is distributed. + +That said, sklearn makes using new classifiers *really easy* once have learned one. All of the [classifiers](https://scikit-learn.org/1.5/glossary.html#term-classifiers) have the same API (same methods and attributes). + +::::{margin} +:::{tip} +They also provide a [guide for developing your own object](https://scikit-learn.org/1.5/developers/develop.html). If this is something you are interested in, this is a good research topic that could be done in a CSC499 project that I would be happy to supervise. + +Their [glossary](https://scikit-learn.org/1.5/glossary.html#glossary-estimator-types) also can help disambiguate these different terms. +::: +:::: + +```{code-cell} ipython3 +dt = tree.DecisionTreeClassifier() +dt.fit(X_train,y_train) +dt.score(X_test,y_test) +``` + +The sklearn estimator objects (that corresond to different models) all have the same API, so the `fit`, `predict`, and `score` methods are the same as above. We will see this also in regression and clustering. What each method does in terms of the specific calculations will vary depending on the model, but they're always there. + +the `tree` module also allows you to plot the tree to examine it. + +```{code-cell} ipython3 +plt.figure(figsize=(15,20)) +tree.plot_tree(dt, rounded =True, class_names = ['A','B'], + proportion=True, filled =True, impurity=False,fontsize=10); +``` + +On the iris dataset, the [sklearn docs include a diagram showing the decision boundary](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html) You should be able to modify this for another classifier. + +## Setting Classifier Parameters + +The decision tree we had above has a lot more layers than we would expect. This is really simple data so we still got perfect classification. However, the more complex the model, the more risk that it will learn something noisy about the training data that doesn't hold up in the test set. + +Fortunately, we can control the parameters to make it find a simpler decision boundary. + + +```{code-cell} ipython3 +dt2 = tree.DecisionTreeClassifier(max_depth=2) +dt2.fit(X_train,y_train) +dt2.score(X_test,y_test) +``` + +```{code-cell} ipython3 +plt.figure(figsize=(15,20)) +tree.plot_tree(dt2, rounded =True, class_names = ['A','B'], + proportion=True, filled =True, impurity=False,fontsize=10); +``` + + +```{code-cell} ipython3 + +``` + +## Questions +```{note} +I added in some questions from previous semesters because few questions were asked. +``` + +### Are there any good introductions to ScikitLearn that you are aware of? + +Scikit Learn [User Guide](https://scikit-learn.org/stable/user_guide.html) is the best one and they have a large example gallery. + + +### Are there any popular machine learning models that use decision trees? + +Yes, a lot of medical appilcations do, because since they are easy to understand, it is easier for healthcare providers to trust them. + + +### Do predictive algorithms have pros and cons? Or is there a standard? + +Each algorithm has different properties and strengths and weaknesses. Some are more popular than others, but they all do have weaknesses. + +### how often should we use confusion matrixes? Would it be better just to check the accuracy without one? + +A confusion matrix gives *more detail* on the performance than accuracy alone. If the accuracy was like 99.99 maybe the confustion matrix is not informative, but otherwise, it +is generally useful to understand what types of mistakes as context for how you might use/not use/ trust the model. + +### Due to the initial 'shuffling' of data: Is it possible to get a seed/shuffle of data split so that it that does much worse in a model? + +Yes you can get a bad split, we will see next week a statistical technique that helps us improve this. However, the larger your dataset, the less likely this will happen, so we +mostly now just get bigger and bigger datasets. + +### How does `gnb = GaussianNB()` work? + + + +this line calls the `GaussianNB` class constructor method with all default paramters. + +This means it *creates* and object of that type, we can see that like follows + + +```{code-cell} ipython3 +gnb_ex = GaussianNB() +type(gnb_ex) +``` + +### how will level 3s work? + +I will update on level 3s in class on Tuesday (and the site before then) + +### Could we use strata to better identify the the data in the toy dataset? + +The dataset didn't have any additional columns we could use to stratify this data, this data +is simply not a good fit for `GaussianNB` because it does not fit the assumptions. +Above I included the Decision Tree example, so you can see a classifer that does work for it. + +### is it possible after training the model to add more data to it ? + +The models we will see in class, no. However, there is a thing called "online learning" that involves getting more data on a regular basis to improve its performance. + + + +### Can you use machine learning for any type of data? + +Yes the features for example could be an image instead of four numbers. It could also be text. The basic ideas are the same for more complex data, so we are going to spend a lot of time building your understanding of what ML *is* on simple data. Past students have successfully applied ML in more complex data after this course because once you have a good understanding of the core ideas, applying it to other forms of data is easier to learn on your own. + +### Can we check how well the model did using the y_test df? +we could compare them directly or using `score` that does. + + +```{code-cell} ipython3 +y_pred == y_test +``` + +```{code-cell} ipython3 +sum(y_pred == y_test)/len(y_test) +``` + +```{code-cell} ipython3 +gnb.score(X_test,y_test) +``` + + +We can also use any of the other metrics we saw, we'll practice more on Wednesday + +### I want to know more about the the test_train_split() function + +[the docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) are a good place to start. diff --git a/resources/glossary.md b/resources/glossary.md index 903ccdc..64868bc 100644 --- a/resources/glossary.md +++ b/resources/glossary.md @@ -25,9 +25,15 @@ corpus [dictionary](https://docs.python.org/3/glossary.html#term-dictionary) (data type) a mapping array that matches keys to values. (in NLP) all of the possible tokens a model knows +discriminative + a model that describes the decision rule for labeling a sample as one class or another + document unit of text for analysis (one sample). Could be one sentence, one paragraph, or an article, depending on the goal +generative + a model that describes the data and therefore can also be used to generate new data that looks like the training data. + gh GitHub's command line tools @@ -97,6 +103,9 @@ TraceBack training accuracy percentage of predictions that the model predict correctly, based on the training data +transpose + swap the rows and columns of a matrix or dataframe + Web Scraping the process of extracting data from a website. In the context of this class, this is usually done using the python library beautiful soup and a html parser to retrieve specific data.