lab-2-polynomial-regression.Rmd

---
title: "Lab 2"
author: "Elke Windschitll"
date: "2023-01-18"
output: html_document
---

Today we will be continuing the pumpkin case study from last week. We will be using the data that you cleaned and split last time (pumpkins_train) and will be comparing our results today to those you have already obtained, so open and run your Lab 1 .Rmd as a first step so those objects are available in your Environment (unless you created an R Project last time, in which case, kudos to you!).  

Once you have done that, we'll start today's lab by specifying a recipe for a polynomial model.  First we specify a recipe that identifies our variables and data, converts package to a numerical form, and then add a polynomial effect with step_poly()

```{r, include=FALSE}

# Source from lab 1
sourceDir <- "/Users/elkewindschitl/Documents/MEDS/eds-232/labs/eds-232-labs/lab-1.Rmd"
library(knitr)
source(knitr::purl(sourceDir, quiet=TRUE))
```

**Note: Sourcing from lab1**
```{r}
# Specify a recipe
poly_pumpkins_recipe <-
  recipe(price ~ package, data = pumpkins_train) %>%
  step_integer(all_predictors(), zero_based = TRUE) %>% 
  step_poly(all_predictors(), degree = 4)
```

How did that work? Choose another value for degree if you need to. Later we will learn about model tuning that will let us do things like find the optimal value for degree.  For now, we'd like to have a flexible model, so find the highest value for degree that is consistent with our data.

Polynomial regression is still linear regression, so our model specification looks similar to before.

```{r}
# Create a model specification called poly_spec
poly_spec <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")
```
Question 1: Now take the recipe and model specification that just created and bundle them into a workflow called poly_df.

```{r}
# Bundle recipe and model spec into a workflow
poly_wf <- workflow() %>% 
  add_recipe(poly_pumpkins_recipe) %>% 
  add_model(poly_spec)
```

Question 2: fit a model to the pumpkins_train data using your workflow and assign it to poly_wf_fit
```{r}
# Create a model
poly_wf_fit <- poly_wf %>% 
  fit(data = pumpkins_train)
```

```{r}
# Print learned model coefficients
poly_wf_fit
```


```{r}
# Make price predictions on test data
poly_results <- poly_wf_fit %>% predict(new_data = pumpkins_test) %>% 
  bind_cols(pumpkins_test %>% select(c(package, price))) %>% 
  relocate(.pred, .after = last_col())
# Print the results
poly_results %>% 
  slice_head(n = 10)
```

Now let's evaluate how the model performed on the test_set using yardstick::metrics().
```{r}
metrics(data = poly_results, truth = price, estimate = .pred)
```
Question 3: How do the performance metrics differ between the linear model from last week and the polynomial model we fit today?  Which model performs better on predicting the price of different packages of pumpkins?

**The model from this week seems to be performing better than the model last week. We can see this from a lower rmse value (3.27 vs. 7.23 last week) and an rsq value (0.892 vs. 0.495 last week) closer to one, meaning that the correlation is higher here, and therefore better.** 

Let's visualize our model results.  First prep the results by binding the encoded package variable to them.
```{r}
# Bind encoded package column to the results
poly_results <- poly_results %>% 
  bind_cols(package_encode %>% 
              rename(package_integer = package)) %>% 
  relocate(package_integer, .after = package)
# Print new results data frame
poly_results %>% 
  slice_head(n = 5)
```

OK, now let's take a look! 

Question 4: Create a scatter plot that takes the poly_results and plots package vs. price.  Then draw a line showing our model's predicted values (.pred). Hint: you'll need separate geoms for the data points and the prediction line.
```{r}
# Make a scatter plot
poly_results %>% ggplot(aes(x = package_integer, y = price)) +
  geom_point(size = 1.6) +
  geom_line(aes(y = .pred), color = "orange", size = 1.2) +
  xlab("package")
```

You can see that a curved line fits your data much better.

Question 5: Now make a smoother line by using geom_smooth instead of geom_line and passing it a polynomial formula like this:
geom_smooth(method = lm, formula = y ~ poly(x, degree = 3), color = "midnightblue", size = 1.2, se = FALSE)

```{r}
# Make a smoother scatter plot 
poly_results %>% ggplot(aes(x = package_integer, y = price)) +
  geom_point(size = 1.6) +
  geom_smooth(method = lm, formula = y ~ poly(x, degree = 3), color = "midnightblue", size = 1.2, se = FALSE) +
  xlab("package")
```

OK, now it's your turn to go through the process one more time.
 
Additional assignment components :
6. Choose  a new predictor variable (anything not involving package type) in this dataset.
**Exploring variety**
7. Determine its correlation with the outcome variable (price).  (Remember we calculated a correlation matrix last week)
```{r}
cor(baked_pumpkins$variety, baked_pumpkins$price)
```

**The correlation between price and variety is -0.86.** 

8. Create and test a model for your new predictor:
  - Create a recipe
  - Build a model specification (linear or polynomial)
  - Bundle the recipe and model specification into a workflow
  - Create a model by fitting the workflow
  - Evaluate model performance on the test data
  - Create a visualization of model performance
  
Lab 2 due 1/24 at 11:59 PM
```{r}
# Specify a recipe for encode
variety_pumpkins_recipe_encode_step <-
  recipe(price ~ variety, data = pumpkins_train) %>%
  step_integer(all_predictors(), zero_based = TRUE)
```


```{r}
# Encode variety column
variety_encode <- variety_pumpkins_recipe_encode_step %>% 
  prep() %>% 
  bake(new_data = pumpkins_test) %>% 
  select(variety)
```

```{r}
# Specify a recipe
variety_pumpkins_recipe <-
  recipe(price ~ variety, data = pumpkins_train) %>%
  step_integer(all_predictors(), zero_based = TRUE) %>% 
  step_poly(all_predictors(), degree = 3) 

# Create a model specification called variety_spec
variety_spec <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

# Bundle recipe and model spec into a workflow
variety_wf <- workflow() %>% 
  add_recipe(variety_pumpkins_recipe) %>% 
  add_model(variety_spec)

# Create a model
variety_wf_fit <- variety_wf %>% 
  fit(data = pumpkins_train)

# Print learned model coefficients
variety_wf_fit

# Make price predictions on test data
variety_results <- variety_wf_fit %>% predict(new_data = pumpkins_test) %>% 
  bind_cols(pumpkins_test %>% select(c(variety, price))) %>% 
  relocate(.pred, .after = last_col())

# Print the results
variety_results %>% 
  slice_head(n = 10)

# Evaluate
metrics(data = variety_results, truth = price, estimate = .pred)

# Bind encoded variety column to the results
variety_results <- variety_results %>% 
  bind_cols(variety_encode %>% 
              rename(variety_integer = variety)) %>% 
  relocate(variety_integer, .after = variety)

# Print new results data frame
variety_results %>% 
  slice_head(n = 5)

# Make a scatter plot
variety_results %>% ggplot(aes(x = variety_integer, y = price)) +
  geom_point(size = 1.6) +
  geom_line(aes(y = .pred), color = "orange", size = 1.2) +
  xlab("variety")

# Make a smoother scatter plot 
variety_results %>% ggplot(aes(x = variety_integer, y = price)) +
  geom_point(size = 1.6) +
  geom_smooth(method = lm, formula = y ~ poly(x, degree = 2), color = "midnightblue", size = 1.2, se = FALSE) +
  xlab("variety")
```

**Based on my metrics, it appears packages is a "better" indicator of price than variety. The rmse is lower in the package model than the variety model, and the rsq is higher in the package model than the variety model.**