-
Notifications
You must be signed in to change notification settings - Fork 12
/
brexit-vote-classification-2.rmd
119 lines (82 loc) · 2.79 KB
/
brexit-vote-classification-2.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: "Predicting the brexit vote: Tree-based method"
author: ""
date: "8/14/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Data
- In this assignment, we will work on a classification task of brexit referendum vote
- The data is originally from British Election Study Online Panel
- codebook: https://www.britishelectionstudy.com/wp-content/uploads/2020/05/Bes_wave19Documentation_V2.pdf
- The outcome is `LeaveVote` (1: Leave, 0: Remain)
## Libraries
- We will use the following packages
```{r}
library(tidyverse)
library(caret)
library(tree)
```
## Load data
We sub-sample the data. Full data takes too much time to estimate for the class... (Feel free to run full sample after the class)
```{r}
set.seed(20200813)
data_brexit <- read_csv("data/data_bes.csv.gz") %>%
sample_n(3000) # sampling data so
```
## Data preparation
- We will carry out:
- make `LeaveVote` factor variable
- test train split
- preprocess
```{r}
data_brexit <- data_brexit %>%
mutate(LeaveVote = factor(LeaveVote))
```
### Train-test split
```{r}
train_idx <- createDataPartition(data_brexit$LeaveVote, p = .7, list = F)
data_train <- data_brexit %>% slice(train_idx)
data_test <- data_brexit %>% slice(-train_idx)
```
### Preprocess
```{r}
prep <- preProcess(data_train %>% select(-LeaveVote), method = c("center", "scale"))
prep
data_train_preped <- predict(prep, data_train)
data_test_preped <- predict(prep, data_test)
```
## Model formulas
There are four logistic regression models in the manuscript (Table 2).
1. Sociodemographics
2. Identity
3. Anti-elite
4. Attitudes
The following line of codes will generate the each model.
```{r}
fm_socdem <- formula("LeaveVote ~ gender + age + edlevel + hhincome + econPersonalRetro1")
fm_identity <- formula("LeaveVote ~ gender + age + edlevel + hhincome +
EuropeanIdentity + EnglishIdentity + BritishIdentity")
fm_antielite <- formula("LeaveVote ~ gender + age + edlevel + hhincome +
PolMistrust + GovDisapproval + PopulismScale +
ConVote + LabVote + LibVote + SNPPCVote + UKIP")
fm_attitudes <- formula("LeaveVote ~ gender + age + edlevel + hhincome + euUKNotRich +
euNotPreventWar + FreeTradeBad + euParlOverRide1 + euUndermineIdentity1 +
lessEUmigrants + effectsEUTrade1 + effectsEUImmigrationLower")
fm_all <- formula("LeaveVote ~ .")
```
## Simple tree method
- Estimate a simple classification tree, using one of the formula above.
- Draw tree
- Evaluate the model
- What do you find?
```{r eval = F}
```
## Randowm forest regression and adaBoost
- Now, run "forest" models
- Any improvement of the prediction?
- For random forest training, use `tuneGrid = data.frame(mtry = c(2:4))`. Otherwise it's too slow...
```{r cache=T}
```