title | output | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Using NLP to Analyze Stand Up Comedy |
|
Unlike Python, R, Java and other programming languages, a natural language is a language that has naturally evolved through human history (English, Chinese, Spanish...). Natural Language Processing (NLP) is how computers process and go about these natural languages: basically how to deal with text data. It's a branch of AI that really interests me. In this post, I'll guide you through a basic example of how we can apply NLP techniques to Stand Up Comedy transcripts, using R.
About five years ago, I was not really into Stand Up comedy; for some reason, most of the performers' sense of humor didn't click with me. But that was until I discovered Sarah Silverman. There was something about her comedy that made me laugh to tears. So in this analysis I want to know: why do I like her so much? Is there something in the language that she uses that makes her different? What makes Sara Silverman's routine stand out?
We'll need to find the transcript of one of her comedies. My favorite is Speck of Dust so we'll focus on finding that one. The transcript can be found here.
In that same link, I found five other famous comedians which we will compare her to. Here is the full compilation of text data we'll use:
- Ali Wong ~ Baby Cobra
- Bill Burr ~ I'm Sorry You Feel That Way
- Dave Chapelle ~ The Age of Spin
- Hasan Minhaj ~ Homecoming King
- Louis C.K. ~ Oh My God
- Sarah Silverman ~ Jesus is Magic
For a computer to read a sentence such as "Hello? Is anyone there? I thought this was 543 W 2 S", the text should be cleaned, tokenized, and put into a matrix. So to do this we will: A. Remove punctuation B. Remove numbers C. Lowecase all letters
Our sentence then would look like this: "hello is anyone there i thought this was w s".
#We put each transcript into one single line.
ali <- paste(readLines('ali.txt'), collapse = " ")
bill <- paste(readLines('bill.txt'), collapse = " ")
dave <- paste(readLines('dave.txt'), collapse = " ")
hasan <- paste(readLines('hasan.txt'), collapse = " ")
louis <- paste(readLines('louis.txt'), collapse = " ")
sarah <- paste(readLines('sarah.txt'), collapse = " ")
#We remove punctuation.
ali <- gsub(pattern = '\\W', replace = ' ', ali)
bill <- gsub(pattern = '\\W', replace = ' ', bill)
dave <- gsub(pattern = '\\W', replace = ' ', dave)
hasan <- gsub(pattern = '\\W', replace = ' ', hasan)
louis <- gsub(pattern = '\\W', replace = ' ', louis)
sarah <- gsub(pattern = '\\W', replace = ' ', sarah)
#We remove numbers.
ali <- gsub(pattern = '\\d', replace = ' ', ali)
bill <- gsub(pattern = '\\d', replace = ' ', bill)
dave <- gsub(pattern = '\\d', replace = ' ', dave)
hasan <- gsub(pattern = '\\d', replace = ' ', hasan)
louis <- gsub(pattern = '\\d', replace = ' ', louis)
sarah <- gsub(pattern = '\\d', replace = ' ', sarah)
#We lowercase everything.
ali <- tolower(ali)
bill <- tolower(bill)
dave <- tolower(dave)
hasan <- tolower(hasan)
louis <- tolower(louis)
sarah <- tolower(sarah)
#We remove stop words (e.g. and, or, the...)
custom_stopwords <- read.csv("stop-word-list.csv", header = FALSE)
custom_stopwords <- as.character(custom_stopwords$V1)
final_stopwords <- c(custom_stopwords, stopwords())
ali <- removeWords(ali, final_stopwords)
bill <- removeWords(bill, final_stopwords)
dave <- removeWords(dave, final_stopwords)
hasan <- removeWords(hasan, final_stopwords)
louis <- removeWords(louis, final_stopwords)
sarah <- removeWords(sarah, final_stopwords)
#Sarah value needs to have "â" removed.
sarah <- gsub(pattern = 'â', replace = ' ', sarah)
#We remove single letter words.
ali <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', ali)
bill <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', bill)
dave <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', dave)
hasan <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', hasan)
louis <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', louis)
sarah <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', sarah)
#Clean up extra white space.
ali <- stripWhitespace(ali)
bill <- stripWhitespace(bill)
dave <- stripWhitespace(dave)
hasan <- stripWhitespace(hasan)
louis <- stripWhitespace(louis)
sarah <- stripWhitespace(sarah)
#We put our new variables into a table
Comedian <- c("Ali Wong", "Bill Burr", 'Dave Chapelle', 'Hasan Minhaj', 'Louis C.K.', 'Sarah Silverman')
Transcript <- c(ali, bill, dave, hasan, louis, sarah)
clean_transcripts <- data.frame(Comedian, Transcript, stringsAsFactors = FALSE)
#We split the transcript into separate words
ali_bag <- str_split(ali, pattern = "\\s+")
bill_bag <- str_split(bill, pattern = "\\s+")
dave_bag <- str_split(dave, pattern = "\\s+")
hasan_bag <- str_split(hasan, pattern = "\\s+")
louis_bag <- str_split(louis, pattern = "\\s+")
sarah_bag <- str_split(sarah, pattern = "\\s+")
#We unlist the variables above to make a bag of words (single char vectors)
ali_bag <- unlist(ali_bag)
bill_bag <- unlist(bill_bag)
dave_bag <- unlist(dave_bag)
hasan_bag <- unlist(hasan_bag)
louis_bag <- unlist(louis_bag)
sarah_bag <- unlist(sarah_bag)
We'll use word clouds to visualizae the 30 words each comedian used the most in the performance.
par(mfrow = c(1,2))
wordcloud(ali, max.words = 30, random.order = FALSE, colors = rainbow(2))
wordcloud(bill, max.words = 30, random.order = FALSE, colors = rainbow(2))
wordcloud(dave, max.words = 30, random.order = FALSE, colors = rainbow(2))
wordcloud(hasan, max.words = 30, random.order = FALSE, colors = rainbow(2))
wordcloud(louis, max.words = 30, random.order = FALSE, colors = rainbow(2))
wordcloud(sarah, max.words = 30, random.order = FALSE, colors = rainbow(2))
- "God" is a word used by both female comedians but not by the rest of the male comedians.
- All comedians use a lot of profanity.
- It's interesting Sarah Silverman uses a lot of adjectives (positive ones) as well as negative words, such as dying, die, bling, crying; maybe this reflect her big use of sarcasm which I think it's funny.
- It looks like she talks about some controversies such as god, kids, death, sex, disablities. I like people who talks about taboos, especially when it's in a funny way.
Here the goal is to find how big of a vocabulary each comedian has. We do this by checking the amount of unique words they use.
ali_unique <- n_distinct(ali_bag)
bill_unique <- n_distinct(bill_bag)
dave_unique <- n_distinct(dave_bag)
hasan_unique <- n_distinct(hasan_bag)
louis_unique <- n_distinct(louis_bag)
sarah_unique <- n_distinct(sarah_bag)
ali_total <- length(ali_bag)
bill_total <- length(bill_bag)
dave_total <- length(dave_bag)
hasan_total <- length(hasan_bag)
louis_total <- length(louis_bag)
sarah_total <- length(sarah_bag)
### On the internet, we find out what each performance lasts
ali_time <- 60
bill_time <- 80
dave_time <- 67
hasan_time <- 73
louis_time <- 58
sarah_time <- 72
###Now we can determine the comedians speed(words per min)
ali_speed <- as.integer(ali_total/ali_time)
bill_speed <- as.integer(bill_total/bill_time)
dave_speed <- as.integer(dave_total/dave_time)
hasan_speed <- as.integer(hasan_total/hasan_time)
louis_speed <- as.integer(louis_total/louis_time)
sarah_speed <- as.integer(sarah_total/sarah_time)
##Now let's put all of that in a table
Unique_words <- c(ali_unique, bill_unique, dave_unique, hasan_unique, louis_unique, sarah_unique)
Total_words <- c(ali_total, bill_total, dave_total, hasan_total, louis_total, sarah_total)
Time <- c(ali_time, bill_time, dave_time, hasan_time, louis_time, sarah_time)
Speed <- c(ali_speed, bill_speed, dave_speed, hasan_speed, louis_speed, sarah_speed)
transcripts_stats <- data.frame(Comedian, Unique_words, Total_words, Time, Speed)
datatable(transcripts_stats)
<\/th>\n | Comedian<\/th>\n | Unique_words<\/th>\n | Total_words<\/th>\n | Time<\/th>\n | Speed<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":[2,3,4,5]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
transcripts_stats <- transcripts_stats %>%
mutate(highlight_sarah = ifelse(Comedian == 'Sarah Silverman', T, F))
par(mfrow = c(1,2))
ggplot(transcripts_stats, aes(x = reorder(Comedian, -Unique_words), y = Unique_words)) +
geom_bar(stat="identity", width = 0.5, aes(fill= highlight_sarah)) +
labs(title="Number of Unique Words") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
theme_classic() +
ylab('') +
xlab('Comedian')+
geom_hline(yintercept=mean(Unique_words), linetype="dashed", color = "black") +
theme(legend.position = 'none') ggplot(transcripts_stats, aes(x = reorder(Comedian, -Speed), y = Speed)) +
geom_bar(stat="identity", width = 0.5, aes(fill= highlight_sarah)) +
labs(title="Number of Words per Minute") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
theme_classic() +
ylab('') +
xlab('Comedian')+
geom_hline(yintercept=mean(Speed), linetype = "dashed", color = "black") +
theme(legend.position = 'none')
These two findings could explain why I like Sarah: it makes sense that me as an English as a Second Languange learner would gravitate towards those who use a simple vocabulary as well as speak slowlier. Now we'll look at the amount of profanity the comedians use. ali_bad <- length(which(ali_bag == "fuck")) + length(which(ali_bag == "shit")) + length(which(ali_bag == "fucking"))
bill_bad <- length(which(bill_bag == "fuck")) + length(which(bill_bag == "shit")) + length(which(bill_bag == "fucking"))
dave_bad <- length(which(dave_bag == "fuck")) + length(which(dave_bag == "shit")) + length(which(dave_bag == "fucking"))
hasan_bad <- length(which(hasan_bag == "fuck")) + length(which(hasan_bag == "shit")) + length(which(hasan_bag == "fucking"))
louis_bad <- length(which(louis_bag == "fuck")) + length(which(louis_bag == "shit")) + length(which(louis_bag == "fucking"))
sarah_bad <- length(which(sarah_bag == "fuck")) + length(which(sarah_bag == "shit")) + length(which(sarah_bag == "fucking"))
Swear_words <- c(ali_bad, bill_bad, dave_bad, hasan_bad, louis_bad, sarah_bad)
transcripts_stats <- data.frame(Comedian, Unique_words, Total_words, Time, Swear_words)
datatable(transcripts_stats)
|
---|