Text

Amelia McNamara

August 16, 2016

Overview so far

Interactivity in R

Getting data from the web

We’ll go through some of Scott, Karthik, and Garrett’s useR tutorial. I’ll flip through the API stuff, and we’ll focus on scraping.

Scraping!

We’re switching over to the useR tutorial by Scott, Karthik, and Garrett.

See it here: useR tutorial.

Some text

Loading data

This is the data I mentioned from Kaylin Walker’s analysis.

(I got the URL right this time– notice it starts with raw.)

library(RCurl)
library(readr)
webData <- getURL("https://raw.githubusercontent.com/walkerkq/musiclyrics/master/billboard_lyrics_1964-2015.csv")
lyrics <- read_csv(webData)
dim(lyrics)
## [1] 5100    6

string manipulations

library(dplyr)
library(stringr)
beatles <- lyrics %>%
  filter(str_detect(Artist, "beatles"))
dim(beatles)
## [1] 18  6

Now you – find all the songs containing “love”

(How much smaller do you think love is than lyrics? How much smaller is it really?)

One approach

love <- lyrics %>%
  filter(str_detect(Lyrics, "lov"))
dim(love)
## [1] 3032    6

Trump tweets

David Robinson wrote this great blog post about Trump’s tweets. It’s also a great walkthrough of some text analysis! We’re going to try it on our own data.

Words

library(tidytext)
lyricwords <- lyrics %>%
  unnest_tokens(word, Lyrics, token = "words") %>%
  filter(!word %in% stop_words$word,str_detect(word, "[a-z]"))
lyricwords %>%
  select(Song, Artist, word)
## # A tibble: 579,910 x 3
##           Song                        Artist          word
##          <chr>                         <chr>         <chr>
## 1  wooly bully sam the sham and the pharaohs           sam
## 2  wooly bully sam the sham and the pharaohs          sham
## 3  wooly bully sam the sham and the pharaohs miscellaneous
## 4  wooly bully sam the sham and the pharaohs         wooly
## 5  wooly bully sam the sham and the pharaohs         bully
## 6  wooly bully sam the sham and the pharaohs         wooly
## 7  wooly bully sam the sham and the pharaohs         bully
## 8  wooly bully sam the sham and the pharaohs           sam
## 9  wooly bully sam the sham and the pharaohs          sham
## 10 wooly bully sam the sham and the pharaohs      pharaohs
## # ... with 579,900 more rows

Common words

library(ggplot2)
wordcounts <- lyricwords %>%
  group_by(word) %>%
  summarize(uses = n())
wordcounts <- wordcounts %>%
  arrange(desc(uses)) %>%
  slice(1:20) 

wordcounts %>%
  ggplot() + geom_bar(aes(x=reorder(word, uses), y=uses),stat = "identity")

#  ggplot() + geom_bar(aes(x=word, y=uses),stat = "identity")

Stop words

data(stop_words)
head(stop_words)
## # A tibble: 6 x 2
##        word lexicon
##       <chr>   <chr>
## 1         a   SMART
## 2       a's   SMART
## 3      able   SMART
## 4     about   SMART
## 5     above   SMART
## 6 according   SMART

We can make our own list of the stop words

morewords <- data.frame(word = c("im", "aint", "dont"), lexicon = "MM")
lyricwords <- lyricwords %>%
  filter(!word %in% morewords$word)

One approach

decadelyrics <- lyricwords %>%
  mutate(decade = (Year %/% 10) * 10)
wordcounts <- decadelyrics %>%
  group_by(word, decade) %>%
  summarize(uses = n()) 

popular <- wordcounts %>%
  group_by(decade) %>%
  slice(which.max(uses))

wordcounts %>%
  arrange(decade, desc(uses)) 
## Source: local data frame [69,679 x 3]
## Groups: word [41,335]
## 
##     word decade  uses
##    <chr>  <dbl> <int>
## 1   love   1960  1176
## 2   baby   1960   783
## 3   yeah   1960   387
## 4  youre   1960   357
## 5   girl   1960   322
## 6   time   1960   305
## 7    ill   1960   274
## 8  gonna   1960   256
## 9    hey   1960   215
## 10   day   1960   208
## # ... with 69,669 more rows

Sentiment analysis

nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  dplyr::select(word, sentiment)
years <- lyricwords %>%
  group_by(Year, Song, word) %>%
  mutate(total_words = n()) %>%
  ungroup() %>%
  distinct(Year, word, total_words)
by_source_sentiment <- years %>%
  inner_join(nrc, by = "word") %>%
  group_by(Year, sentiment) %>%
  summarize(total = sum(total_words)) %>%
  group_by(Year) %>%
  slice(which.max(total))

by_source_sentiment %>%
  arrange(desc(total)) 
## Source: local data frame [51 x 3]
## Groups: Year [51]
## 
##     Year sentiment total
##    <int>     <chr> <int>
## 1   2007  positive  1506
## 2   2003  positive  1483
## 3   2002  positive  1429
## 4   2006  positive  1418
## 5   2009  positive  1408
## 6   2010  positive  1354
## 7   2001  positive  1349
## 8   1991  positive  1280
## 9   2004  negative  1236
## 10  2015  positive  1226
## # ... with 41 more rows
# Doesn't make sense anymore
# by_source_sentiment <- by_source_sentiment %>%
#   mutate(binom = if_else(sentiment =="positive",1,0))

p1 <- ggplot(by_source_sentiment, aes(x=Year, y=sentiment)) + geom_point() 
p1 

# p2 <- ggplot(by_source_sentiment, aes(x=Year, y=binom)) + geom_point() + geom_smooth(aes(y=binom), method="glm", method.args = list(family = "binomial"), se=FALSE, fullrange=TRUE)+xlim(1960, 2040)
# p2

Are songs getting longer or shorter?

One approach

lyrics <- lyrics %>%
  mutate(lyrchar = str_length(Lyrics))

lettery <- lyrics %>%
  group_by(Year) %>%
  summarize(songlength = mean(lyrchar, na.rm=TRUE))
ggplot(lettery) + geom_line(aes(x=Year, y=songlength)) + ylab("Number of letters in lyrics")

Another approach

wordy <- lyrics %>%
  unnest_tokens(word, Lyrics, token = "words") %>%
  group_by(Song, Year) %>%
  summarize(length=n()) %>%
  group_by(Year) %>%
  summarize(songlength = mean(length, na.rm=TRUE))
ggplot(wordy) + geom_line(aes(x=Year, y=songlength)) + ylab("Number of words in lyrics")

Repetitive words

repetitive <- lyricwords %>%
  group_by(Artist, Song, word) %>%
  summarize(n=n()) %>%
  arrange(desc(n)) 
repetitive %>% select(word, n, Song, Artist)
## Source: local data frame [256,320 x 4]
## Groups: Artist, Song [4,671]
## 
##     word     n                          Song
##    <chr> <int>                         <chr>
## 1    dit   180 december 1963 oh what a night
## 2  thoia   156                  thoia thoing
## 3     da   150                   be my lover
## 4    bum   148                     disturbia
## 5     la   141                      la la la
## 6     la   140                        nothin
## 7  shake   138                  shake it off
## 8    bay   136                     a bay bay
## 9     na   132                      la la la
## 10    na   132           gettin jiggy wit it
## # ... with 256,310 more rows, and 1 more variables: Artist <chr>

Vocabulary size

uniques <- lyricwords %>%
  filter(!str_detect(Artist, "featuring")) %>%
  filter(word != "instrumental") %>%
  group_by(Song, Artist) %>%
  summarize(n = length(unique(word))) %>%
  arrange(desc(n))
uniques
## Source: local data frame [4,161 x 3]
## Groups: Song [3,929]
## 
##                   Song                    Artist     n
##                  <chr>                     <chr> <int>
## 1      one more chance         the notorious big   237
## 2        i got 5 on it                     luniz   233
## 3        they want efx                   das efx   232
## 4  deja vu uptown baby lord tariq and peter gunz   229
## 5         oochie wally       nas and bravehearts   227
## 6         american pie                don mclean   219
## 7        ghetto cowboy                  mo thugs   219
## 8  sing for the moment                    eminem   219
## 9            hypnotize         the notorious big   218
## 10             my band                       d12   216
## # ... with 4,151 more rows

Try it with project Gutenberg data

Jordan gave us already-counted words from Project Gutenberg books!

Either: go to a url, like http://www.science.smith.edu/~jcrouser/data/burton-arabian-363.txt and change the .txt to .csv to download data.

arabian <- read_csv("burton-arabian-363.csv")

Or

webData <- getURL("http://www.science.smith.edu/~jcrouser/data/alice.csv")
alice <- read_csv(webData)