The Eras tour has inspired endless discussion of the differences among Taylor Swift’s eras and the themes, words, and phrases that cut across them. T-Swift herself has organized some of these themes into playlists organized according to the five stages of grief. But what if we let the words do the talking?
The goal of this project is to use natural language processing tools to analyze Taylor Swift’s body of lyrics, from her debut album to The Tortured Poets Department. We’ll start with the Kaggle Taylor Swift lyric database and adapt some ideas from Analyze Taylor Swift Lyrics with Python and Tidy Text Mining with R. We’ll use tidyverse tools to clean up the data.
library(tidyverse)
library(tidytext)
filenames <- list.files(path="lyrics",full.names=TRUE)
df <- do.call(rbind,lapply(filenames,read.csv))
head(df)
# clean up titles by removing everything in parentheses
df$album_name <- str_replace(df$album_name, " \\s*\\([^\\)]+\\)", "")
df$track_title <- str_replace(df$track_title, " \\s*\\([^\\)]+\\)", "")
# convert strings to factors & put albums in order
df$album_name <- factor(df$album_name,
levels=c('Taylor Swift',"Fearless","Speak Now","Red",
"1989","reputation","Lover","folklore","evermore",
"Midnights","TTPD"))
df$track_title <- factor(df$track_title)
# number of lyrics per album
df %>%
group_by(album_name) %>%
count()
# lowercase everything
df$lyric.original <- df$lyric
df$lyric <- tolower(df$lyric)
# remove stop words
stopwords <- c('the', 'a', 'this', 'that', 'to', 'is', 'am', 'was', 'were', 'be', 'being', 'been', 'and', 'of', 'it', 'in', 'but', 'on', 'or')
pattern <- paste(stopwords, collapse = "\\b|\\b")
pattern <- paste0("\\b", pattern, "\\b")
df$lyric <- gsub(pattern, "", df$lyric, ignore.case = TRUE)
# remove extra whitespace
df$lyric <- str_replace_all(df$lyric, "\\s+", " ") # within lyric
df$lyric <- trimws(df$lyric) # before or after lyric
# unpack lyric into its component words and save to new df
df %>%
unnest_tokens(word,lyric) -> df.tokens
# use an alternative approach to removing stopwords - this is a larger set
data(stop_words)
df.tokens <- df.tokens %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
# count most common words
token_count <- df.tokens %>%
count(word,sort=TRUE)
head(token_count)
# plot most common words
df.tokens %>%
count(word, sort = TRUE) %>%
filter(n > 100) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL)
The most common word is “love”- no surprise there! If we hadn’t filtered out stop words, it would have been “you” shortly followed by “I”.
# get the AFINN database, which has valence scores from -5 to 5
afinn <- get_sentiments('afinn')
# combine with the tokenized lyric data
df.tokens.afinn <- df.tokens %>%
inner_join(afinn)
## Joining with `by = join_by(word)`
# most negative words
df.tokens.afinn %>%
select(word, value) %>%
unique() %>%
arrange(value) %>%
head()
# most positive words
df.tokens.afinn %>%
select(word, value) %>%
unique() %>%
arrange(-value) %>%
head()
# find lyrics with the word "win" in them
df.tokens.afinn %>%
filter(word=="win")
A quick visual inspection reveals a major drawback of this kind of sentiment analysis: it picks up on individual words that are positive or negative but it loses the context. For instance, the lyric You play stupid games, you win stupid prizes gets a positive valence boost from “win” and “prizes” but the overall meaning of the metaphor is negative.
Another noticeable pattern is that swear words drag down the valence rating regardless of the way that they’re used. We’ll remove them for this analysis.
# aggregate first by lyric (sentiment is often conveyed in short phrases or sentences rather than long passages) and then by track
# including overall sentiment as well as sentiment split by positive & negative weights
df.tokens.afinn %>%
filter(word != "damn" & word != "shit" & word != "fuck" & word != "bitch" &
word != "fucking" & word != "hell") %>%
group_by(album_name,track_title,track_n,line) %>%
summarize(lineval = mean(value),
linepos = ifelse(mean(value)>0,mean(value),NA),
lineneg = ifelse(mean(value)<0,mean(value),NA)) %>%
group_by(album_name,track_title,track_n) %>%
summarize(sentiment = mean(lineval),
sentpos = mean(linepos,na.rm=TRUE),
sentneg = mean(lineneg,na.rm=TRUE)) -> df.sentiment
# clean up positive & negative sentiment
df.sentiment %>%
mutate(sentpos = ifelse(is.na(sentpos),0,abs(sentpos)),
sentneg = ifelse(is.na(sentneg),0,abs(sentneg))) -> df.sentiment
# colors for the eras
era.colors <- c("#a5c9a5","#efc180","#C7a8cb","#823549","#b5e5f8","#746f70",
"#f7b0cc","#cdc9c1","#c5ac90","#242e47","#6f6a66")
# plot the results
p <- ggplot(df.sentiment, aes(x = reorder(track_title, -sentiment),
y = sentiment, fill = album_name)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values=era.colors) +
facet_grid(.~album_name, scales = "free_x") +
xlab('tracks in descending order by sentiment') +
theme_minimal() +
theme(axis.text.x = element_blank()) # Remove x-axis tick labels
print(p)
What trends do you see in the data?
# randomize row order because of check_overlap (see below)
set.seed(1234)
df.sentiment$random_order <- sample(row_number(df.sentiment))
df.sentiment %>%
arrange(random_order) %>%
ggplot(aes(x = sentneg, y = sentpos, color = album_name)) +
# check_overlap shows only non-overlapping labels, entered in order of rows
geom_text(aes(label = sprintf("%s", track_title)),size=3,check_overlap=TRUE) +
geom_abline(slope=1,intercept=0,color="grey70",linetype="dotted") +
scale_color_manual(values=era.colors) +
xlab('negative sentiment') +
ylab('positive sentiment') +
theme_minimal()
Songs along the diagonal are more balanced, whereas songs off the diagonal tend to have more positive (or negative) sentiment. Songs toward the bottom left tend to have less sentiment conveyed by their lyrics.
What else would you want to visualize?