We are a group of music lovers from MS Business Analytics Program 2018, UC San Diego. By studying the lyrics of hot 100 songs on Billboard year-end from year 1968 to 2017, we would like to answer the below questions:
-
How did music topics change over time?
-
Are pop song lyrics getting more repetitive?
-
How did sentiment on popular song lyrics change over time?
-
Can we classify the genre of a song based on its lyrics?
-
Song name: Hot 100 ranking: https://en.wikipedia.org/wiki/Billboard_Year-End
-
Song lyrics: https://www.azlyrics.com/
-
Song genre: Google search
-
Hot 100 ranking: Extract datatable from wikipedia by year
-
Lyrics: from the above two websites
-
Genre: google search to fetch the below variables:
a. Song name
b. Artist name
c. Lyrics
d. Genre: Pop, Rock, R&B, Soul, Hip-hop, Rap, Country, Dance, Alternative-indie, Blues, Punk, Metal, etc
e. Ranking
f. Ranking Year
We conducted text mining by different dimensions and metrics:
-
Dimension: By music genre / decades / ranking
-
Metrics:
a. Word use: frequency / repetition / diversity
b. Top Artists based on ranking
c. Sentiment analysis
d. Part-of-speech tagging
e. Topic modeling
f. Repetition
To predict whether a song belongs to a particular song genre or not solely based on lyrics, we used the below approaches to conduct feature engineering:
- Use word count of the lyrics as predictors
- Change word count to binary 0 / 1 (i.e., whether a certain word appears or not), then re-run models to see if any improvement on accuracy
- Use Word2vec to predict 2-level / multiple-level classifications
We used SVM and logistic regression models in this project.
Please see:
- web scraping code in billboard.R
- text mining code in lyrics text mining.R
- classification code in model.R
- Presentation deck in Text Mining Project Deck