The lyrical-evolution-capstone from zgulde

A LYRICAL EVOLUTION:

An Investigation of the Cultural Lexicon & Historical Relevance of U.S. Popular Music from 1958 - Present

Final Slide Presentation

===

Team Members: Ben Smith, Chris Teceno, Jerry Nolf, Rachel Robbins-Mayhill | Codeup | Innis Cohort | June 2022

===

I. Project Overview
1. Goal
2. Description
3. Formulating Hypotheses
4. Initial Questions
5. Key Findings
6. Deliverables
II. Project Data Context
1. Data Dictionary
III. Project Plan - Data Science Pipeline
1. Project Planning
2. Data Acquisition
3. Data Preparation
4. Data Exploration
5. Modeling & Evaluation
6. Product Delivery
IV. Project Modules
V. Project Reproduction

I. PROJECT OVERVIEW

1. GOAL:

This project aimed to investigate the patterns of song lyrics across decades using Time Series Analysis and Natural Language Processing techniques including Topic Modeling, Sentiment Analysis, and Term Frequency. The data used was collected from a Kaggle data set of the Billboard Top 100 Songs from 1958 to 2021 and lyrics pulled through web-scraping from the Genius.com API. We believe the lyrics of popular songs could be used for historical analysis using exploratory methods and hypothesis testing to identify changing societal trends in relationships, technology, sexuality, and vulgarity. Furthermore, we believe we can predict the decade the song first appeared on the Top 100 using features and machine learning methods.

2. DESCRIPTION:

Songs are powerful tokens: they can soothe, validate, ignite, confront, and educate us – among other things. Like time capsules, they are captured for eternity. The slang and language used are often indicative of the times, and you can probably recall exactly when a song was made based on what is mentioned. Arguably, music is a catalyst for societal and cultural evolution like no other art form. It has been causing controversy and societal upheaval for decades, and it seems with every generation there’s a new musical trend that has the older generations shaking their heads.

For centuries, songs have been passed down through generations, being sung as oral histories. However, with advancements of the 20th century, technology has made the world of music a much smaller place and, thanks to cheap, widely-available audio equipment, songs are now distributed on a much larger scale, having a farther-reaching impact, and a more permanent place in history.

This project aimed to combine the record of lyrical history and technological advancements to evaluate the changes in the cultural lexicon and societal evolution over the last 50+ years. Using machine learning and natural language processing methodologies we investigated the topics prevalent in songs of the past, predicted the decade in which they were written, and conducted historical analysis through exploration to identify changing societal trends in relationships, technology, sexuality, and vulgarity.

To do this, we acquired a Kaggle data set of the Billboard Top 100 Songs from its inception in 1958 to present. We then utilized the Genius.com API and LyricGenius Library to conduct web scraping to pull the lyrics for the specified songs which became the corpus for this project. The acquired data can be easily accessed via thisGoogle Drive .csv file.

After acquiring and preparing the corpus, our team conducted time series analysis and natural language processing exploration utilizing methods such as topic modeling, word clouds, and bigrams. We also employed multiclass classification methods to create multiple machine learning models. The end goal was to create an NLP model that accurately predicted the decade a song first appeared on the Billboard Top 100 chart, based on the words and word combinations found in the lyrics of the song.

We choose the Billboard Hot 100 song list as a focus because it is the music industry standard record chart in the United States for song popularity, published weekly by Billboard magazine. It provides a window into popular culture at a given time, by providing chart rankings of songs that were trending on sales, airplay, and now streaming for that week in the United States. It is arguably the best historical record of the impact of specific popular songs over time.

3. FORMULATING HYPOTHESES

The initial hypothesis of this project was that we could use the top songs of each decade in conjunction with topic modeling to identify unique words or topics which could be used as features to accurately predict the decade a song was on the Billboard Top 100 using machine learning. The thought behind this was that popular songs have been the historians of a unique lexicon, specific to their place in time. We believe the lyrics of popular songs could be analyzed through machine learning to identify societal trends in relationships, technology, sexuality, and vulgarity.

4. INITIAL QUESTIONS:

The focus of the project is on identifying the decade a song first appeared on the Billboard Top 100. Below are some of the initial questions this project looks to answer throughout the Data Science Pipeline.

Data-Focused Questions

What are the most frequently occurring words?
What are the most frequently occurring bigrams (pairs of words) by each decade?
What topics are most unique to each decade?
Is there a correlation between sentiment and decade?

5. KEY FINDINGS:

The key findings for this presentation are available in slide format by clicking on the Final Slide Presentation.

Ultimately, our hypothesis that ______ TBD _____________

Exploration revealed ________ TBD ____________

Extensive feature engineering was completed prior to modeling in an attempt to create a higher performing model, however our best performing models were made with ________ TBD ______________

6. DELIVERABLES:

README file - provides an overview of the project and steps for project reproduction
Draft Jupyter Notebook - provides all steps taken to produce the project
.py modules - provide reproducible code to automate acquiring, preparing, splitting, exploring, and modeling the corpus
Final Jupyter Notebook - provides presentation-ready acquire, prepare, exploration, modeling, and summary
Slide Deck - includes executive summary, takeaways, and explanation of key insights from each step of the Data Science Pipeline
One-Page Handout - provides overview and summary of project process and outcome
10 Minute Presentation

II. PROJECT DATA CONTEXT

1. DATA DICTIONARY:

The final DataFrame used to explore the corpus for this project contains the following variables (columns). The variables, along with their data types, are defined below:

Variables	Definition	DataType
title	Title of song listed on Billboard Top 100 Chart	object
artist	Vocalist who performed song	object
date	Date the song FIRST appeared on the Billboard Top 100	date time
lyrics	The lyric contents cleaned with prep_data function	object
raw_lyrics	The contents of scraped song lyrics without cleaning	object
decade (target)	The decade the song was FIRST listed on the Billboard Top 100	integer
character_count*	The number of characters within the cleaned document	integer
word_count*	The number of words within the cleaned document	integer
unique_words*	A list of the unique words in the cleaned document	object
unique_word_count*	The number of unique words in the cleaned document	integer
sentiment*	Score between -1.0 (negative) and 1.0 (positive) indicating overal emotional leaning of lyrics	float
sentiment_category*	Categorical category based upon sentiment score: very negative, somewhat negative, nuetral, somewhat positive, very positive	category
place_words*	A list of song part identifiers in the lyrics	object
chorus_count*	The number of choruses in the song	integer
verse_count*	The number of unique words in the cleaned document	integer
verse_chorus_ratio*	The ratio of verses to choruses	float
pre_chorus_count*	The number of pre-choruses in the song	integer
outro_count*	The number of outros in the song	integer
bridge_count*	The number of bridges in the song	integer
hook_count*	The number of hooks in the song	integer
bigrams*	A list of bigrams in the cleaned document	object
trigrams*	A list of trigrams in the cleaned document	object

feature engineered

III. PROJECT PLAN - USING THE DATA SCIENCE PIPELINE:

The following outlines the process taken through the Data Science Pipeline to complete this project.

Plan➜ Acquire ➜ Prepare ➜ Explore ➜ Model & Evaluate ➜ Deliver

EXPAND PLAN STEPS

1. PLAN

Create an organizational tool for tracking project completion through the data science pipeline using Trello
Review project expectations
Draft project goal to include measures of success
Clarify questions related to the project
Create exploratory questions related to the corpus
Draft starting hypothesis
Add all planning and project breakdown tasks to the organizational tool

EXPAND ACQUIRE STEPS

2. ACQUIRE

EXPAND PREPARE STEPS

3. PREPARE

Using Jupyter Notebook

EXPAND EXPLORE STEPS

4.EXPLORE

Using Jupyter Notebook:

EXPAND MODEL & EVALUATE STEPS

5. MODEL & EVALUATE

Using Jupyter Notebook:

EXPAND DELIVERY STEPS

6. DELIVERY

IV. PROJECT MODULES:

acquire.py - provides reproducible python code to automate acquisition
prepare.py - provides reproducible python code to automate cleaning, preparing, and splitting the corpus
explore.py - provides reproducible python code to automate exploration and visualization
model.py - provides reproducible python code to automate modeling and evaluation

V. PROJECT REPRODUCTION:

EXPAND REPRODUCTION STEPS

zgulde / lyrical-evolution-capstone Goto Github PK

lyrical-evolution-capstone's Introduction

Table of Contents