Giter Site home page Giter Site logo

lyrical-evolution-capstone's Introduction

A LYRICAL EVOLUTION:

An Investigation of the Cultural Lexicon & Historical Relevance of U.S. Popular Music from 1958 - Present

===

Team Members: Ben Smith, Chris Teceno, Jerry Nolf, Rachel Robbins-Mayhill | Codeup | Innis Cohort | June 2022

===

===

Table of Contents

I. PROJECT OVERVIEW

1. GOAL:

This project aimed to investigate the patterns of song lyrics across decades using Time Series Analysis and Natural Language Processing techniques including Topic Modeling, Sentiment Analysis, and Term Frequency. The data used was collected from a Kaggle data set of the Billboard Top 100 Songs from 1958 to 2021 and lyrics pulled through web-scraping from the Genius.com API. We believe the lyrics of popular songs could be used for historical analysis using exploratory methods and hypothesis testing to identify changing societal trends in relationships, technology, sexuality, and vulgarity. Furthermore, we believe we can predict the decade the song first appeared on the Top 100 using features and machine learning methods.

2. DESCRIPTION:

Songs are powerful tokens: they can soothe, validate, ignite, confront, and educate us – among other things. Like time capsules, they are captured for eternity. The slang and language used are often indicative of the times, and you can probably recall exactly when a song was made based on what is mentioned. Arguably, music is a catalyst for societal and cultural evolution like no other art form. It has been causing controversy and societal upheaval for decades, and it seems with every generation there’s a new musical trend that has the older generations shaking their heads.

For centuries, songs have been passed down through generations, being sung as oral histories. However, with advancements of the 20th century, technology has made the world of music a much smaller place and, thanks to cheap, widely-available audio equipment, songs are now distributed on a much larger scale, having a farther-reaching impact, and a more permanent place in history.

This project aimed to combine the record of lyrical history and technological advancements to evaluate the changes in the cultural lexicon and societal evolution over the last 50+ years. Using machine learning and natural language processing methodologies we investigated the topics prevalent in songs of the past, predicted the decade in which they were written, and conducted historical analysis through exploration to identify changing societal trends in relationships, technology, sexuality, and vulgarity.

To do this, we acquired a Kaggle data set of the Billboard Top 100 Songs from its inception in 1958 to present. We then utilized the Genius.com API and LyricGenius Library to conduct web scraping to pull the lyrics for the specified songs which became the corpus for this project. The acquired data can be easily accessed via thisGoogle Drive .csv file.

After acquiring and preparing the corpus, our team conducted time series analysis and natural language processing exploration utilizing methods such as topic modeling, word clouds, and bigrams. We also employed multiclass classification methods to create multiple machine learning models. The end goal was to create an NLP model that accurately predicted the decade a song first appeared on the Billboard Top 100 chart, based on the words and word combinations found in the lyrics of the song.

We choose the Billboard Hot 100 song list as a focus because it is the music industry standard record chart in the United States for song popularity, published weekly by Billboard magazine. It provides a window into popular culture at a given time, by providing chart rankings of songs that were trending on sales, airplay, and now streaming for that week in the United States. It is arguably the best historical record of the impact of specific popular songs over time.

3. FORMULATING HYPOTHESES

The initial hypothesis of this project was that we could use the top songs of each decade in conjunction with topic modeling to identify unique words or topics which could be used as features to accurately predict the decade a song was on the Billboard Top 100 using machine learning. The thought behind this was that popular songs have been the historians of a unique lexicon, specific to their place in time. We believe the lyrics of popular songs could be analyzed through machine learning to identify societal trends in relationships, technology, sexuality, and vulgarity.

4. INITIAL QUESTIONS:

The focus of the project is on identifying the decade a song first appeared on the Billboard Top 100. Below are some of the initial questions this project looks to answer throughout the Data Science Pipeline.

Data-Focused Questions
  • What are the most frequently occurring words?
  • What are the most frequently occurring bigrams (pairs of words) by each decade?
  • What topics are most unique to each decade?
  • Is there a correlation between sentiment and decade?

5. KEY FINDINGS:

The key findings for this presentation are available in slide format by clicking on the Final Slide Presentation.

Ultimately, our hypothesis that ______ TBD _____________

Exploration revealed ________ TBD ____________

Extensive feature engineering was completed prior to modeling in an attempt to create a higher performing model, however our best performing models were made with ________ TBD ______________

6. DELIVERABLES:

  • README file - provides an overview of the project and steps for project reproduction
  • Draft Jupyter Notebook - provides all steps taken to produce the project
  • .py modules - provide reproducible code to automate acquiring, preparing, splitting, exploring, and modeling the corpus
  • Final Jupyter Notebook - provides presentation-ready acquire, prepare, exploration, modeling, and summary
  • Slide Deck - includes executive summary, takeaways, and explanation of key insights from each step of the Data Science Pipeline
  • One-Page Handout - provides overview and summary of project process and outcome
  • 10 Minute Presentation

II. PROJECT DATA CONTEXT

1. DATA DICTIONARY:

The final DataFrame used to explore the corpus for this project contains the following variables (columns). The variables, along with their data types, are defined below:

Variables Definition DataType
title Title of song listed on Billboard Top 100 Chart object
artist Vocalist who performed song object
date Date the song FIRST appeared on the Billboard Top 100 date time
lyrics The lyric contents cleaned with prep_data function object
raw_lyrics The contents of scraped song lyrics without cleaning object
decade (target) The decade the song was FIRST listed on the Billboard Top 100 integer
character_count* The number of characters within the cleaned document integer
word_count* The number of words within the cleaned document integer
unique_words* A list of the unique words in the cleaned document object
unique_word_count* The number of unique words in the cleaned document integer
sentiment* Score between -1.0 (negative) and 1.0 (positive) indicating overal emotional leaning of lyrics float
sentiment_category* Categorical category based upon sentiment score: very negative, somewhat negative, nuetral, somewhat positive, very positive category
place_words* A list of song part identifiers in the lyrics object
chorus_count* The number of choruses in the song integer
verse_count* The number of unique words in the cleaned document integer
verse_chorus_ratio* The ratio of verses to choruses float
pre_chorus_count* The number of pre-choruses in the song integer
outro_count* The number of outros in the song integer
bridge_count* The number of bridges in the song integer
hook_count* The number of hooks in the song integer
bigrams* A list of bigrams in the cleaned document object
trigrams* A list of trigrams in the cleaned document object
  • feature engineered

III. PROJECT PLAN - USING THE DATA SCIENCE PIPELINE:

The following outlines the process taken through the Data Science Pipeline to complete this project.

Plan➜ Acquire ➜ Prepare ➜ Explore ➜ Model & Evaluate ➜ Deliver

EXPAND PLAN STEPS

1. PLAN

  • Create an organizational tool for tracking project completion through the data science pipeline using Trello
  • Review project expectations
  • Draft project goal to include measures of success
  • Clarify questions related to the project
  • Create exploratory questions related to the corpus
  • Draft starting hypothesis
  • Add all planning and project breakdown tasks to the organizational tool
EXPAND ACQUIRE STEPS

2. ACQUIRE

  • Create .gitignore
  • Obtain API token from Genius.com
  • Create env file and store the Genius token within
  • Store env file in .gitignore to ensure the security of sensitive data
  • Obtain Billboard Top 100 Song list from Kaggle
  • Pip Install LyricsGenius
  • Create acquire.py module
  • Store functions needed to acquire the lyric documents to make up the corpus
    • Pull artist and song title from Kaggle dataset
    • Run through Lyrics Genius to obtain lyrics
    • Store as.csv
  • Ensure all imports needed to run the acquire functions are inside the acquire.py document
  • Using Command Line / Terminal, run ‘python acquire.py’ to create the data.json file that contains the corpus
  • Using Jupyter Notebook or other Python Scripting Program
    • Run all required imports
    • Import functions for acquiring the corpus from acquire.py module
    • Obtain the original size of the corpus
EXPAND PREPARE STEPS

3. PREPARE

Using Jupyter Notebook

  • Acquire corpus using functions from the acquire.py module or by utilizing the Google Drive .csv
  • Summarize corpus using methods and document observations
  • Clean documents/corpus
    • Make all text lowercase
    • Normalize, encode, and decode to remove accented text and special characters
    • Remove stopwords
    • Stem or Lemmatize words to acquire base words
    • Convert date to DateTime format
    • Expand contractions to include the full word/meaning
    • Remove song part identifiers ('verse', 'chorus', etc.)
  • Address missing values, data errors, unnecessary data, renaming
  • Conduct feature engineering to create features to explore and feed into the model
    • Add: Decade, Chorus Count, Verse Count, Verse/Chorus Ratio, Word Count, Unique Words per Song, Unique Words per Decade, Bigrams, and Trigrams
    • Conduct Topic Modeling using ________ to extract the main topics from the corpus
    • Conduct Sentiment Analysis using __________ to determine positive, negative, and neutral sentiment
  • Create a data dictionary framework to define final variables and data context
  • Split corpus into train, validate, and test samples prior to modeling if using features in model Using Python Scripting Program (Jupyter Notebook)
  • Create prepare functions within prepare.py
  • Store functions needed to prepare the Lyrics Corpus such as:
    • Cleaning Function: to normalize text and remove accented and special characters
    • Stem Function: to acquire root words
    • Lemmatize Function: to acquire lexicographically correct root words
    • Stopwords Function: to remove meaningless words
    • Clean_df Function: to remove nulls, convert to DateTime, add ______
    • Engineered Features Functions: to add desired features, topics, and sentiment
    • Split Function: to split the corpus prior to modeling if using features
  • Ensure all imports needed to run the prepare functions are added to the prepare.py document
EXPAND EXPLORE STEPS

4.EXPLORE

Using Jupyter Notebook:

  • Document key questions about hypotheses
  • Create visualizations with the intent to discover variable relationships
    • Identify variables related to decade of lyrics
    • Identify any potential data integrity issues
  • Document findings
  • Summarize conclusions, provide clear answers, and summarize takeaways
    • Explain plan of action as deduced from work to this point
  • Create explore functions within explore.py
  • Store functions needed to explore the Lyrics Corpus in explore.py
  • Ensure all imports needed to run the explore functions are added to the explore.py document
EXPAND MODEL & EVALUATE STEPS

5. MODEL & EVALUATE

Using Jupyter Notebook:

  • Establish baseline accuracy
  • Create two types of dataframes to run through models (TF-IDF, and Feature Engineering)
  • Train and fit multiple (1000+) models with varying algorithms and/or hyperparameters
  • Compare evaluation metrics across models
  • Remove unnecessary features
  • Evaluate best performing models using validate set
  • Choose best performing validation model for use on test set
  • Test final model on out-of-sample testing corpus
  • Summarize performance
  • Interpret and document findings
  • Create prepare functions within model.py
  • Store functions needed to model the Lyrics Corpus in model.py
  • Ensure all imports needed to run the model functions are added to the model.py document
EXPAND DELIVERY STEPS

6. DELIVERY

  • Prepare a presentation using Canva to document data science pipeline process and findings
    • Include an introduction of the project and goals
    • Provide an executive summary of findings, key takeaways, recommendations, and next steps
    • Create a walkthrough of the acquisition, preparation, exploration and modeling processes
    • Include presentation-worthy visualizations that support exploration and modeling
    • Provide final takeaways, recommend a course of action, and next steps
  • Prepare final notebook in Jupyter Notebook
    • Create clear walk-through of the Data Science Pipeline using headings and dividers
    • Explicitly define questions asked during the initial analysis
    • Visualize relationships
    • Document takeaways
    • Comment code thoroughly

IV. PROJECT MODULES:

  • acquire.py - provides reproducible python code to automate acquisition
  • prepare.py - provides reproducible python code to automate cleaning, preparing, and splitting the corpus
  • explore.py - provides reproducible python code to automate exploration and visualization
  • model.py - provides reproducible python code to automate modeling and evaluation

V. PROJECT REPRODUCTION:

EXPAND REPRODUCTION STEPS
  • There are 2 options to acquire the data in order to reproduce the project.
    • 1. Download the .csv file of the song title, artist, lyrics, and date the song first appeared on the Top 100 through Google Drive
    • 2. Create an env.py file that will contain access credentials to the Genius API.
      • Follow directions at Genius to generate an API personal access token to gain access to the contents within the Genius API.
      • Save the token in your env.py file under the variable api_token
      • Make .gitignore and confirm .gitignore is hiding your env.py file
      • Store that .gitignore and env file locally in the repository
      • Run the acquire.py modules
  • Clone our repo (including all .py modules)
  • Import supporting resources and python libraries: pandas, matplotlib, seaborn, plotly, numpy, sklearn, scipy, nltk, contractions, json, re, and unicodedata, unidecode, and lyricsgenius
  • Follow steps as outlined in the README.md. and work.ipynb
  • Run Final_Report.ipynb to view the final product

lyrical-evolution-capstone's People

Contributors

rachelrobbinsmayhill avatar jnolf avatar christeceno avatar bensmith07 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.