Music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in AWS S3 including user activity logs and song metadata in JSON format.
- Analytics team want to find insights on what songs their users are listening to using dimension tables
- analytics team want dimension model optimized for queries on song play analysis in data warehouse
- Launch a AWS redshift cluster
- Redshift cluster able to access s3
- Create dimension model and ETL pipeline to prepare data for analytics team
- Explore and load JSON files in S3 to Redshift staging tables
- Define and create FACT and DIMENSION tables for a star schema model for data analytics purposes
- Create ETL pipeline to load data from staging tables to data warehouse tables on Redshift
- Able to Connect to the Redshift cluster and complete testing utilizing test queries
- AWS S3
- Python
- Redshift
data comprises of songs and songs playing activity
- Song Dataset
The song dataset is a subset of real data from the Million Song Dataset(https://labrosa.ee.columbia.edu/millionsong/). Data files are in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example:
song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of single song file, TRAABJL12903CDCF1A.json.
{ "num_songs":1, "artist_id":"ARJIE2Y1187B994AB7", "artist_latitude":null, "artist_longitude":null, "artist_location":"", "artist_name":"Line Renaud", "song_id":"SOUPIRU12A6D4FA1E1", "title":"Der Kleine Dompfaff", "duration":152.92036, "year":0 }
- Log Dataset
The second dataset consists of log files in JSON format. The log files in the dataset with are partitioned by year and month. For example:
log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json
And below is an example of single log file, 2018-11-13-events.json
{ "artist":"Pavement", "auth":"Logged In", "firstName":"Sylvie", "gender":"F", "itemInSession":0, "lastName":"Cruz", "length":99.16036, "level":"free", "location":"Klamath Falls, OR", "method":"PUT", "page":"NextSong", "registration":"1.541078e+12", "sessionId":345, "song":"Mercy:The Laundromat", "status":200, "ts":1541990258796, "userAgent":"Mozilla/5.0(Macintosh; Intel Mac OS X 10_9_4...)", "userId":10 }
A Star Schema would be required for optimized queries on song play queries
- Fact Table
Table | Description |
---|---|
songplays | records in event data associated with song plays i.e. records with page NextSong songplay_id, start_time,user_id, level,song_id, artist_id, session_id, location, user_agent |
- Dimension Tables
Table | Description |
---|---|
users | users in the app user_id, first_name, last_name, gender, level |
songs | songs in music database song_id, title, artist_id, year, duration |
artists | artists in music database artist_id, name, location, lattitude, longitude |
time | timestamps of records in songplays broken down into specific units start_time, hour, day, week, month, year,weekday |
create_resources.py - create IAM role,security group, redshift cluster
create_table.py - Creates the fact, dimension and staging tables schemas, staging, fact and dimension tables for the star schema on Redshift.
etl.py - Script loads data from S3 into staging tables on Redshift and then process that data into data warehouse tables(dimension and Fact) on Redshift.
sql_queries.py - Script has Sql queries to define tables, insert data , which will be imported into the two other files above.
validation.py - script to validate data in data warehouse and present some analysis on data
shutdown_resources.py - script to shutdown resources post validation
README.md - Describes process and decisions for this ETL pipeline
- Update dwh.cfg file with valid AWS credentials
- Execute code by typing command --> python3 etl.py (Code is end to end automated to ensure minimum manual intervention)
- For example, if you want to execute only table creation script, then comment as below screenshot
Top 10 popular songs
Query:
SELECT sp.song_id, s.title, count(*) AS cnt FROM songplays sp JOIN songs s ON sp.song_id = s.song_id GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10; Result: query1
- Top-10 popular artists
Query:
SELECT sp.artist_id, a.name AS artist_name, count(*) AS cnt FROM songplays sp JOIN artists a ON sp.artist_id = a.artist_id GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10;
Result:
- When most people are listening to songs?
Query:
SELECT CASE WHEN t.hour BETWEEN 2 AND 8 THEN '2-8' WHEN t.hour BETWEEN 9 AND 12 THEN '9-12' WHEN t.hour BETWEEN 13 AND 18 THEN '13-18' WHEN t.hour BETWEEN 19 AND 22 THEN '19-22' ELSE '23-24, 0-2' END AS play_time, count(*) AS cnt FROM songplays sp JOIN time t ON sp.start_time = t.start_time GROUP BY 1 ORDER BY 2 DESC;
Result: