Giter Site home page Giter Site logo

postgres-playlists's Introduction

Sparkify DB creation + ETL Pipeline

This repository contains the necessary scripts to create a PostgreSQL DB for a music streaming app (Sparkify), with a performant architecture, as well as the scripts to extract data from 2 sets of files and load it to the DB.

0. Context & Architecture

In order to enable performant queries to be done to the database, a star-schema composed of 5 tables was used: 1 fact table - songplays - and 4 dimension tables - users, songs, artists, time.

image

The startup has put up systems that collect 2 types of data:

  • Songs
  • Activity Logs

Through the scripts developed in this repo, data is extracted from 2 groups of files and loaded into the 5 tables of the DB.

With the data cleanly placed into the 5 tables, the analytics team can now easily create dashboards which focus on the different areas of the business:

  1. Acquiring and retaining users
  2. Increase the songs catalog
  3. Acquire more artists
  4. Engage users daily

which correspond to 4 independent tables, while minimizing JOINS. The time table is used when time granularity is needed in the query.

Advanced queries & usecases like recommending songs to a user, based on listening history can be performed easily with the songplays table.

1. Running Python Scripts

In order to create the tables run:

python create-tables.py

To load the data from the songs and activity logs run:

python etl.py

2. Directory Structure

  • /data - folder containing all the data
    • /data/log_data - Activity Logs
    • /data/song_data - Activity Logs
  • create_tables.py - script to create DB + tables
  • etl.py - script to extract data from /data/log_data and /data/song_data and load it to the DB
  • sql_queries.py - helper script with all SQL queries
  • etl.ipynb - Jupyter notebook with etl process - for development and converted into etl.py
  • test.ipynb - Jupyter notebook for testing
  • Readme.md

postgres-playlists's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.