Giter Site home page Giter Site logo

marwenhmidi / data-warehouse-with-aws Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bondxue/data-warehouse-with-aws

0.0 0.0 0.0 50 KB

:mushroom:Udacity Data Engineering Nanodegree Project 3

License: MIT License

Python 88.53% Jupyter Notebook 11.47%

data-warehouse-with-aws's Introduction

Project 3: Song Play Analysis with S3 and Redshift


Project passed

Introduction

In this project, we try to help one music streaming startup, Sparkify, to move their user base and song database processes on to the cloud. Specifically, I build an ETL pipeline that extracts their data from AWS S3 (data storage), stages tables on AWS Redshift (data warehouse with columnar storage), and execute SQL statements that create the analytics tables from these staging tables.

Datasets

Datasets used in this project are provided in two public S3 buckets. One bucket contains info about songs and artists, the second bucket has info concerning actions done by users (which song are listening, etc.. ). The objects contained in both buckets are JSON files.

The Redshift service is where data will be ingested and transformed, in fact though COPY command we will access to the JSON files inside the buckets and copy their content on our staging tables.

Database Schema

We have two staging tables which copy the JSON file inside the S3 buckets.

Staging Table

  • staging_songs - info about songs and artists
  • staging_events - actions done by users (which song are listening, etc.. )

I createa a star schema optimized for queries on song play analysis. This includes the following tables.

Fact Table

  • songplays - records in event data associated with song plays i.e. records with page NextSong

Dimension Tables

  • users - users in the app
  • songs - songs in music database
  • artists - artists in music database
  • time - timestamps of records in songplays broken down into specific units

The database schema is shown as follows schema

Data Warehouse Configurations and Setup

  • Create a new IAM user in your AWS account
  • Give it AdministratorAccess and Attach policies
  • Use access key and secret key to create clients for EC2, S3, IAM, and Redshift.
  • Create an IAM Role that makes Redshift able to access S3 bucket (ReadOnly)
  • Create a RedShift Cluster and get the DWH_ENDPOIN(Host address) and DWH_ROLE_ARN and fill the config file.

ETL Pipeline

  • Created tables to store the data from S3 buckets.
  • Loading the data from S3 buckets to staging tables in the Redshift Cluster.
  • Inserted data into fact and dimension tables from the staging tables.

Project Structure

  • create_tables.py - This script will drop old tables (if exist) ad re-create new tables.
  • etl.py - This script executes the queries that extract JSON data from the S3 bucket and ingest them to Redshift.
  • sql_queries.py - This file contains variables with SQL statement in String formats, partitioned by CREATE, DROP, COPY and INSERT statement.
  • dhw.cfg - Configuration file used that contains info about Redshift, IAM and S3

How to Run

  1. Create tables by running create_tables.py.

  2. Execute ETL process by running etl.py.

data-warehouse-with-aws's People

Contributors

bondxue avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.