Giter Site home page Giter Site logo

Juan Carlos Rivera Cuadros's Projects

cloud_data_warehouses icon cloud_data_warehouses

Data Modeling for the company Sparkify. The data is a collection of songs and user activities on their new music streaming app. The goal of the program is to understand what songs users are listening to. The information comes from two JSON directories the datasets are in S3 AWS. The program is a Postgres database with tables designed to optimize queries on song play analysis and an ETL pipeline created in python. ## JSON directories (Start Dataset) Song Dataset The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. columns: artist_id, artist_latitude, artist_location,artist_longitude, artist_name, duration, num_songs, song_id, title, year Log Dataset The second dataset consists of log files in JSON format generated by this event simulator. These simulate activity logs from a music streaming app based on specified configurations. columns: artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song, status, ts, userAgent, userId ## Tables data base in Posgres Basic Table (Data from S3 aws) staging_song, staging_events 1) Song Data: s3://udacity-dend/song_data staging_song: num_songs, artist_id, artist_latitude, artist_longitude, artist_location, artist_name, song_id, title, duration, year 2) Log Data: s3://udacity-dend/log_data ; Log data json path: s3://udacity-dend/log_json_path.json staging_events: staging_event_key, artist, auth, firstName, gender, iteminSession, lastName, length, level, location, method, page = Next Song, registration, sessionId, song, status, ts, userAgent, userId it will be only load the row if page = NextSong Dimension Table Data from Song Dataset as staging_song (song_table, artist_table) 1) song_table: song_id, title, artist_id, year, duration 2) artist_table: artist_id, artist_name, artist_location, artist_latitude, artist_longitude Data from Log Dataset as staging_events (time_table, users) + NextSong 1) time_table: start_time, hour, day, week, month, year, weekday 2) users: userId, firstName, lastName, gender, level Fact Table JOIN beteween staging_song und staging_events (songplay) songplay: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent ## Program description steps (create_tables.py and etl.py) 1) create_tables.py 1.1) connect to datawarehouse and SQL database 1.2) Drop all tables with fuction drop_tables 1.3) create all tables with the fuction create_tables 2) etl.py 2.1) connect to datawarehouse and SQL database 2.2) copy the information of the two S3 AWS datasets into SQL tables (staging_song, staging_events(page=NextSong)) with the function load_staging_tables 2.3) load the information from the staging tables into the fact and dim tables ## querys and dwh data (sql_queries.py and dwh.cfg) ### sql_queries.py in sql_queries.py are the 4 kinds of basics queries: basic queries: -Drop table staging_events_table_drop staging_songs_table_drop songplay_table_drop user_table_drop song_table_drop artist_table_drop time_table_drop -Create table staging_events_table_create staging_songs_table_create songplay_table_create user_table_create song_table_create artist_table_create time_table_create -load_staging_tables from S3 AWS into SQL staging_events_copy staging_songs_copy -insert into fact and dim tables SQL songplay_table_insert user_table_insert song_table_insert artist_table_insert time_table_insert # dhw.cfg in the file dhw is the information about the AWS cluster in redschift and the AIM role. The path from the S3-buckets are in this file as well

data-lake icon data-lake

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake

data_modeling_with_cassandra icon data_modeling_with_cassandra

Data Modelling for the company Sparkify. The data is a collection of songs and user activities on their new music streaming app. The goal of the program is to understand what songs users are listening to.The information comes from a directory of CSV files. The program is to created a data base in Apache Cassandra. The pipeline was writed in Python and the program is in a Jupyter Notebook (Data_Modeling_with_Cassandra.ipynb).

data_modeling_with_postgres icon data_modeling_with_postgres

Database and ETL pipeline in Postgres for a music streaming app. They will define Fact and Dimension tables and insert datainto new tables

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.