Giter Site home page Giter Site logo

udacity_data_engineering_datalake's Introduction

Data lake using Spark

Sparkify is a music streaming startup that has grown really fast for the past few months and now its services is known world wide.

The customer database became huge and brought new challenges to deliver diverse data in a time manner to business analysts. Also, new roles, such as data scientists, are going to work on that data.

Usage instructions: This Python nodebook is was run on AWS EMR notebook, might need to add AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to run it from outside AWS

Data Sources

Data resides in two directories that contain files in JSON format:

s3a://udacity-dend/song_data : Contains metadata about a song and the artist of that song;
s3a://udacity-dend/log_data : Consists of log files generated by the streaming app based on the songs in the dataset above;

Songs and Artists data processing

The data will be loaded from song_data s3 folder and saved as parquet file in to the s3 which will be later to pupulate the fact table

Logs data processing

The data will be populated from logs_data folder in s3, and uses the songs and artists parquet files to create the fact table

Usage:

Connect to your master using scp -i ~/yourkeypair.pem etl.py [email protected]:/home/hadoop/ ssh -i ~/yourkeypair.pem [email protected] spark-submit etl.py

use http://parquet-viewer-online.com/ to see the data in s3

udacity_data_engineering_datalake's People

Contributors

narengowda avatar

Watchers

 avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.