Giter Site home page Giter Site logo

etl_with_pyspark_-_sparksql's Introduction

ETL_with_Pyspark_-_SparkSQL

A sample project designed to demonstrate ETL process using Pyspark & Spark SQL API in Apache Spark.

In this project I used Apache Sparks's Pyspark and Spark SQL API's to implement the ETL process on the data and finally load the transformed data to a destination source.

I have used Azure Databricks to run my notebooks and to create jobs for my notebooks. To orchestrate the entire workflow, I have used Azure data factory to create the pipelines.

Note: Any resources deployed in azure has an associated price involved. So, user's are wholely responsible for creating and deploying resources to azure and also responsible for all the charges that are incurred if any.

-------------------************************-------------------

main_latest branch:

This branch contains the updated code of the main project that's under main_old branch.

New implementaions/changes:

When compared with the code of main_old branch the number of notebooks and the number of lines of code were decreased to acheive the goal of automating the entire ETL process by creating a single generic notebook that will be used for performing the transformations on the data.

I will be updating this readme file soon with the links of medium post and youtube video where i have clearly explained the new changes that i have done to the old notebooks/code.

etl_with_pyspark_-_sparksql's People

Contributors

pujith-v avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.