Giter Site home page Giter Site logo

jsauerland / data-engineering-olist-project Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 60 KB

A Data Engineering deep dive with several data warehouses using real data sourced from Kaggle

Python 100.00%
cassandra data-engineering googlebigquery redshift snowflake spark api api-rest kaggle

data-engineering-olist-project's Introduction

Data-Engineering-Olist-Project

In order to practice and enhance my data engineeering skillset, I launched a project to model data, create data pipelines, a data lake, and much more. This helped me to further my understanding of Data Engineering both practically and conceptually. For this project, I am using Azure to setup a Data Lake.

Purpose of this project

  • Broaden my knowledge of big data infrastructure, and understand the differences between databases, data warehouses, data lakes, data marts, etc.
  • Explore On-premises and Cloud Data Solutions, and different use cases that Cloud technologies are used for
  • Build and deploy solutions using a variety of modern tools: Google BigQuery, SQL Server, Microsoft Azure, Snowflake, Apache Cassandra
  • Learn how Data Pipelines work in detail
  • Use various data services to store and produce cleansed and enhanced datasets for analysis
  • Learn about common data engineering practices and a high-level architecting process for a data engineering project

Goals and Deliverables









Part 1: Establishing the Data Warehouses

For this project, I used several completely different applications and cloud data warehouses in order to gain more exposure, and understanding of different CDWs, and how they differ in form and function. These include Apache Cassandra, Snowflake, Azure Data Lake, Google BigQuery, and SQL SERVER.

Data Build Tool (DBT)

In DBT, you can change things between tables and views by changing a keyword rather than writing the data definition language (DDL) to do this behind the scenes

The tool acts as an extra layer on top of your data warehouse to improve data transformation and integration, which in turn, makes things much easier. It also allows you to neatly organize all transformations into discrete models. This takes out the guesswork in running DDL, and trying to "take a stab in the dark" instead of just deploying a well functioning data model.

Part 2: Extracting and Loading the data

In this stage, I wrote an API to retrieve data from Kaggle, and another API to interface with a new project in Google BigQuery. Then, I retrieved and extracted all of the files from Kaggle, and using enumeration, looped through all of the files to create Pandas Dataframes. Finally, after creating the dataframes, I piped the data directly into Google BigQuery using the API into brand new tables.

Part 3: Modeling the data (ELT)

In all of the applications I used, I first loaded the extracts into the Data Warehouse, and then transformation takes place in the target system. The data was not transformed on submission to the data warehouse, but stored in its original format.

data-engineering-olist-project's People

Contributors

jsauerland avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

venky-madineni

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.