Giter Site home page Giter Site logo

papires-linux / retail_data_pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alanceloth/retail_data_pipeline

0.0 0.0 0.0 14.96 MB

Retail data pipeline using Airflow, Dbt, Soda, GCP (GCS and BigQuery) and Metabase

Shell 3.11% Python 87.07% Makefile 4.75% Dockerfile 5.07%

retail_data_pipeline's Introduction

Data Engineer Project: Retail Data Pipeline

image

Using Airflow, BigQuery, Google Cloud Storage, dbt, Soda, Metabase and Python

Objective

The goal of this project is to create an end-to-end data pipeline from a Kaggle dataset on retail data. This involves modeling the data into fact-dimension tables, implementing data quality steps, utilizing modern data stack technologies (dbt, Soda, and Airflow), and storing the data in the cloud (Google Cloud Platform). Finally, the reporting layer is consumed in Metabase. The project is containerized via Docker and versioned on GitHub.

  • Tech Stack Used
  • Python
  • Docker and Docker-compose
  • Soda.io
  • Metabase
  • Google Cloud Storage
  • Google BigQuery
  • Airflow (Astronomer version)
  • dbt
  • GitHub (Repo HERE)

Here is all the details about the project objective and results

In the folder dags/include/datasets/ you will find 3 files, the online_retail.csv is the original one downloaded from Kaggle and the country.csv was generated using a BigQuery table. It's all the data that is needed for this project.


To run this project you must

Install Docker

Install Docker for your OS

Install Astro CLI

Install Astro CLI for your OS

Clone the GitHub repo

In your terminal:

Clone the repo using Github CLI or Git CLI

gh repo clone alanceloth/Retail_Data_Pipeline

or

git clone https://github.com/alanceloth/Retail_Data_Pipeline.git

Open the folder with your code editor.

Reinitialize the Airflow project

Open the code editor terminal:

astro dev init

It will ask: You are not in an empty directory. Are you sure you want to initialize a project? (y/n) Type y and the project will be reinitialized.

Build the project

In the code editor terminal, type:

astro dev start

The default Airflow endpoint is http://localhost:8080/

  • Default username: admin
  • Default password: admin

Create the GCP project

In your browser go to https://console.cloud.google.com/ and create a project, recomended something like: airflow-dbt-soda-pipeline

Copy your project ID and save it for later.

Using the project ID from GCP

Change the following files:

  • .env (GCP_PROJECT_ID)
  • include\dbt\models\sources\sources.yml (database)
  • include\dbt\profiles.yml (project)

Create a Bucket on GCP

With the project selected, go to https://console.cloud.google.com/storage/browser and create a Bucket. Use the name <yourname>_online_retail. And change the variable bucket_name value to your bucket name at the dags\retail.py file.

Create an service account for the project

Go to the IAM tab, and create the Service account with the name airflow-online-retail. Give admin access to GCS and BigQuery, and export the json keys. Rename the file to service_account.json and put inside the folder include/gcp/ (you will have to create this folder).

Build a connection in your airflow

In your airflow, at the http://localhost:8080/, login and go to Admin โ†’ Connections. Create a new connection and use this configs:

  • id: gcp
  • type: Google Cloud
  • Keypath Path: /usr/local/airflow/include/gcp/service_account.json

Save it.

Create you SODA account and API Keys

Go to https://www.soda.io/ and click "start a trial" and create an account. Then, login and go to your profile, API Keys and create a new API key.

Copy the soda_cloud code, it will look like this:

soda_cloud:
  host: cloud.us.soda.io
  api_key_id: <KEY>
  api_key_secret: <SECRET>

And paste it in include\soda\configuration.yml or edit the .env file with the respective values. Note that, in this example the account created was in the US region, if your account is in EU region, you will have to change the "host" variable.

All set, start the DAG

With your Airflow running, go to http://localhost:8080/ and click on DAGs, and click on the retail DAG. Then, start the DAG (play button on the upper right side).

It will go step by step, and if everything was followed, you will get a green execution at the end. Check in your GCP Storage account if the file was uploaded succesfully, in your BigQuery tab if the tables was been built and in your Soda dashboard if everithing is fine.

Then, move to the Metabase and build your own Dashboard. The Metabase service is on the http://localhost:3000/

Metabase

Go to the http://localhost:3000/ and create your local account. When the Add your data option shows up, choose BigQuery, and enter your details. It will ask for a Display name, I recomend BigQuery_DW or something like that. The project ID is your GCP Project name (mine was airflow-dbt-soda-pipeline). And finally the service_account.json the one that you saved in the include/gcp/ folder.

Connect the database, and it's all set.

Contact

If you have questions feel free to ask me.

References

Project based on Marc Lamberti videos.

retail_data_pipeline's People

Contributors

alanceloth avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.