ANNUAL INCOME FOR NEWZEALAND

A data engineering Project

At this project, i've tried to download a CSV file from the website of the newzealand gouvernemant to create a dashboard that resume the income of this country from 2013 to 2020. For this project we're going to use :

Terraform : Infrastructure-as-Code tool more info here : https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp/1_terraform_overview.md
Airflow :https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_2_data_ingestion/airflow/docs/1_concepts.md
Dbt :dbt stands for data build tool. It's a transformation tool: it allows us to transform process raw data in our Data Warehouse to transformed data which can be later used by Business Intelligence tools and any other data consumers.

dbt also allows us to introduce good software engineering practices by defining a deployment workflow: - Develop models - Test and document models - Deploy models with version control and CI/CD. for more infos go check : https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/4_analytics.md

BigQuery : is a Data Warehouse solution offered by Google Cloud Platform. -BQ is serverless. There are no servers to manage or database software to install; this is managed by Google and it's transparent to the customers. -BQ is scalable and has high availability. Google takes care of the underlying software and infrastructure. -BQ has built-in features like Machine Learning, Geospatial Analysis and Business Intelligence among others. -BQ maximizes flexibility by separating data analysis and storage in different compute engines, thus allowing the customers to budget accordingly and reduce costs. -Some alternatives to BigQuery from other cloud providers would be AWS Redshift or Azure Synapse Analytics.
Data studio : A free datavisualisation tool

we’re going to design a pipeline based on GCP (Google Cloud Platform) with the use of :

Terraform to create resources in GCP
Data studio for visualization
Airflow for Pipeline Orchestration
BigQuery as a Warehouse
Dbt : For tranforming data

Terraform used to create ressources in GCP, two files :

maint.tf :the version of terraform and GCP credentials
variables.tf : all the ressources (bucket, bigquery datset, storage type ...)

In this deposit you're going to find the data ingestion script in the folder airflow/dags + a docker-compose.yaml and a docker file in order to install airlow. I've put some comments on the code the results of airflow :

4 dags :

Download the data set (choose your dataset)
Format to park in order to change the type of document from csv to parquet size
local to gcs in order to put data in a DATA LAKE
Bigquery in order to create the table in the DATA WAREHOUSE result on GCP :

After doing all that, we're going to use dbt tool (cloud version) in order to transform our data (raw data to transformed) i've created a table named total income

look at the code :

Schema.yml : Define our source code
mytransformation.sql : my first transfomation
Total_income.sql : my second one with where statement in order to get only the income

Our new table is ready to be visualisad on Data studio :

after we've created the table with dbt, i 've created tree tiles with Data studio :

sgouda0412 / data_project Goto Github PK

data_project's Introduction

ANNUAL INCOME FOR NEWZEALAND

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent