A data engineering Project
At this project, i've tried to download a CSV file from the website of the newzealand gouvernemant to create a dashboard that resume the income of this country from 2013 to 2020. For this project we're going to use :
-
Terraform : Infrastructure-as-Code tool more info here : https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp/1_terraform_overview.md
-
Dbt :dbt stands for data build tool. It's a transformation tool: it allows us to transform process raw data in our Data Warehouse to transformed data which can be later used by Business Intelligence tools and any other data consumers.
dbt also allows us to introduce good software engineering practices by defining a deployment workflow: - Develop models - Test and document models - Deploy models with version control and CI/CD. for more infos go check : https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/4_analytics.md
-
BigQuery : is a Data Warehouse solution offered by Google Cloud Platform. -BQ is serverless. There are no servers to manage or database software to install; this is managed by Google and it's transparent to the customers. -BQ is scalable and has high availability. Google takes care of the underlying software and infrastructure. -BQ has built-in features like Machine Learning, Geospatial Analysis and Business Intelligence among others. -BQ maximizes flexibility by separating data analysis and storage in different compute engines, thus allowing the customers to budget accordingly and reduce costs. -Some alternatives to BigQuery from other cloud providers would be AWS Redshift or Azure Synapse Analytics.
-
Data studio : A free datavisualisation tool
we’re going to design a pipeline based on GCP (Google Cloud Platform) with the use of :
- Terraform to create resources in GCP
- Data studio for visualization
- Airflow for Pipeline Orchestration
- BigQuery as a Warehouse
- Dbt : For tranforming data
Terraform used to create ressources in GCP, two files :
- maint.tf :the version of terraform and GCP credentials
- variables.tf : all the ressources (bucket, bigquery datset, storage type ...)
In this deposit you're going to find the data ingestion script in the folder airflow/dags + a docker-compose.yaml and a docker file in order to install airlow. I've put some comments on the code the results of airflow :
- Download the data set (choose your dataset)
- Format to park in order to change the type of document from csv to parquet size
- local to gcs in order to put data in a DATA LAKE
- Bigquery in order to create the table in the DATA WAREHOUSE result on GCP :
After doing all that, we're going to use dbt tool (cloud version) in order to transform our data (raw data to transformed) i've created a table named total income
look at the code :
- Schema.yml : Define our source code
- mytransformation.sql : my first transfomation
- Total_income.sql : my second one with where statement in order to get only the income
Our new table is ready to be visualisad on Data studio :
after we've created the table with dbt, i 've created tree tiles with Data studio :