Analysis of Transborder freight transport between the USA, Canada and Mexico

The last few years have witnessed significant developments in North American trade, including the renegotiation of NAFTA into the USMCA, trade tensions and tariff disputes due to changing governments, and the impact of the COVID-19 pandemic. I was interested in analyzing freight transport data to assess the possible effects of these events on trade trends over a decade (2014-2023) between the three biggest countries in North America - USA, Canada and Mexico.

Objective

My objective was to build a robust data ingestion and processing pipeline that would deliver reliable, reproducible and clean data for analysis, insight generation and potential decision-making regarding North American Transborder trade.

Technologies

I used the following tools to build the pipeline:

Terraform for cloud infrastructure provisioning
Google Cloud Project (GCP) for:
- data lake storage (Google Cloud Storage, GCS)
- data warehouse storage (BigQuery)
- Spark job processing (Dataproc)
- data visualization + analysis (Looker Studio)
Docker/Docker Compose for containerization
Airflow for workflow orchestration
Apache Spark for distributed batch data processing

Data description

The raw data is made available by the Bureau of Transportation Statistics (BTS), Department of Transportation (DOT), Government of the United States and can be found here: Bureau of Transportation Statistics Transborder Freight Raw Data. It is collected and processed by multiple agencies and organizations (public and private), via multiple approaches including administrative data required by regulation and surveys, from both carriers and shippers. A more thorough description of the data can be found here: North American Transborder Freight Data FAQs

The data is in .csv format and is updated on the BTS website on a monthly basis. There is also a metadata file that accompanies the raw data, which details the codes used in the raw data files. The metadata is available in excel and pdf format, with the pdf file containing additional information than what is available in the excel file. Therefore, before using the metadata for this project, I manually curated and reconciled the excel and the pdf files. I’ve used this cleaned up metadata excel file as my input file, available here: TransBorderCodes.xls

Pipeline architecture

Overview of the Transborder Freight data processing pipeline

Pipeline execution

Prerequisites: You are required to have the following accounts and softwares installed on your system before you can begin:
- GCP account, GCP project and a service account that has the following permissions to the project:
  - Viewer
  - Storage Admin
  - Storage Object Admin
  - Storage Object Viewer
  - BigQuery Admin
  - Dataproc Administrator
  - Dataproc Service Agent
  - Editor Download the auth-key credentials json file for this service account and save to ~/.google/credentials/google_credentials.json.
- Google SDK on local machine
- Docker desktop and docker compose
- Terraform
Clone this github repo
Create a bucket on GCP and copy the file "dataproc-pip-installation.sh" into it using gsutil:

gsutil cp <path-to-dataproc-pip-installation.sh> gs://accessory-bucket-name/scripts/dataproc-pip-installation.sh

For example, I've called this bucket "tbf-analysis-accessory", so my code would read:

gsutil cp /Users/ahegde/Job_misc/Portfolio/TransBorderFreightAnalysis/VM_project_files/TransBorderFreight-pySpark-BigQuery-Looker/Terraform/dataproc-pip-installation.sh gs://tbf-analysis-accessory/scripts/dataproc-pip-installation.sh

The variable "pip_initialization_script" in variables.tf file in the Terraform folder needs to be set to this GCP script path.

Navigate to the Terraform directory. Replace GCP details like project name, bucket name, etc. in the variables.tf file with yours, and set up cloud infrastructure using the following Terraform commands:
```
terraform init
terraform plan
terraform apply
```
Navigate to the Airflow directory, make necessary changes to GCP project name, etc in the docker-compose.yaml and Dockerfile, and run docker compose:
```
docker compose build
docker compose up airflow-init
docker compose up
```
To check whether docker compose has generated all necessary containers, you can ssh into the worker container by running:
```
docker ps
docker exec -it <worker-container-ID> bash
```
Open localhost:8080 in a browser and type username:password as admin:admin . You can now trigger your DAG and check the progress as well as the logs.
After you're done running the entire project, destroy your cloud resources by running terraform destroy inside the Terraform directory (you will need to add the force_destroy = true option within your resource blocks in your main.tf file if you want Terraform to forcefully destroy the said resources even if you have data stored in the resources. I have switched this off on purpose.)

Analysis overview

Overview of the Transborder Freight data analysis

apuhegde / transborderfreight-pyspark-bigquery-looker Goto Github PK

transborderfreight-pyspark-bigquery-looker's Introduction

Analysis of Transborder freight transport between the USA, Canada and Mexico

Objective

Technologies

Data description

Pipeline architecture

Pipeline execution

Analysis overview

transborderfreight-pyspark-bigquery-looker's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent