Case details are located here and here.
Clone the repository with git
:
git clone https://github.com/gwyddie/grupoboticario-case
Setup up the environment:
make prep-env
Adjust the settings on .env
. It's also necessary to use a service account to access GCP services, so downlaod the key_file
to ./service-account.json
, and make it available to Airflow using:
make import-connection
The data to be processed and the resources required by the Dataproc cluster can be uploaded with:
make ci
Lastly, you can start running the containers:
make up
I haven't setup overcomplicated unit tests to assert the DAGs' structure, but I included a single one, that I find the most useful, to check if all DAGs can be created without import errors.
make test
- Apache Airflow 2.5.1
- Apache Spark 3.1.3
- Python 3.7
- Cloud Storage
- Dataproc
- BigQuery
That's all, folks!