Giter Site home page Giter Site logo

jaketf / ci-cd-for-data-processing-workflow Goto Github PK

View Code? Open in Web Editor NEW

This project forked from googlecloudplatform/ci-cd-for-data-processing-workflow

63.0 63.0 12.0 26.04 MB

Cloud Build for Deploying Datapipelines with Composer, Dataflow and BigQuery

License: Apache License 2.0

Shell 17.98% Java 9.63% Python 25.02% Go 25.82% HCL 17.92% Makefile 2.54% Dockerfile 1.09%

ci-cd-for-data-processing-workflow's People

Contributors

kingman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ci-cd-for-data-processing-workflow's Issues

Add separate cloud build for push to prod.

The current cloud build is great for CI checks on PRs.
However, to push a release to prod we should have a separate cloud build file to push this to a production composer environment.

The prod push cloud build can be a subset of the CI cloud build.

  1. deploy-sql-queries-for-composer: Copy BigQuery SQL scripts to the Composer Dags bucket in a dags/sql/ directory.
  2. render-airflow-variables: Renders airflow variables based on cloud build parameters to automate deployments across environments.
  3. deploy-airflowignore: Copies an .airflowignore to ignore non-dag definition files (like sql files) in the dag parser.
  4. stage-airflow-variables: Copies the rendered AirflowVariables.json file to the Cloud Composer wokers.
  5. import-airflow-variables: Imports the rendered AirflowVariables.json file to the Cloud Composer Environment.
  6. set-composer-test-jar-ref: Override an aiflow variable that points to the Dataflow jar built during this run (with this BUILD_ID).
  7. deploy-custom-plugins: Copy the source code for the Airflow plugins to the plugins/ directory of the Composer Bucket.
  8. stage-for-integration-test: Copy the airflow dags to a data/test/ directory in the Composer environment for integration test.
  9. dag-parse-integration-test: Run list_dags on the data/test/ directory in the Composer environment.
  10. clean-up-data-dir-dags: Clean up the integration test artifacts.
  11. build-deploydags: Build the golang deploydags application (documented in composer/cloudbuild/README.md)
  12. run-deploydags: Run the deploy dags application.

Add linting checks.

We should add linting checks:

  • python yapf
  • go fmt
  • shellcheck
  • SQL sql-format

Should Cache deploydags image

Currently, the builds the deploydags app every time.
This should be taken out into a thin container that can be cached in GCR to speed up builds.

Improve DAG validation tests

  • Assert all DAGs have owners
  • be more precise about file name == dag id
  • don't fail on non-DAG python files in dags folder

This tool does not consider inter-dag dependencies

The deploydags app could be enhanced with the ability to ensure certain DAGs are successfully deployed before deploying others.

Potential solutions:

  1. Parse the DAGs and looking for TriggerDagRun / ExternalTaskSensor, etc. This will likely miss some edge cases.
  2. Provide a more rich configuration configuration file than a flat running_dags.txt that allows a user to specify a "DAG of DAGs".

Add solutiuon for running on a private ip Composer Cluster

Use Case:

Cloud Composer supports private ip clusters which spins up a private IP GKE cluster in the customer project within a VPC network, many customers have org policies or security practices of using only private ip GKE clusters, making this a popular feature. This deployment solution should not prevent users from using a private IP cluster.

Issue:

Using private IP causes gcloud composer environments run ... commands (which are used heavily in this solution to run Airflow CLI commands) to fail / timeout from cloud build.

Root Cause:

The Cloud Build Execution environment is serverless and does not run on the customer's network and therefore cannot reach the private ip GKE cluster when gcloud composer environments run runs kubectl under the hood to execute various airflow commands.

Potential Resolution

Consider redesign where the deploydags application is deployed as a Kubernetes Job on the Composer GKE cluster in the customer project and runs the airflow commands directly on the cluster (in a worker pod) rather than via the gcloud indirection.

Alternatives

Interact with Airflow only through REST API (public endpoint already secured with IAP).
This interface is experimental / subject to change and currently being refactored in Airflow 2.0

Add support all airflow connection types

The composer/cloudbuild/bin/run_tests.sh script should be able to set (or mock) all connection types.

  • HTTP connection
  • Google Cloud Platform Connection
  • Google Cloud SQL Connection
  • gRPC
  • MySQL Connection
  • Oracle Connection
  • PostgresSQL Connection
  • SSH Connection

Handle building common dependencies before pipelines

Many organizations have build common data infrastructure packages that are used by many pipelines (e.g. convenience wrappers for spark, beam, airflow to enforce consistency across the team/org).
This repo should leverage a more sophisticated build tool like Bazel to manage builds of these dependencies in the appropriate order.

Improve portability of dags deployer app

Refactor gcloud composer environments run wrapper commands with something like kubectl exec.

Benefits

  1. generalize it's usefulness to "roll your own" k8s deployments or airflow.
  2. Pave the way to eliminate glcoud dependency of dagsdeployer app. (Would also require a refactor to use go storage client for file copying / hashing etc.) This would allow us to use a MUCH smaller container for the dagsdeployer app (e.g. distroless). However, this may be a lot of work for not much advantage.

Refactor description:

  • migrate most of the airflow logic out of composer_ops.go to a airflow_ops package
  • Add a k8s wrapper for airflow cli commands to replace the gcloud wrapper
  • Use the composer go client to retrieve k8s cluster details

Other considerations

This should include adding a k8s airflow deployment to the integration test infrastructure.

References:

k8s go client for kubectl exec from go code
composer go client (to retrieve k8s cluster details)

Example Pipeline doesn't make any sense

Right now we have a dataflow job and an un-related BigQuery query in the word count dag.

These should be refactored to actually seem like they are part of the same workflow.

For example

df_wordcount >> bq_load >> bq_top_10_query

Don't use plugins for custom operator

The current example uses plugins for the contrived xcom compare operator.

Instead this should be managed as a separate python module (e.g. in the DAGs folder on gcs ignored by .airflowignore)

Also let's consider adding a more useful plugin example like a UI view or something!

Handle nested directories under dags folder

Currently the dagsdeployer app assumes all dag definition files live right under the dags folder.
Many airflow code bases have sub directories under the dags folder used by different teams sharing a composer instance.

Add example of separate running_dags.txt for ci and prod

We should support the use case where there are special CI only DAGs (e.g. those that run large scale integration tests) or prod only DAGs (e.g. one to destroy / create the CI environmetn)
we can support this by having separate ci_running_dags.txt and prod_running_dags.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.