astronomer / astronomer Goto Github PK

View Code? Open in Web Editor NEW

452.0 43.0 82.0 9.43 MB

Helm Charts for the Astronomer Platform, Apache Airflow as a Service on Kubernetes

Home Page: https://www.astronomer.io

License: Other

Makefile 0.77% Shell 5.39% Smarty 2.39% Python 79.47% Mustache 8.53% Jinja 3.44%

apache-airflow kubernetes docker astronomer-platform

astronomer's Introduction

Astronomer Platform Helm Charts

This repository contains the helm charts for deploying the Astronomer Platform into a Kubernetes cluster.

Astronomer is a commercial "Airflow as a Service" platform that runs on Kubernetes. Source code is made available for the benefit of our customers, if you'd like to use the platform reach out for a license.

Architecture

Docker images

Docker images for deploying and running Astronomer are currently available on Quay.io/Astronomer.

Documentation

You can read the Astronomer platform documentation at https://docs.astronomer.io/enterprise. For a record of all user-facing changes to the Astronomer platform, see Release Notes.

Contributing

We welcome any contributions:

Report all enhancements, bugs, and tasks as GitHub issues
Provide fixes or enhancements by opening pull requests in GitHub

Local Development

Install the following tools:

docker (make sure your user has permissions - try 'docker ps')
kubectl
kind
mkcert (make sure mkcert in PATH)
helm

Run this script from the root of this repository:

bin/reset-local-dev

Each time you run the script, the platform will be fully reset to the current helm chart.

Customizing the local deployment

Turn on or off parts of the platform

Modify the "tags:" in configs/local-dev.yaml

platform: core Astronomer components
logging (large impact on RAM use): ElasticSearch, Kibana, Fluentd (aka 'EFK' stack)
monitoring: Prometheus

Load a Docker image into KinD's nodes (so it's available for pods)

kind load docker-image $your_local_image_name_with_tag

Make use of that image

Make note of your pod name

kubectl get pods -n astronomer

Find the corresponding deployment, daemonset, or statefulset

kubectl get deployment -n astronomer

Replace the pod with the new image Look for "image" on the appropriate container and replace with the local tag, and set the pull policy to "Never".

kubectl edit deployment -n astronomer <your deployment>

Specify the Kubernetes version

bin/reset-local-dev -K 1.28.6

Locally test HA configurations

You need a powerful computer to run the HA testing locally. 28 GB or more of memory should be available to Docker.

Environment variables:

USE_HA: when set, will deploy using HA configurations
CORDON_NODE: when set, will cordon this node after kind create cluster
MULTI_NODE: when set, will deploy kind with two worker nodes

Scripts:

Use bin/run-ci to start the cluster
Modify / use bin/drain.sh to test draining

Example:

export USE_HA=1
export CORDON_NODE=kind-worker
export MULTI_NODE=1
bin/run-ci

After the platform is up, then do

bin/drain.sh

How to upgrade airflow chart json schema

Every time we upgrade the airflow chart we will also need to update the json schema file with the list of acceptable top level params (eventually this will be fixed on the OSS side but for now this needs to be a manual step https://github.com/astronomer/issues/issues/3774). Additionally the json schema url will need to be updated to something of the form https://raw.githubusercontent.com/apache/airflow/helm-chart/1.x.x/chart/values.schema.json. This param is found in astronomer/values.schema.json at the astronomer.houston.config.deployments.helm.airflow.$ref parameter

To get a list of the top level params it is best to switch to the apache/airflow tagged commit for that chart release. Then run the ag command to get a list of all top level params.

Example:

gch tags/helm-chart/1.2.0
ag "\.Values\.\w+" -o --no-filename --no-numbers | sort | uniq

The values output by this command will need to be inserted manually into astronomer/values.schema.json at the astronomer.houston.config.deployments.helm.airflow.allOf parameter. There are two additional params that need to be at this location outside of what is returned from above. They are podMutation and useAstroSecurityManager. These can be found by running the same ag command against the astronomer/airflow-chart values.yaml file.

Searching code

We include k8s schema files and calico CRD manifests in this repo to aid in testing, but their inclusion makes grepping for code a bit difficult in some cases. You can exclude those files from your `git grep`` results if you use the following syntax:

git grep .Values.global. -- ':!tests/k8s_schema' ':!bin/kind'

The -- ends the git command arguments and indicates that the rest of the arguments are filenames or pathspecs. pathspecs begin with a colon. :!tests/k8s_schema is a pathspec that instructs git to exclude the directory tests/k8s_schema.

Note that this pathspec syntax is a git feature, so this exclusion technique will not work with normal grep.

License

The code in this repo is licensed Apache 2.0 with Commons Clause, however it installs Astronomer components that have a commercial license, and requires a commercial subscription from Astronomer, Inc.

Optional schema validation

The ./values.schema.json.example file can be used to validate the helm values you are using work with the default airflow chart shipped with this repo. To use it remove the .example postfix from the file and proceed with the helm lint, install, and upgrade commands as normal.

astronomer's People

Contributors

Stargazers

Watchers

Forkers

andscoop skynet pedromachados rbramwell basph rogervaas cwurtz smoll kaxil outshinelabs gabrielnicolasavellaneda benjamingregory reinhardhsu maxivak shashankn91 rafi-tilt geoxing eduardofb sjmiller609 amandajcrawford chualanagit dhuckins eyalba gusostow chinnareddy578 dimberman filipebanzoli kaushikjumia stunum anhuaxiang kriti-sc jbharath11 akum1 robertcadfarm yuanli1 clettieri tynes saggi432 xtema ianstanton y0zg sophomeric furminsky xbetox jesseoue jbampton scheung38 prasadrayudu king burandobata mgoeke kc105 jwitz paulczar doytsujin romedawg joan-ll morningswait andradericardo deepaksharma23 tu170591 syamasakigoodrx turnbros angrycaptain19 marclamberti carpathianua kihahu alexbegg adickinson72 luismacosta bryanasdev000 agile jedcunningham geoffallendev mz0in aquamas guerremdq

astronomer's Issues

Launch POC event-router

Fix up event-router for open/EE.

Astronomer Cloud: Audit Logs

Users can see a history of all user activity in the app, by user, team, and deployment.

Migrate Clickstream Workers To Lambda/ECS/EKS

Astronomer Cloud: Manage Users, Groups, and Access

Users can create/invite and manage users and teams, and assign them access roles with CRUD permissions on various actions in the app.

Add to Docs: How continuous deployment with Astronomer set up works, including CI builds/integration with github

We've heard this question from a few prospects and think it's relevant enough to add to docs.

Can someone answer this/write the guide, and then add it to the docs site?

Redshift Loader v4

We are looking to replace the Airflow based Redshift loader with any combination of Flink, Kafka, Spark

Airflow: Remove bash dependency from airflow image

@schnie commented on Sat Dec 09 2017

Our entrypoint script currently uses some bash features. It'd be nice if we could just use the default shell, BusyBox Ash. Removes another dependency and reduces surface area.

Migrate Clickstream Airflow to EE on GCP / AWS outside of DC/OS cluster

Clickstream: Add zookeeper and kafka Dockerfiles

Astronomer Cloud: Manage Account / Profile

Users can manage their account and profile info.

Prove Out KubernetesExecutor

Create a lite version of Airflow in examples

A user may prefer to develop and work on only Airflow. We need to create an example that loads the bare bones install of Airflow Scheduler and Webserver along with the PostgresDB.

PostgresDB is required if we want users to be able to demo Airflow DAG concurrency.

Astronomer Cloud: Manage Deployments

Users can create and manage airflow deployments.

Add Open docs for Creating a Clickstream Destination

@tedmiston i think you have a checklist somewhere for how to add new clickstream destinations. Can you add to https://github.com/astronomerio/astronomer/blob/master/docs/pages/clickstream/add_destination.md

.

Support multiple virtualenvs

I've been using your Open setup with moderate success. I've spent a lot of time reconciling dependencies. It would be great if instead of a single requirements.txt you could support multiple virtual environments. I don't believe this is possible out-of-the-box now but if it is, please let me know.

To add more context, this is to be able to run Singer taps and targets which designed to be run from the command line (as opposed to as a python module). Another package I had trouble with was dbt.

I realize there is a Python operator that has support for virtual environments but given the fact these tools are run from the command line, I am not sure that would serve my use case.

As a workaround, I created a new Dockerfile based on astronomerinc/ap-airflow:latest-onbuild and installed target-stitch and dbt in their separate venvs. I ran a quick test and that seems to be working OK.

Distinguish Open From EE

After a conversation with a customer - we need to be distinguishing Open from EE. Ideas around this are

Add warning to docs to highlight that Open is meant for testing and viewing of EE internals, not for dev or production workflows. For this they should be pointed to the Astronomer EE CLI.
change airflow-enterprise example to airflow-ee-bundle in order to remove the impression that the example is anywhere near feature parody to an enterprise install.

This is slightly inline with #39

Clickstream: Streaming data mapper

@schnie commented on [Tue Oct 03 2017]

We've discussed this internally several times and Alooma has a similar feature. We'd like to be able to optionally slot a mapper function in the middle of our streaming pipelines. This could be useful for clickstream as well as our standard pipelines built on Kafka Connect. Potential use cases would be light data transformations, or enrichment. We're getting this request from customers and potential clients a lot lately.

Some open questions:

Where do we slot this into the clickstream product, if at all?
What languages should we support?
Code function in browser or code push (like airflow) or git hooks?

@willastronomer commented on [Wed Nov 29 2017]

We could create consume/produce pairs for different languages, create and interface for them that the customer implements and then pull that code in between the event-router and integration-worker.

Clickstream Open: Monitor Ingestion API messages consumed

Include a prometheus client with the ingestion api to maintain a count of messages consumed

Add copy code buttons to Open

Broken links in README.md

README.md

Full documentation for using the images can be found here.

The project is licensed under the Apache 2 license. For more information on the licenses for each of the individual Astronomer Platform components packaged in the images, please refer to the respective Astronomer Platform documentation for each component.

This URL returns 404: https://astronomerio.github.io/astronomer/

Astronomer Cloud: Manage Teams

Users can create and manage organizations.

Bulk document download — Pandadoc API

Migrate Kairos and Cassandra out of DC/OS

move kairos container out of dcos
migrate cassandra to scylladb on compose

Airflow: Extract API Plugin from SaaS

@cwurtz made an API plugin which covers a few use cases (list dags, list dag runs, backfill using executor), but can be extended, that we want to add to our Airflow distribution.

Compare w/ functionality of https://github.com/teamclairvoyant/airflow-rest-api-plugin/blob/master/plugins/rest_api_plugin.py which just makes calls to the python cli.

SaaS Billing Cleanup

Airflow EE Users and Auth

Clickstream Open: Cross-Domain Analytics

@schnie commented on [Wed Nov 08 2017]

Several customers/prospects have been asking about cross-domain analytics and how they can track user actions across multiple domains.

An example use case is a company that uses a micro-site or blog to drive traffic into their main site to purchase products. This company could pump events from both sites into schemas within a single data warehouse or cluster and run queries that JOIN the two datasets to see how events/pageviews on one site, drive sales on the other.

A solution would be to append a crossDomainId to all events originating from a domain that the customer has set up in our app somewhere in our event processing pipeline.

Simple example query could look like:

SELECT count(*)
FROM my_blog.pages
INNER JOIN my.bid_on_item
ON my_blog.pages.context_traits_crossDomainId = 
    my.bid_on_item.context_traits_crossDomainId
WHERE my_blog.pages.name = 'Introducing Collectibles' AND
    my.bid_on_item.product_category = 'Collectible'

Migrate Vortex out of DC/OS to ?

G

Create webhooks API to ingest data from webhooks like Sendgrid

@aloisbarreras had created MVP with node api and express - adding real-time ingestion from other sources using clickstream infrastructure would be great.

Airflow: Get alerted when Airflow can't schedule a task

Would be helpful if Airflow could alert us when it can't schedule a task for X minutes, instead of just silently spinning. If we had a Prometheus counter for tasks scheduled, alert could be based on that.

Launch POC event-forwarder

Get event-forwarded ready for open/EE.

Support latest Celery consistent with Airflow

Tweak our install requirements to allow minor version upgrades to Celery to support the new RC.

Requested by a user in astronomer/airflow#23 (comment).

Consolidate group/user across images

@schnie commented on Mon Dec 11 2017

Either create a single user that we user everywhere in base, or come up with a common way/order of adding users to application images.

Launch POC event-api

Get event-api in shape for open/EE.

Astronomer Cloud: Sign On

Users can sign up and login to the app.

Astronomer Cloud: Feature Flagging

Setup a system of enabling/disabling features for permissions, release management, and billing authorization.

Simplify Airflow EE Install

Simplify Airflow EE install, could be a bash script, go bin or python

Fix Airflow stats

Our grafana dashboard for the scheduler/workers shows some inaccurate numbers. See: https://issues.apache.org/jira/browse/AIRFLOW-774.

We should fix those bugs on our fork, release it, and submit a PR to apache.

This is kind of important, but since people aren't actually using it in production, there are no complaints. We should get ahead of it.

Clickstream: Implement message deduping system

https://segment.com/blog/exactly-once-delivery

Big takeaways here is we should be partitioning by messageId so that each message should get processed by the same worker every time, so then we can build a log of what messages we have already seen. If we haven't seen it, we pass it along. If we have, we drop it.

This is more about de-duping messages that were sent multiple times (a true dupe). Kafka can give us exactly-once processing, but if there's duplicate messages then that would just guarantee that we process each of the duplicate messages once.

Open documentation needs updating

This section of the documentation is hard to follow:

https://open.astronomer.io/airflow/index.html

I noticed this commit 31f95f4#diff-364439c8141492fe1670af4f51c33125 removed the start/stop scripts.

What is the recommended way to start the services now? The commit message mentions the CLI but I could not find any documentation on using the CLI with Open.

I tried docker-compose up but it does not seem to be pointing to the onbuild image that reads requirements.txt and packages.txt so dependencies are not installed.

The document also mentions the .astro directory, which I don't believe is created by this version of the platform.

Automate publishing docs via CircleCI

Set up circle to build and push docs.

Consider pinning Celery to 4.0.2

Currently we take any version of Celery >=4.0.2, however there is significant discussion on the Airflow Dev list of stability issues with the latest release of Celery (4.1.0) e.g., the recent thread "Airflow looses track of Dag Tasks" and the issue apache/airflow#2806 where a core contrib recommended pinning to 4.0.

Our dependency is https://github.com/astronomerio/incubator-airflow/blob/148c8f4a26bcc1be745534e0a5981982202db66e/setup.py#L100-L103 via https://github.com/astronomerio/astronomer/blob/f03223f773c93cd09417044742722679dd8b97a8/docker/platform/airflow/Dockerfile#L26

The puckel/docker-airflow repo that we forked docker-airflow-saas and docker-airflow-clickstream from pins to ==4.0.2 as well.

https://github.com/puckel/docker-airflow/blob/ef712bccc2d68994c16f6851a065bb835a5d4f78/Dockerfile#L60

@schnie @andscoop @cwurtz @ryw I think we should pin to 4.0.2 like everyone else for now for similar reasons. Anyone opposed to that?

Build 'How to Use Airflow' Video

We have good content around installing Astronomer Airflow in Kubes, but we don't have very many videos or much material on actually using Airflow in the wild.

Clickstream Open: view analytics.js build log

As a superadmin
I want to see a log of analytics.js builds
so that I can troubleshoot potential issues with the builder

Clickstream Open: dashboard for clickstream processing delays

Airflow: Can we get memory usage over time for a task?

@tedmiston commented on Fri Dec 15 2017

It might be possible to get this info through Flower or the Airflow statsd thing?

@tedmiston commented on Mon Dec 18 2017

@schnie Can you add anything you've learned here recently? I think you were working with statsd in Airflow. Is there overlap with that and how we use Prometheus?

@schnie commented on Wed Dec 20 2017

@tedmiston I don't have anything currently that will track memory usage at a task level over time, but I think we could get live monitoring of the celery workers into prometheus/grafana with cadvisor. I'll tinker around some more and report back.

Move Builders to Lambda

GDPR CS Compliance: delete data

Turn off the archiver
Delete all the archiver data (S3)
~~Add a new step to the Airflow CS DAG to delete S3 data for that run~~ (addressed by bucket lifecycle policy)
Delete all redshift loader historical data (S3) (addressed by bucket lifecycle policy)
Send customer notice: turn on S3 integration if you want your events saved to a data lake