Comments (8)
@schnie I think we should keep the default setup simple for the basic use case, but I'd expect we will experience this need more especially with EE and Celery workers on one box. In any case, the isolation is a good practice.
-
We could expand the Airflow onbuild Dockerfile to support creating 0 or more virtualenvs to encapsulate dependencies. It'd be more of an advanced setup since I don't think Airflow's PythonOperator supports activating/deactivating a venv. Maybe the user config could look something like:
. └── requirements/ ├── first_venv.txt └── second_venv.txt
then we could do all the magic automatically behind the scenes + one of these changes below.
-
We could modify Airflow's standard PythonOperator to support optionally activating/deactivating an existing venv before running the task (passing it by name). I think that change has potential for upstream.
-
The PythonVirtualenvOperator kind of addresses this use case but it's a bit wasteful creating and tearing down a venv for every single task. Alternatively, we could modify the PythonVirtualenvOperator to support optionally not creating a venv that already exists and not destroying the venv on exit.
I'm happy to help with this one — it's been a pet use case for me as well. We should discuss a bit more as it potentially has other implications like using different isolated sets of env vars in the different venvs, or maybe we draw a line there and say, use a DockerOperator for that.
from astronomer.
@tedmiston what are your thoughts on something like this?
from astronomer.
@pedromachados Hmm... yeah, this use case is not super well served yet.
So in your forked Dockerfile, you create the two virtualenvs upfront, then activate them when you need to run the commands? Something like a BashOperator with:
source ~/.virtualenvs/<my_venv>/bin/activate && python -m foo_mod
Is that what you're currently doing? Or any chance you're using the SingerOperator that Ben wrote recently?
I'm not super familiar with Singer. Can you also point me toward their python docs you're using?
I wrote a comment a few weeks ago for someone asking to run multiple separate versions of Python (copied below for convenience), but the same answer could be applied to isolating venvs for instance with a DockerOperator. I think the separate Docker container setup would work fine but is slightly overkill unless you also need diff sys-level dependencies or something like that beyond the venv.
That's definitely a good question, and one that I don't think is clearly addressed in the docs (so I might take the opportunity to turn this into a little blog post later). tl;dr - Airflow can support running multiple Python versions.
Here's a three-pronged answer:
PythonOperator - Assuming you're running Airflow on 3.6, you can run your 3.6 functions with the normal PythonOperator.
PythonVirtualenvOperator - Then, if upgrading your 2.7 code to 3.6 is an option, I would; if that's a hurdle... You can run multiple Python versions mixed using the PythonVirtualenvOperator to run the other tasks in dedicated 2.7 etc virtual environments. That's pretty simple if it satisfies your requirements, but it's not a foolproof solution if your reason for being on 2.7 is complex or some system-level dependency that requires more.
DockerOperator - If the setup requires more complex things to be installed, you can use the DockerOperator and build out a simple Python Docker image e.g., like this. This is the method we use to run other languages like Scala and JavaScript via Airflow.
Feel free to reach out via email if you'd like to discuss more or have questions - taylor [at] astronomer.io. We like to help people get started with our Airflow and have written a bit of content for it.
from astronomer.
@tedmiston DockerOperator
will be problematic in EE and future cloud with Celery as the executor. If we get some time I'd love to explore better integration with running container tasks on kube. KubeOperator
and KubeExecutor
seems like a step in that direction, but it's not easily able to be duplicated locally.
from astronomer.
Sounds good, let's discuss more next week. I skimmed Airflow docs for k8s stuff and the KubernetesPodOperator looks promising.
- KubernetesPodOperator example
- Kubernetes Executor/Operator - Airflow wiki
- Airflow + Kubernetes slides from Bloomberg
from astronomer.
Hi @tedmiston
I did not want to fork your Dockerfile, although that could have been a little more efficient as I had to install some build dependencies in order to install a couple of packages.
Instead, I created a new Dockerfile based on your airflow image. This is what the docker file looks like:
FROM astronomerinc/ap-airflow:latest-onbuild
RUN apk add --no-cache --virtual .build-deps \
build-base \
libffi-dev \
libxml2-dev \
libxslt-dev \
linux-headers \
python3-dev \
postgresql-dev \
&& pip3 install --upgrade setuptools \
&& python -m venv ./venvs/dbt \
&& source ./venvs/dbt/bin/activate \
&& pip3 install --no-cache-dir dbt \
&& python -m venv ./venvs/target_stitch \
&& source ./venvs/target_stitch/bin/activate \
&& pip3 install --no-cache-dir target-stitch \
&& apk del .build-deps
I am creating the venvs as shown above and running the scripts with BashOperator as follows:
./<venvs directory>/<tap venv>/bin/<my tap>
This way, activation of the environment is not needed.
I could not use the SingerOperator because it does not allow me to pass all the required command line options. I may go back and try to tweak it to work with other taps and my current setup.
Here is a link to the singer documentation: https://github.com/singer-io/getting-started but you'll also want to read https://github.com/singer-io/getting-started/blob/master/BEST_PRACTICES.md
Here is a concrete example of how I am running the target-stitch
script:
send_to_stitch = BashOperator(
task_id="send_to_stitch",
bash_command=("{0}/venvs/target_stitch/bin/target-stitch "
"-c {1}/libring_stitch_config.json"
" < {2}/{3} ").format(AIRFLOW_HOME, DAGS_DIR, DATA_DIR,
local_file_name),
dag=dag
)
Thanks for considering this!
from astronomer.
Hey @pedromachados, thanks for sharing your setup and the docs, and my apologies for not posting back here sooner.
I spoke with one of our data engineers familiar with Singer taps last week and he indicated there are some dependency oddities such as version conflicts there around running multiple taps in the same venv/execution environment, so this is likely a problem other tap users would experience too. Was it your experience that putting all of the dependencies into the top-level venv didn't work as well?
If you're interested in modifying the SingerOperator to take an optional venv path that it activates to run the tap like your sample command, we would be happy to merge that into https://github.com/airflow-plugins/singer_plugin. Feel free to assign a PR to me.
from astronomer.
Closing this issue as inactive for now.
from astronomer.
Related Issues (20)
- [This was actually changed to 5Gi](https://github.com/astronomer/astronomer/commit/061c85e040d50a849e3ed307e19eed15d53699eb), which means these are out of sync and the alert will not fire at 90% using these defaults. We should update charts/prometheus/templates/prometheus-alerts-configmap.yaml with a comment pointing to this default, and also figure out how we want to reconcile this difference. IMHO we can just bump this to 5Gi.
- Help needed - additionalVolume not parsable by helm with version 0.25.8 HOT 3
- Kibana content-security-policy page errors HOT 4
- Fluentd flush_at_shutdown?
- [HELM] Hardcoded Ingress hostnames for main Astronomer components HOT 2
- [HELM] Not possible to set custom Ingress annotations without enabling authSidecar HOT 3
- [HELM] Not possible to use Deployment for FluentD HOT 4
- Add ability to provide individual subdomain tls cert HOT 1
- K8s v1.22 support HOT 2
- Add k8s 1.22 support for astronomer airflow-chart HOT 1
- cannot use 'airflow db shell' command in scheduler container in astro cli HOT 5
- Provide correct email templates for developers if runtime upgrade fails. HOT 1
- Expose prometheus scrape intervals HOT 1
- email recipients within the UI not receiving deployment unhealthy email alerts HOT 1
- postgres grant statement fails with user@host:pw@host syntax HOT 16
- False alert for alertname: TargetDown HOT 4
- Feature Request: Filterable Deployments on Lineage Issues Page HOT 2
- performanceOptimizationModeEnabled - Needs some elaboration in document HOT 2
- helm uninstall doesn't remove jobs, secrets and pvcs HOT 1
- Deprecated Metrics Usage (kube-state-metrics)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from astronomer.