Giter Site home page Giter Site logo

Comments (8)

tedmiston avatar tedmiston commented on May 29, 2024 7

@schnie I think we should keep the default setup simple for the basic use case, but I'd expect we will experience this need more especially with EE and Celery workers on one box. In any case, the isolation is a good practice.

  • We could expand the Airflow onbuild Dockerfile to support creating 0 or more virtualenvs to encapsulate dependencies. It'd be more of an advanced setup since I don't think Airflow's PythonOperator supports activating/deactivating a venv. Maybe the user config could look something like:

      .
      └── requirements/
          ├── first_venv.txt
          └── second_venv.txt
    

    then we could do all the magic automatically behind the scenes + one of these changes below.

  • We could modify Airflow's standard PythonOperator to support optionally activating/deactivating an existing venv before running the task (passing it by name). I think that change has potential for upstream.

  • The PythonVirtualenvOperator kind of addresses this use case but it's a bit wasteful creating and tearing down a venv for every single task. Alternatively, we could modify the PythonVirtualenvOperator to support optionally not creating a venv that already exists and not destroying the venv on exit.

I'm happy to help with this one — it's been a pet use case for me as well. We should discuss a bit more as it potentially has other implications like using different isolated sets of env vars in the different venvs, or maybe we draw a line there and say, use a DockerOperator for that.

from astronomer.

schnie avatar schnie commented on May 29, 2024

@tedmiston what are your thoughts on something like this?

from astronomer.

tedmiston avatar tedmiston commented on May 29, 2024

@pedromachados Hmm... yeah, this use case is not super well served yet.

So in your forked Dockerfile, you create the two virtualenvs upfront, then activate them when you need to run the commands? Something like a BashOperator with:

source ~/.virtualenvs/<my_venv>/bin/activate && python -m foo_mod

Is that what you're currently doing? Or any chance you're using the SingerOperator that Ben wrote recently?

I'm not super familiar with Singer. Can you also point me toward their python docs you're using?


I wrote a comment a few weeks ago for someone asking to run multiple separate versions of Python (copied below for convenience), but the same answer could be applied to isolating venvs for instance with a DockerOperator. I think the separate Docker container setup would work fine but is slightly overkill unless you also need diff sys-level dependencies or something like that beyond the venv.

That's definitely a good question, and one that I don't think is clearly addressed in the docs (so I might take the opportunity to turn this into a little blog post later). tl;dr - Airflow can support running multiple Python versions.

Here's a three-pronged answer:

  1. PythonOperator - Assuming you're running Airflow on 3.6, you can run your 3.6 functions with the normal PythonOperator.

  2. PythonVirtualenvOperator - Then, if upgrading your 2.7 code to 3.6 is an option, I would; if that's a hurdle... You can run multiple Python versions mixed using the PythonVirtualenvOperator to run the other tasks in dedicated 2.7 etc virtual environments. That's pretty simple if it satisfies your requirements, but it's not a foolproof solution if your reason for being on 2.7 is complex or some system-level dependency that requires more.

  3. DockerOperator - If the setup requires more complex things to be installed, you can use the DockerOperator and build out a simple Python Docker image e.g., like this. This is the method we use to run other languages like Scala and JavaScript via Airflow.

Feel free to reach out via email if you'd like to discuss more or have questions - taylor [at] astronomer.io. We like to help people get started with our Airflow and have written a bit of content for it.

from astronomer.

schnie avatar schnie commented on May 29, 2024

@tedmiston DockerOperator will be problematic in EE and future cloud with Celery as the executor. If we get some time I'd love to explore better integration with running container tasks on kube. KubeOperator and KubeExecutor seems like a step in that direction, but it's not easily able to be duplicated locally.

from astronomer.

tedmiston avatar tedmiston commented on May 29, 2024

Sounds good, let's discuss more next week. I skimmed Airflow docs for k8s stuff and the KubernetesPodOperator looks promising.

from astronomer.

pedromachados avatar pedromachados commented on May 29, 2024

Hi @tedmiston

I did not want to fork your Dockerfile, although that could have been a little more efficient as I had to install some build dependencies in order to install a couple of packages.

Instead, I created a new Dockerfile based on your airflow image. This is what the docker file looks like:

FROM astronomerinc/ap-airflow:latest-onbuild
RUN apk add --no-cache --virtual .build-deps \
                build-base \
                libffi-dev \
                libxml2-dev \
                libxslt-dev \
                linux-headers \
                python3-dev \
                postgresql-dev \
        && pip3 install --upgrade setuptools \
	&& python -m venv ./venvs/dbt \
		&& source ./venvs/dbt/bin/activate \
		&& pip3 install --no-cache-dir dbt \
	&& python -m venv ./venvs/target_stitch \
		&& source ./venvs/target_stitch/bin/activate \
		&& pip3 install --no-cache-dir target-stitch \
        && apk del .build-deps

I am creating the venvs as shown above and running the scripts with BashOperator as follows:

./<venvs directory>/<tap venv>/bin/<my tap>

This way, activation of the environment is not needed.

I could not use the SingerOperator because it does not allow me to pass all the required command line options. I may go back and try to tweak it to work with other taps and my current setup.

Here is a link to the singer documentation: https://github.com/singer-io/getting-started but you'll also want to read https://github.com/singer-io/getting-started/blob/master/BEST_PRACTICES.md

Here is a concrete example of how I am running the target-stitch script:

send_to_stitch = BashOperator(
    task_id="send_to_stitch",
    bash_command=("{0}/venvs/target_stitch/bin/target-stitch "
                  "-c {1}/libring_stitch_config.json"
                  " < {2}/{3} ").format(AIRFLOW_HOME, DAGS_DIR, DATA_DIR,
                                        local_file_name),
    dag=dag
)

Thanks for considering this!

from astronomer.

tedmiston avatar tedmiston commented on May 29, 2024

Hey @pedromachados, thanks for sharing your setup and the docs, and my apologies for not posting back here sooner.

I spoke with one of our data engineers familiar with Singer taps last week and he indicated there are some dependency oddities such as version conflicts there around running multiple taps in the same venv/execution environment, so this is likely a problem other tap users would experience too. Was it your experience that putting all of the dependencies into the top-level venv didn't work as well?

If you're interested in modifying the SingerOperator to take an optional venv path that it activates to run the tap like your sample command, we would be happy to merge that into https://github.com/airflow-plugins/singer_plugin. Feel free to assign a PR to me.

from astronomer.

tedmiston avatar tedmiston commented on May 29, 2024

Closing this issue as inactive for now.

from astronomer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.