Giter Site home page Giter Site logo

airflow-snowpark-demo's Introduction

Status: This provider is currently in development and should not be used with production data.

Airflow Provider for Snowpark

This guide demonstrates using Apache Airflow to orchestrate a machine learning pipeline leveraging Airflow Operators and Decorators for Snowpark Python as well as a new customer XCOM backend using Snowflake tables and stages. While this demo shows both the operators and the xcom backend either can be used without the other.

At a high level Astronomer's Snowflake Provider provides the following:

  • Snowpark operators and decorators: Snowpark Python provides an intuitive Dataframe API as well as the ability to run python-based User-Defined Functions and Stored Procedures for processing data in Snowflake. The provider includes operators and decorators to simplify data passing, remove boiler-plate setup and teardown operations, and generally simplify running Snowpark python tasks.

Example:

@task.snowpark_python(task_id='feature_engineering', temp_data_output='stage', temp_data_stage='MY_STAGE')
def feature_engineering(taxidf:SnowparkTable) -> list[SnowparkTable]:

	taxidf = taxidf.with_column('HOUR_OF_DAY', F.col('HOUR').cast(T.StringType()))

  new_df = snowpark_session.create_dataframe(pd.DataFrame(['x','y'], columns=['name']))

  return [taxidf, newdf]

All operators include the following functionality:

  • Snowpark Session: A session instance called snowpark_session is automatically created (using Airflow Snowflake connection parameters) and can be referenced by user python code.
  • SnowparkTable Object: A new datatype was created that represents a SnowparkTable. This is serializable and can be passed between tasks. Any arguments of the python function which are annotated as SnowparkTable will be automatically instantiated as Snowpark Dataframes in the users code. Additionally the SnowparkTable is interchangeable with Astro SDK Table and TempTable objects.
  • Snowpark Dataframe serialization and deserialization: Snowpark Dataframes returned from python functions can be automatically serialized to a Snowflake table or stage as set in temp_data_output = 'stage'|'table'|None. The database and stage used for serialization can optionally be set in temp_data_database and temp_data_schema. If these are not set the provider will use the database/schema as set in Operator/Decorator parameters, the Airlfow Snowflake connection and lastly the default database/schema in Snowflake user settings. Snowpark Dataframes serialized to stage will be saved as Parquet files. Table/Stage data will be deserialized automatically as Snowpark Dataframes in the receiving task.

See below for a list of all Operators and Decorators.

  • The custom XCOM backend: To provide additional security and data governance the Snowflake XCOM backend allows storing task input and output in Snowflake. Rather than storing potentially-sensitive data in the Airflow XCom tables Snowflake users can now ensure that all their data stays in Snowflake. This also allows passing large data and/or non-serializable data (ie. Pandas dataframes) between tasks. JSON-serializable data is stored in an XCom table and large or non-serializable data is stored as objects in a Snowflake stage.

Package

While in development the provider package is not yet in pypi. For this demo the provider is installed from a wheel file in `include/astro_provider_snowflake-0.0.1.dev1-py3-none-any.whl' and can be used in other projects by copying this file.

Demonstration

The following demo has been created to show the use of this provider and leverages the Astronomer Runtime and Astro CLI to create a local dev instance of Airflow.

Prerequisites

  • Astro CLI
  • Docker Desktop
  • Git
  • Snowflake account: For this demo a free tier trial account will suffice.

Setup

  1. Install Astronomer's Astro CLI. The Astro CLI is an Apache 2.0 licensed, open-source tool for building Airflow instances and is the fastest and easiest way to be up and running with Airflow in minutes. Open a terminal window and run:

For MacOS

brew install astro

For Linux

curl -sSL install.astronomer.io | sudo bash -s
  1. Clone this repository:
git clone https://github.com/astronomer/airflow-snowpark-demo
cd airflow-snowpark-demo
  1. Save your Snowflake account credentials as environment variables. Edit the following strings with your account information and run the export command in the terminal window where you will run the remaining commands.
export AIRFLOW_CONN_SNOWFLAKE_DEFAULT='{"conn_type": "snowflake", "login": "USER_NAME", "password": "PASSWORD", "schema": "demo", "extra": {"account": "ORG_NAME-ACCOUNT_NAME", "warehouse": "WAREHOUSE_NAME", "database": "demo", "region": "REGION_NAME", "role": "USER_ROLE", "authenticator": "snowflake", "session_parameters": null, "application": "AIRFLOW"}}'
  1. Start Apache Airflow:
astro dev start
  1. Connect to the Airflow Scheduler container to setup Snowflake objects for the demo.
astro dev bash -s

Setup the Snowflake database, schema, tables, etc for this demo. This must be run as a user with admin priveleges. Alternatively use an existing database and schema or look at the setup scripts and have a Snowflake administrator create these objects and grant permissions.

python include/utils/setup_snowflake.py \
  --conn_id 'snowflake_default' \
  --admin_role 'sysadmin' \
  --database 'demo' \
  --schema 'demo'
exit
  1. Run the Snowpark Demo DAG
astro dev run dags unpause snowpark_demo
astro dev run dags trigger snowpark_demo
  1. Connect to the Local Airflow UI and login with Admin/Admin

  2. As the DAG runs notice that the XCOM values in the Airflow UI only contain URIs and not the actual data.
    For example:
    snowflake://myORG-myACCT?&table=DEMO.XCOM.XCOM_TABLE&key=snowpark_demo/load.load_yellow_tripdata_sample_2019_01.csv/manual__2023-06-19T15:41:46.538589+00:00/0/return_value

Available Snowpark Operators and Decorators:

  • SnowparkPythonOperator: This is the simplest operator which runs as a PythonOperator in the Airflow instance. This requires that the Airflow instance is running a version of python supported by Snowpark and has Snowpark Python package installed. NOTE: Currently Snowpark only supports python 3.8 so this operator has limited functionality. Snowpark python for 3.9 and 3.10 is expected soon.
  • SnowparkVirtualenvOperator: This operator creates a python virtualenv to run the python callable in a subprocess. Users can specify python package requirements (ie. snowflake-snowpark-python). It is assumed that the python version specified is installed. The Astronomer buildkit can be used to add this to a Docker container.
  • SnowparkExternalPythonOperator: This operator runs the Snowpark python callable in a pre-existing virtualenv. It is assumed that Snowpark is already installed in that environment. Using the Astronomer buildkit will simplify building this environment.
  • SnowparkPythonUDFOperator: Work in progress
  • SnowparkPythonSPROCOperator: Work in progress
  • snowpark_python_task: Decorator for SnowparkPythonOperator
  • snowpark_virtualenv_task: Decorator for SnowparkVirtualenvOperator
  • snowpark_ext_python_task: Decorator for SnowparkExternalPythonOperator

airflow-snowpark-demo's People

Contributors

mpgreg avatar

Watchers

 avatar

airflow-snowpark-demo's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.