Dynamic Airflow Pipeline Generator

Installation:

1.)

git clone https://github.com/shubham-padia/asr_airflow
cd asr_airflow

2.) run bash install.sh

3.) To deploy i.e run your project, execute bash tools/deploy.sh

Changing the IP:

By default the installation script will pickup your network IP (which you can find by ifconfig) and use that. If you want to change that IP to either 127.0.0.1 or some other IP please follow the below given instructions.

1.) Run bash tools/change_ip.sh <your_ip_here >

Please note that you also have to change the IP for the frontend, please have a look at the README of the frontend repo for the same.

Adding a task:

Let's assume we are adding another version fo vad task called vad2

1.) Create a new file named vad2.py in dags/tasks directory. 2.) Add a function called vad2_task which will do all of the operations that your task needs to do. 3.) Add another function called get_vad2_task which returns the PythonOperator to the pipeline generator file. You can access the params passed in the pythonoperator in the vad2_task. 4.) Now open the pipeline generator file which will be under the dags directory (dags/test_asr_tree_v5.py at the time of writing). 5.) Add VAD2='vad2' at the top opf the file along with the other tasks. 6.) Add an elif condition checking for your task in the get_task_by_type which will call the get_vad2_task method from the file that we just added. Pass the parameters of your choice to get_vad2_task

lease note that you also have to add the task to the frontend, please have a look at the README of the frontend repo for the same.

Manual Installation:

0.) Clone the repo:

git clone https://github.com/shubham-padia/asr_airflow
cd asr_airflow

1.) Install pyenv for managing virtual environments using pyenv-installer (You can use any environment manager you like):

Install prerequisites for Ubuntu/Debian:

sudo apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \
libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \
xz-utils tk-dev libffi-dev liblzma-dev python-openssl git

For other distributions or any other build problems, please refer to the this wiki

curl https://pyenv.run | bash

2.) Install Python 3.7.2

pyenv install 3.7.2

3.) Create virtual environment.

pyenv virtualenv 3.7.2 asr_airflow

4.) Activate virtual environment.

pyenv activate asr_airflow

5.) Install dependencies

export SLUGIFY_USES_TEXT_UNIDECODE=yes
pip install -r requirements.txt

6.) Postgres:

Make sure postgres is installed:

sudo apt-get update
sudo apt-get install postgresql postgresql-contrib libssl-dev

Create the database. For that we first open the psql shell. Go the directory where your postgres file is stored.

# For linux users
sudo -u postgres psql

# For macOS users
psql -d postgres

When inside psql, create a user for open-event and then using the user create the database. Also, create a test databse named opev_test for the test suites by dumping the oevent database into it. without this, the tests will not run locally.

CREATE USER asr_airflow WITH PASSWORD 'asr_airflow_password';
CREATE DATABASE asr_airflow WITH OWNER asr_airflow;
CREATE USER watcher WITH PASSWORD 'yeshallnotpass';
CREATE DATABASE watcher WITH OWNER watcher;

Once the databases are created, exit the psql shell with \q followed by ENTER.

7.) Create application environment variables.

cp env-example .env

Change AIRFLOW_HOME to your current directory i.e. the one you cloned the project into.
Change AIRFLOW__CORE__SQL_ALCHEMY_CONN and AIRFLOW_CONN_AIRFLOW_DB to point to the database we created earlier i.e. asr_airflow db.
Change PATH to your current system path, make sure the virtual environment that we created earlier is active. You can view your current path by typing echo $PATH
Set WATCHER_DB_URL to the url of the watcher database we created in the earlier steps.

8.) Create the tables and run migration.

cd app
python manage.py migrate

9.) Copy the example services from example_services to a directory named services. Please try to name the directory for the actual services as services as it has been added to the .gitignore. cp -r example_services/ services

10.) Change the systemd service paths in the services directory according to your system:

EnvironmentFile should point to the absolute location of the .env file that we created in the earlier step.
User should be your linux user name and Group should be your user group.
ExecStart should use the binary present in your virtual environment. e.g. /home/user/.pyenv/versions/3.7.2/envs/asr_airflow/bin/airflow or /home/user/.pyenv/versions/3.7.2/envs/asr_airflow/bin/python for the python binary.
In case of airflow webserver or scheduler, an extra argument for --pid is required in ExecStart e.g. --pid /home/user/github/asr_airflow/scheduler.pid
Set WorkingDirectory to /home/user/github/asr_airflow

11.) copy the systemd services to your system's service folder:

cp -r services/ /lib/systemd/system
systemctl enable watcher-to-db
systemctl enable airflow-scheduler
systemctl enable airflow-webserver

12.) Start the server

systemctl start watcher-to-db
systemctl start airflow-webserver
systemctl start airflow-scheduler

13.)

Go to Admin > Variables
Import variables by choosing the file airflow_default_variables.json
Change the variables to point to the path of scripts on your system.

Glossary:

Recording_id: Usually the name of your Recording Info file i.e. the file that contains the information about the sessions, mic types and channels
Pipeline_id: The name for the different combination of steps you are running for the same recording id.
BASE_FOLDER: The folder where asr_airflow has been installed.
Pipeline Creator: The frontend for generating pipeline files.

Output file Structure:

BASE_FOLDER/<Recording_id>/<Pipeline_id>/<session(1/2/3...)>/<mic_name or "hybrid">/<0_raw or 1_vad or 1_dia or 2_decoder>/<Your_output_files>

Steps

Upload your audio files to BASE_FOLDER/data/<Recording_id>
Open the Pipeline Creator and upload you recording metadata files. Fill the name (which will be the name of the file you downlaod when you click the submit button), recording_id, version, and pipeline_id. The combination of Recording_id + Pipeline_id should always be unique. The name of the pipeline will be <Recording_id>-<Pipeline_id>.
Add steps for your pipeline in the steps section, check the graph for your steps by clicking view graph on the navbar. If you are satisfied with the graph, then click on the submit button and the file will be downloaded at .json.
NOTE: To use the downloaded file for another recording metadata file, you can use the import function at the top of the navbar. The imported pipeline will the contain the old recording data. You can replace that by uploading the new metadata file. Please remember to first import the pipeline and then import the metadata file.
Upload the downloaded metadata file to BASE_FOLDER/metadata directory. The file watcher in that directory will log the file in the database. The pipeline will then be registered to airflow after a wait of 1-2 mins.
Search for <Recording_id>-<Pipeline_id> in airflow and turn the toggle button in airflow for whichever session you want to run for the uploaded pipeline. Check the status of the pipeline to see whether the pipeline has been successfully executed.

shubham-padia / asr_airflow Goto Github PK

asr_airflow's Introduction

Dynamic Airflow Pipeline Generator

Installation:

Changing the IP:

Adding a task:

Manual Installation:

Glossary:

Output file Structure:

Steps

asr_airflow's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent