Giter Site home page Giter Site logo

asr_airflow's Introduction

Dynamic Airflow Pipeline Generator

Installation:

1.)

git clone https://github.com/shubham-padia/asr_airflow
cd asr_airflow

2.) run bash install.sh

3.) To deploy i.e run your project, execute bash tools/deploy.sh

Changing the IP:

By default the installation script will pickup your network IP (which you can find by ifconfig) and use that. If you want to change that IP to either 127.0.0.1 or some other IP please follow the below given instructions.

1.) Run bash tools/change_ip.sh <your_ip_here >

Please note that you also have to change the IP for the frontend, please have a look at the README of the frontend repo for the same.

Adding a task:

Let's assume we are adding another version fo vad task called vad2

1.) Create a new file named vad2.py in dags/tasks directory. 2.) Add a function called vad2_task which will do all of the operations that your task needs to do. 3.) Add another function called get_vad2_task which returns the PythonOperator to the pipeline generator file. You can access the params passed in the pythonoperator in the vad2_task. 4.) Now open the pipeline generator file which will be under the dags directory (dags/test_asr_tree_v5.py at the time of writing). 5.) Add VAD2='vad2' at the top opf the file along with the other tasks. 6.) Add an elif condition checking for your task in the get_task_by_type which will call the get_vad2_task method from the file that we just added. Pass the parameters of your choice to get_vad2_task

lease note that you also have to add the task to the frontend, please have a look at the README of the frontend repo for the same.

Manual Installation:

0.) Clone the repo:

git clone https://github.com/shubham-padia/asr_airflow
cd asr_airflow

1.) Install pyenv for managing virtual environments using pyenv-installer (You can use any environment manager you like):

Install prerequisites for Ubuntu/Debian:

sudo apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \
libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \
xz-utils tk-dev libffi-dev liblzma-dev python-openssl git

For other distributions or any other build problems, please refer to the this wiki

curl https://pyenv.run | bash

2.) Install Python 3.7.2

pyenv install 3.7.2

3.) Create virtual environment.

pyenv virtualenv 3.7.2 asr_airflow

4.) Activate virtual environment.

pyenv activate asr_airflow

5.) Install dependencies

export SLUGIFY_USES_TEXT_UNIDECODE=yes
pip install -r requirements.txt

6.) Postgres:

  • Make sure postgres is installed:
sudo apt-get update
sudo apt-get install postgresql postgresql-contrib libssl-dev
  • Create the database. For that we first open the psql shell. Go the directory where your postgres file is stored.
# For linux users
sudo -u postgres psql

# For macOS users
psql -d postgres
  • When inside psql, create a user for open-event and then using the user create the database. Also, create a test databse named opev_test for the test suites by dumping the oevent database into it. without this, the tests will not run locally.
CREATE USER asr_airflow WITH PASSWORD 'asr_airflow_password';
CREATE DATABASE asr_airflow WITH OWNER asr_airflow;
CREATE USER watcher WITH PASSWORD 'yeshallnotpass';
CREATE DATABASE watcher WITH OWNER watcher;
  • Once the databases are created, exit the psql shell with \q followed by ENTER.

7.) Create application environment variables.

cp env-example .env
  • Change AIRFLOW_HOME to your current directory i.e. the one you cloned the project into.
  • Change AIRFLOW__CORE__SQL_ALCHEMY_CONN and AIRFLOW_CONN_AIRFLOW_DB to point to the database we created earlier i.e. asr_airflow db.
  • Change PATH to your current system path, make sure the virtual environment that we created earlier is active. You can view your current path by typing echo $PATH
  • Set WATCHER_DB_URL to the url of the watcher database we created in the earlier steps.

8.) Create the tables and run migration.

cd app
python manage.py migrate

9.) Copy the example services from example_services to a directory named services. Please try to name the directory for the actual services as services as it has been added to the .gitignore. cp -r example_services/ services

10.) Change the systemd service paths in the services directory according to your system:

  • EnvironmentFile should point to the absolute location of the .env file that we created in the earlier step.
  • User should be your linux user name and Group should be your user group.
  • ExecStart should use the binary present in your virtual environment. e.g. /home/user/.pyenv/versions/3.7.2/envs/asr_airflow/bin/airflow or /home/user/.pyenv/versions/3.7.2/envs/asr_airflow/bin/python for the python binary.
  • In case of airflow webserver or scheduler, an extra argument for --pid is required in ExecStart e.g. --pid /home/user/github/asr_airflow/scheduler.pid
  • Set WorkingDirectory to /home/user/github/asr_airflow

11.) copy the systemd services to your system's service folder:

cp -r services/ /lib/systemd/system
systemctl enable watcher-to-db
systemctl enable airflow-scheduler
systemctl enable airflow-webserver

12.) Start the server

systemctl start watcher-to-db
systemctl start airflow-webserver
systemctl start airflow-scheduler

13.)

  • Go to Admin > Variables
  • Import variables by choosing the file airflow_default_variables.json
  • Change the variables to point to the path of scripts on your system.
Glossary:
  • Recording_id: Usually the name of your Recording Info file i.e. the file that contains the information about the sessions, mic types and channels
  • Pipeline_id: The name for the different combination of steps you are running for the same recording id.
  • BASE_FOLDER: The folder where asr_airflow has been installed.
  • Pipeline Creator: The frontend for generating pipeline files.
Output file Structure:
  • BASE_FOLDER/<Recording_id>/<Pipeline_id>/<session(1/2/3...)>/<mic_name or "hybrid">/<0_raw or 1_vad or 1_dia or 2_decoder>/<Your_output_files>

Steps

  1. Upload your audio files to BASE_FOLDER/data/<Recording_id>
  2. Open the Pipeline Creator and upload you recording metadata files. Fill the name (which will be the name of the file you downlaod when you click the submit button), recording_id, version, and pipeline_id. The combination of Recording_id + Pipeline_id should always be unique. The name of the pipeline will be <Recording_id>-<Pipeline_id>.
  3. Add steps for your pipeline in the steps section, check the graph for your steps by clicking view graph on the navbar. If you are satisfied with the graph, then click on the submit button and the file will be downloaded at .json.
  4. NOTE: To use the downloaded file for another recording metadata file, you can use the import function at the top of the navbar. The imported pipeline will the contain the old recording data. You can replace that by uploading the new metadata file. Please remember to first import the pipeline and then import the metadata file.
  5. Upload the downloaded metadata file to BASE_FOLDER/metadata directory. The file watcher in that directory will log the file in the database. The pipeline will then be registered to airflow after a wait of 1-2 mins.
  6. Search for <Recording_id>-<Pipeline_id> in airflow and turn the toggle button in airflow for whichever session you want to run for the uploaded pipeline. Check the status of the pipeline to see whether the pipeline has been successfully executed.

asr_airflow's People

Contributors

shubham-padia avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.