Giter Site home page Giter Site logo

pyorchdb's Introduction

PyOrchDB Run linters

A package for designing and implementing ETL. This repository intends to use the following infrastructure in Azure, which has a landing zone represented by a Storage account where the data is located. This storage has 2 stages, the first one containing the rawdata and the second one containing the processed data in .parquet format to be uploaded to the SQL server database. The scheme of this solution is hybrid, having processes that run locally and processes that are managed to run using Azure ML. For a more detailed description, see below.

Prerequisites

  • To execute the commands shown below, an Azure Subscription is required.
  • To run the commands from your local environment, you need to install Azure CLI.

Install and run Azure CLI

Docker container

You can use Docker to run a stand-alone Linux container with Azure CLI and Python3.10 pre-installed.

docker build -t your-image-name:tag .

The Docker image can be run using the following command:

docker run -it your-image-name:tag

Windows

Here are the instructions to install Azure CLI on Windows using the MSI installer.

  1. Download Azure CLI

  2. In the installer, accept the terms and select install.

  3. Check the installation:

    To run Azure CLI open bash or from the command prompt or from the PowerShell. To verify, execute the following command:

    az --version

    Additionally it is necessary to install the Azure Machine Learning extension. To do so, execute the following commands:

    az upgrade
    az extension add -n ml

    If you already have the extension, it will be necessary to first run the following command:

    az extension remove -n azure-cli-ml
    az extension remove -n ml

    If you want to update the extension, execute the following:

    az extension update -n ml

Project structure

The project is structured as follows:

├── PyOrchDB
├── src
    ├── jobs
    │   ├── models
    │   │   └── .gitignore
    │   ├── create-instance.yml
    │   ├── create-serialized-model.ipynb
    │   ├── experiment_job.yml
    │   ├── job_workflow.py
    |   ├── ml_complete_job.py
    │   ├── pipeline_job.yml
    │   ├── README.md
    │   ├── requirements.txt
    │   └── upload_model.py
    ├── templates
    ├── config.yml
    ├── Dockerfile
    ├── pipeline.py
    └── run.ps1
├── .gitignore
├── .pre-commit-config.yaml
├── diagram.png
├── Dockerfile
├── LICENSE
├── README.md
├── requirements.txt
├── run_workflow.py
└── setup.py

ETL creation and execution

  • Scenario 1

    This considers that the infrastructure is not deployed, so the following command lines will be executed:

    .\src\templates\run_template_infra.ps1
    cd src
    Start-Process python pipeline.py -NoNewWindow -Wait

    Before executing the last command, it is necessary to verify that the $location_name variable in the run_template_infra.ps1 file matches the location variable in the config.yml file. The same applies to $resource_group_name and the variable resource_group_name. The only variables that are modified by the user in the run_template_infra.ps1 file are those that begin with the symbol $. However, it is not necessary to modify them directly in the script, if you select n to the use the default values option that will appear in the console, the console will give you the option to enter the values as input.

    By executing the following command line we will be reproducing the architecture shown above, however, it is possible that the user has the data in another Storage account that is in another resource group. It is also possible that the user wants to execute steps 3) and 4) on an already deployed architecture, both cases will be explored in the following scenario.

    .\run.ps1
  • Scenario 2

    Suppose you already have the infrastructure in place and all you want to do is to run an extraction, transformation and loading job on a Storage account in a resource group. In this case, the following command lines must be executed:

    az login
    $resource_group_name = '<resource_group_name>'
    $storage_account_name = '<storage_account_name>'
    $container_name = '<container_name>'
    $sql_server_name = '<sql_server_name>'
    $database_name = '<database_name>'
    $storageBlob_conn = (az storage account show-connection-string --name $storage_account_name --resource-group $resource_group_name --query 'connectionString' --output tsv)
    $db_conn_string = (az sql db show-connection-string -c odbc -n $database_name -s $sql_server_name -a Sqlpassword --output tsv)

    The user must replace fields such as <resource_group_name>, <storage_account_name> and <container_name> with the corresponding values. The execution of run_workflow.py requires a specialized environment, so as intermediate steps it will be necessary to create and activate such an environment with the necessary requirements before executing the last command line.

    Start-Process python -ArgumentList './run_workflow.py', $storageBlob_conn, $container_name, '<exclude_files>', '/', $db_conn_string -NoNewWindow -Wait
  • Scenario 3

    In this scenario, you have the resource group created and within it is the Storage account that contains the data. In this case you will have to check the config.yml file and verify that everything is correct. Subsequently the following lines of code will be executed:

    cd src
    Start-Process python pipeline.py -NoNewWindow -Wait
    .\run.ps1

pyorchdb's People

Contributors

jzsmoreno avatar bubudavid avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.