PyOrchDB

A package for designing and implementing ETL. This repository intends to use the following infrastructure in Azure, which has a landing zone represented by a Storage account where the data is located. This storage has 2 stages, the first one containing the rawdata and the second one containing the processed data in .parquet format to be uploaded to the SQL server database. The scheme of this solution is hybrid, having processes that run locally and processes that are managed to run using Azure ML. For a more detailed description, see below.

Prerequisites

To execute the commands shown below, an Azure Subscription is required.
To run the commands from your local environment, you need to install Azure CLI.

Install and run Azure CLI

Docker container

You can use Docker to run a stand-alone Linux container with Azure CLI and Python3.10 pre-installed.

docker build -t your-image-name:tag .

The Docker image can be run using the following command:

docker run -it your-image-name:tag

Windows

Here are the instructions to install Azure CLI on Windows using the MSI installer.

Download Azure CLI
In the installer, accept the terms and select install.
Check the installation:

To run Azure CLI open bash or from the command prompt or from the PowerShell. To verify, execute the following command:
```
az --version
```
Additionally it is necessary to install the Azure Machine Learning extension. To do so, execute the following commands:
```
az upgrade
az extension add -n ml
```
If you already have the extension, it will be necessary to first run the following command:
```
az extension remove -n azure-cli-ml
az extension remove -n ml
```
If you want to update the extension, execute the following:
```
az extension update -n ml
```

Project structure

The project is structured as follows:

├── PyOrchDB
├── src
    ├── jobs
    │   ├── models
    │   │   └── .gitignore
    │   ├── create-instance.yml
    │   ├── create-serialized-model.ipynb
    │   ├── experiment_job.yml
    │   ├── job_workflow.py
    |   ├── ml_complete_job.py
    │   ├── pipeline_job.yml
    │   ├── README.md
    │   ├── requirements.txt
    │   └── upload_model.py
    ├── templates
    ├── config.yml
    ├── Dockerfile
    ├── pipeline.py
    └── run.ps1
├── .gitignore
├── .pre-commit-config.yaml
├── diagram.png
├── Dockerfile
├── LICENSE
├── README.md
├── requirements.txt
├── run_workflow.py
└── setup.py

ETL creation and execution

Scenario 1

This considers that the infrastructure is not deployed, so the following command lines will be executed:
```
.\src\templates\run_template_infra.ps1
cd src
Start-Process python pipeline.py -NoNewWindow -Wait
```
Before executing the last command, it is necessary to verify that the $location_name variable in the run_template_infra.ps1 file matches the location variable in the config.yml file. The same applies to $resource_group_name and the variable resource_group_name. The only variables that are modified by the user in the run_template_infra.ps1 file are those that begin with the symbol $. However, it is not necessary to modify them directly in the script, if you select n to the use the default values option that will appear in the console, the console will give you the option to enter the values as input.

By executing the following command line we will be reproducing the architecture shown above, however, it is possible that the user has the data in another Storage account that is in another resource group. It is also possible that the user wants to execute steps 3) and 4) on an already deployed architecture, both cases will be explored in the following scenario.
```
.\run.ps1
```

Scenario 2

Suppose you already have the infrastructure in place and all you want to do is to run an extraction, transformation and loading job on a Storage account in a resource group. In this case, the following command lines must be executed:

az login
$resource_group_name = '<resource_group_name>'
$storage_account_name = '<storage_account_name>'
$container_name = '<container_name>'
$sql_server_name = '<sql_server_name>'
$database_name = '<database_name>'
$storageBlob_conn = (az storage account show-connection-string --name $storage_account_name --resource-group $resource_group_name --query 'connectionString' --output tsv)
$db_conn_string = (az sql db show-connection-string -c odbc -n $database_name -s $sql_server_name -a Sqlpassword --output tsv)

The user must replace fields such as <resource_group_name>, <storage_account_name> and <container_name> with the corresponding values. The execution of run_workflow.py requires a specialized environment, so as intermediate steps it will be necessary to create and activate such an environment with the necessary requirements before executing the last command line.

Start-Process python -ArgumentList './run_workflow.py', $storageBlob_conn, $container_name, '<exclude_files>', '/', $db_conn_string -NoNewWindow -Wait

Scenario 3

In this scenario, you have the resource group created and within it is the Storage account that contains the data. In this case you will have to check the config.yml file and verify that everything is correct. Subsequently the following lines of code will be executed:
```
cd src
Start-Process python pipeline.py -NoNewWindow -Wait
.\run.ps1
```

jzsmoreno / pyorchdb Goto Github PK

pyorchdb's Introduction

PyOrchDB

Prerequisites

Install and run Azure CLI

Docker container

Windows

Project structure

ETL creation and execution

Scenario 1

Scenario 2

Scenario 3

pyorchdb's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent