microsoftdocs / pipelines-azureml Goto Github PK

Example Azure Pipeline to train and deploy a machine learning model using the Azure Machine Learning service

License: Creative Commons Attribution 4.0 International

Python 100.00%

pipelines-azureml's Introduction

Note

This repo uses Azure Machine Learning Python SDK v1 and is not actively maintained. For Azure Machine Learning Python SDK v2 examples, see https://github.com/Azure/azureml-examples.

Introduction

This repo shows an E2E training and deployment pipeline with Azure Machine Learning's CLI. For more info, please visit Azure Machine Learning CLI documentation.

This example requires some familiarity with Azure Pipelines or GitHub Actions. For more information, see here.

Instructions

Detailed Instructions

First, fork (or clone) the repository to your own GitHub account, so that you can make modification to your pipelines. From there, follow these instructions to get the whole setup and demo up and running:

📄 Detailed step-by-step setup instructions 📄

Short Instructions

If you are familar with Azure Machine Learning and Azure DevOps, you can follow these shortend instructions:

Fork or clone this repo
Create an Azure Machine Learning workspace named aml-demo in a resource group named aml-demo
Create a new project in Azure DevOps/Pipelines
Goto Project settings, select Service connections, create a new connection of type Azure Resource Manager, select Service principal (automatic) and configure it to the Resource Group of your Machine Learning workspace. Name it azmldemows. For more details see here or follow the tutorial.
Create a new pipeline for the project, point it to the pipelines/diabetes-train-and-deploy.yml file in your forked GitHub repo. This defines an example pipeline.
Modify the pipelines/diabetes-train-and-deploy.yml and change the ml-rg variable to the Azure resource group that contains your workspace. You may also change the ml-ws variable to the name of your Azure Machine Learning service workspace.
Run the pipeline.

Declare variables for CI/CD pipeline

In case you want to leverage an existing ML workspace, you can customize it in the example pipeline pipelines/diabetes-train-and-deploy.yml:

 - ml-ws-connection: 'azmldemows'  # Workspace Service Connection name
 - ml-ws: 'aml-demo'               # AML Workspace name
 - ml-rg: 'aml-demo'               # AML resource Group name
 - ml-ct: 'cpu-cluster-1'          # AML Compute cluster name
 - ml-path: 'models/diabetes'      # Model directory path in repo
 - ml-exp: 'exp-test'              # Experiment name
 - ml-model-name: 'diabetes-model' # Model name
 - ml-aks-name: 'aks-prod'         # AKS cluster name

Run CLI scripts to create training compute, train model, register model, deploy model

You can also manually emulate the example pipeline on your machine by running the following commands (make sure to substitue the variables from above):

az extension add -n azure-cli-ml

cd models/diabetes/
az ml folder attach -w $(ml-ws) -g $(ml-rg)
az ml computetarget create amlcompute -n $(ml-ct) --vm-size STANDARD_D2_V2 --max-nodes 1
az ml run submit-script -c config/train --ct $(ml-ct) -e $(ml-exp) -t run.json train.py
az ml model register -n $(ml-model-name) -f run.json --asset-path outputs/ridge_0.95.pkl -t model.json
az ml model deploy -n diabetes-qa-aci -f model.json --ic config/inference-config.yml --dc config/deployment-config-aci.yml --overwrite
az ml computetarget create aks --name $(ml-aks-name) --cluster-purpose DevTest
az ml model deploy --name diabetes-prod-aks --ct $(ml-aks-name) -f model.json --ic config/inference-config.yml --dc config/deployment-config-aks.yml  --overwrite

Further notes

If you want to scope your project to your Azure Machine Learning service workspace, you can install the Machine Learning DevOps extension in your Azure DevOps project.

pipelines-azureml's People

Contributors

Stargazers

Watchers

Forkers

pulkitaggarwl sahanaprabhakar drivetimeinc rsteeno sramayanam laihoangle mengyoo johnpaulada obsidianvoid jacwu shinchan75034 mooncowboy orieke llalonde gbaeke prathamesh99 vaidya-s danders32 bokravts p3ngu1nx matakeda1 bospoort sudh9931 wistreng vespassassina-zz jyotsnaravi ram-msft satonaoki neaorin lazzerifrancesca tamitarai revodavid quentin241 olivierdolle datasnowman jimicasey ompanda rajmsft krshoper tamagosenshi xibelly camilowarren jairochoa smolanor jherna9 tbchk sotomsa karol001 margaritacelada yesidmope diacosta santiago1129 christianidata andresariasgarcia dianaaguirre luismejia9 shwetams noorabani yanmayattai riedwaanb wbuchwalter anildwarepo ashu1979 trankennyk saito-jbs tshau vvalpe rubeneric charleswm bhawna5singh fadnavistanmay tcharlesdam drivably devanshidiaries amir8015 wernerchao olufemig mstosugimo-gh kaorinawata free4m peterychang reddum mazn-aurubis taejoo sqlstack goelhardik mikeburba-msft isabelgrund peidyen savs33 jazzyray jasonerrett rileymshea manish-shukla01 cromagnonninja bijuthan rahulkishan-mobbed katyamust anttoni-pykalisto chitratsr

pipelines-azureml's Issues

Running h2o.ai in Azure ML (Installing Java is a must)

mcr.microsoft.com/azureml/base:0.2.4 is pretty flat, so tried a few steps to install Java.

Adding a custom base dockerfile

script: train.py
arguments: []
framework: Python
environment:
  python:
    userManagedDependencies: false
    interpreterPath: python
    condaDependenciesFile: train-env.yml
  docker:
    enabled: true
    baseDockerfile: Dockerfile

Returns error:

Output from dependency scanning: fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

Add an argument to the docker. According to this documentation and this one as well, I can add an argument to the docker command. So, tried the following.

script: train.py
arguments: []
framework: Python
environment:
  python:
    userManagedDependencies: false
    interpreterPath: python
    condaDependenciesFile: config/train-conda.yml
  docker:
    enabled: true
    baseImage: mcr.microsoft.com/azureml/base:0.2.4
    arguments: ["--run","apt-get install default-jdk"]

also arguments: "apt-get install default-jdk" like this.

As there is no documentation about it, having issues installing Java on the environment. Looking for your help.

Error in Attach folder to workspace step

Hi,
When I run the pipeline, I'm getting the error below :
The problem seems to be at the Attach folder to workspace step.

task: AzureCLI@2
displayName: 'Attach folder to workspace'
inputs:
azureSubscription: $(ml-ws-connection)
workingDirectory: $(ml-path)
scriptLocation: inlineScript
scriptType: 'bash'
inlineScript: 'az ml folder attach -w $(ml-ws) -g $(ml-rg)'

ERROR: ProjectSystemException:
Message: {
"error_details": {
"error": {
"code": "AuthorizationFailed",
"message": "The client 'a43e0215-c079-499e-b242-2c8cdc19e0ec' with object id 'a43e0215-c079-499e-b242-2c8cdc19e0ec' does not have authorization to perform action 'Microsoft.MachineLearningServices/workspaces/read' over scope '/subscriptions/#######-####-####-####-###########/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo' or the scope is invalid. If access was recently granted, please refresh your credentials."
}
},
"status_code": 403,
"url": "https://management.azure.com/subscriptions/ce55f75a-7c5d-4393-ac9e-601083781d51/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo?api-version=2020-01-01"
}
InnerException None
ErrorResponse
{
"error": {
"message": "{\n "error_details": {\n "error": {\n "code": "AuthorizationFailed",\n "message": "The client 'a43e0215-c079-499e-b242-2c8cdc19e0ec' with object id 'a43e0215-c079-499e-b242-2c8cdc19e0ec' does not have authorization to perform action 'Microsoft.MachineLearningServices/workspaces/read' over scope '/subscriptions/#######-####-####-####-###########/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo' or the scope is invalid. If access was recently granted, please refresh your credentials."\n }\n },\n "status_code": 403,\n "url": "https://management.azure.com/subscriptions/#######-####-####-####-###########/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo?api-version=2020-01-01\"\n}"
}
}
##[error]Script failed with exit code: 1

Testing the model

My deployment in AKS and ACI is done properly. But how can I test that this is running as expected or not.?

Retry pipeline and/or task on failure

I use the Python SDK to develop ML pipelines for Azure ML.

How do I get my PythonScriptStep tasks or the encompassing Pipeline object to simply rerun upon failure?
I reckon it's pretty common for pipelines to temporarily break upon temporary network, storage, etc. issues so a simple rerun / retry seems pretty basic for task orchestration frameworks to provide (see e.g. Apache Airflow).

I've spent a fair amount of time going over the documentation for Azure ML and I just can't figure out how to get "retry upon failure" behaviour.

The closest there is is the continue_on_step_failure pipeline / task parameter which doesn't really do what's needed.

Any advice please?

Model not found in cache or in root at ./diabetes-model

Hello,

Following the different steps of the Azure Pipeline, I got this issue :

"message": "Service deployment polling reached non-successful terminal state, current service state: Unhealthy\nOperation ID: e9252f0d-81f8-44e5-bd6d-983076eca1f5\nMore information can be found using '.get_logs()'\nError:\n{\n "code": "DeploymentTimedOut",\n "statusCode": 504,\n "message": "The deployment operation polling has TimedOut. The service creation is taking longer than our normal time. We are still trying to achieve the desired state for the web service. Please check the webservice state for the current webservice health. You can run print(service.state) from the python SDK to retrieve the current state of the webservice."\n}

Looking for the logs with get_logs(), I extract this part of the message :
Model not found in cache or in root at ./diabetes-model

The az CLI command is the following : az ml model deploy -n diabetes-qa-aci -f model.json --ic config/inference-config.yml --dc config/deployment-config-aci.yml --overwrite -v

And model.json is created by the previous step and contains :
{
"cpu": "",
"createdTime": "2020-06-09T04:57:54.550301+00:00",
"description": "",
"experimentName": "diabetes-exp",
"framework": "Custom",
"frameworkVersion": null,
"gpu": "",
"id": "diabetes_reg_model:2",
"memoryInGB": "",
"name": "diabetes_reg_model",
"properties": "",
"runId": "diabetes-exp_1591678184_b25da442",
"sampleInputDatasetId": "",
"sampleOutputDatasetId": "",
"tags": "",
"version": 2
}

Any idea ?

Unable to delete pipeline drafts?

The Designer UI has a feature to delete pipeline drafts.

This feature is grayed out. There is no ability to select the pipeline draft and delete it either. Is this a defect?

Error in train model

I'm having trouble completing the getting_started example (getting_started.md) as the pipeline stops on the train (takes too long ≈ 60 min on train model job). Here are the last logs before canceling automatically (the file contains the entire logs:
Complete Logs.txt
):

2022-02-07T00:52:37.0050192Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/6b/b2/c0d62a3a91c13641e09af294c13fe16929f88dc5902718388cd9b292217f/azure_mgmt_authorization-0.52.0-py2.py3-none-any.whl
2022-02-07T00:52:37.0052090Z Downloading azure_mgmt_authorization-0.52.0-py2.py3-none-any.whl (112 kB)
2022-02-07T00:52:37.0052735Z
2022-02-07T00:57:40.9228879Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/a1/71/9a20913e92771b3c23564f1bea54d376d09fb30a75585087c70b769d75c8/azure_mgmt_authorization-0.51.1-py2.py3-none-any.whl
2022-02-07T00:58:41.5520782Z Downloading azure_mgmt_authorization-0.51.1-py2.py3-none-any.whl (111 kB)
2022-02-07T00:58:41.5521395Z
2022-02-07T00:59:42.2727333Z INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. If you want to abort this run, you can press Ctrl + C to do so. To improve how pip performs, tell us what happened here: https://pip.pypa.io/surveys/backtracking
2022-02-07T01:03:45.8869909Z Downloading azure_mgmt_authorization-0.51.0-py2.py3-none-any.whl (111 kB)
2022-02-07T01:09:52.4374279Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/6f/17/55b974603c16be89c7a7c16bac57b7bce48527bf1bebc3f116f7215176e6/azure_mgmt_authorization-0.50.0-py2.py3-none-any.whl
2022-02-07T01:09:52.4376241Z Downloading azure_mgmt_authorization-0.50.0-py2.py3-none-any.whl (81 kB)
2022-02-07T01:09:52.4376835Z
2022-02-07T01:26:07.6809069Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/67/e4/b3535daae30db9b3f73046a0c151c5c2ae2d2bff96ba0c28c1f26a21dbf1/azure_mgmt_authorization-0.40.0-py2.py3-none-any.whl
2022-02-07T01:26:07.6811091Z Downloading azure_mgmt_authorization-0.40.0-py2.py3-none-any.whl (38 kB)
2022-02-07T01:26:07.6811445Z
2022-02-07T01:39:04.9650251Z ##[error]The operation was canceled.
2022-02-07T01:39:04.9664245Z ##[section]Finishing: Train model

Any example of model deployment on local compute?

Instead of ACI, what if we want to test our deployment via Azure DevOps locally?

What would the steps? Please add it? So far I have this:
in deployment-config-local.yml

computeType: local
port: 13579

and in the pipeline I have

az ml model deploy -n diabetes-qa-local --model diabetes-model:1 --ic config/inference-config.yml --dc config/deployment-config-local.yml

But it returns

Downloading model diabetes-model:1 to C:\Users\mkrdi\AppData\Local\Temp\azureml_s5877b_f\diabetes-model\1
Generating Docker build context.

then it fails

{'Azure-cli-ml Version': '1.4.0', 'Error': WebserviceException:
        Message: Received bad response from service:
Response Code: 400
Headers: {'Date': 'Wed, 06 May 2020 02:01:46 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Request-Context': 'appId=cid-v1:2d2e8e63-272e-4b3c-8598-4ee570a0e70d', 'x-ms-client-request-id': 'e734f89cdce14741bf8dc8ca879a8bab', 'x-ms-client-session-id': '71665c61-45e2-465a-9b6b-10d23ce6b0f8', 'api-supported-versions': '1.0, 2018-03-01-preview, 2018-11-19', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains; preload'}
Content: b'{"code":"BadRequest","statusCode":400,"message":"The request is invalid.","details":[{"code":"ServiceModelConflict","message":"Exactly one of the ModelIds or Models must be specified for a service."}],"correlation":{"RequestId":"e734f89cdce14741bf8dc8ca879a8bab"}}'
        InnerException None
        ErrorResponse
{
    "error": {
        "message": "Received bad response from service:\nResponse Code: 400\nHeaders: {'Date': 'Wed, 06 May 2020 02:01:46 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Request-Context': 'appId=cid-v1:2d2e8e63-272e-4b3c-8598-4ee570a0e70d', 'x-ms-client-request-id': 'e734f89cdce14741bf8dc8ca879a8bab', 'x-ms-client-session-id': '71665c61-45e2-465a-9b6b-10d23ce6b0f8', 'api-supported-versions': '1.0, 2018-03-01-preview, 2018-11-19', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains; preload'}\nContent: b'{\"code\":\"BadRequest\",\"statusCode\":400,\"message\":\"The request is invalid.\",\"details\":[{\"code\":\"ServiceModelConflict\",\"message\":\"Exactly one of the ModelIds or Models must be specified for a service.\"}],\"correlation\":{\"RequestId\":\"e734f89cdce14741bf8dc8ca879a8bab\"}}'"     
    }
}}

Issue running Azure DevOps pipeline from pipelines/diabetes-train-and-deploy.yml

I've followed the instructions in the readme to set up the repo, created the service connection as directed, and created an Azure DevOps pipeline based on the diabetes-train-and-deploy.yml file. The workspace the pipeline points to is an existing resource that was created prior to finding the pipelines-azureml repo. When I run the pipeline it always fails on the Train Model step with the following error:

"error": {
    "code": "UserError",
    "message": "Image build run on compute failed: User starting the run is not an owner or assigned user to the Compute Instance",
    "details": []
},

I'm able to dig in further to the error in ML Studio and it shows the user calling is the service connection I set up for the pipeline. On the off chance that it might be a permissions issue, I added that user as a contributor to the workspace but I see the same error. I've also tried the powershell commands from the "Run CLI scripts..." section at the bottom of the README.md file and I get the same message running under my Azure account which has the Owner role on the ML Workspace.

The pipeline was able to create the compute cluster, but it seems that it doesn't have access to the cluster after it's created? Another possibility is that our workspace has something locked down that is preventing this pipeline from working properly. Any help is greatly appreciated. Thank you!

'Error': TypeError("'<' not supported between instances of 'int' and 'NoneType'",)}

getting this error when executing below computetarget create command:
"az ml computetarget create amlcompute -n $(ml-ct) --vm-size STANDARD_D2_V2 --max-nodes 1"

Version:
Azure-cli-ml Version': '1.24.0'

Issue with model train command .

Hi,

We are getting error when running the below command .
az ml run submit-script -c config/train --ct $(ml-ct) -e $(ml-exp) -t run.json train.py

Readme instructions broken

Since the last change to the azure-pipelines.yml the instructions in the readme.md are not valid anymore:

Modify the azure-pipelines.yml and change myresourcegroup to the Azure resource group that contains your workspace. You must also change the myworkspace entry to the name of your Azure Machine Learning service workspace.

azureSubscription (service connection) is now "build-demo" everywhere instead of "azmldemows"
resource group name is now "scottgu-all-hands" instead of "myresourcegroup"
ML workspace name is now "build-2019-demo" instead of "myworkspace"

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Compute name 'cpu-cluster-1' is invalid

Raising a ticket because the compute name 'cpu-cluster-1' is invalid. My suggestion would be to change it into 'cpu'. See error message below:

Command group 'ml' is experimental and under development. Reference and support levels: https://aka.ms/CLI_refstatus
Creating compute instance...
{'Azure-cli-ml Version': '1.29.0', 'Error': ComputeTargetException:
        Message: Compute name 'cpu-cluster-1' is not available. Reason: Invalid. Message: A name for an Azure ML Com
pute Instance must be between 3 and 24 characters in length and must use only numbers, letters and minus symbol (-)
，must start with letters. Numbers cannot be the ending of the name if the previous character is a minus symbol (-).
 Please specify a different Azure ML Instance name
        InnerException None
        ErrorResponse
{
    "error": {
        "message": "Compute name 'cpu-cluster-1' is not available. Reason: Invalid. Message: A name for an Azure ML
Compute Instance must be between 3 and 24 characters in length and must use only numbers, letters and minus symbol (
-)\uff0cmust start with letters. Numbers cannot be the ending of the name if the previous character is a minus symbo
l (-). Please specify a different Azure ML Instance name"
    }
}}

Problems executing the pipeline examples

Hello there,
I'm trying to follow the tutorial but when I executed it I got the following error

##[error]No hosted parallelism has been purchased or granted. To request a free parallelism grant, please fill out the following form https://aka.ms/azpipelines-parallelism-request
Pool: Azure Pipelines
Image: Ubuntu-16.04
Started: Just now
Duration: 11s

Job preparation parameters
ContinueOnError: False
TimeoutInMinutes: 60
CancelTimeoutInMinutes: 5
Expand:
  MaxConcurrency: 0
  ########## System Pipeline Decorator(s) ##########

  Begin evaluating template 'system-pre-steps.yml'
Evaluating: eq('true', variables['system.debugContext'])
Expanded: eq('true', Null)
Result: False
Evaluating: resources['repositories']['self']
Expanded: Object
Result: True
Evaluating: not(containsValue(job['steps']['*']['task']['id'], '6d15af64-176c-496d-b583-fd2ae21d4df4'))
Expanded: not(containsValue(Object, '6d15af64-176c-496d-b583-fd2ae21d4df4'))
Result: True
Evaluating: resources['repositories']['self']['checkoutOptions']
Result: Object
Finished evaluating template 'system-pre-steps.yml'
********************************************************************************
Template and static variable resolution complete. Final runtime YAML document:
steps:
- task: 6d15af64-176c-496d-b583-fd2ae21d4df4@1
  inputs:
    repository: self

I found that now you have to request permissions to MS, there is any way to execute it without request their permissions?

Thank you