Giter Site home page Giter Site logo

aws-do-inference's Introduction

Inference workload deployment sample with optional bin-packing

The aws-do-inference repository contains an end-to-end example for running model inference locally on Docker or at scale on EKS. It supports CPU, GPU, and Inferentia processors and can pack multiple models in a single processor core for improved cost efficiency. While this example focuses on one processor target at a time, iterating over the steps below for CPU/GPU and Inferentia enables hybrid deployments where the best processor/accelerator is used to serve each model depending on its resource consumption profile. In this sample repository, we use a bert-base NLP model from huggingface.co, however the project structure and workflow is generic and can be adapted for use with other models.


Fig. 1 - Sample Amazon EKS cluster infrastructure for deploying, running and testing ML Inference workloads

The ML inference workloads in this sample project are deployed on the CPU, GPU, or Inferentia nodes as shown on Fig. 1. The control scripts run in any location that has access to the cluster API. To eliminate latency concern related to the cluster ingress, load tests run in a pod within the cluster and send requests to the models directly through the cluster pod network.

1. The Amazon EKS cluster has several node groups, with one EC2 instance family per node group. Each node group can support different instance types, such as CPU (c5,c6i, c7g), GPU (g4dn), AWS Inferentia (Inf2) and can pack multiple models per EKS node to maximize the number of served ML models that are running in a node group. Model bin packing is used to maximize compute and memory utilization of the compute node EC2 instances in the cluster node groups.
2. The natural language processing (NLP) open-source PyTorch model from [huggingface.co](https://huggingface.co/) serving application and ML framework dependencies are built by Users as container images using Automation framework uploaded to Amazon Elastic Container Registry - [Amazon ECR](https://aws.amazon.com/ecr/).
3. Using project Automation framework, Model container images are obtained from ECR and deployed to [Amazon EKS cluster](https://aws.amazon.com/eks/) using generated Deployment and Service manifests via Kubernetes API exposed via Elastic Load Balancer (ELB). Model deployments are customized for each target EKS compute node instance type via settings in the central configuration file.
4. Following best practices of separation of Model data from containers that run it, ML model microservice design allows to scale out to a large number of models. In the project, model containers are pulling data from Amazon Simple Storage Service ([Amazon S3](https://aws.amazon.com)) and other public model data sources each time they are initialized.
5. Using project Automation framework, Test container images are obtained from ECR and deployed to Amazon EKS cluster using generated Deployment and Service manifests via Kubernetes API. Test deployments are customized for each deployment target EKS compute node architecture via settings in the central configuration file. Load/scale testing is performed via sending simultaneous requests to the Model service pool. Performance Test results metrics are obtained, recorded and aggregated.



Fig. 2 - aws-do-inference video walkthrough

See an end-to-end accelerated video walkthrough (7 min) or follow the instructions below to build and run your own inference solution.

Prerequisites

It is assumed that an EKS cluster exists and contains nodegroups of the desired target instance types. In addition it is assumed that the following basic tools are present: docker, kubectl, envsubst, kubetail, bc.

Operation

The project is operated through a set of action scripts as described below. To complete a full cycle from beginning-to-end, first configure the project, then follow steps 1 through 5 executing the corresponding action scripts. Each of the action scripts has a help screen, which can be invoked by passing "help" as argument: <script>.sh help

Configure

./config.sh

A centralized configuration file config.properties contains all settings that are customizeable for the project. This file comes pre-configured with reasonable defaults that work out of the box. To set the processor target or any other setting edit the config file, or execute the config.sh script. Configuration changes take effect immediately upon execution of the next action script.

1. Build

./build.sh

This step builds a base container for the selected processor. A base container is required for any of the subsequent steps. This step can be executed on any instance type, regardless of processor target.

Optionally, if you'd like to push the base image to a container registry, execute ./build.sh push. Pushing the base image to a container registry is required if you are planning to run the test step against models deployed to Kubernetes. If you are using a private registry and you need to login before pushing, execute ./login.sh. This script will login to AWS ECR, other private registry implementations can be added to the script as needed.

2. Trace

./trace.sh

Compiles the model into a TorchScript serialized graph file (.pt). This step requires the model to run on the target processor. Therefore it is necessary to run this step on an instance that has the target processor available.

Upon successful compilation, the model will be saved in a local folder named trace-{model_name}.

Note

It is recommended to use the AWS Deep Learning AMI to launch the instance where your model will be traced.

3. Pack

./pack.sh

Packs the model in a container with FastAPI, also allowing for multiple models to be packed within the same container. FastAPI is used as an example here for simplicity and performance, however it can be interchanged with any other model server. For the purpose of this project we pack several instances of the same model in the container, however a natural extension of the same concept is to pack different models in the same container.

To push the model container image to a registry, execute ./pack.sh push. The model container must be pushed to a registry if you are deploying your models to Kubernetes.

4. Deploy

./deploy.sh

This script runs your models on the configured runtime. The project has built-in support for both local Docker runtimes and Kubernetes. The deploy script also has several sub-commands that facilitate the management of the full lifecycle of your model server containers.

  • ./deploy.sh run - (default) runs model server containers
  • ./deploy.sh status [number] - show container / pod / service status. Optionally show only specified instance number
  • ./deploy.sh logs [number] - tail container logs. Optionally tail only specified instance number
  • ./deploy.sh exec <number> - open bash into model server container with the specified instance number
  • ./deploy.sh stop - stop and remove deployed model contaiers from runtime

5. Test

./test.sh

The test script helps run a number of tests against the model servers deployed in your runtime environment.

  • ./test.sh build - build test container image
  • ./test.sh push - push test image to container registry
  • ./test.sh pull - pull the current test image from the container registry if one exists
  • ./test.sh run - run a test client container instance for advanced testing and exploration
  • ./test.sh exec - open shell in test container
  • ./test.sh status- show status of test container
  • ./test.sh stop - stop test container
  • ./test.sh help - list the available test commands
  • ./test.sh run seq - run sequential test. One request at a time submitted to each model server and model in sequential order.
  • ./test.sh run rnd - run random test. One request at a time submitted to a randomly selected server and model at a preset frequency.
  • ./test.sh run bmk - run benchmark test client to measure throughput and latency under load with random requests
  • ./test.sh run bma - run benchmark analysis - aggregate and average stats from logs of all completed benchmark containers

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

References

aws-do-inference's People

Contributors

amazon-auto avatar dzilbermanvmw avatar iankouls-aws avatar keitaw avatar modestcigit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aws-do-inference's Issues

neuron runtime error

I executed ./trace.sh according to README, but neuron runtime error has occered.

Part of error log.

Question: What does the little engine say?
2022-Mar-29 07:07:29.0082    11:11    ERROR   NRT:nrt_init                                Unable to determine Neuron Driver version. Please check aws-neuron-dkms package is installed.
Traceback (most recent call last):
  File "model-tracer.py", line 101, in <module>
    answer_logits = model_traced(*example_inputs)
  File "/usr/local/lib64/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
/usr/local/lib64/python3.7/site-packages/torch_neuron/decorators.py(373): forward
/usr/local/lib64/python3.7/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/usr/local/lib64/python3.7/site-packages/torch/nn/modules/module.py(1102): _call_impl
/usr/local/lib64/python3.7/site-packages/torch_neuron/graph.py(546): __call__
/usr/local/lib64/python3.7/site-packages/torch_neuron/graph.py(205): run_op
/usr/local/lib64/python3.7/site-packages/torch_neuron/graph.py(194): __call__
/usr/local/lib64/python3.7/site-packages/torch_neuron/convert.py(217): forward
/usr/local/lib64/python3.7/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/usr/local/lib64/python3.7/site-packages/torch/nn/modules/module.py(1102): _call_impl
/usr/local/lib64/python3.7/site-packages/torch/jit/_trace.py(965): trace_module
/usr/local/lib64/python3.7/site-packages/torch/jit/_trace.py(750): trace
/usr/local/lib64/python3.7/site-packages/torch_neuron/convert.py(183): trace
model-tracer.py(92): <module>
RuntimeError: The PyTorch Neuron Runtime could not be initialized. Neuron Driver issues are logged
to your system logs. See the Neuron Runtime's troubleshooting guide for help on this
topic: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/

I changed 1-build/Dockerfile-base-inf#L16

After changed, worked normally in my environment.

RUN yum update -y && \
    yum install -y python3 python3-devel gcc-c++ && \
    yum install -y tar gzip ca-certificates procps net-tools which vim wget libgomp htop jq bind-utils bc && \
    yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r) aws-neuron-dkms aws-neuron-tools # I changed here.

I' AWS employee in Japan. alias akazawt

Error while executing build script

Getting the following error when I run the build script.

I have the registry configured

#8 38.24 No package aws-neuron-runtime-base available.
#8 38.48 No package aws-neuron-runtime available.
#8 38.70 No package aws-neuron-tools available.
#8 38.83 Error: Nothing to do
------
executor failed running [/bin/sh -c yum update -y &&     yum install -y python3 python3-devel gcc-c++ &&     yum install -y tar gzip ca-certificates procps net-tools which vim wget libgomp htop jq bind-utils bc &&     yum install -y aws-neuron-runtime-base aws-neuron-runtime aws-neuron-tools]: exit code: 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.