azure / kaito Goto Github PK

Kubernetes AI Toolchain Operator

License: MIT License

Dockerfile 1.11% Makefile 2.58% Go 78.11% Smarty 0.72% Shell 0.20% Python 17.28%

ai gpu kubernetes operator

kaito's Introduction

Kubernetes AI Toolchain Operator (Kaito)

What is NEW!
Latest Release: March 28th, 2024. Kaito v0.2.2.
First Release: Nov 15th, 2023. Kaito v0.1.0.

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. The target models are popular large open-sourced inference models such as falcon and llama2. Kaito has the following key differentiations compared to most of the mainstream model deployment methodologies built on top of virtual machine infrastructures:

Manage large model files using container images. A http server is provided to perform inference calls using the model library.
Avoid tuning deployment parameters to fit GPU hardware by providing preset configurations.
Auto-provision GPU nodes based on model requirements.
Host large model images in the public Microsoft Container Registry (MCR) if the license allows.

Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

Architecture

Kaito follows the classic Kubernetes Custom Resource Definition(CRD)/controller design pattern. User manages a workspace custom resource which describes the GPU requirements and the inference specification. Kaito controllers will automate the deployment by reconciling the workspace custom resource.

The above figure presents the Kaito architecture overview. Its major components consist of:

Workspace controller: It reconciles the workspace custom resource, creates machine (explained below) custom resources to trigger node auto provisioning, and creates the inference workload (deployment or statefulset) based on the model preset configurations.
Node provisioner controller: The controller's name is gpu-provisioner in gpu-provisioner helm chart. It uses the machine CRD originated from Karpenter to interact with the workspace controller. It integrates with Azure Kubernetes Service(AKS) APIs to add new GPU nodes to the AKS cluster.

Note: The gpu-provisioner is an open sourced component. It can be replaced by other controllers if they support Karpenter-core APIs.

Installation

Please check the installation guidance here.

Quick start

After installing Kaito, one can try following commands to start a falcon-7b inference service.

$ cat examples/kaito_workspace_falcon_7b.yaml
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: workspace-falcon-7b
resource:
  instanceType: "Standard_NC12s_v3"
  labelSelector:
    matchLabels:
      apps: falcon-7b
inference:
  preset:
    name: "falcon-7b"

$ kubectl apply -f examples/kaito_workspace_falcon_7b.yaml

The workspace status can be tracked by running the following command. When the WORKSPACEREADY column becomes True, the model has been deployed successfully.

$ kubectl get workspace workspace-falcon-7b
NAME                  INSTANCE            RESOURCEREADY   INFERENCEREADY   WORKSPACEREADY   AGE
workspace-falcon-7b   Standard_NC12s_v3   True            True             True             10m

Next, one can find the inference service's cluster ip and use a temporal curl pod to test the service endpoint in the cluster.

$ kubectl get svc workspace-falcon-7b
NAME                  TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)            AGE
workspace-falcon-7b   ClusterIP   <CLUSTERIP>  <none>        80/TCP,29500/TCP   10m

export CLUSTERIP=$(kubectl get svc workspace-falcon-7b -o jsonpath="{.spec.clusterIPs[0]}") 
$ kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$CLUSTERIP/chat -H "accept: application/json" -H "Content-Type: application/json" -d "{\"prompt\":\"YOUR QUESTION HERE\"}"

Usage

The detailed usage for Kaito supported models can be found in HERE. In case users want to deploy their own containerized models, they can provide the pod template in the inference field of the workspace custom resource (please see API definitions for details). The controller will create a deployment workload using all provisioned GPU nodes. Note that currently the controller does NOT handle automatic model upgrade. It only creates inference workloads based on the preset configurations if the workloads do not exist.

The number of the supported models in Kaito is growing! Please check this document to see how to add a new supported model.

FAQ

How to upgrade the existing deployment to use the latest model configuration?

When using hosted public models, a user can delete the existing inference workload (Deployment of StatefulSet) manually, and the workspace controller will create a new one with the latest preset configuration (e.g., the image version) defined in the current release. For private models, it is recommended to create a new workspace with a new image version in the Spec.

How to update model/inference parameters to override the Kaito Preset Configuration?

Kaito provides a limited capability to override preset configurations for models that use transformer runtime manually. To update parameters for a deployed model, perform kubectl edit against the workload, which could be either a StatefulSet or Deployment. For example, to enable 4-bit quantization on a falcon-7b-instruct deployment, you would execute:

kubectl edit deployment workspace-falcon-7b-instruct

Within the deployment specification, locate and modify the command field.

Original

accelerate launch --num_processes 1 --num_machines 1 --machine_rank 0 --gpu_ids all inference_api.py --pipeline text-generation --torch_dtype bfloat16

Modify to enable 4-bit Quantization

accelerate launch --num_processes 1 --num_machines 1 --machine_rank 0 --gpu_ids all inference_api.py --pipeline text-generation --torch_dtype bfloat16 --load_in_4bit

Currently, we allow users to change the following paramenters manually:

pipeline: For text-generation models this can be either text-generation or conversational.
load_in_4bit or load_in_8bit: Model quantization resolution.

Should you need to customize other parameters, kindly file an issue for potential future inclusion.

What is the difference between instruct and non-instruct models?

The main distinction lies in their intended use cases. Instruct models are fine-tuned versions optimized for interactive chat applications. They are typically the preferred choice for most implementations due to their enhanced performance in conversational contexts. On the other hand, non-instruct, or raw models, are designed for further fine-tuning.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

License

See LICENSE.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Contact

"Kaito devs" [email protected]

kaito's People

Contributors

Stargazers

Watchers

kaito's Issues

Can't get it work

Describe the bug
workspace is never ready

NAME                            INSTANCE            RESOURCEREADY   INFERENCEREADY   WORKSPACEREADY   AGE
workspace-falcon-7b-instruct    Standard_NC6s_v3    False                            False            33m
workspace-mistral-7b-instruct   Standard_NC12s_v3   False                            False            85m

Steps To Reproduce
First I wanted to follow the documentation (but links to yaml file are no broken), but when fixed I still can't get any workspace or resource ready.

For the NC12s, I thought it could be a quota. but for NC6, I even added a nodepool to check and it's fine.

How can I get some logs on what's is going wrong ?

Last point: I see that Kaito is now 2.2 here on GH but I'm using the AKS addon. could it use an oldest broken version ?

Endpoint for Checking Model Info

Create an endpoint with model information/version

Kaito AKS add-on labels

Is your feature request related to a problem? Please describe.

Kaito managed add-on on AKS deploys the gpu-provisioner and the workspace-controller in a managed fashion. Both apps (pods) are coming with the same labels

app=ai-toolchain-operator

Describe the solution you'd like

To have a better separation there should be another label like app.kubernetes.io/instance:workspace or component:gpu-provisioner (naming is hard 😄 ).

Additional context

This would also make it easier to fetch the logs from the gpu-provisioner like described in the docs.

Inference allow for retries

models/llama2/inference-api.py

The worker task could be more flexible by allowing for retries when a socket timeout has occurred. We would need to have maximum number of retries as well to prevent it from getting into a continuous loop of timeouts.

Original
except Exception as e:
print(f"Error in Worker Listen Task", e)
if 'Socket Timeout' in str(e):
print("A socket timeout occurred.")
os.killpg(os.getpgrp(), signal.SIGTERM)

Proposed
def worker_listen_tasks():
max_socket_timeouts = 3
socket_timeout_count = 0
...
if 'Socket Timeout' in str(e):
socket_timeout_count += 1
if socket_timeout_count > max_socket_timeouts:
print("Maximum socket timeouts exceeded. Exiting...")
os.killpg(os.getpgrp(), signal.SIGTERM)
else:
print(f"Retrying operation after socket timeout (attempt {socket_timeout_count}/{max_socket_timeouts})...")
continue # Continue the loop to retry

@Fei-Guo @ishaansehgal99

Vector datastore for RAG support

Onboard Katio to Kubernetes services hosted by other cloud vendors

Use GH Pages to Host Open API Spec

Host API kaito/presets/inference/text-generation/api_spec.json in GitHub Pages

Service not being created on workspace deployment

When testing with an image that I pulled into my own Azure Container Registry from the supported MCR images for Kaito the pod comes online fine and is functional, however the kubernetes service is never created.

Pull an image for Kaito into your own ACR:
az acr import --name kaitodemo --source mcr.microsoft.com/aks/kaito/kaito-mistral-7b-instruct:0.0.4 --image mistral-7b-instruct:0.0.4
Deploy a manifest with custom image source:

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: workspace-mistral-7b
resource:
  instanceType: "Standard_NC64as_T4_v3"
  labelSelector:
    matchLabels:
      apps: mistral-7b
inference:
  template:
    spec:
      containers:
      - name: mistral-7b-instruct-container
        image: kaitodemo.azurecr.io/mistral-7b-instruct:0.0.4
        command: ["accelerate"]
        args: ["launch", "--num_processes", "1", "--num_machines", "1", "--gpu_ids", "all", "text-gen-inference.py", "--pipeline", "text-generation", "--torch_dtype", "bfloat16"]
        volumeMounts:
        - name: dshm
          mountPath: /dev/shm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory

Watch for creation of the deployment and service

Expected Result
Pods and service all created successfully

Environment

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:13:27Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"28", GitVersion:"v1.28.5", GitCommit:"9cf543f4f17cbc5b74c24880e77590eeb1af683c", GitTreeState:"clean", BuildDate:"2024-01-31T09:07:34Z", GoVersion:"go1.20.12", Compiler:"gc", Platform:"linux/amd64"}

Additional context

Benchmarking

Examine the default inference parameters

So far, the default inference parameters mainly come from the library defaults. We have found the default values were changed in the library without any justifications and we also found some of the changes lead to suboptimal inference output from language perspective. It would be ideal to design/use benchmarks to drive the changes of Kaito default values for inference.

Swagger API Docs

Add the Swagger API docs for inference - /presets/inference/text-generation/API.md

Investigate/add a code model to supported list

Segmentation fault in Llama2

Hi, I've uploaded the llama2 model image to Azure but I'm facing a Segmentation fault error in Python that is preventing my container to start.

Any suggestions?

Output

> kubectl logs workspace-llama-2-7b-0
Fatal Python error: Segmentation fault

Current thread 0x00007fa2975e0b80 (most recent call first):
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in __init__
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 253 in create_backend
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 258 in create_handler
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 238 in launch_agent        
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135 in __call__
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/run.py", line 803 in run
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/run.py", line 812 in main
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347 in wrapper
  File "/usr/local/bin/torchrun", line 8 in 

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20)
Segmentation fault (core dumped)

Kaito workspace-controller support for Karpenter nodeclaim

Describe the bug

Kaito workspace controller seems to be only compatible with Karpenter version prior v0.33.0 as they deprecated the machine CRD (karpenter.sh/v1alpha5) in that release and the controller seems to relay on that machine CRD for spinning up the workspace.

Steps To Reproduce

Create an AKS cluster with NAP enabled
Install Kaito workspace-controller helm install workspace kaito/workspace --namespace workspace --create-namespace
Add an workspace kubectl apply -f https://raw.githubusercontent.com/Azure/kaito/main/examples/inference/kaito_workspace_phi-2.yaml
Nothing happens, run kubectl api-resouces and see there is no CRD machines karpenter.sh/v1alpha5

Expected behavior

As Kaito per docs supports node provisioning controllers that supports Karpenter-core APIs, it should also support the new Karpenter CRDs for nodeClaims (machine CRD was deprecated in December). In best-case scenario Kaito (workspace-controller) should also run on AKS with NAP enabled and should be aware of the Karpenter version (or which CRD is available: machines vs nodeclaim).

Logs
Controller starts and prints this:

2024-04-04T10:07:49Z INFO Starting EventSource {"controller": "workspace", "controllerGroup": "kaito.sh", "controllerKind": "Workspace", "source": "kind source: *v1alpha5.Machine"}

Environment

Kubernetes version (use kubectl version): 1.28.5
Install tools: AKS NAP (Karpenter)

Support Karpenter Tasks

Beta Give feedback

Add NodeClaim API #362
Update workspace controller to support NodeClaim #366
Add e2e tests and support karpenter in the workflow #375
Update doc
Options

Installation fails workspace:0.2.1 Image not found

The installation fails with the provided values for the helm chart.

Failed to pull image "mcr.microsoft.com/aks/kaito/workspace:0.2.1": rpc error: code = NotFound

desc = failed to pull and unpack image "mcr.microsoft.com/aks/kaito/workspace:0.2.1": failed to resolve reference "mcr.microsoft.com/aks/kaito/workspace:0.2.1": mcr.microsoft.com/aks/kaito/workspace:0.2.1: not found

Bumping down to tag: 0.2.0 works.

9894f3d

Combine with KEDA

It would be amazing to combine this with KEDA and the http scaler so the instances would scale to 0 when not in use.

failed calling webhook when namespace is not `kaito-workspace`

Describe the bug

If you follow the installation docs here and here it is stated to install the workspace controller to the namespace kaito-workspace. But when you do this, you get an validation error when adding any preset model.

Steps To Reproduce

helm repo add kaito https://azure.github.io/kaito/charts/kaito
helm repo update
helm install workspace kaito/workspace --namespace kaito-workspace --create-namespace
kubectl apply -f https://raw.githubusercontent.com/Azure/kaito/main/examples/inference/kaito_workspace_phi-2.yaml

Expected behavior

Installation in any namespace should be to final solution, but until that the docs should be adopted.

Logs

Error from server (InternalError): error when creating "https://raw.githubusercontent.com/Azure/kaito/main/examples/inference/kaito_workspace_phi-2.yaml": Internal error occurred: failed calling webhook "validation.workspace.kaito.sh": failed to call webhook: Post "https://workspace.workspace.svc:9443/?timeout=10s": service "workspace" not found

Environment

Kubernetes version (use kubectl version): v1.28.5

Additional context

PR for this is created here #334

Kaito support lora/qlora fine tuning

Error while installing Inference Examples

I have installed the Kaito add on to AKS. when running the phi-2 example I get an error saying

Error from server (InternalError): error when creating "examples/inference/kaito_workspace_phi-2.yaml": Internal error occurred: failed calling webhook "validation.workspace.kaito.sh": failed to call webhook: Post "https://workspace.workspace.svc:9443/?timeout=10s": service "workspace" not found

Output from the svc command is as follows

~/dev/projects/msft/kaito main ❯ kubectl get svc -A 13:32:36 NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE default kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 65m kaito-workspace workspace ClusterIP 10.0.193.0 <none> 8080/TCP,9443/TCP 20m kube-system ama-metrics-ksm ClusterIP 10.0.54.212 <none> 8080/TCP 58m kube-system azure-wi-webhook-webhook-service ClusterIP 10.0.9.150 <none> 443/TCP 17m kube-system kube-dns ClusterIP 10.0.0.10 <none> 53/UDP,53/TCP 65m kube-system metrics-server ClusterIP 10.0.219.30 <none> 443/TCP 65m

Output of kubectl get Pods -A is this

NAMESPACE NAME READY STATUS RESTARTS AGE gpu-provisioner gpu-provisioner-6fb5dfcb6b-l657r 1/1 Running 3 (18m ago) 18m kaito-workspace workspace-578f7f9b97-z9nkp 0/1 CrashLoopBackOff 9 (2m51s ago) 25m kube-system ama-metrics-575c7c7c87-xzfth 2/2 Running 0 63m kube-system ama-metrics-ksm-d9c6f475b-7652s 1/1 Running 0 63m kube-system ama-metrics-node-s29zq 2/2 Running 0 63m kube-system azure-ip-masq-agent-4gfw6 1/1 Running 0 67m kube-system azure-wi-webhook-controller-manager-6dc49dfffd-5qpwf 1/1 Running 0 22m kube-system azure-wi-webhook-controller-manager-6dc49dfffd-rqfwj 1/1 Running 0 22m kube-system cloud-node-manager-5ngrx 1/1 Running 0 67m kube-system coredns-7459659b97-92m4b 1/1 Running 0 70m kube-system coredns-7459659b97-ssmwn 1/1 Running 0 66m kube-system coredns-autoscaler-7c88465478-cwtpl 1/1 Running 0 70m kube-system csi-azuredisk-node-j5qt2 3/3 Running 0 67m kube-system csi-azurefile-node-hcn8b 3/3 Running 0 67m kube-system konnectivity-agent-57f5549bdc-hh8hp 1/1 Running 0 22m kube-system konnectivity-agent-57f5549bdc-pbf5z 1/1 Running 0 22m kube-system kube-proxy-tqgjq 1/1 Running 0 67m kube-system metrics-server-7fd45bf99d-q5qrj 2/2 Running 0 66m kube-system metrics-server-7fd45bf99d-v2v8v 2/2 Running 0 66m

logs from the workspace pods are as follows

2024-04-05T17:34:45Z INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"} I0405 17:34:45.590174 1 main.go:122] "starting webhook reconcilers" 2024/04/05 17:34:45 Registering 1 clients 2024/04/05 17:34:45 Registering 2 informer factories 2024/04/05 17:34:45 Registering 2 informers 2024/04/05 17:34:45 Registering 2 controllers I0405 17:34:47.591241 1 main.go:142] "starting manager" 2024-04-05T17:34:47Z INFO Starting server {"kind": "health probe", "addr": "[::]:8081"} W0405 17:34:47.593569 1 reflector.go:533] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kaito-workspace:workspace-sa" cannot list resource "pods" in API group "" at the cluster scope E0405 17:34:47.593619 1 reflector.go:148] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kaito-workspace:workspace-sa" cannot list resource "pods" in API group "" at the cluster scope W0405 17:34:48.983035 1 reflector.go:533] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kaito-workspace:workspace-sa" cannot list resource "pods" in API group "" at the cluster scope E0405 17:34:48.983075 1 reflector.go:148] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kaito-workspace:workspace-sa" cannot list resource "pods" in API group "" at the cluster scope 2024/04/05 17:34:50 Error reading/parsing logging configuration: timed out waiting for the condition: configmaps "config-logging" is forbidden: User "system:serviceaccount:kaito-workspace:workspace-sa" cannot get resource "configmaps" in API group "" in the namespace "kaito-workspace"

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Contact email address not valid

At the bottom of the README.md it says: Contact "Kaito devs" [email protected]. That email address is not valid. What should it be?

Action required: migrate or opt-out of migration to GitHub inside Microsoft

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

In order to protect and secure Microsoft, private or internal repositories in GitHub for Open Source which are not related to open source projects or require collaboration with 3rd parties (customer, partners, etc.) must be migrated to GitHub inside Microsoft a.k.a GitHub Enterprise Cloud with Enterprise Managed User (GHEC EMU).

Action

✍️ Please RSVP to opt-in or opt-out of the migration to GitHub inside Microsoft.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived.🔒

Instructions

Reply with a comment on this issue containing one of the following optin or optout command options below.

✅ Opt-in to migrate

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

❌ Opt-out of migration

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

staging : This repository will ship as Open Source or go public

collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.

delete : This repository will be deleted because it is no longer needed.

other : Other reasons not specified

Need more help? 🖐️

Email [email protected]. ✉️
Post your questions in GitHub inside Microsoft Team in Microsoft Teams. 🗨️

Best way to train a custom dataset?

I've got Llama 2 running in an AKS cluster, but I need to train my model with a custom dataset and make it work with Kaito. Do you have any suggestions?

Thanks in advance.

OpenAI compatibility for open source models

Having a OpenAI compatible endpoints would make it easier for developers to use different models with the same code base.

Describe the solution you'd like
Beyond just having the compatible endpoints, it should also be payload compatible for things like function calling and image upload. I've seen the Json payload vary for both of those with different open-source models.

Describe alternatives you've considered
Using a tool like LiteLLM to host the OpenAI compatible endpoints is what I use now, and it works with any models I can host with Ollama, for example.

Add adapter support for inference

Design of RAG support

Support for other LLMs?

Hi, Is there any roadmap to support other open source LLM's? If there is any documentation already in place, please share.

Add stable diffusion to supported model list

Other instance types besides Standard_NC12s_v3?

The Quick start section features the "Standard_NC12s_v3" instance type for starting a inference service:

resource:
  instanceType: "Standard_NC12s_v3"

The Standard_NC12s_v3 instance is gpu-powered, however it is very expensive. What is the minimal (cost-wise) instance type required?

Phi-3 Models Support

20240527-phi3-instruct.md
Is your feature request related to a problem? Please describe.
Add Phi-3 Models to Supported List

Describe the solution you'd like
Addition of the latest SLMs from Microsoft to be in the supported list of models. Attached the Proposal Document here

azure / kaito Goto Github PK

kaito's Introduction

Kubernetes AI Toolchain Operator (Kaito)

Architecture

Installation

Quick start

Usage

FAQ

How to upgrade the existing deployment to use the latest model configuration?

How to update model/inference parameters to override the Kaito Preset Configuration?

Original

Modify to enable 4-bit Quantization

What is the difference between instruct and non-instruct models?

Contributing

Trademarks

License

Code of Conduct

Contact

kaito's People

Contributors

Stargazers

Watchers

Forkers

kaito's Issues

Output

Support Karpenter Tasks

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

Action

Instructions

Need more help? 🖐️

Recommend Projects

Recommend Topics

Recommend Org