Giter Site home page Giter Site logo

66ring / elastic-gpu-scheduler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from elastic-ai/elastic-gpu-scheduler

0.0 1.0 0.0 5.11 MB

elastic-gpu-scheduler is a Kubernetes scheduler extender for GPU resources scheduling.

License: Apache License 2.0

Go 98.48% Makefile 0.66% Dockerfile 0.86%

elastic-gpu-scheduler's Introduction

Elastic GPU Scheduler

About This Project

With the continuous evolution of cloud native AI scenarios, more and more users run AI tasks on Kubernetes, which also brings more and more challenges to GPU resource scheduling.

Elastic gpu scheduler is a gpu scheduling framework based on Kubernetes, which focuses on gpu sharing and allocation.

You may also be interested in Elastic GPU Agent which is a Kubernetes device plugin implement.

Motivation

In the GPU container field, GPU providers such as nvidia have introduced a docker-based gpu containerization project that allows users to use GPU cards in Kubernetes Pods via the Kubernetes extended resource with the nvidia k8s device plugin. However, this project focuses more on how containers use GPU cards on Kubernetes nodes, and not on GPU resource scheduling.

Elastic GPU scheduler is based on Kubernetes extended scheduler, which can schedule gpu cores, memories, percents, share gpu with multiple containers and even spread containers of pod to different GPUs. The scheduling algorithm supports binpack, spread, random and other policies. In addition, through the supporting elastic gpu agent, it can be adapted to nvidia docker, gpushare, qgpu and other gpu container solutions. Elastic GPU scheduler mainly satisfies the GPU resources scheduling and allocation requirements in Kubernetes.

Architecture

Prerequisites

  • Kubernetes v1.17+
  • golang 1.16+
  • NVIDIA drivers
  • nvidia-docker
  • set nvidia as docker default-runtime: add "default-runtime": "nvidia" to /etc/docker/daemon.json, and restart docker daemon.

Build Image

Run make or TAG=<image-tag> make to build elastic-gpu-scheduler image

Getting Started

  1. Deploy Elastic GPU Agent
$ kubectl apply -f https://raw.githubusercontent.com/elastic-gpu/elastic-gpu-agent/master/deploy/elastic-gpu-agent.yaml

For more information , please refer to Elastic GPU Agent.

  1. Deploy Elastic GPU Scheduler
$ kubectl apply -f deploy/elastic-gpu-scheduler.yaml
  1. Enable Kubernetes scheduler extender

Below Kubernetes v1.23

Add the following configuration to extenders section in the --policy-config-file file (<elastic-gpu-scheduler-svc-clusterip> is the cluster IP of elastic-gpu-scheduler service, which can be found by kubectl get svc elastic-gpu-scheduler -n kube-system -o jsonpath='{.spec.clusterIP}' ).

{
  "urlPrefix": "http://<elastic-gpu-scheduler-svc-clusterip>:39999/scheduler",
  "filterVerb": "filter",
  "prioritizeVerb": "priorities",
  "bindVerb": "bind",
  "weight": 1,
  "enableHttps": false,
  "nodeCacheCapable": true,
  "managedResources": [
    {
      "name": "elasticgpu.io/gpu-memory"
    },
    {
      "name": "elasticgpu.io/gpu-core"
    }
  ]
}

You can set a scheduling policy by running kube-scheduler --policy-config-file <filename> or kube-scheduler --policy-configmap <ConfigMap>. Here is a scheduler policy config sample.

From Kubernetes v1.23

Because of --policy-config-file flag for the kube-scheduler is not supported anymore. You can use --config=/etc/kubernetes/scheduler-policy-config.yaml and create a file scheduler-policy-config.yaml compliant to KubeSchedulerConfiguration requirements.

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
extenders:
- urlPrefix: "http://<elastic-gpu-scheduler-svc-clusterip>:39999/scheduler"
  filterVerb: filter
  prioritizeVerb: priorities
  bindVerb: bind
  weight: 1
  enableHTTPS: false
  nodeCacheCapable: true
  managedResources:
  - name: elasticgpu.io/gpu-core
  - name: elasticgpu.io/gpu-memory
  1. Create pod sharing one GPU
cat <<EOF  | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cuda-gpu-test
  labels:
    app: gpu-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      containers:
        - name: cuda
          image: nvidia/cuda:10.0-base
          command: [ "sleep", "100000" ]
          resources:
            limits:
              elasticgpu.io/gpu-memory: "256" // 256MB memory 
EOF
  1. Create pod with multiple GPU cards
cat <<EOF  | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cuda-gpu-test
  labels:
    app: gpu-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      containers:
        - name: cuda
          image: nvidia/cuda:10.0-base
          command: [ "sleep", "100000" ]
          resources:
            limits:
              elasticgpu.io/gpu-core: "200" // 2 GPU cards
EOF

Roadmap

  • Support GPU topology-aware scheduling
  • Support GPU load-aware scheduling

License

Distributed under the Apache License.

elastic-gpu-scheduler's People

Contributors

kerthcet avatar xiaoxubeii avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.