Giter Site home page Giter Site logo

edgetpu-device-plugin's Introduction

edgetpu-device-plugin

Build Status Docker Repository on Quay

Experimental Kubernetes Device Plugin for Coral Edge TPU

edgetpu-device-plugin's People

Contributors

kkohtaka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

edgetpu-device-plugin's Issues

K3S compatibility

First of all thank you for sharing your work on this plugin, it has saved me a lot of time already.

I have started working on k3s compatibility for this device plugin and I have gotten to the point when the plugin discovers available TPUs and registers them with the kubelet.

$ kubectl describe nodes
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource              Requests   Limits
  --------              --------   ------
  cpu                   100m (2%)  0 (0%)
  memory                70Mi (7%)  170Mi (18%)
  ephemeral-storage     0 (0%)     0 (0%)
  kkohtaka.org/edgetpu  1          1
I0327 17:02:58.014104       1 plugin.go:98] Started gRPC service on plugin socket
I0327 17:02:58.014183       1 plugin.go:101] Started monitoring devices
I0327 17:02:58.014211       1 plugin.go:49] gRPC server started.
I0327 17:02:58.015416       1 plugin.go:118] Opened connection to kubelet socket
I0327 17:02:58.015486       1 plugin.go:121] Registering dpServer: &{[] 0xd58100 0xd58140}
I0327 17:02:58.017211       1 plugin.go:134] Registered device plugin
I0327 17:02:58.019309       1 server.go:56] Start watching devices
I0327 17:02:58.019419       1 server.go:66] Update a device list
I0327 17:02:58.019455       1 server.go:126] Starting Edge TPU device monitor
I0327 17:03:03.184672       1 server.go:155] Edge TPU became active.
I0327 17:03:03.185096       1 server.go:66] Update a device list
I0327 17:04:03.390627       1 server.go:79] Container TPU request: &ContainerAllocateRequest{DevicesIDs:[42],}
I0327 17:04:03.390765       1 server.go:80] Allocating devices... Device IDs: [42]

side note: it looks like the device id is hardcoded to 42 so only 1 TPU is currently allowed per node, do you plan to support multiple devices?

I am able to schedule a pod which requests a TPU but the container fails to start due to:

Events:
  Type     Reason     Age               From                  Message
  ----     ------     ----              ----                  -------
  Normal   Scheduled  86s               default-scheduler     Successfully assigned default/edgetpu-demo-54f5l to raspberrypi
  Normal   Pulling    2s (x2 over 84s)  kubelet, raspberrypi  pulling image "quay.io/kkohtaka/edgetpu-demo:arm32"
  Warning  Failed     2s                kubelet, raspberrypi  Error: failed to generate container "41e1245b846a2a815f54ca40741d10fc071f91b34eadf22309c52f949ea1d4ce" spec: failed to set devices mapping [&Device{ContainerPath:/dev/bus/usb,HostPath:/dev/bus/usb,Permissions:rw,}]: not a device node
  Normal   Pulled     1s (x2 over 2s)   kubelet, raspberrypi  Successfully pulled image "quay.io/kkohtaka/edgetpu-demo:arm32"
  Warning  Failed     1s                kubelet, raspberrypi  Error: failed to generate container "d72f48a0b04a560b1c7e81ec20d686b6ca3710f0a02fd38cecb1b937b52e8d05" spec: failed to set devices mapping [&Device{ContainerPath:/dev/bus/usb,HostPath:/dev/bus/usb,Permissions:rw,}]: not a device node

It looks like I need to specify an absolute path to the device but I'm not sure what that would be, could you point me in the right direction? I'm using a raspberrypiB+ with a coral accelerator usb for testing

If you're open to it, I'd be happy to submit a PR for k3s support once I get this working.

Host USB devices changes when sample runs

I'm running your device plugin in OpenShift 3.11 which has kubernetes under the hood. I realize you might not have done any testing with OCP but figured you might be able to help. Here is the setup:

  • Physical host (with edge tpu usb device attached)
  • OCP cluster virtual machines

I am able to connect the TPU to the physical host and run the python demo code to show it works. I can even assign the USB device to a compute node VM and in the VM run the python demo code to show that the VM sees the device and can talk to it.

Before I do anything with the daemonset I see this on the physical host:

$ lsusb
Bus 002 Device 005: ID 1a6e:089a Global Unichip Corp. 

I then use your yaml to deploy the daemonset. One of the pods in the daemonset shows:

oc logs -f edgetpu-device-plugin-52sjt
I0812 16:17:10.264373       1 plugin.go:98] Started gRPC service on plugin socket
I0812 16:17:10.264399       1 plugin.go:101] Started monitoring devices
I0812 16:17:10.264404       1 plugin.go:49] gRPC server started.
I0812 16:17:10.264607       1 plugin.go:118] Opened connection to kubelet socket
I0812 16:17:10.268002       1 server.go:56] Start watching devices
I0812 16:17:10.268025       1 server.go:66] Update a device list
I0812 16:17:10.268092       1 plugin.go:132] Registered device plugin
I0812 16:17:15.369094       1 server.go:150] Edge TPU became active.
I0812 16:17:15.369137       1 server.go:66] Update a device list

So far that all looks good. I then deploy the sample with your yaml file and it comes back with:

oc logs -f edgetpu-demo-9cb92

ERROR: Failed to retrieve TPU context.
ERROR: Node number 0 (edgetpu-custom-op) failed to prepare.

Failed in Tensor allocation, status_code: 1

And then if I go back to the physical host:

lsusb
Bus 002 Device 006: ID 18d1:9302 Google Inc. 

It changed from 002:005 to 002:006. It is like the physical host thinks the USB device was disconnected and reconnected. I have see this before I started using your code where I'd run a container on the VM and it would fail, see the device changed on the host, readd device to VM, and run container on VM again...and it works.

Would you have any insight into why talking to the device or somehow assigning it to a container causes this name change? Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.