Giter Site home page Giter Site logo

pcidevices's Introduction

PCI Devices Controller

Build Status Go Report Card

PCI Devices Controller is a Kubernetes controller that:

  • Discovers PCI Devices for nodes in your cluster and
  • Allows users to prepare devices for PCI Passthrough, for use with KubeVirt-managed virtual machines.

API

This operator introduces these CRDs:

  • PCIDevice
  • PCIDeviceClaim
  • USBDevice
  • USBDeviceClaim

It introduces different types of devices that can be used here. In general, when a device is claimed, it is marked as "healthy" and is ready to be used. Each device type has own custom device plugin to manage the device lifecycle.

  • PCI Device: Each resource name might have multiple PCI devices.
  • USB Device: Each resource name only has one USB device.

PCIDevice

This custom resource represents PCI Devices on the host. The motivation behind getting a list of PCIDevice objects for a node is to have a cloud-native equivalent to the lspci command.

For example, if I have a 3 node cluster:

NAME     STATUS   ROLES                       AGE   VERSION
node1    Ready    control-plane,etcd,master   26h   v1.24.3+k3s1
node2    Ready    control-plane,etcd,master   26h   v1.24.3+k3s1
node3    Ready    control-plane,etcd,master   26h   v1.24.3+k3s1

And I wanted to see which PCI Devices were on node1, I would have to use ssh and get it like this:

user@host % ssh node1
user@node1 ~$ lspci
00:1c.0 PCI bridge: Intel Corporation Device 06b8 (rev f0)
00:1c.7 PCI bridge: Intel Corporation Device 06bf (rev f0)
00:1d.0 PCI bridge: Intel Corporation Comet Lake PCI Express Root Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation Device 068e

But as more nodes are added to the cluster, this kind of manual work gets tedious. The solution is to have a DaemonSet that runs lspci on each node and then synchronizes the results with the rest of the cluster.

CRD

The PCIDevice CR looks like this:

apiVersion: devices.harvesterhci.io/v1beta1
kind: PCIDevice
metadata:
  name: pcidevice-sample
status:
  address: "00:1f.6"
  vendorId: "8086"
  deviceId: "0d4c"
  classId: "0200"
  nodeName: "titan"
  resourceName: "intel.com/ETHERNET_CONNECTION_11_I219LM"
  description: "Ethernet controller: Intel Corporation Ethernet Connection (11) I219-LM"
  kernelDriverInUse: "e1000e"

PCIDeviceClaim

This custom resource is created to store the request to prepare a device for PCI Passthrough. It has pciAddress and a nodeSystemUUID, since each request is unique for a device on a particular node.

CRD

The PCIDeviceClaim CR looks like this:

apiVersion: devices.harvesterhci.io/v1beta1
kind: PCIDeviceClaim
metadata:
  name: pcideviceclaim-sample
spec:
  address: "00:1f.6"
  nodeName:  "titan"
  userName:  "yuri"
status:
  kernelDriverToUnbind: "e1000e"
  passthroughEnabled: true

The PCIDeviceClaim is created with a target PCI address, for the device that the user wants to prepare for PCI Passthrough. Then the status.passthroughEnabled is set to false while it's in progress, then true when it is bound to the vfio-pci driver.

The status.kernelDriverToUnbind is stored so that deleting the claim can re-bind the device to the original driver.

USBDevice

Collect all USB devices on all nodes, provided to users, so they know how many USB devices they could use.

Considering the same vendor/product case, we should add the bus and device number as a suffix to avoid misunderstanding. It's 002/004 in this case.

CRD

apiVersion: devices.harvesterhci.io/v1beta1
kind: USBDevice
metadata:
  labels:
    nodename: jacknode
  name: jacknode-0951-1666-002004
status:
  description: DataTraveler 100 G3/G4/SE9 G2/50 Kyson (Kingston Technology)
  devicePath: /dev/bus/usb/002/004
  enabled: false
  nodeName: jacknode
  pciAddress: "0000:02:01.0"
  productID: "1666"
  resourceName: kubevirt.io/jacknode-0951-1666-002004
  vendorID: "0951"

USBDeviceClaim

When users decide which USB devices they want to use, they will create a usbdeviceclaim. After usbdeviceclaim is created, the controller will update spec.configuration.permittedHostDevices.usb in the kubevirt resource.

We use the pciAddress field to prevent users from enabling the specified PCI device.

CRD

apiVersion: devices.harvesterhci.io/v1beta1
kind: USBDeviceClaim
metadata:
  name: jacknode-0951-1666-002004
  ownerReferences:
    - apiVersion: devices.harvesterhci.io/v1beta1
      kind: USBDevice
      name: jacknode-0951-1666-002004
      uid: b584d7d0-0820-445e-bc90-4ffbfe82d63b
spec: {}
status:
  userName: ""
  nodeName: jacknode
  pciAddress: "0000:02:01.0"

Controllers

There is a DaemonSet that runs the device controller on each node. The controller reconciles the stored list of devices for that node to the actual current list of devices for that node.

PCIDevice Controller

The PCIDevice controller will pick up on the new currently active driver automatically, as part of its normal operation.

PCIDeviceClaim Controller

The PCIDeviceClaim controller will process the requests by attempting to set up devices for PCI Passthrough. The steps involved are:

  • Load vfio-pci kernel module
  • Unbind current driver from device
  • Create a driver_override for the device
  • Bind the vfio-pci driver to the device

Once the device is confirmed to have been bound to vfio-pci, the PCIDeviceClaim controller will delete the request.

USBDevice Controller

The USBDevice Controller triggers the collection of USB devices on the host at two specific times:

  • When the pod starts.
  • When there is a change in the USB devices (including plugin and removal), primarily detected through fsnotify.

Once the USBDevice Controller detects the USB devices on the host, it converts them into the USBDevice CRD format and creates the corresponding resources, ignoring any devices that have already been created.

USBDeviceClaim Controller

The USBDeviceClaim Controller is triggered when a user decides which USB device to use and creates a USB device claim. Upon the creation of a USBDeviceClaim, this controller performs two main actions:

  • It activates the device plugin.
  • It updates the kubevirt resource.

This controller primarily manages the lifecycle of the device plugin, ensuring that the claimed USB device is properly integrated and available for use.

Limitation

USB Device

  • Don't support live migration.
  • Don't support hot-plug (including re-plug).
  • Require re-creating a USBDeviceClaim to enable the USB device if device path changes in following situations:
    • Re-plugging USB device.
    • Rebooting the node.

Daemon

The daemon will run on each node in the cluster and build up the PCIDevice list. A daemonset will enforce this daemon is running on each node.

Alternatives considered

NFD detects all kinds of features, like CPU features, USB devices, PCI devices, etc. It needs to be configured, and the output is a node label that tells whether a given device is present or not.

This only detects the presence or absence of device, not the number of them.

  "feature.node.kubernetes.io/pci-<device label>.present": "true",

Another reason not to use these simple labels is that we want to be able to allow our customers to set custom RBAC rules that restrict who can use which device in the cluster. We can do that with a custom PCIDevice CRD, but it's not clear how to do that with node labels.

License

Copyright (c) 2024 Rancher Labs, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

pcidevices's People

Contributors

chrisho avatar eliaskoromilas avatar frankyang0529 avatar ibrokethecloud avatar tlehman avatar votdev avatar yu-jack avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pcidevices's Issues

[BUG] Making a PCIDeviceClaim for a device in an IOMMU group can violate IOMMU restrictions, preventing successful allocation to a VM

For example, suppose you have an NVIDIA GPU with

  • a VGA controller on board and
  • an Audio device on board
    Both with their own PCI addresses. They belong to the same IOMMU group.

If you claim just the VGA controller but not the Audio device, then it cannot be attached to a KubeVirt VM. Currently, users have to claim both and then assign them both to the same VM, but this is a sub-par user experience. It allows the user to violate IOMMU restrictions.

One way to solve this problem would be to have the PCIDeviceClaim controller automatically create claims for all members of the same IOMMU group. But assignment to a VM will have to change as well, since it's possible that two VMs get assigned members of the same IOMMU group.

See this archived page for a more detailed discussion of this failure mode.

file is not closed in some cases

The file is not closed in all cases.

And there are some similar functions have such issue, e.g. unbindPCIDeviceFromDriver, unbindPCIDeviceFromVfioPCIDriver, please check and update.

func addNewIdToVfioPCIDriver(vendorId string, deviceId string) error {
	var id string = fmt.Sprintf("%s %s", vendorId, deviceId)

	file, err := os.OpenFile("/sys/bus/pci/drivers/vfio-pci/new_id", os.O_WRONLY, 0400)
	if err != nil {
		return err
	}
	_, err = file.WriteString(id)
	if err != nil {
		return err   // the file is not closed in this return
	}
	file.Close()
	return nil
}

Found when reviewing: #14

How to install pcidevices into k8s using helm

I'm not using harvester.
according to both issues. #31 #33

  1. kubectl apply -f crd.yaml
  2. helm install harvester-pcidevices-controller -n virtnest-system --debug ./harvester-pcidevices-controller-0.2.6.tgz

harvester-cpidevices-controller log

time="2024-01-15T10:20:42Z" level=info msg="No access to list CRDs, assuming CRDs are pre-created."
time="2024-01-15T10:20:42Z" level=info msg="Registering PCI Device Claims controller"
time="2024-01-15T10:20:42Z" level=fatal msg="error registering pcidevicclaim controllers :pcideviceclaims.devices.harvesterhci.io is forbidden: User \"system:serviceaccount:virtnest-system:harvester-pcidevices-controller\" cannot list resource \"pcideviceclaims\" in API group \"devices.harvesterhci.io\" at the cluster scope"

Build failure

When I try to build this, I see a bunch of timeout errors connecting to proxy.golang.org

pkg/deviceplugins/common.go:37:2: kubevirt.io/[email protected]: Get "https://proxy.golang.org/kubevirt.io/kubevirt/@v/v0.54.0.zip": dial tcp: lookup proxy.golang.org on 192.168.1.1:53: read udp 172.17.0.3:48019->192.168.1.1:53: i/o timeout

I can curl proxy.golang.org on the host, but when I go into the container dapper sh, I can't curl anything. This is a container networking issue:

% dapper sh
sh-4.4#   curl https://proxy.golang.org
curl: (6) Could not resolve host: proxy.golang.org

PDC controller is slow for user claim

PDC controller is slow for user claim, as it depends on a 20s timer.

A better solution is to work as normal controller, watch & react with PDC CRD object, meanwhile, use timer to enqueue self about routine reconciler.

Controller doesn't update permittedHostDevices

Environment:

I have Harvester cluster v1.3.1

  • one master DELL T7820 with nVidia P620
  • one worker DELL T7920 with nVidia P5000

Steps:

  • I enabled passthrough for both GPUs.
  • I created Ubuntu 22 VM, added P5000 device to it. Started.

Error:

  • HostDevice nvidia.com/GP104GL_QUADRO_P5000 is not permitted in permittedHostDevices configuration

while running
kubectl get kubevirt -n harvester-system -o yaml
I see that is not in the list.

Looking in the logs, I see the controller attempts to add it:

kubectl logs -n harvester-system harvester-pcidevices-controller-fgml9
time="2024-06-22T22:14:52Z" level=info msg="Adding t7920-0000d5000 to KubeVirt list of permitted devices"
time="2024-06-22T22:14:52Z" level=info msg="Enabling passthrough for PDC: t7920-0000d5000"
time="2024-06-22T22:14:52Z" level=info msg="Binding device t7920-0000d5000 [10de 1bb0] to vfio-pci"
time="2024-06-22T22:14:52Z" level=info msg="Binding device 0000:d5:00.0 vfio-pci"
time="2024-06-22T22:14:52Z" level=error msg="error syncing 't7920-0000d5000': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: device or resource busy, requeuing"
time="2024-06-22T22:14:52Z" level=info msg="Adding t7920-0000d5001 to KubeVirt list of permitted devices"
time="2024-06-22T22:14:52Z" level=info msg="Enabling passthrough for PDC: t7920-0000d5001"
time="2024-06-22T22:14:53Z" level=info msg="Binding device t7920-0000d5001 [10de 10f0] to vfio-pci"
time="2024-06-22T22:14:53Z" level=info msg="Binding device 0000:d5:00.1 vfio-pci"
time="2024-06-22T22:14:53Z" level=error msg="error syncing 't7920-0000d5001': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: device or resource busy, requeuing"
time="2024-06-22T22:14:53Z" level=info msg="Adding t7920-0000d5000 to KubeVirt list of permitted devices"
time="2024-06-22T22:14:53Z" level=info msg="Creating DevicePlugin: nvidia.com/GP104GL_QUADRO_P5000"
time="2024-06-22T22:14:53Z" level=info msg="Started DevicePlugin: nvidia.com/GP104GL_QUADRO_P5000"
{"component":"","level":"info","msg":"Initialized DevicePlugin: nvidia.com/GP104GL_QUADRO_P5000","pos":"device_manager.go:206","timestamp":"2024-06-22T22:14:53.581669Z"}
time="2024-06-22T22:14:53Z" level=info msg="Adding t7920-0000d5001 to KubeVirt list of permitted devices"
time="2024-06-22T22:14:53Z" level=info msg="Creating DevicePlugin: nvidia.com/GP104_HIGH_DEFINITION_AUDIO_CONTROLLER"
time="2024-06-22T22:14:53Z" level=info msg="Started DevicePlugin: nvidia.com/GP104_HIGH_DEFINITION_AUDIO_CONTROLLER"
{"component":"","level":"info","msg":"Initialized DevicePlugin: nvidia.com/GP104_HIGH_DEFINITION_AUDIO_CONTROLLER","pos":"device_manager.go:206","timestamp":"2024-06-22T22:14:53.899496Z"}
time="2024-06-22T22:14:54Z" level=info msg="Adding t7920-0000d5000 to KubeVirt list of permitted devices"
time="2024-06-22T22:14:54Z" level=info msg="Adding t7920-0000d5001 to KubeVirt list of permitted devices"
time="2024-06-22T22:14:55Z" level=info msg="Adding t7920-0000d5000 to KubeVirt list of permitted devices"
time="2024-06-22T22:14:55Z" level=info msg="Adding t7920-0000d5001 to KubeVirt list of permitted devices"

But something is not letting that entry to "permittedHostDevices -> pciHostDevices"

I was able to get it resolved by running

kubectl edit kubevirt -n harvester-system -o yaml

and adding entry manually:

    - externalResourceProvider: true
      pciVendorSelector: 10de:1bb0
      resourceName: nvidia.com/GP104GL_QUADRO_P5000

Then VM started.

But need to figure out why it was not added, as expected.

Also, I see a lot of duplicates in that section:
kubevirt.yaml.txt

Introduce a separate `resourceName` property for KubeVirt device plugin names

Description

I suggest using the .status.description property for a human-readable lspci-like description of the PCI device and introduce a .spec.resourceName property for the device plugin name. It would be nice to allow users edit this.

Describe the results you received:

$ kubectl get pcidevices
NAME                           ADDRESS        VENDOR ID   DEVICE ID   NODE NAME   DESCRIPTION                     KERNEL DRIVER IN USE
...
minikube-10de-174d-000001000   0000:01:00.0   10de        174d        minikube    nvidia.com/GM108MGeForceMX130   nouveau
...

Describe the results you expected:

$ kubectl get pcidevices
NAME                           ADDRESS        VENDOR ID   DEVICE ID   CLASS ID   NODE NAME   RESOURCE NAME                  DESCRIPTION                                                 KERNEL DRIVER IN USE
...
minikube-10de-174d-000001000   0000:01:00.0   10de        174d        0302       minikube    nvidia.com/GM108MGeForceMX130  3D controller: NVIDIA Corporation GM108M [GeForce MX130]    nouveau
...

attemptToEnablePassthrough func return object conflict

The modification suggestions are as follows:

retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
		newPdc, err := h.pdcClient.Get(pdc.Name, metav1.GetOptions{})
		if err != nil {
			return err
		}
		newPdc.Status.KernelDriverToUnbind = pd.Status.KernelDriverInUse
		newPdc.Status.PassthroughEnabled = true
		newPdc, err = h.pdcClient.UpdateStatus(newPdc)
		if err != nil {
			return err
		}
		return nil
	})

Race condition preventing PCIDeviceClaims from being deleted in some cases

ime="2022-10-07T00:01:44Z" level=info msg="Removing janus-000024002 from KubeVirt list of permitted devices"
time="2022-10-07T00:01:44Z" level=info msg="Attempting to enable passthrough for janus-000024002"
time="2022-10-07T00:01:45Z" level=info msg="Adding janus-000024002 to KubeVirt list of permitted devices"
time="2022-10-07T00:01:46Z" level=info msg="Attempting to disable passthrough for janus-000024000"
time="2022-10-07T00:01:46Z" level=info msg="Removing janus-000024000 from KubeVirt list of permitted devices"
time="2022-10-07T00:01:47Z" level=info msg="Attempting to enable passthrough for janus-000024000"
time="2022-10-07T00:01:47Z" level=info msg="Adding janus-000024000 to KubeVirt list of permitted devices"
time="2022-10-07T00:01:47Z" level=info msg="Attempting to disable passthrough for janus-000004000"
time="2022-10-07T00:01:47Z" level=info msg="Removing janus-000004000 from KubeVirt list of permitted devices"
time="2022-10-07T00:01:48Z" level=info msg="Attempting to enable passthrough for janus-000004000"
time="2022-10-07T00:01:48Z" level=info msg="Adding janus-000004000 to KubeVirt list of permitted devices"
time="2022-10-07T00:01:50Z" level=info msg="Attempting to disable passthrough for janus-000024001"
time="2022-10-07T00:01:50Z" level=info msg="Removing janus-000024001 from KubeVirt list of permitted devices"
time="2022-10-07T00:01:50Z" level=info msg="Attempting to disable passthrough for janus-000024003"
time="2022-10-07T00:01:50Z" level=info msg="Removing janus-000024003 from KubeVirt list of permitted devices"
time="2022-10-07T00:01:51Z" level=info msg="Attempting to enable passthrough for janus-000024001"
time="2022-10-07T00:01:51Z" level=info msg="Attempting to enable passthrough for janus-000024003"
time="2022-10-07T00:01:51Z" level=info msg="Adding janus-000024001 to KubeVirt list of permitted devices"
time="2022-10-07T00:01:52Z" level=info msg="Adding janus-000024003 to KubeVirt list of permitted devices"
time="2022-10-07T00:02:04Z" level=info msg="Reconciling PCI Device Claims list"
time="2022-10-07T00:02:14Z" level=info msg="Attempting to disable passthrough for janus-000004001"
time="2022-10-07T00:02:14Z" level=info msg="Removing janus-000004001 from KubeVirt list of permitted devices"
time="2022-10-07T00:02:14Z" level=info msg="Attempting to enable passthrough for janus-000004001"
time="2022-10-07T00:02:15Z" level=info msg="Adding janus-000004001 to KubeVirt list of permitted devices"
time="2022-10-07T00:02:15Z" level=info msg="Attempting to disable passthrough for janus-000024002"
time="2022-10-07T00:02:16Z" level=info msg="Removing janus-000024002 from KubeVirt list of permitted devices"
time="2022-10-07T00:02:16Z" level=info msg="Attempting to enable passthrough for janus-000024002"
time="2022-10-07T00:02:16Z" level=info msg="Adding janus-000024002 to KubeVirt list of permitted devices"
time="2022-10-07T00:02:18Z" level=info msg="Attempting to disable passthrough for janus-000024000"
time="2022-10-07T00:02:18Z" level=info msg="Removing janus-000024000 from KubeVirt list of permitted devices"
time="2022-10-07T00:02:18Z" level=info msg="Attempting to enable passthrough for janus-000024000"
time="2022-10-07T00:02:19Z" level=info msg="Adding janus-000024000 to KubeVirt list of permitted devices"
time="2022-10-07T00:02:19Z" level=info msg="Attempting to disable passthrough for janus-000004000"

Notice that the enable and disable can be running in parallel.

How to use it?

How to use it with harvester?
I created the crds.yaml , PCIDevice and PCIDeviceClaim, what is the next?

Anyone can help?

I find a bug about indexer???

add Indexer: input func should VirtualMachine.Name
otherwise!!!!
get Indexer:
When we get indexer by VmName-NameSpace, response that is vm list don's exists!!!

[Bug] KubeVirt Allow-list doesn't always get refreshed

Changing the kubevirt list of permitted devices will trigger a change to the .status.allocatable value. If you add multiple identical devices, it only counts the first one. Also, if you remove the device, the allocatable list is not updated.

Proposed solution: remove the device and then re-add it, to trigger an update.

Question about reconcilePCIDeviceClaims

Question about reconcilePCIDeviceClaims:

  1. The following Update operations, do not use a copied object, but the local queried one
				_, err = h.pdcClient.Update(&pdc)
				if err != nil {
					return err
				}
				_, err = h.pdcClient.UpdateStatus(&pdc)
				if err != nil {
					return err
				}
  1. The pdc spec seems not changed in reconcilePCIDeviceClaims, but here it calls Update and UpdateStatus, is one time of Update* is enough?

  2. Should if pdc.DeletionTimestamp != nil be checked before if !pdc.Status.PassthroughEnabled ?

  3. Those change of pdc.Status.PassthroughEnabled but return err, may cause (k8s) controller local cached data is not consistant with apiserver.

						if err != nil {
							pdc.Status.PassthroughEnabled = false
							return err
						}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.