Giter Site home page Giter Site logo

Comments (4)

jfroy avatar jfroy commented on July 16, 2024

The Talos Nvidia driver extensions installs nvidia-smi under /usr/local/bin, which is somewhat of a non-standard location for an Nvidia driver component (other components are under /usr/local/lib, which is also non-standard; this will come up later if you read on). The current release version of nvidia-validator will not find nvidia-smi at that path. However, the main branch of the operator and operator-validator have significantly different code (to handle driver container images). If you override the image for operator and operator-validator to use one of the daily CI builds on Github, you should get past that issue.

However, you will then find that the device plug-in will not find a core CUDA library as part of its driver detection process. This is because of the aforementioned custom install path for other driver components. Furthermore, Talos applies a patch to the container toolkit to change the ldcache path (which the toolkit uses to find libraries), because Talos needs to maintain separate glibc and musl LD caches and thus stores them in custom locations. You will need to patch the device plug-in, build and publish a custom image, and use that image to get past that issue. Something like this:

diff --git a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
index 2f6de2fe..35f62f45 100644
--- a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
+++ b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
@@ -33,7 +33,7 @@ import (
 	"github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/symlinks"
 )
 
-const ldcachePath = "/etc/ld.so.cache"
+const ldcachePath = "/usr/local/glibc/etc/ld.so.cache"
 
 const (
 	magicString1 = "ld.so-1.7.0"
diff --git a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
index 7f5cf7c8..85fd1db9 100644
--- a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
+++ b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
@@ -36,6 +36,7 @@ func NewLibraryLocator(opts ...Option) Locator {
 
 	// If search paths are already specified, we return a locator for the specified search paths.
 	if len(b.searchPaths) > 0 {
+		b.logger.Infof("Returning symlink locator with paths: %v", b.searchPaths)
 		return NewSymlinkLocator(
 			WithLogger(b.logger),
 			WithSearchPaths(b.searchPaths...),
@@ -56,6 +57,7 @@ func NewLibraryLocator(opts ...Option) Locator {
 			"/lib/aarch64-linux-gnu",
 			"/lib/x86_64-linux-gnu/nvidia/current",
 			"/lib/aarch64-linux-gnu/nvidia/current",
+			"/usr/local/lib",
 		}...),
 	)
 	// We construct a symlink locator for expected library locations.

With the previously mentioned upcoming support for driver container images in the GPU operator, Talos may want to consider reworking their Nvidia extensions to deliver all the components as container image. That should hopefully provide a more supported and long-term stable solution.

from extensions.

TimJones avatar TimJones commented on July 16, 2024

Hi @jfroy, one issue here is that Talos requires signed drivers, and the singing key is ephemeral to each build process, hence why each release of Talos has a specific corresponding release of each system extension.
We would love to be able to work through some potential ideas with you if you'd be interested in joining our Slack community, or even be available for a call to run though them.

from extensions.

jfroy avatar jfroy commented on July 16, 2024

Hi @jfroy, one issue here is that Talos requires signed drivers, and the singing key is ephemeral to each build process, hence why each release of Talos has a specific corresponding release of each system extension.

We would love to be able to work through some potential ideas with you if you'd be interested in joining our Slack community, or even be available for a call to run though them.

Yeah I like that Talos provides a chain of trust. You would need a per-release driver container just like you have a per-release extension.

I work at Nvidia, but I only speak for myself here. It would be inappropriate to engage beyond the occasional comment and bug fix PR on GitHub. I will however reach out to the folks working on our container technologies.

from extensions.

TimJones avatar TimJones commented on July 16, 2024

I will however reach out to the folks working on our container technologies.

That would be greatly appreciated, and thank you for reaching out in the first instance.

from extensions.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.