Comments (4)
The Talos Nvidia driver extensions installs nvidia-smi
under /usr/local/bin
, which is somewhat of a non-standard location for an Nvidia driver component (other components are under /usr/local/lib
, which is also non-standard; this will come up later if you read on). The current release version of nvidia-validator
will not find nvidia-smi
at that path. However, the main
branch of the operator and operator-validator have significantly different code (to handle driver container images). If you override the image for operator and operator-validator to use one of the daily CI builds on Github, you should get past that issue.
However, you will then find that the device plug-in will not find a core CUDA library as part of its driver detection process. This is because of the aforementioned custom install path for other driver components. Furthermore, Talos applies a patch to the container toolkit to change the ldcache path (which the toolkit uses to find libraries), because Talos needs to maintain separate glibc and musl LD caches and thus stores them in custom locations. You will need to patch the device plug-in, build and publish a custom image, and use that image to get past that issue. Something like this:
diff --git a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
index 2f6de2fe..35f62f45 100644
--- a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
+++ b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
@@ -33,7 +33,7 @@ import (
"github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/symlinks"
)
-const ldcachePath = "/etc/ld.so.cache"
+const ldcachePath = "/usr/local/glibc/etc/ld.so.cache"
const (
magicString1 = "ld.so-1.7.0"
diff --git a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
index 7f5cf7c8..85fd1db9 100644
--- a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
+++ b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
@@ -36,6 +36,7 @@ func NewLibraryLocator(opts ...Option) Locator {
// If search paths are already specified, we return a locator for the specified search paths.
if len(b.searchPaths) > 0 {
+ b.logger.Infof("Returning symlink locator with paths: %v", b.searchPaths)
return NewSymlinkLocator(
WithLogger(b.logger),
WithSearchPaths(b.searchPaths...),
@@ -56,6 +57,7 @@ func NewLibraryLocator(opts ...Option) Locator {
"/lib/aarch64-linux-gnu",
"/lib/x86_64-linux-gnu/nvidia/current",
"/lib/aarch64-linux-gnu/nvidia/current",
+ "/usr/local/lib",
}...),
)
// We construct a symlink locator for expected library locations.
With the previously mentioned upcoming support for driver container images in the GPU operator, Talos may want to consider reworking their Nvidia extensions to deliver all the components as container image. That should hopefully provide a more supported and long-term stable solution.
from extensions.
Hi @jfroy, one issue here is that Talos requires signed drivers, and the singing key is ephemeral to each build process, hence why each release of Talos has a specific corresponding release of each system extension.
We would love to be able to work through some potential ideas with you if you'd be interested in joining our Slack community, or even be available for a call to run though them.
from extensions.
Hi @jfroy, one issue here is that Talos requires signed drivers, and the singing key is ephemeral to each build process, hence why each release of Talos has a specific corresponding release of each system extension.
We would love to be able to work through some potential ideas with you if you'd be interested in joining our Slack community, or even be available for a call to run though them.
Yeah I like that Talos provides a chain of trust. You would need a per-release driver container just like you have a per-release extension.
I work at Nvidia, but I only speak for myself here. It would be inappropriate to engage beyond the occasional comment and bug fix PR on GitHub. I will however reach out to the folks working on our container technologies.
from extensions.
I will however reach out to the folks working on our container technologies.
That would be greatly appreciated, and thank you for reaching out in the first instance.
from extensions.
Related Issues (20)
- how to use zfs tools HOT 7
- Extensions docs need some help HOT 5
- QEMU guest agent shutdown command leads to restart instead of shutdown (in maintenance mode) HOT 2
- Kata-containers - failed to create containerd task: failed to create shim task HOT 3
- Extension for the `binder_linux` module
- extensions: lint the manfiest, verify name, etc.
- zfs breaks nvidia-container-toolkit HOT 10
- Extension for mergerFS HOT 2
- [nvidia-container-toolkit] Allow customizing nvidia-container-runtime.toml HOT 2
- `tailscale` versions pinned for a given Talos version are quickly out of date and have security vulnerabilities HOT 1
- change the vmtoolsd extension name directory too? HOT 2
- Kata Pods fail to boot with RuntimeClass on Raspberry Pi HOT 5
- Update NVIDIA drivers up to version 550.x HOT 6
- Tailscale network deadlock issue
- Kata Container follow-up: vAccel support
- Feature request: add nvproxy to gVisor if Nvidia driver is included HOT 1
- Gvisor pod cannot be terminated properly HOT 2
- kata containers 3.6.0 doesn't work
- Add Yggdrasil Network
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from extensions.