shelmangroup / coreos-gpu-installer Goto Github PK
View Code? Open in Web Editor NEWScripts to build and use a container to install GPU drivers on CoreOS Container Linux
Home Page: https://shelman.io
License: Apache License 2.0
Scripts to build and use a container to install GPU drivers on CoreOS Container Linux
Home Page: https://shelman.io
License: Apache License 2.0
Hi,
getting this error:
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most
frequently when this kernel module was built against the wrong or
improperly configured kernel sources, with a version of gcc that
differs from the one used to build the target kernel, or if a driver
such as rivafb, nvidiafb, or nouveau is present and prevents the
NVIDIA kernel module from obtaining ownership of the NVIDIA graphics
device(s), or no NVIDIA GPU installed in this system is supported by
this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel
messages' at the end of the file
'/usr/local/nvidia/nvidia-installer.log' for more information.
ERROR: Installation has failed. Please see the file
'/usr/local/nvidia/nvidia-installer.log' for details. You may find
suggestions on fixing installation problems in the README available
on the Linux driver download page at www.nvidia.com.
In nvidia log I see:
LD [M] /tmp/selfgz31129/NVIDIA-Linux-x86_64-390.46/kernel/nvidia-drm.ko
LD [M] /tmp/selfgz31129/NVIDIA-Linux-x86_64-390.46/kernel/nvidia-modeset.ko
LD [M] /tmp/selfgz31129/NVIDIA-Linux-x86_64-390.46/kernel/nvidia.ko
LD [M] /tmp/selfgz31129/NVIDIA-Linux-x86_64-390.46/kernel/nvidia-uvm.ko
make[1]: Leaving directory '/build/usr/src/linux'
-> done.
-> Kernel module compilation complete.
-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/usr/local/nvidia/nvidia-installer.log' for more information.
-> Kernel module load error: No such file or directory
-> Kernel messages:
[ 270.651503] nvidia: Unknown symbol ipmi_free_recv_msg (err 0)
[ 270.651523] nvidia: Unknown symbol ipmi_set_my_address (err 0)
[ 270.651541] nvidia: Unknown symbol ipmi_request_settime (err 0)
[ 270.651552] nvidia: Unknown symbol ipmi_set_gets_events (err 0)
[ 353.361619] nvidia: Unknown symbol ipmi_create_user (err 0)
[ 353.361661] nvidia: Unknown symbol ipmi_destroy_user (err 0)
[ 353.361699] nvidia: Unknown symbol ipmi_validate_addr (err 0)
[ 353.361716] nvidia: Unknown symbol ipmi_free_recv_msg (err 0)
[ 353.361728] nvidia: Unknown symbol ipmi_set_my_address (err 0)
[ 353.361745] nvidia: Unknown symbol ipmi_request_settime (err 0)
[ 353.361756] nvidia: Unknown symbol ipmi_set_gets_events (err 0)
[ 451.583456] nvidia: Unknown symbol ipmi_create_user (err 0)
[ 451.583501] nvidia: Unknown symbol ipmi_destroy_user (err 0)
[ 451.583539] nvidia: Unknown symbol ipmi_validate_addr (err 0)
[ 451.583555] nvidia: Unknown symbol ipmi_free_recv_msg (err 0)
[ 451.583567] nvidia: Unknown symbol ipmi_set_my_address (err 0)
[ 451.583606] nvidia: Unknown symbol ipmi_request_settime (err 0)
[ 451.583623] nvidia: Unknown symbol ipmi_set_gets_events (err 0)
[ 558.985632] nvidia: Unknown symbol ipmi_create_user (err 0)
[ 558.985674] nvidia: Unknown symbol ipmi_destroy_user (err 0)
[ 558.985711] nvidia: Unknown symbol ipmi_validate_addr (err 0)
[ 558.985727] nvidia: Unknown symbol ipmi_free_recv_msg (err 0)
[ 558.985739] nvidia: Unknown symbol ipmi_set_my_address (err 0)
[ 558.985756] nvidia: Unknown symbol ipmi_request_settime (err 0)
[ 558.985767] nvidia: Unknown symbol ipmi_set_gets_events (err 0)
ERROR: Installation has failed. Please see the file '/usr/local/nvidia/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
On the other hand, my automated build based off your fork of container-engine-accelerators is still working fine: https://hub.docker.com/r/fish/nvidia-driver-installer-containerlinux/
Any idea what did change between that and this stand-alone repo here? Possibly some kernel header mismatch?
Running CoreOS 1520.8.0 which is 4.13.9-coreos
In file entrypoint.sh, there's the "NVIDIA_DRIVER_DOWNLOAD_URL" variable that interpolates the prescribed version into the URL from which to download the driver runfile. When choosing a recent driver version like 410.48, the interpolated URL yields a 404 response from nvidia.com. Instead, the URL looks more like this:
https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
It's possible to use the "NVIDIA_DRIVER_DOWNLOAD_URL" variable to override the template entirely, but we may want to use a more recent template.
Mind you, this driver doesn't build correctly for me yet, but I'm still investigating that problem separately.
Hello, and thank you a lot for this repo.
I installed the nvidia driver:
kubectl apply -f https://raw.githubusercontent.com/shelmangroup/coreos-gpu-installer/master/daemonset.yaml
Then the device manager:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
Then submit a pod:
kind: Pod
apiVersion: v1
metadata:
name: nvidia-pod
spec:
containers:
- name: nvidia
image: 'nvidia/cuda'
command:
- bash
- -c
- "exec sleep 60000"
resources:
limits:
nvidia.com/gpu: 1
I just have a little issue, the bin/lib are not available from within the pod.
When I copy the content of /opt/nvidia
to /home/kubernetes/bin/nvidia
on the host, everything works like a charm.
After a docker inspect on my container, I figure out the mount was mounting on /home/kubernetes/bin/nvidia
,:
{
"Type": "bind",
"Source": "/home/kubernetes/bin/nvidia",
"Destination": "/usr/local/nvidia",
"Mode": "ro",
"RW": false,
"Propagation": "rprivate"
},
but the local host path should be /opt/nvidia
as specified in the installation driver:
- name: NVIDIA_INSTALL_DIR_HOST
value: /opt/nvidia
I tried to redeploy the driver-installer-deamonset with /home/kubernetes/bin/nvidia
as NVIDIA_INSTALL_DIR_HOST
and updated volume
, but the init-pod return an error:
overlayfs: filesystem on 'bin' not supported as upperdir
I don't know how I could setup that information on the Device Manager, and it seems to be hardcoded https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/70a1113ee4ceb413e21c54ca943bc2f4b984117d/cmd/nvidia_gpu/nvidia_gpu.go#L33?
I am running a wrong version of the device manager ?
What version of device manager daemonSet manifest are you using ?
Thank you.
Hello,
Thanks for your work about nvidia drivers installation on coreOS, it works really great.
I just have an issue when I reboot the server where the drivers are installed.
Indeed, the command nvidia-smi does not work anymore:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
It seems like some kernel modules are not loaded anymore:
$ lsmod | grep nvidia
returns an empty output.
Do you have an idea about what should be done to make it work again after a reboot ?
Thanks,
Thomas
Hi
It would be helpful if you could support in setting up nvidia-docker v2 and nvidia runtime environment for using the GPU drivers. The driver installation on Tesla V100 volta 32Gigs and GeForce 1080 Ti 8Gigs works perfectly well and the drivers are installed to /opt as /usr is read only file system in CoreOS distribution.
CoreOS version installed: 2079.3.0 (Stable)
k8s version: v1.14.1 (flannel)
Docker version 18.06.3-ce, build d7080c1
Hardwares:
1080Ti Desktop: Linux coreos-01 4.19.34-coreos #1 SMP Mon Apr 22 20:32:34 -00 2019 x86_64 Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz GenuineIntel GNU/Linux
v100 EDGE: Linux coreos-01 4.19.34-coreos #1 SMP Mon Apr 22 20:32:34 -00 2019 x86_64 Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz GenuineIntel GNU/Linux
Near the end of the steps performed by the entrypoint.sh script, the verify_nvidia_installation
function invokes nvidia-smi and nvidia-modprobe. I don't expect those would run without error on a machine with no GPU, but what about the rest of the script?
I ask because I'll be invoking this script (running it inside of a Docker or rkt container) when building an AMI for use on Amazon EC2 instances. We usually build our AMIs on impoverished, small instance types—those with no GPUs—and I was considering doing the same here. We'd build the image on a machine without a GPU, but then use the image later with machines that do have at least one GPU.
If we made the verify_nvidia_installation
step optional (whether by command-line flag or environment variable), could the rest succeed? The verify_nvidia_installation
function doesn't just check to see whether nvidia-smi succeeds; it also calls on nvidia-modprobe, which sounds necessary. I can't tell whether nvidia-modprobe is part of confirming success, or if it's an essential step we need to perform before making use of the GPU.
Current Dockerhub images are outdated and practically unusable. CI pipleine for Dockerhub images should be implemented. Best wishes from SEED :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.