shelmangroup / coreos-gpu-installer Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 14.0 15 KB

Scripts to build and use a container to install GPU drivers on CoreOS Container Linux

Home Page: https://shelman.io

License: Apache License 2.0

Makefile 2.00% Shell 89.44% Dockerfile 8.56%

container docker drivers gpu kubernetes nvidia

coreos-gpu-installer's People

Contributors

Stargazers

Watchers

Forkers

seh pipo02mix src-d linki discordianfish hainesbg thevennamaneni deniskodesh juris happy2048 arif911

coreos-gpu-installer's Issues

Can't load nvidia module due to unknown symbols

Hi,

getting this error:

ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most
       frequently when this kernel module was built against the wrong or
       improperly configured kernel sources, with a version of gcc that
       differs from the one used to build the target kernel, or if a driver
       such as rivafb, nvidiafb, or nouveau is present and prevents the
       NVIDIA kernel module from obtaining ownership of the NVIDIA graphics
       device(s), or no NVIDIA GPU installed in this system is supported by
       this NVIDIA Linux graphics driver release.
       
       Please see the log entries 'Kernel module load error' and 'Kernel
       messages' at the end of the file
       '/usr/local/nvidia/nvidia-installer.log' for more information.


ERROR: Installation has failed.  Please see the file
       '/usr/local/nvidia/nvidia-installer.log' for details.  You may find
       suggestions on fixing installation problems in the README available
       on the Linux driver download page at www.nvidia.com.

In nvidia log I see:

     LD [M]  /tmp/selfgz31129/NVIDIA-Linux-x86_64-390.46/kernel/nvidia-drm.ko
     LD [M]  /tmp/selfgz31129/NVIDIA-Linux-x86_64-390.46/kernel/nvidia-modeset.ko
     LD [M]  /tmp/selfgz31129/NVIDIA-Linux-x86_64-390.46/kernel/nvidia.ko
     LD [M]  /tmp/selfgz31129/NVIDIA-Linux-x86_64-390.46/kernel/nvidia-uvm.ko
   make[1]: Leaving directory '/build/usr/src/linux'
-> done.
-> Kernel module compilation complete.
-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/usr/local/nvidia/nvidia-installer.log' for more information.
-> Kernel module load error: No such file or directory
-> Kernel messages:
[  270.651503] nvidia: Unknown symbol ipmi_free_recv_msg (err 0)
[  270.651523] nvidia: Unknown symbol ipmi_set_my_address (err 0)
[  270.651541] nvidia: Unknown symbol ipmi_request_settime (err 0)
[  270.651552] nvidia: Unknown symbol ipmi_set_gets_events (err 0)
[  353.361619] nvidia: Unknown symbol ipmi_create_user (err 0)
[  353.361661] nvidia: Unknown symbol ipmi_destroy_user (err 0)
[  353.361699] nvidia: Unknown symbol ipmi_validate_addr (err 0)
[  353.361716] nvidia: Unknown symbol ipmi_free_recv_msg (err 0)
[  353.361728] nvidia: Unknown symbol ipmi_set_my_address (err 0)
[  353.361745] nvidia: Unknown symbol ipmi_request_settime (err 0)
[  353.361756] nvidia: Unknown symbol ipmi_set_gets_events (err 0)
[  451.583456] nvidia: Unknown symbol ipmi_create_user (err 0)
[  451.583501] nvidia: Unknown symbol ipmi_destroy_user (err 0)
[  451.583539] nvidia: Unknown symbol ipmi_validate_addr (err 0)
[  451.583555] nvidia: Unknown symbol ipmi_free_recv_msg (err 0)
[  451.583567] nvidia: Unknown symbol ipmi_set_my_address (err 0)
[  451.583606] nvidia: Unknown symbol ipmi_request_settime (err 0)
[  451.583623] nvidia: Unknown symbol ipmi_set_gets_events (err 0)
[  558.985632] nvidia: Unknown symbol ipmi_create_user (err 0)
[  558.985674] nvidia: Unknown symbol ipmi_destroy_user (err 0)
[  558.985711] nvidia: Unknown symbol ipmi_validate_addr (err 0)
[  558.985727] nvidia: Unknown symbol ipmi_free_recv_msg (err 0)
[  558.985739] nvidia: Unknown symbol ipmi_set_my_address (err 0)
[  558.985756] nvidia: Unknown symbol ipmi_request_settime (err 0)
[  558.985767] nvidia: Unknown symbol ipmi_set_gets_events (err 0)
ERROR: Installation has failed.  Please see the file '/usr/local/nvidia/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

On the other hand, my automated build based off your fork of container-engine-accelerators is still working fine: https://hub.docker.com/r/fish/nvidia-driver-installer-containerlinux/

Any idea what did change between that and this stand-alone repo here? Possibly some kernel header mismatch?

Running CoreOS 1520.8.0 which is 4.13.9-coreos

Default driver download URL template is out of date

In file entrypoint.sh, there's the "NVIDIA_DRIVER_DOWNLOAD_URL" variable that interpolates the prescribed version into the URL from which to download the driver runfile. When choosing a recent driver version like 410.48, the interpolated URL yields a 404 response from nvidia.com. Instead, the URL looks more like this:

https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux

It's possible to use the "NVIDIA_DRIVER_DOWNLOAD_URL" variable to override the template entirely, but we may want to use a more recent template.

Mind you, this driver doesn't build correctly for me yet, but I'm still investigating that problem separately.

Device manager - nvidia drivers host path

Hello, and thank you a lot for this repo.

Reproduce

I installed the nvidia driver:

kubectl apply -f https://raw.githubusercontent.com/shelmangroup/coreos-gpu-installer/master/daemonset.yaml

Then the device manager:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

Then submit a pod:

kind: Pod
apiVersion: v1
metadata:
  name: nvidia-pod
spec:
  containers:
    - name: nvidia
      image: 'nvidia/cuda'
      command:
        - bash
        - -c
        - "exec sleep 60000"
      resources:
        limits:
          nvidia.com/gpu: 1

The issue

I just have a little issue, the bin/lib are not available from within the pod.
When I copy the content of /opt/nvidia to /home/kubernetes/bin/nvidia on the host, everything works like a charm.

After a docker inspect on my container, I figure out the mount was mounting on /home/kubernetes/bin/nvidia,:

            {
                "Type": "bind",
                "Source": "/home/kubernetes/bin/nvidia",
                "Destination": "/usr/local/nvidia",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            },

but the local host path should be /opt/nvidia as specified in the installation driver:

          - name: NVIDIA_INSTALL_DIR_HOST
            value: /opt/nvidia

Investigation

I tried to redeploy the driver-installer-deamonset with /home/kubernetes/bin/nvidia as NVIDIA_INSTALL_DIR_HOST and updated volume, but the init-pod return an error:

overlayfs: filesystem on 'bin' not supported as upperdir

I don't know how I could setup that information on the Device Manager, and it seems to be hardcoded https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/70a1113ee4ceb413e21c54ca943bc2f4b984117d/cmd/nvidia_gpu/nvidia_gpu.go#L33?

I am running a wrong version of the device manager ?
What version of device manager daemonSet manifest are you using ?

Thank you.

Driver installation not reboot proof

Hello,

Thanks for your work about nvidia drivers installation on coreOS, it works really great.

I just have an issue when I reboot the server where the drivers are installed.

Indeed, the command nvidia-smi does not work anymore:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

It seems like some kernel modules are not loaded anymore:
$ lsmod | grep nvidia returns an empty output.

Do you have an idea about what should be done to make it work again after a reboot ?

Thanks,
Thomas

Nvidia runtime not available to use the installed drivers for Docker and Kubernetes

It would be helpful if you could support in setting up nvidia-docker v2 and nvidia runtime environment for using the GPU drivers. The driver installation on Tesla V100 volta 32Gigs and GeForce 1080 Ti 8Gigs works perfectly well and the drivers are installed to /opt as /usr is read only file system in CoreOS distribution.

CoreOS version installed: 2079.3.0 (Stable)
k8s version: v1.14.1 (flannel)
Docker version 18.06.3-ce, build d7080c1

Hardwares:
1080Ti Desktop: Linux coreos-01 4.19.34-coreos #1 SMP Mon Apr 22 20:32:34 -00 2019 x86_64 Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz GenuineIntel GNU/Linux

v100 EDGE: Linux coreos-01 4.19.34-coreos #1 SMP Mon Apr 22 20:32:34 -00 2019 x86_64 Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz GenuineIntel GNU/Linux

Can entrypoint.sh run successfully on a machine with no GPU?

Near the end of the steps performed by the entrypoint.sh script, the verify_nvidia_installation function invokes nvidia-smi and nvidia-modprobe. I don't expect those would run without error on a machine with no GPU, but what about the rest of the script?

I ask because I'll be invoking this script (running it inside of a Docker or rkt container) when building an AMI for use on Amazon EC2 instances. We usually build our AMIs on impoverished, small instance types—those with no GPUs—and I was considering doing the same here. We'd build the image on a machine without a GPU, but then use the image later with machines that do have at least one GPU.

If we made the verify_nvidia_installation step optional (whether by command-line flag or environment variable), could the rest succeed? The verify_nvidia_installation function doesn't just check to see whether nvidia-smi succeeds; it also calls on nvidia-modprobe, which sounds necessary. I can't tell whether nvidia-modprobe is part of confirming success, or if it's an essential step we need to perform before making use of the GPU.

Ci for Dockerhub images should be implemented

Current Dockerhub images are outdated and practically unusable. CI pipleine for Dockerhub images should be implemented. Best wishes from SEED :)

shelmangroup / coreos-gpu-installer Goto Github PK

coreos-gpu-installer's People

Contributors

Stargazers

Watchers

Forkers

coreos-gpu-installer's Issues

Can't load nvidia module due to unknown symbols

Default driver download URL template is out of date

Device manager - nvidia drivers host path

Reproduce

The issue

Investigation

Driver installation not reboot proof

Nvidia runtime not available to use the installed drivers for Docker and Kubernetes

Can entrypoint.sh run successfully on a machine with no GPU?

Ci for Dockerhub images should be implemented

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent