Giter Site home page Giter Site logo

ansible-role-nvidia-docker's Introduction

An ansible role to install nvidia-docker.

ansible-role-nvidia-docker's People

Contributors

ajdecon avatar aurelien-bareille avatar dholt avatar lukeyeager avatar michael-balint avatar phogan-nvidia avatar ryanolson avatar supertetelman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ansible-role-nvidia-docker's Issues

Add a License

It would be great if Nvidia could add an explicit license for this repository.
Note that nvidia-docker is using the Apache 2 license.

RHEL8.4: Missing dependency [nvidia-container-toolkit]

When installing the nvidia-container-runtime with this rôle, i stilled had an issue and couldn't launch any GPU tasks, having the error: "Error response from daemon: OCI runtime create failed"

I had used the ansible role on Ubuntu and it worked fine, but on RHEL8.4, i was always having an error after install

After investigating, i found than on Ubuntu, the installation of the nvidia-container-runtime package comes with the nvidia-container-toolkit dependency, however on RHEL is does not. It is this executable that is used by container runtime platforms to initiate GPU tasks

This dependency is also a dependency of the nvidia-docker2 package, but in your rôle you only get the script.

I was able to make everything work by installing the missing nvidia-container-toolkit with yum

Is this missing dependency on RedHat platforms normal ?

Fix ansible-lint errors

The following errors propagate up to a playbook that lists this role as a dependency when running ansible-lint:

risky-file-permissions: File permissions unset or incorrect
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/main.yml:2 Task/Handler: ensure facts directory exists
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/main.yml:51 Task/Handler: grab nvidia-docker wrapper
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/redhat-pre-install.yml:10 Task/Handler: add repo
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/redhat-pre-install.yml:21 Task/Handler: install packages
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:11 Task/Handler: add key
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:18 Task/Handler: add repo
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:28 Task/Handler: install packages

[nvidia.nvidia_docker : install packages] Failed to update apt cache: unknown reason

Installing apt package is failing with Failed to update apt cache: unknown reason even though repo setup seems to be successful

Running on Ubuntu 22.04.1 LTS

$ ansible-playbook roles.yml -i inventory/hosts -K --tags docker -vv 
ansible-playbook [core 2.13.6]
  config file = None
  configured module search path = ['/Users/user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /opt/homebrew/Cellar/ansible/6.6.0/libexec/lib/python3.11/site-packages/ansible
  ansible collection location = /Users/user/.ansible/collections:/usr/share/ansible/collections
  executable location = /opt/homebrew/bin/ansible-playbook
  python version = 3.11.0 (main, Oct 26 2022, 19:06:18) [Clang 14.0.0 (clang-1400.0.29.202)]
  jinja version = 3.1.2
  libyaml = True
No config file found; using defaults

PLAYBOOK: roles.yml **********************************************************************
1 plays in roles.yml

PLAY [host] *****************************************************************************

TASK [Gathering Facts] *******************************************************************
task path: /Users/user/code/co/infrastructure/host/roles.yml:2
ok: [host.com]
META: ran handlers
META: 

TASK [nvidia.nvidia_docker : ensure facts directory exists] ******************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/main.yml:2
ok: [192.168.127.58] => {"changed": false, "gid": 0, "group": "root", "mode": "0755", "owner": "root", "path": "/etc/ansible/facts.d", "size": 4096, "state": "directory", "uid": 0}

TASK [nvidia.nvidia_docker : setup custom facts] *****************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/main.yml:9
ok: [192.168.127.58] => {"changed": false, "checksum": "7ca5812abf54241cf2127bd6a0722fa919a4de24", "dest": "/etc/ansible/facts.d/nv_os_release.fact", "gid": 0, "group": "root", "mode": "0755", "owner": "root", "path": "/etc/ansible/facts.d/nv_os_release.fact", "size": 572, "state": "file", "uid": 0}

TASK [nvidia.nvidia_docker : re-gather facts] ********************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/main.yml:17
ok: [192.168.127.58]

TASK [nvidia.nvidia_docker : check distro] ***********************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/main.yml:20
skipping: [192.168.127.58] => {"changed": false, "skip_reason": "Conditional result was False"}

TASK [nvidia.nvidia_docker : ubuntu pre-install tasks] ***********************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/main.yml:25
included: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml for 192.168.127.58

TASK [nvidia.nvidia_docker : remove packages] ********************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:2
ok: [192.168.127.58] => {"changed": false}

TASK [nvidia.nvidia_docker : add key] ****************************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:11
ok: [192.168.127.58] => {"before": ["DDCAE044F796ECB0", "93C4A3FD7BB9C367", "EB3E94ADBE1229CF", "D94AA3F0EFE21092", "871920D1991BC93C"], "changed": false, "fp": "DDCAE044F796ECB0", "id": "DDCAE044F796ECB0", "key_id": "DDCAE044F796ECB0", "short_id": "F796ECB0"}

TASK [nvidia.nvidia_docker : add repo] ***************************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:18
ok: [192.168.127.58] => {"changed": false, "dest": "/etc/apt/sources.list.d/nvidia-docker.list", "elapsed": 0, "gid": 0, "group": "root", "mode": "0644", "msg": "HTTP Error 304: Not Modified", "owner": "root", "size": 401, "state": "file", "status_code": 304, "uid": 0, "url": "https://nvidia.github.io/nvidia-docker/ubuntu22.04/nvidia-docker.list"}

TASK [nvidia.nvidia_docker : install packages] *******************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:28
fatal: [192.168.127.58]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}

TASK [docker : set docker daemon configuration - Failed

System failing on

ubuntu@c12:~$ hostnamectl
   Static hostname: c12
         Icon name: computer-desktop
           Chassis: desktop
        Machine ID: d2f9d8ea573f45d98a5053af204e1ed5
           Boot ID: 3821b5c73f6b460db404b912f42fd7d1
  Operating System: Ubuntu 18.04.2 LTS
            Kernel: Linux 4.18.0-21-lowlatency
      Architecture: x86-64
ubuntu@c12:~$ 

Playbook feedback

shane.holloman at 10-81-0-252 in ~/Documents/GitHub/stage-ansible on master [!?]
$ ansible-playbook provision.yml

PLAY [stage] *************************************************************************************************************

TASK [Gathering Facts] ***************************************************************************************************
ok: [10.81.0.118]

TASK [docker : check distro] *********************************************************************************************
skipping: [10.81.0.118]

TASK [docker : ubuntu pre-install tasks] *********************************************************************************
included: /Users/shane.holloman/Documents/GitHub/stage-ansible/roles/docker/tasks/ubuntu-pre-install.yml for 10.81.0.118

TASK [docker : remove packages] ******************************************************************************************
ok: [10.81.0.118]

TASK [docker : add key] **************************************************************************************************
[DEPRECATION WARNING]: evaluating nvidia_docker_add_repo as a bare variable, this behaviour will go away and you might 
need to add |bool to the expression in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This feature 
will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in 
ansible.cfg.
changed: [10.81.0.118]

TASK [docker : add repo] *************************************************************************************************
[DEPRECATION WARNING]: evaluating nvidia_docker_add_repo as a bare variable, this behaviour will go away and you might 
need to add |bool to the expression in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This feature 
will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in 
ansible.cfg.
changed: [10.81.0.118]

TASK [docker : install packages] *****************************************************************************************
changed: [10.81.0.118]

TASK [docker : redhat family pre-install tasks] **************************************************************************
skipping: [10.81.0.118]

TASK [docker : set docker daemon configuration] **************************************************************************
fatal: [10.81.0.118]: FAILED! => {"changed": false, "checksum": "c3d8b05372117ae3c0e6a7b1f5b917c88b33844e", "msg": "Destination directory /etc/docker does not exist"}

RUNNING HANDLER [docker : reload docker] *********************************************************************************

PLAY RECAP ***************************************************************************************************************
10.81.0.118                : ok=6    changed=3    unreachable=0    failed=1    skipped=2    rescued=0    ignored=0   


shane.holloman at 10-81-0-252 in ~/Documents/GitHub/stage-ansible on master [!?]

Confirmed here:

ubuntu@c12:~$ cd /etc/docker
-bash: cd: /etc/docker: No such file or directory

Any thoughts?
Maybe call this task later on and maybe even keep it in the default location?:
Docker daemon directory

daemon.json gets overwritten

Hi,
when installing nvidia docker on ubuntu 18 the daemon.json of the docker daemon gets overwritten with the nvidia runtime.

Correct behavior should be, that the nvidia specific configuration should append to daemon.json

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.