An ansible role to install nvidia-docker.
ansible-role-nvidia-docker's Introduction
ansible-role-nvidia-docker's People
Forkers
atetelman ryanolson anyvisionltd phogan-nvidia opnmind ajdecon justcherie dholt srikalyan hephaex mfergus1 rpo19 elgalu cloudcg yifei-ma sxnet akshayraina999 yousong 5l1v3r1 mindbreak aurelien-bareille heniland dev-x-ioansible-role-nvidia-docker's Issues
Add a License
It would be great if Nvidia could add an explicit license for this repository.
Note that nvidia-docker is using the Apache 2 license.
RHEL8.4: Missing dependency [nvidia-container-toolkit]
When installing the nvidia-container-runtime with this rôle, i stilled had an issue and couldn't launch any GPU tasks, having the error: "Error response from daemon: OCI runtime create failed"
I had used the ansible role on Ubuntu and it worked fine, but on RHEL8.4, i was always having an error after install
After investigating, i found than on Ubuntu, the installation of the nvidia-container-runtime
package comes with the nvidia-container-toolkit
dependency, however on RHEL is does not. It is this executable that is used by container runtime platforms to initiate GPU tasks
This dependency is also a dependency of the nvidia-docker2
package, but in your rôle you only get the script.
I was able to make everything work by installing the missing nvidia-container-toolkit
with yum
Is this missing dependency on RedHat platforms normal ?
ubuntu-19 is not supported
currently it look like ubuntu 19 is not supported. Is it possible to have a fallback to 18.04 in this case?
fatal: [workstation-*]: FAILED! => {"changed": false, "dest": "/etc/apt/sources.list.d/nvidia-docker.list", "elapsed": 0, "msg": "Request failed", "response": "HTTP Error 404: Not Found", "status_code": 404, "url": "https://nvidia.github.io/nvidia-docker/ubuntu19.04/nvidia-docker.list"}
Fix ansible-lint errors
The following errors propagate up to a playbook that lists this role as a dependency when running ansible-lint:
risky-file-permissions: File permissions unset or incorrect
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/main.yml:2 Task/Handler: ensure facts directory exists
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/main.yml:51 Task/Handler: grab nvidia-docker wrapper
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/redhat-pre-install.yml:10 Task/Handler: add repo
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/redhat-pre-install.yml:21 Task/Handler: install packages
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:11 Task/Handler: add key
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:18 Task/Handler: add repo
var-spacing: Variables should have spaces before and after: {{proxy_env if proxy_env is defined else {}}}
../../../../../root/.cache/ansible-lint/aafa07/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:28 Task/Handler: install packages
[nvidia.nvidia_docker : install packages] Failed to update apt cache: unknown reason
Installing apt package is failing with Failed to update apt cache: unknown reason
even though repo setup seems to be successful
Running on Ubuntu 22.04.1 LTS
$ ansible-playbook roles.yml -i inventory/hosts -K --tags docker -vv
ansible-playbook [core 2.13.6]
config file = None
configured module search path = ['/Users/user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /opt/homebrew/Cellar/ansible/6.6.0/libexec/lib/python3.11/site-packages/ansible
ansible collection location = /Users/user/.ansible/collections:/usr/share/ansible/collections
executable location = /opt/homebrew/bin/ansible-playbook
python version = 3.11.0 (main, Oct 26 2022, 19:06:18) [Clang 14.0.0 (clang-1400.0.29.202)]
jinja version = 3.1.2
libyaml = True
No config file found; using defaults
PLAYBOOK: roles.yml **********************************************************************
1 plays in roles.yml
PLAY [host] *****************************************************************************
TASK [Gathering Facts] *******************************************************************
task path: /Users/user/code/co/infrastructure/host/roles.yml:2
ok: [host.com]
META: ran handlers
META:
TASK [nvidia.nvidia_docker : ensure facts directory exists] ******************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/main.yml:2
ok: [192.168.127.58] => {"changed": false, "gid": 0, "group": "root", "mode": "0755", "owner": "root", "path": "/etc/ansible/facts.d", "size": 4096, "state": "directory", "uid": 0}
TASK [nvidia.nvidia_docker : setup custom facts] *****************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/main.yml:9
ok: [192.168.127.58] => {"changed": false, "checksum": "7ca5812abf54241cf2127bd6a0722fa919a4de24", "dest": "/etc/ansible/facts.d/nv_os_release.fact", "gid": 0, "group": "root", "mode": "0755", "owner": "root", "path": "/etc/ansible/facts.d/nv_os_release.fact", "size": 572, "state": "file", "uid": 0}
TASK [nvidia.nvidia_docker : re-gather facts] ********************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/main.yml:17
ok: [192.168.127.58]
TASK [nvidia.nvidia_docker : check distro] ***********************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/main.yml:20
skipping: [192.168.127.58] => {"changed": false, "skip_reason": "Conditional result was False"}
TASK [nvidia.nvidia_docker : ubuntu pre-install tasks] ***********************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/main.yml:25
included: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml for 192.168.127.58
TASK [nvidia.nvidia_docker : remove packages] ********************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:2
ok: [192.168.127.58] => {"changed": false}
TASK [nvidia.nvidia_docker : add key] ****************************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:11
ok: [192.168.127.58] => {"before": ["DDCAE044F796ECB0", "93C4A3FD7BB9C367", "EB3E94ADBE1229CF", "D94AA3F0EFE21092", "871920D1991BC93C"], "changed": false, "fp": "DDCAE044F796ECB0", "id": "DDCAE044F796ECB0", "key_id": "DDCAE044F796ECB0", "short_id": "F796ECB0"}
TASK [nvidia.nvidia_docker : add repo] ***************************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:18
ok: [192.168.127.58] => {"changed": false, "dest": "/etc/apt/sources.list.d/nvidia-docker.list", "elapsed": 0, "gid": 0, "group": "root", "mode": "0644", "msg": "HTTP Error 304: Not Modified", "owner": "root", "size": 401, "state": "file", "status_code": 304, "uid": 0, "url": "https://nvidia.github.io/nvidia-docker/ubuntu22.04/nvidia-docker.list"}
TASK [nvidia.nvidia_docker : install packages] *******************************************
task path: /Users/user/.ansible/roles/nvidia.nvidia_docker/tasks/ubuntu-pre-install.yml:28
fatal: [192.168.127.58]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}
TASK [docker : set docker daemon configuration - Failed
System failing on
ubuntu@c12:~$ hostnamectl
Static hostname: c12
Icon name: computer-desktop
Chassis: desktop
Machine ID: d2f9d8ea573f45d98a5053af204e1ed5
Boot ID: 3821b5c73f6b460db404b912f42fd7d1
Operating System: Ubuntu 18.04.2 LTS
Kernel: Linux 4.18.0-21-lowlatency
Architecture: x86-64
ubuntu@c12:~$
Playbook feedback
shane.holloman at 10-81-0-252 in ~/Documents/GitHub/stage-ansible on master [!?]
$ ansible-playbook provision.yml
PLAY [stage] *************************************************************************************************************
TASK [Gathering Facts] ***************************************************************************************************
ok: [10.81.0.118]
TASK [docker : check distro] *********************************************************************************************
skipping: [10.81.0.118]
TASK [docker : ubuntu pre-install tasks] *********************************************************************************
included: /Users/shane.holloman/Documents/GitHub/stage-ansible/roles/docker/tasks/ubuntu-pre-install.yml for 10.81.0.118
TASK [docker : remove packages] ******************************************************************************************
ok: [10.81.0.118]
TASK [docker : add key] **************************************************************************************************
[DEPRECATION WARNING]: evaluating nvidia_docker_add_repo as a bare variable, this behaviour will go away and you might
need to add |bool to the expression in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This feature
will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in
ansible.cfg.
changed: [10.81.0.118]
TASK [docker : add repo] *************************************************************************************************
[DEPRECATION WARNING]: evaluating nvidia_docker_add_repo as a bare variable, this behaviour will go away and you might
need to add |bool to the expression in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This feature
will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in
ansible.cfg.
changed: [10.81.0.118]
TASK [docker : install packages] *****************************************************************************************
changed: [10.81.0.118]
TASK [docker : redhat family pre-install tasks] **************************************************************************
skipping: [10.81.0.118]
TASK [docker : set docker daemon configuration] **************************************************************************
fatal: [10.81.0.118]: FAILED! => {"changed": false, "checksum": "c3d8b05372117ae3c0e6a7b1f5b917c88b33844e", "msg": "Destination directory /etc/docker does not exist"}
RUNNING HANDLER [docker : reload docker] *********************************************************************************
PLAY RECAP ***************************************************************************************************************
10.81.0.118 : ok=6 changed=3 unreachable=0 failed=1 skipped=2 rescued=0 ignored=0
shane.holloman at 10-81-0-252 in ~/Documents/GitHub/stage-ansible on master [!?]
Confirmed here:
ubuntu@c12:~$ cd /etc/docker
-bash: cd: /etc/docker: No such file or directory
Any thoughts?
Maybe call this task later on and maybe even keep it in the default location?:
Docker daemon directory
daemon.json gets overwritten
Hi,
when installing nvidia docker on ubuntu 18 the daemon.json of the docker daemon gets overwritten with the nvidia runtime.
Correct behavior should be, that the nvidia specific configuration should append to daemon.json
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.