Comments (10)
ok, this bug is sloved.
use environment and version:
- OS : ubuntu 20.04.1
- k8s version : v1.23.1
- docker-client version : 19.03.13
- docker-server version : 20.10.12
- CRI: docker
- cgroup driver : systemd
i use nvidia k8s-device-plugin,
and i setting "/etc/docker/daemon.json" contant:
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
in code of "/pkg/util/util.go", always pass "cgroupfs" to cgroupDriver when call function GetCgroupName, then will error.
so this bug not k8s version problem!
and pod id have _
in k8s v1.23.1, so don't check _
characters in function NewCgroupName.
so need detect for what use cgroup method of now.
but i'm rookie for golang, so need more time to coding, i will send PR in few day later.
and need edit title? if need will can direct edit.
from gpumounter.
Thanks for your feedback. I will try to fix it.
PRs are also very welcomed!
from gpumounter.
@cool9203 Happy Spring Festival!
Thanks for your efforts. Sorry for waiting so long time.
- The checking of
_
is to handle the systemd cgroup driver. But if_
can be involved in pod id, it may be complex to handle. Can you show me some k8s document descriptions about_
in pod id?
GPUMounter/pkg/util/cgroup/cgroup.go
Lines 33 to 40 in 7036133
- Pass constant
cgroupfs
is really a bug! It should be configurable.
Line 25 in 7036133
from gpumounter.
@pokerfaceSad Happy Spring Festival!!
thanks for your reply.
GPUMounter/pkg/util/cgroup/cgroup.go
Lines 33 to 40 in 7036133
you right, today i test done, this is can run.
this is not necessary edit.
this edit is my test in the beginning.
my bug is passcgroupfs
inLine 25 in 7036133
from gpumounter.
i got another problem.
GPUMounter/pkg/server/gpu-mount/server.go
Lines 124 to 135 in 7036133
in call RemoveGPU, some times get errorInvalid UUIDs
.
i track this error, found this is slave pod status is terminating, than pod will delete.
example:kubectl get pod --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE gpu-pool test460c04d4-slave-pod-bca118 1/1 Terminating 0 30s
so updategpu will not found any slave pod.
then will not get any GPUresource in mounted gpu pod.
this error maybe only in k8s v1.23.1? or other version occur too?
from gpumounter.
#19 (comment)
maybe i solved this.
https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/
seem like from k8s v1.20+, owner pod and slave pod be need in same namespace.
if owner pod and slave pod not in same namespace, slave pod status will is Terminating
.
so slave pod namespace need set to same as owner pod namespace.
but i need testing more, i will report testing result.
update
my test result:
kubectl get pod -n gpu-pool
NAME READY STATUS RESTARTS AGE
test 1/1 Running 0 3m12s
test-slave-pod-d34ea2 1/1 Running 0 19s
pod/test.yaml
apiVersion: v1
kind: Pod
metadata:
name: test
namespace: gpu-pool
labels:
app: test
spec:
containers:
- name: test
image: [docker-image]
resources:
requests:
memory: "1024M"
cpu: "1"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "none"
kubectl describe pod test-slave-pod-d34ea2 -n gpu-pool
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4s default-scheduler Successfully assigned gpu-pool/test-slave-pod-290964 to rtxws
Normal Pulling 3s kubelet Pulling image "alpine:latest"
Normal Pulled 1s kubelet Successfully pulled image "alpine:latest" in 2.563965249s
Normal Created 1s kubelet Created container gpu-container
Normal Started 1s kubelet Started container gpu-container
owner pod and slave pod not in same namespace pod event:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4s default-scheduler Successfully assigned gpu-pool/test460c04d4-slave-pod-22d29a to rtxws
Warning OwnerRefInvalidNamespace 5s garbage-collector-controller ownerRef [v1/Pod, namespace: gpu-pool, name: test460c04d4, uid: a55bc88b-60d1-460f-a7c7-4072fe6a9a2c] does not exist in namespace "gpu-pool"
Normal Pulling 4s kubelet Pulling image "alpine:latest"
Normal Pulled 1s kubelet Successfully pulled image "alpine:latest" in 2.568386225s
Normal Created 1s kubelet Created container gpu-container
Normal Started 1s kubelet Started container gpu-container
Normal Killing 0s kubelet Stopping container gpu-container
can see in this test, pod/test.yaml
namespace change to gpu-pool
.
now, slave pod status is running
, not is terminating
.
and check pod event, will get pod is running, not is stopping.
i have test for idle 15 minutes, slave pod will be running, not delete.
and can see to not in same namespace event log, show to does not exist in namespace gpu-pool
.
so in k8s v1.20+, slave pod and owner pod must of same namespace.
if not same, slave pod status will be terminating
.
and call RemoveGPU
service, will show Invalid UUIDs
error.
maybe don't use gpu-pool
namespace in k8s v1.20+.
slave pod always use owner pod namespace, not use gpu-pool
.
this is good idea or not? please give me advice, thanks!!
from gpumounter.
@cool9203
Thank you for revealing this!
The reason why slave pod can't be created in owner pod namspace is #3.
Maybe need some modifications to adpat k8s v1.20+.
from gpumounter.
@cool9203
The bug of constant cgroup driver
has been fixed in 163ef7b.
cgroup driver
can be set in /deploy/gpu-mounter-workers.yaml by environment variable CGROUP_DRIVER
.
from gpumounter.
@pokerfaceSad
sorry, i reply late.
@cool9203 The bug of constant
cgroup driver
has been fixed in 163ef7b.cgroup driver
can be set in /deploy/gpu-mounter-workers.yaml by environment variableCGROUP_DRIVER
.
thanks your fixed, pass a environment variable in worker.yaml is good idea!
@cool9203 Thank you for revealing this! The reason why slave pod can't be created in owner pod namspace is #3. Maybe need some modifications to adpat k8s v1.20+.
i show one solve method in #19 (comment)
in this solve, owner pod and slave pod must be same namespace, like gpu-pool
, default
, kube-system
or other namespace
.
and i not set any resource quota.
so like this solve showed, i think owner and slave pod must be same namespace in k8s v1.20+.
what do you think?
from gpumounter.
@cool9203
In fact, slave pods were created in owner pod namespace before a378e39.
However, in a multi-tenant cluster scenario, cluster administrator may use resourse quota feature to limit the resource usage of users.
If GPUMounter create the slave pods in owner pod namespaces, slave pods will consume the resource quota of the user.
from gpumounter.
Related Issues (15)
- gpu-mounter-worker Error: Can not connect to /var/lib/kubelet/pod-resources/kubelet.sock HOT 6
- Is it necessary to bind only one GPU to one slave pod? HOT 3
- gpu节点上,gpu-worker报错:nvml error: %+vcould not load NVML library HOT 2
- Some TODOs after merge PR #15 HOT 2
- Insufficient GPU on Node: xxx HOT 2
- 试用过程中发现的问题 HOT 3
- Can not use GPUMounter on k8s HOT 6
- 请教个问题,运行woker的时候,找不到libnvidia-ml.so.1 HOT 3
- mount成功之后Slave-pod 过一会被杀死,导致不能卸载GPU HOT 2
- Insufficient GPU on Node: xxx HOT 5
- Slave pod creating will failed if the owner pod namespace enabled Resource Quotas HOT 2
- More graceful dependency management HOT 5
- Wait until the slave pods deletion finished HOT 1
- Slave Pod BestEffort QoS may lead to GPU resource leak
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpumounter.