Giter Site home page Giter Site logo

with preempt or reclaim plugin, the high priority pod can not be placed at some node which meet the conditions for preemption about volcano HOT 18 CLOSED

LivingCcj avatar LivingCcj commented on August 18, 2024
with preempt or reclaim plugin, the high priority pod can not be placed at some node which meet the conditions for preemption

from volcano.

Comments (18)

lowang-bh avatar lowang-bh commented on August 18, 2024

Would you please supply more informations, such as scheduler configmap, and scheduler logs and jobs config?

from volcano.

LivingCcj avatar LivingCcj commented on August 18, 2024

when preempt or reclaim, if one predicate function handler return status with Unschedulable state,but the err is not nil, it will ignore the potential node. According to the comments Allows scheduling to nodes that are in Success or Unschedulable state after filtering by predicate in function predicateFn , the node could be preempted
here is the volcano scheduler configmap:

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, preempt, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
      - name: cdp
    - plugins:
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: nodeorder
        arguments:
          nodeaffinity.weight: 5
      - name: binpack
        arguments:
          binpack.weight: 5
          binpack.cpu: 2
          binpack.memory: 1
kind: ConfigMap
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system

from volcano.

lowang-bh avatar lowang-bh commented on August 18, 2024

but the err is not nil

What is the error?

from volcano.

LivingCcj avatar LivingCcj commented on August 18, 2024

There is a scene, an unscheduled pod with gpu resources is in the session of preempt action. one node have some lower priority pods,if it preempts the lower priority pod,the unscheduled pod could be placed at the node. However at predicate stage, the predicateStatus.Code is Unschedulable and err is not nil (refer to the code below). It leads to ignoring the potential node when filtering preempted nodes.

for _, val := range api.RegisteredDevices {
if dev, ok := node.Others[val].(api.Devices); ok {
if dev == nil {
predicateStatus = append(predicateStatus, &api.Status{
Code: devices.Unschedulable,
Reason: "node not initialized with device" + val,
})
return predicateStatus, fmt.Errorf("node not initialized with device %s", val)
}
code, msg, err := dev.FilterNode(task.Pod)
filterNodeStatus := &api.Status{
Code: code,
Reason: msg,
}
if err != nil {
return predicateStatus, err
}
if filterNodeStatus.Code != api.Success {
predicateStatus = append(predicateStatus, filterNodeStatus)
return predicateStatus, fmt.Errorf("plugin device filternode predicates failed %s", msg)
}
} else {
klog.Warningf("Devices %s assertion conversion failed, skip", val)
}
}

from volcano.

Monokaix avatar Monokaix commented on August 18, 2024

There is a scene, an unscheduled pod with gpu resources is in the session of preempt action. one node have some lower priority pods,if it preempts the lower priority pod,the unscheduled pod could be placed at the node. However at predicate stage, the predicateStatus.Code is Unschedulable and err is not nil (refer to the code below). It leads to ignoring the potential node when filtering preempted nodes.

for _, val := range api.RegisteredDevices {
if dev, ok := node.Others[val].(api.Devices); ok {
if dev == nil {
predicateStatus = append(predicateStatus, &api.Status{
Code: devices.Unschedulable,
Reason: "node not initialized with device" + val,
})
return predicateStatus, fmt.Errorf("node not initialized with device %s", val)
}
code, msg, err := dev.FilterNode(task.Pod)
filterNodeStatus := &api.Status{
Code: code,
Reason: msg,
}
if err != nil {
return predicateStatus, err
}
if filterNodeStatus.Code != api.Success {
predicateStatus = append(predicateStatus, filterNodeStatus)
return predicateStatus, fmt.Errorf("plugin device filternode predicates failed %s", msg)
}
} else {
klog.Warningf("Devices %s assertion conversion failed, skip", val)
}
}

It's truly a problem in vgpu preemption, I think we should not reuturn err when vgpu resource insufficient here, if you're interested, welcome to fix that.

from volcano.

Monokaix avatar Monokaix commented on August 18, 2024

Same problem: #3186. We can fix it to resolve both of them.

from volcano.

Monokaix avatar Monokaix commented on August 18, 2024

@LivingCcj @lowang-bh You're welcome to fix this: )

from volcano.

Monokaix avatar Monokaix commented on August 18, 2024

This phenomenon has recurred when vgpu resource is insufficient.
Here are volcano scheduler logs:

I0319 02:51:05.281886       1 preempt.go:43] Enter Preempt ...
I0319 02:51:05.281895       1 job_info.go:728] job podgroup-f354bb74-7c3d-4429-aa92-3c02a7ab99ba/kubeflow actual: map[:1], ji.TaskMinAvailable: map[]
I0319 02:51:05.281913       1 preempt.go:65] Added Queue <default> for Job <kubeflow/podgroup-f354bb74-7c3d-4429-aa92-3c02a7ab99ba>
I0319 02:51:05.281925       1 job_info.go:728] job podgroup-ccffee3d-b1e2-4a94-9f4d-f15502dc3f77/kubeflow actual: map[:1], ji.TaskMinAvailable: map[]
I0319 02:51:05.281942       1 job_info.go:728] job podgroup-6ba7f409-d8c0-498f-a2ec-ec7b0c7f75fc/kubeflow actual: map[:1], ji.TaskMinAvailable: map[]
I0319 02:51:05.281973       1 predicates.go:384] pod(kubeflow/x-v1-76d645bc8c-8sr2m) affinity require information is nil, plugin InterPodAffinity is skipped
I0319 02:51:05.282004       1 predicate_helper.go:55] Considering Task <kubeflow/x-v1-76d645bc8c-8sr2m> on node <10.x.x.x>: <cpu 1000.00, memory 4294967296.00, volcano.sh/vgpu-number 2000.00> vs. <cpu 2750.00, memory 8095842304.00, ephemeral-storage 38644306266000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>
I0319 02:51:05.282013       1 predicate_helper.go:55] Considering Task <kubeflow/x-v1-76d645bc8c-8sr2m> on node <10.x.x.x>: <cpu 1000.00, memory 4294967296.00, volcano.sh/vgpu-number 2000.00> vs. <cpu 28530.00, memory 126106681344.00, hugepages-2Mi 0.00, nstack/vcuda-core 0.00, nstack/vcuda-memory 0.00, nvidia.com/gpu 2000.00, volcano.sh/vgpu-number 17000.00, ephemeral-storage 482947890401000.00, hugepages-1Gi 0.00>
I0319 02:51:05.282064       1 predicate_helper.go:75] Predicates failed for task <kubeflow/x-v1-76d645bc8c-8sr2m> on node <10.x.x.x>: task kubeflow/x-v1-76d645bc8c-8sr2m on node 10.x.x.x fit failed: plugin TaintToleration predicates failed node(s) had untolerated taint {node-role.kubernetes.io/controlplane: true}
I0319 02:51:05.282078       1 predicates.go:505] pod(kubeflow/x-v1-76d645bc8c-8sr2m) affinity require information is nil, plugin InterPodAffinity is skip for node 10.x.x.x
I0319 02:51:05.282105       1 csi.go:210] "Could not find a CSI driver name or volume handle, not counting volume"
I0319 02:51:05.282125       1 device_info.go:152] DeviceSharing:Into FitInPod x-v1-76d645bc8c-8sr2m
I0319 02:51:05.282136       1 device_info.go:167] DeviceSharing:FitInPod successed
I0319 02:51:05.282143       1 device_info.go:183] 4pdvgpu DeviceSharing starts filtering pods x-v1-76d645bc8c-8sr2m
I0319 02:51:05.282153       1 utils.go:256] counts= [{2 NVIDIA 10240 101 0}]
I0319 02:51:05.282178       1 utils.go:350] Allocating device for container request {2 NVIDIA 10240 101 0}
I0319 02:51:05.282201       1 utils.go:353] Scoring pod 10240:101:0:2i1device:1
I0319 02:51:05.282223       1 utils.go:354] gs 1 = 11441 10250 2
I0319 02:51:05.282244       1 utils.go:353] Scoring pod 10240:101:0:2i0device:0
I0319 02:51:05.282268       1 utils.go:354] gs 0 = 11441 10240 1
E0319 02:51:05.282285      1 device_info.go:187] deviceSharing err= not enough gpu fitted on this node
I0319 02:51:05.282306       1 predicate_helper.go:75] Predicates failed for task <kubeflow/x-v1-76d645bc8c-8sr2m> on node <10.x.x.x>: task kubeflow/x-v1-76d645bc8c-8sr2m on node 10.x.x.x fit failed: not enough gpu fitted on this node

Vital information:device_info.go:187] deviceSharing err= not enough gpu fitted on this node

from volcano.

lowang-bh avatar lowang-bh commented on August 18, 2024

There is a scene, an unscheduled pod with gpu resources is in the session of preempt action. one node have some lower priority pods,if it preempts the lower priority pod,the unscheduled pod could be placed at the node. However at predicate stage, the predicateStatus.Code is Unschedulable and err is not nil (refer to the code below). It leads to ignoring the potential node when filtering preempted nodes.

for _, val := range api.RegisteredDevices {
if dev, ok := node.Others[val].(api.Devices); ok {
if dev == nil {
predicateStatus = append(predicateStatus, &api.Status{
Code: devices.Unschedulable,
Reason: "node not initialized with device" + val,
})
return predicateStatus, fmt.Errorf("node not initialized with device %s", val)
}
code, msg, err := dev.FilterNode(task.Pod)
filterNodeStatus := &api.Status{
Code: code,
Reason: msg,
}
if err != nil {
return predicateStatus, err
}
if filterNodeStatus.Code != api.Success {
predicateStatus = append(predicateStatus, filterNodeStatus)
return predicateStatus, fmt.Errorf("plugin device filternode predicates failed %s", msg)
}
} else {
klog.Warningf("Devices %s assertion conversion failed, skip", val)
}
}

It's truly a problem in vgpu preemption, I think we should not reuturn err when vgpu resource insufficient here, if you're interested, welcome to fix that.

@archlitchi is owned and familar with the vgpu code. @Monokaix

from volcano.

dmitsh avatar dmitsh commented on August 18, 2024

I might experience a similar issue.
My cluster has 4 GPU nodes.
First, I start a 4-nodes job with low priority, which gets scheduled and running.
A little later I start two 2-nodes jobs with high priority.
I would expect that the high priority jobs would preempt the first job, but it doesn't happen.
Please refer to the attached files.
volcano.zip
/cc @k82cn

from volcano.

Monokaix avatar Monokaix commented on August 18, 2024

I might experience a similar issue. My cluster has 4 GPU nodes. First, I start a 4-nodes job with low priority, which gets scheduled and running. A little later I start two 2-nodes jobs with high priority. I would expect that the high priority jobs would preempt the first job, but it doesn't happen. Please refer to the attached files. volcano.zip /cc @k82cn

Maybe you can provide some logs: )

from volcano.

dmitsh avatar dmitsh commented on August 18, 2024

I might experience a similar issue. My cluster has 4 GPU nodes. First, I start a 4-nodes job with low priority, which gets scheduled and running. A little later I start two 2-nodes jobs with high priority. I would expect that the high priority jobs would preempt the first job, but it doesn't happen. Please refer to the attached files. volcano.zip /cc @k82cn

Maybe you can provide some logs: )

logs are in the zip file.

from volcano.

k82cn avatar k82cn commented on August 18, 2024

@dmitsh , I think your case maybe different with this one. According the the log, it seems your case is that: do preemption for gang scheduling. I'd like to a google doc for this case, as we already have several discussion about that; it's time to close it :)

from volcano.

lowang-bh avatar lowang-bh commented on August 18, 2024

volcano.zip

Your jobs are in same queue, and queue is overused. It need to preemp tasks in same queue. Please update to latest version in master branch which fix the issue high priority job preempt low priority job in same queue.

from volcano.

william-wang avatar william-wang commented on August 18, 2024

I might experience a similar issue. My cluster has 4 GPU nodes. First, I start a 4-nodes job with low priority, which gets scheduled and running. A little later I start two 2-nodes jobs with high priority. I would expect that the high priority jobs would preempt the first job, but it doesn't happen. Please refer to the attached files. volcano.zip

@dmitsh We are reproducing this issue and finding the cause.

from volcano.

Monokaix avatar Monokaix commented on August 18, 2024

I might experience a similar issue. My cluster has 4 GPU nodes. First, I start a 4-nodes job with low priority, which gets scheduled and running. A little later I start two 2-nodes jobs with high priority. I would expect that the high priority jobs would preempt the first job, but it doesn't happen. Please refer to the attached files. volcano.zip /cc @k82cn

Maybe you can provide some logs: )

logs are in the zip file.

Seems it's another issue in your case, and #3230 has fixed it, please check your volcano version whether has included this pr: )

from volcano.

Monokaix avatar Monokaix commented on August 18, 2024

/close

from volcano.

volcano-sh-bot avatar volcano-sh-bot commented on August 18, 2024

@Monokaix: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from volcano.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.