Expected Behavior Works well as my another hpa, in same environmen

Let me close this ticket. Thanks for your excellent support. <a class="user-mention no

Thanks for your reminder. <div class="snippet-clipboard-content notranslate positi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

here is k8s code for this: <a href="https://github.com/kubernetes/kubernetes/blob/mast

can you show the table view of the query? <div class="snippet-clipboard-content no

currentMetrics is null for external metrics hpa about kube-metrics-adapter HOT 17 CLOSED

johnzheng1975 commented on August 24, 2024

currentMetrics is null for external metrics hpa

from kube-metrics-adapter.

Comments (17)

johnzheng1975 commented on August 24, 2024 2

Let me close this ticket. Thanks for your excellent support. @szuecs @mikkeloscar

from kube-metrics-adapter.

johnzheng1975 commented on August 24, 2024 1

Thanks for your reminder.

Unrelated to the issue: One other small thing that you likely want to change is memory averageUtilization: 95 would mean that it scales-out only at +10%, which is 105%, which is likely already OOM.

Since the averageUtilization: 95 is based on request memory, it will not OOM if limit memory higher than request memory, am I right? Thanks.

from kube-metrics-adapter.

mikkeloscar commented on August 24, 2024 1

@johnzheng1975 I wanted to see the events, you shared get hpa output, I want to see describe hpa.

from kube-metrics-adapter.

johnzheng1975 commented on August 24, 2024

Please guide how to make it works, thanks.

from kube-metrics-adapter.

johnzheng1975 commented on August 24, 2024

@mikkeloscar could you help to take a look, is this a defect? Thanks.

from kube-metrics-adapter.

johnzheng1975 commented on August 24, 2024

kubectl scale deployment aiservice --replicas=1 -n zone-dev
can work, but will trigger one more rs. Not sure this is correct way to do.

from kube-metrics-adapter.

johnzheng1975 commented on August 24, 2024

The thing is: we are using argorollout, so replicas of deploy will be 0. Real replicas count is 1 (will be change with hpa)
So, I think hpa should not show metrics is null. It should show as long as it can be queried from promtheus.
Is this a defect of kube-metrics-adapter or a defect of hpa? Thanks.

or can you provide me some workarround, thanks.

from kube-metrics-adapter.

johnzheng1975 commented on August 24, 2024

here is k8s code for this: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/horizontal.go#L821

Note that cpu/memory metrics will not raise this issue. Workaround is:
Add a combination metrics, then desired replicas and current replicas will be 1.
Then currentMetrics will not be null.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: aiservice
  namespace: zone-dev
  annotations:
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc/
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
     avg(
       avg_over_time(
         DCGM_FI_DEV_GPU_UTIL{
           app="nvidia-dcgm-exporter",
           container="service",
           exported_namespace="zone-dev",
           pod=~"aiservice-.*",
           service="nvidia-dcgm-exporter"
         }[1m]
       )
     )
spec:
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: aiservice
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: External
    external:
      metric:
        name: dcgm-fi-dev-gpu-util
        selector:
          matchLabels:
            type: prometheus
      target:
        type: AverageValue
        averageValue: "50"
  - resource:
      name: memory
      target:
        averageUtilization: 95
        type: Utilization
    type: Resource

from kube-metrics-adapter.

johnzheng1975 commented on August 24, 2024

I think kube-metrics-adapter need improve for this issue. FYI

from kube-metrics-adapter.

szuecs commented on August 24, 2024

can you show the table view of the query?

        DCGM_FI_DEV_GPU_UTIL{
            app="nvidia-dcgm-exporter",
            container="service",
            exported_namespace="zone-dev",
            pod=~"aiservice-.*",
            service="nvidia-dcgm-exporter"

I checked your pictures and for me it looks like the labels are not matching.

Unrelated to the issue: One other small thing that you likely want to change is memory averageUtilization: 95 would mean that it scales-out only at +10%, which is 105%, which is likely already OOM.

from kube-metrics-adapter.

johnzheng1975 commented on August 24, 2024

Thanks for your answer. @szuecs
Here is the result:

DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="535.161.08",
Hostname="ip-10-200-181-23.us-west-2.compute.internal",
UUID="GPU-e1a61ba4-0fff-2b29-744f-110f9ca929cf",
app="nvidia-dcgm-exporter",
container="service",
device="nvidia0",
exported_namespace="zone-dev",
gpu="0",
instance="10.200.164.17:9400",
job="kubernetes-service-endpoints",
modelName="Tesla T4",
namespace="infra",
node="ip-10-200-181-23.us-west-2.compute.internal",
pod="aiservice-84f444c7df-pw2jk",
service="nvidia-dcgm-exporter"}

Value: 65

from kube-metrics-adapter.

szuecs commented on August 24, 2024

Ok thanks the data looks good.
Now I wonder a bit about if I understand the following correctly:

The thing is: we are using argorollout, so replicas of deploy will be 0. Real replicas count is 1 (will be change with hpa)

So, I think hpa should not show metrics is null. It should show as long as it can be queried from promtheus.

Is this a defect of kube-metrics-adapter or a defect of hpa? Thanks.

or can you provide me some workarround, thanks.

So if replicas are more than 0, everything works: Prometheus query returns data and kube-metrics-adapter is providing the data for the hpa, right?
However argocd rollout will set the replica to 0 and then it breaks, right?
And your expectation is that we would provide the last non zero data. Do I understand this correctly?

from kube-metrics-adapter.

johnzheng1975 commented on August 24, 2024

Thanks. @szuecs

So if replicas are more than 0, everything works: Prometheus query returns data and kube-metrics-adapter is providing the data for the hpa, right?
Answer: Yes, if deployment replicas > 0, everything is fine.

However argocd rollout will set the replica to 0 and then it breaks, right?
Answer: Because of argocd rollout, we have to set deploy replicas with 0

And your expectation is that we would provide the last non zero data. Do I understand this correctly?
Answer: I expect:

No error message: message: the HPA controller was able to get the target's current scale
currentMetrics is not null
worked as another case in same enviroment as upper: istio-requests-total
Or worked as combination metrics
#724 (comment)

Note that for same deploy whose repolicas is 0,

hpa based on cpu or memory still works.
Set up a combine hpa based on memory and dcgm-fi-dev-gpu-util works

from kube-metrics-adapter.

mikkeloscar commented on August 24, 2024

@johnzheng1975 What is the output if you describe the hpa?

kubectl --namespace zone-dev describe hpa aiservice

The events in the bottom are the most interesting from that output.

from kube-metrics-adapter.

johnzheng1975 commented on August 24, 2024

@mikkeloscar , pls see upper

from kube-metrics-adapter.

szuecs commented on August 24, 2024

From what I understand is that the istio query will return no data if you scale down to zero.
CPU and memory is not a prometheus query but some kubernetes internal metrics server lookup that could respond data from cache. I wonder a bit if 0 replicas that it returns non zero CPU/memory, but that seems to be a side effect that makes argocd work.

From my side it sounds like a bug in argocd to be honest.
I personally would not like this controller to cache data and null/nil seems to be the right value for a query with no data.

from kube-metrics-adapter.

johnzheng1975 commented on August 24, 2024

@mikkeloscar @szuecs I found the root reason now. This is not a defect of kube-metrics-adapter.
This is caused by incorrect configuration. Sorry for confuse I bring.

The wrong configuration is:
scaleTargetRef is deploy which replicator is 0. It will bring the issue upper "scaling is disabled since the replica count of the target is zero"

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
      avg(
        avg_over_time(
          DCGM_FI_DEV_GPU_UTIL{
            app="nvidia-dcgm-exporter",
            container="service",
            exported_namespace="zone-dev",
            pod=~"aiservice-.*",
            service="nvidia-dcgm-exporter"
          }[1m]
        )
      )
  creationTimestamp: "2024-06-06T12:37:33Z"
  name: aiservice
  namespace: zone-dev
  resourceVersion: "68926327"
  uid: f0e5f9cf-cc9e-4f60-b97f-0ad8a0727cfd
spec:
  maxReplicas: 5
  metrics:
  - external:
      metric:
        name: dcgm-fi-dev-gpu-util
        selector:
          matchLabels:
            type: prometheus
      target:
        averageValue: "50"
        type: AverageValue
    type: External
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: aiservice

The right configuration is changing scaleTargetRef from Deployment to Rollout, who replicas is 1

It works perfect.

Complete correct configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
      avg(
        avg_over_time(
          DCGM_FI_DEV_GPU_UTIL{
            app="nvidia-dcgm-exporter",
            container="service",
            exported_namespace="zone-dev",
            pod=~"aiservice-.*",
            service="nvidia-dcgm-exporter"
          }[1m]
        )
      )
  creationTimestamp: "2024-06-06T12:37:33Z"
  name: aiservice
  namespace: zone-dev
  resourceVersion: "68926327"
  uid: f0e5f9cf-cc9e-4f60-b97f-0ad8a0727cfd
spec:
  maxReplicas: 5
  metrics:
  - external:
      metric:
        name: dcgm-fi-dev-gpu-util
        selector:
          matchLabels:
            type: prometheus
      target:
        averageValue: "50"
        type: AverageValue
    type: External
  minReplicas: 1
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: aiservice

from kube-metrics-adapter.

currentMetrics is null for external metrics hpa about kube-metrics-adapter HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent