Giter Site home page Giter Site logo

Comments (17)

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024 2

Let me close this ticket. Thanks for your excellent support. @szuecs @mikkeloscar

from kube-metrics-adapter.

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024 1

Thanks for your reminder.

Unrelated to the issue: One other small thing that you likely want to change is memory averageUtilization: 95 would mean that it scales-out only at +10%, which is 105%, which is likely already OOM.

Since the averageUtilization: 95 is based on request memory, it will not OOM if limit memory higher than request memory, am I right? Thanks.

from kube-metrics-adapter.

mikkeloscar avatar mikkeloscar commented on June 28, 2024 1

@johnzheng1975 I wanted to see the events, you shared get hpa output, I want to see describe hpa.

from kube-metrics-adapter.

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024

Please guide how to make it works, thanks.

from kube-metrics-adapter.

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024

@mikkeloscar could you help to take a look, is this a defect? Thanks.

from kube-metrics-adapter.

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024

kubectl scale deployment aiservice --replicas=1 -n zone-dev
can work, but will trigger one more rs. Not sure this is correct way to do.

from kube-metrics-adapter.

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024

The thing is: we are using argorollout, so replicas of deploy will be 0. Real replicas count is 1 (will be change with hpa)
So, I think hpa should not show metrics is null. It should show as long as it can be queried from promtheus.
Is this a defect of kube-metrics-adapter or a defect of hpa? Thanks.

or can you provide me some workarround, thanks.

from kube-metrics-adapter.

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024

here is k8s code for this: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/horizontal.go#L821

Note that cpu/memory metrics will not raise this issue. Workaround is:
Add a combination metrics, then desired replicas and current replicas will be 1.
Then currentMetrics will not be null.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: aiservice
  namespace: zone-dev
  annotations:
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc/
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
     avg(
       avg_over_time(
         DCGM_FI_DEV_GPU_UTIL{
           app="nvidia-dcgm-exporter",
           container="service",
           exported_namespace="zone-dev",
           pod=~"aiservice-.*",
           service="nvidia-dcgm-exporter"
         }[1m]
       )
     )
spec:
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: aiservice
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: External
    external:
      metric:
        name: dcgm-fi-dev-gpu-util
        selector:
          matchLabels:
            type: prometheus
      target:
        type: AverageValue
        averageValue: "50"
  - resource:
      name: memory
      target:
        averageUtilization: 95
        type: Utilization
    type: Resource

from kube-metrics-adapter.

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024

I think kube-metrics-adapter need improve for this issue. FYI

from kube-metrics-adapter.

szuecs avatar szuecs commented on June 28, 2024

can you show the table view of the query?

        DCGM_FI_DEV_GPU_UTIL{
            app="nvidia-dcgm-exporter",
            container="service",
            exported_namespace="zone-dev",
            pod=~"aiservice-.*",
            service="nvidia-dcgm-exporter"

I checked your pictures and for me it looks like the labels are not matching.

Unrelated to the issue: One other small thing that you likely want to change is memory averageUtilization: 95 would mean that it scales-out only at +10%, which is 105%, which is likely already OOM.

from kube-metrics-adapter.

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024

Thanks for your answer. @szuecs
Here is the result:
image

DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="535.161.08",
Hostname="ip-10-200-181-23.us-west-2.compute.internal",
UUID="GPU-e1a61ba4-0fff-2b29-744f-110f9ca929cf",
app="nvidia-dcgm-exporter",
container="service",
device="nvidia0",
exported_namespace="zone-dev",
gpu="0",
instance="10.200.164.17:9400",
job="kubernetes-service-endpoints",
modelName="Tesla T4",
namespace="infra",
node="ip-10-200-181-23.us-west-2.compute.internal",
pod="aiservice-84f444c7df-pw2jk",
service="nvidia-dcgm-exporter"}

Value: 65

from kube-metrics-adapter.

szuecs avatar szuecs commented on June 28, 2024

Ok thanks the data looks good.
Now I wonder a bit about if I understand the following correctly:

The thing is: we are using argorollout, so replicas of deploy will be 0. Real replicas count is 1 (will be change with hpa)

So, I think hpa should not show metrics is null. It should show as long as it can be queried from promtheus.

Is this a defect of kube-metrics-adapter or a defect of hpa? Thanks.

or can you provide me some workarround, thanks.

So if replicas are more than 0, everything works: Prometheus query returns data and kube-metrics-adapter is providing the data for the hpa, right?
However argocd rollout will set the replica to 0 and then it breaks, right?
And your expectation is that we would provide the last non zero data. Do I understand this correctly?

from kube-metrics-adapter.

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024

Thanks. @szuecs

So if replicas are more than 0, everything works: Prometheus query returns data and kube-metrics-adapter is providing the data for the hpa, right?
Answer: Yes, if deployment replicas > 0, everything is fine.

However argocd rollout will set the replica to 0 and then it breaks, right?
Answer: Because of argocd rollout, we have to set deploy replicas with 0

And your expectation is that we would provide the last non zero data. Do I understand this correctly?
Answer: I expect:

  1. No error message: message: the HPA controller was able to get the target's current scale

  2. currentMetrics is not null

  3. worked as another case in same enviroment as upper: istio-requests-total
    image

  4. Or worked as combination metrics
    #724 (comment)

Note that for same deploy whose repolicas is 0,

  • hpa based on cpu or memory still works.
  • Set up a combine hpa based on memory and dcgm-fi-dev-gpu-util works

from kube-metrics-adapter.

mikkeloscar avatar mikkeloscar commented on June 28, 2024

@johnzheng1975 What is the output if you describe the hpa?

kubectl --namespace zone-dev describe hpa aiservice

The events in the bottom are the most interesting from that output.

from kube-metrics-adapter.

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024

@mikkeloscar , pls see upper
image

from kube-metrics-adapter.

szuecs avatar szuecs commented on June 28, 2024

From what I understand is that the istio query will return no data if you scale down to zero.
CPU and memory is not a prometheus query but some kubernetes internal metrics server lookup that could respond data from cache. I wonder a bit if 0 replicas that it returns non zero CPU/memory, but that seems to be a side effect that makes argocd work.

From my side it sounds like a bug in argocd to be honest.
I personally would not like this controller to cache data and null/nil seems to be the right value for a query with no data.

from kube-metrics-adapter.

johnzheng1975 avatar johnzheng1975 commented on June 28, 2024

@mikkeloscar @szuecs I found the root reason now. This is not a defect of kube-metrics-adapter.
This is caused by incorrect configuration. Sorry for confuse I bring.

The wrong configuration is:
scaleTargetRef is deploy which replicator is 0. It will bring the issue upper "scaling is disabled since the replica count of the target is zero"

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
      avg(
        avg_over_time(
          DCGM_FI_DEV_GPU_UTIL{
            app="nvidia-dcgm-exporter",
            container="service",
            exported_namespace="zone-dev",
            pod=~"aiservice-.*",
            service="nvidia-dcgm-exporter"
          }[1m]
        )
      )
  creationTimestamp: "2024-06-06T12:37:33Z"
  name: aiservice
  namespace: zone-dev
  resourceVersion: "68926327"
  uid: f0e5f9cf-cc9e-4f60-b97f-0ad8a0727cfd
spec:
  maxReplicas: 5
  metrics:
  - external:
      metric:
        name: dcgm-fi-dev-gpu-util
        selector:
          matchLabels:
            type: prometheus
      target:
        averageValue: "50"
        type: AverageValue
    type: External
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: aiservice

The right configuration is changing scaleTargetRef from Deployment to Rollout, who replicas is 1
image
It works perfect.

Complete correct configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
      avg(
        avg_over_time(
          DCGM_FI_DEV_GPU_UTIL{
            app="nvidia-dcgm-exporter",
            container="service",
            exported_namespace="zone-dev",
            pod=~"aiservice-.*",
            service="nvidia-dcgm-exporter"
          }[1m]
        )
      )
  creationTimestamp: "2024-06-06T12:37:33Z"
  name: aiservice
  namespace: zone-dev
  resourceVersion: "68926327"
  uid: f0e5f9cf-cc9e-4f60-b97f-0ad8a0727cfd
spec:
  maxReplicas: 5
  metrics:
  - external:
      metric:
        name: dcgm-fi-dev-gpu-util
        selector:
          matchLabels:
            type: prometheus
      target:
        averageValue: "50"
        type: AverageValue
    type: External
  minReplicas: 1
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: aiservice

from kube-metrics-adapter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.