Comments (17)
Let me close this ticket. Thanks for your excellent support. @szuecs @mikkeloscar
from kube-metrics-adapter.
Thanks for your reminder.
Unrelated to the issue: One other small thing that you likely want to change is memory averageUtilization: 95 would mean that it scales-out only at +10%, which is 105%, which is likely already OOM.
Since the averageUtilization: 95 is based on request memory, it will not OOM if limit memory higher than request memory, am I right? Thanks.
from kube-metrics-adapter.
@johnzheng1975 I wanted to see the events, you shared get hpa
output, I want to see describe hpa
.
from kube-metrics-adapter.
Please guide how to make it works, thanks.
from kube-metrics-adapter.
@mikkeloscar could you help to take a look, is this a defect? Thanks.
from kube-metrics-adapter.
kubectl scale deployment aiservice --replicas=1 -n zone-dev
can work, but will trigger one more rs. Not sure this is correct way to do.
from kube-metrics-adapter.
The thing is: we are using argorollout, so replicas of deploy will be 0. Real replicas count is 1 (will be change with hpa)
So, I think hpa should not show metrics is null. It should show as long as it can be queried from promtheus.
Is this a defect of kube-metrics-adapter or a defect of hpa? Thanks.
or can you provide me some workarround, thanks.
from kube-metrics-adapter.
here is k8s code for this: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/horizontal.go#L821
Note that cpu/memory metrics will not raise this issue. Workaround is:
Add a combination metrics, then desired replicas and current replicas will be 1.
Then currentMetrics will not be null.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: aiservice
namespace: zone-dev
annotations:
metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc/
metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
avg(
avg_over_time(
DCGM_FI_DEV_GPU_UTIL{
app="nvidia-dcgm-exporter",
container="service",
exported_namespace="zone-dev",
pod=~"aiservice-.*",
service="nvidia-dcgm-exporter"
}[1m]
)
)
spec:
scaleTargetRef:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
name: aiservice
minReplicas: 1
maxReplicas: 5
metrics:
- type: External
external:
metric:
name: dcgm-fi-dev-gpu-util
selector:
matchLabels:
type: prometheus
target:
type: AverageValue
averageValue: "50"
- resource:
name: memory
target:
averageUtilization: 95
type: Utilization
type: Resource
from kube-metrics-adapter.
I think kube-metrics-adapter need improve for this issue. FYI
from kube-metrics-adapter.
can you show the table view of the query?
DCGM_FI_DEV_GPU_UTIL{
app="nvidia-dcgm-exporter",
container="service",
exported_namespace="zone-dev",
pod=~"aiservice-.*",
service="nvidia-dcgm-exporter"
I checked your pictures and for me it looks like the labels are not matching.
Unrelated to the issue: One other small thing that you likely want to change is memory averageUtilization: 95
would mean that it scales-out only at +10%, which is 105%, which is likely already OOM.
from kube-metrics-adapter.
Thanks for your answer. @szuecs
Here is the result:
DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="535.161.08",
Hostname="ip-10-200-181-23.us-west-2.compute.internal",
UUID="GPU-e1a61ba4-0fff-2b29-744f-110f9ca929cf",
app="nvidia-dcgm-exporter",
container="service",
device="nvidia0",
exported_namespace="zone-dev",
gpu="0",
instance="10.200.164.17:9400",
job="kubernetes-service-endpoints",
modelName="Tesla T4",
namespace="infra",
node="ip-10-200-181-23.us-west-2.compute.internal",
pod="aiservice-84f444c7df-pw2jk",
service="nvidia-dcgm-exporter"}
Value: 65
from kube-metrics-adapter.
Ok thanks the data looks good.
Now I wonder a bit about if I understand the following correctly:
The thing is: we are using argorollout, so replicas of deploy will be 0. Real replicas count is 1 (will be change with hpa)
So, I think hpa should not show metrics is null. It should show as long as it can be queried from promtheus.
Is this a defect of kube-metrics-adapter or a defect of hpa? Thanks.
or can you provide me some workarround, thanks.
So if replicas are more than 0, everything works: Prometheus query returns data and kube-metrics-adapter is providing the data for the hpa, right?
However argocd rollout will set the replica to 0 and then it breaks, right?
And your expectation is that we would provide the last non zero data. Do I understand this correctly?
from kube-metrics-adapter.
Thanks. @szuecs
So if replicas are more than 0, everything works: Prometheus query returns data and kube-metrics-adapter is providing the data for the hpa, right?
Answer: Yes, if deployment replicas > 0, everything is fine.
However argocd rollout will set the replica to 0 and then it breaks, right?
Answer: Because of argocd rollout, we have to set deploy replicas with 0
And your expectation is that we would provide the last non zero data. Do I understand this correctly?
Answer: I expect:
-
No error message: message: the HPA controller was able to get the target's current scale
-
currentMetrics is not null
-
worked as another case in same enviroment as upper: istio-requests-total
-
Or worked as combination metrics
#724 (comment)
Note that for same deploy whose repolicas is 0,
- hpa based on cpu or memory still works.
- Set up a combine hpa based on memory and dcgm-fi-dev-gpu-util works
from kube-metrics-adapter.
@johnzheng1975 What is the output if you describe the hpa?
kubectl --namespace zone-dev describe hpa aiservice
The events in the bottom are the most interesting from that output.
from kube-metrics-adapter.
@mikkeloscar , pls see upper
from kube-metrics-adapter.
From what I understand is that the istio query will return no data if you scale down to zero.
CPU and memory is not a prometheus query but some kubernetes internal metrics server lookup that could respond data from cache. I wonder a bit if 0 replicas that it returns non zero CPU/memory, but that seems to be a side effect that makes argocd work.
From my side it sounds like a bug in argocd to be honest.
I personally would not like this controller to cache data and null/nil seems to be the right value for a query with no data.
from kube-metrics-adapter.
@mikkeloscar @szuecs I found the root reason now. This is not a defect of kube-metrics-adapter
.
This is caused by incorrect configuration. Sorry for confuse I bring.
The wrong configuration is:
scaleTargetRef is deploy which replicator is 0
. It will bring the issue upper "scaling is disabled since the replica count of the target is zero"
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
annotations:
metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc
metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
avg(
avg_over_time(
DCGM_FI_DEV_GPU_UTIL{
app="nvidia-dcgm-exporter",
container="service",
exported_namespace="zone-dev",
pod=~"aiservice-.*",
service="nvidia-dcgm-exporter"
}[1m]
)
)
creationTimestamp: "2024-06-06T12:37:33Z"
name: aiservice
namespace: zone-dev
resourceVersion: "68926327"
uid: f0e5f9cf-cc9e-4f60-b97f-0ad8a0727cfd
spec:
maxReplicas: 5
metrics:
- external:
metric:
name: dcgm-fi-dev-gpu-util
selector:
matchLabels:
type: prometheus
target:
averageValue: "50"
type: AverageValue
type: External
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: aiservice
The right configuration is changing scaleTargetRef from Deployment to Rollout, who replicas is 1
It works perfect.
Complete correct configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
annotations:
metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc
metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
avg(
avg_over_time(
DCGM_FI_DEV_GPU_UTIL{
app="nvidia-dcgm-exporter",
container="service",
exported_namespace="zone-dev",
pod=~"aiservice-.*",
service="nvidia-dcgm-exporter"
}[1m]
)
)
creationTimestamp: "2024-06-06T12:37:33Z"
name: aiservice
namespace: zone-dev
resourceVersion: "68926327"
uid: f0e5f9cf-cc9e-4f60-b97f-0ad8a0727cfd
spec:
maxReplicas: 5
metrics:
- external:
metric:
name: dcgm-fi-dev-gpu-util
selector:
matchLabels:
type: prometheus
target:
averageValue: "50"
type: AverageValue
type: External
minReplicas: 1
scaleTargetRef:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
name: aiservice
from kube-metrics-adapter.
Related Issues (20)
- Update outdated base image? HOT 2
- Kustomize support HOT 1
- Helm chart should be available in a chart repository HOT 8
- [Doc Question] May I config multiple type of collector in a single HPA HOT 2
- Is it possible to retrieve request per second to pod without using prometheus? HOT 4
- Parameters consultation
- Missing Documentation for kubernetes compatibility matrix HOT 1
- Update k8s autoscaling dependency to v2 version to support Kubernetes 1.26+ HOT 1
- Metric Adapter is taking metrics from cache HOT 1
- Is that feasible to run the adapter into a namespace different from `kube-system`? HOT 3
- Make log level configurable
- Docker image for arm64 not published HOT 1
- HPA metric got stuck at a random value and not scaling down after reaching max replica count HOT 12
- No latest tag exists for ghcr.io repo
- Error: the server could not find the requested resource HOT 1
- feature: add Argo Rollouts support
- install fail
- Current helm charts does not work HOT 2
- Latest version v0.2.3 not found HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kube-metrics-adapter.