traas-stack / kapacity Goto Github PK

An open cloud native capacity solution which helps you achieve ultimate resource utilization in an intelligent and risk-free way.

Home Page: https://kapacity.netlify.app

License: Apache License 2.0

Dockerfile 0.34% Makefile 1.56% Go 76.22% Python 21.88%

aiops autoscaling capacity cloud-native finops hpa kubernetes monitoring risk-mitigation time-series-forecasting

kapacity's Introduction

English | 中文

♻️ Kapacity is an open cloud native capacity solution which helps you achieve ultimate resource utilization in an intelligent and risk-free way.

It automates your scaling, mitigates capacity risks, saves your effort as well as cost.

Kapacity is built upon core ideas and years of experience of the large-scale production capacity system at Ant Group, which saves ~100k cores yearly with high stability and zero downtime, combined with best practices from the cloud native community.

✨ Watch our talk (in Chinese) at KubeCon China 2023 "How We Build Production-Grade HPA: From Effective Algorithm to Risk-Free Autoscaling" to learn the core idea and principles of Kapacity's Intelligent HPA in depth!

🚀 Please note that Kapacity is still under active development, and not all features proposed have been implemented. Feel free to directly talk to us through community if you have any wish or doubt.

Core Features

Intelligent HPA

Kubernetes HPA is a common way used to scale cloud native workloads automatically, but it has some BIG limitations listed below which make it less effective and practical in real world large-scale production use:

HPA works in a reactive way, which means it would only work AFTER the target metrics exceeding the expected value. It can hardly provide rapid and graceful response to traffic peaks, especially for applications with longer startup times.
HPA calculates replica count based on metrics by a simple ratio algorithm, with an assumption that the replica count must have a strict linear correlation with related metrics. However, this is not always the case in real world.
Scaling is a highly risky operation in production, but HPA provides little risk mitigation means other than scaling rate control.
HPA is a Kubernetes built-in, well, this is not a limitation literally, but it does limit some functions/behaviors to specific Kubernetes versions, and there is no way for end users to extend or adjust its functionality for their own needs.

So we build Intelligent HPA (IHPA), an intelligent, risk-defensive, highly adaptive and customizable substitution for HPA. It has below core features:

Autoscaling powered by multiple intelligent algorithms, all combinable and customizable
- Algorithm which predicts appropriate replica counts in the future, utilizing time series forecasting of metrics and advanced metrics-replicas modeling, which makes it suitable for a variety of scenarios in real world production, such as multi period and trending traffic, load affected by multiple traffics, non-linear correlation between load and replica count, and so on.
- Algorithm which detects abnormal traffic or potential capacity risks, and suggests a safe replica count proactively.
- Also, the classic reactive ratio algorithm and cron-based replica control are batteries included.
Scaling with multiple risk defense means
- Fine-grained pod state control which enables a multi-stage scale down. You can scale down a pod by only turning off its traffics, or releasing its resources without actually stopping the application or deleting the pod. This can greatly increase the speed of rollback (scale up again) if needed.
- Fully customizable gray change for both scale up and scale down. You can even combine it with the pod state control mechanism to achieve multi-stage gray change.
- Automatic risk mitigation based on customizable stability checks. You can let it monitor arbitrary metrics (not limited to the metrics which drive autoscaling) for risk detection, or even define your own detection logic, and it can automatically take actions such as suspend or rollback the scaling to mitigate risks.
Open and highly extensible architecture
- IHPA is split into three independent modules for replica count calculation, workload replicas control and overall autoscaling process management. Each module is replaceable and extensible.
- Various extension points are exposed which makes the behavior of IHPA fully customizable and extensible. For example, you can customize how to control traffics of the pod, which pods shall be scaled down first, how to detect risks during autoscaling and so on.

To start using Kapacity

See our documentation on kapacity.netlify.app.

Walking through the Quick Start Tutorial is also a good way to get started.

Community & Support

You've got questions, or have any ideas? Here's the ways:

Have some general questions or ideas? → GitHub Discussions
Want to report a bug or request a feature? → GitHub Issues
Want further more connections? Join our community by:
- Slack (for English speakers mainly)
- DingTalk (for Chinese speakers mainly, group number is 27855025593)

Contributing

Any form of contributing is warmly welcomed 🤗, read the contribution guidelines for details.

kapacity's People

Contributors

Stargazers

Watchers

Forkers

suziewong zqzten dayko2019 niconical zhy76 piaobeizu hexiaofeng charlesqq teddy-syq conghuhu bootgo mvandermeulen itonyli

kapacity's Issues

Support automatic readiness gate injection

What would you like to be added?

Introduce an admission webhook to automatically inject the kapacity.traas.io/online readiness gate to pods which needed.

Why is this needed?

To support the readiness gate pod traffic controller.

Introduce time series forecasting model training script

What would you like to be added?

Introduce the Python script to train the time series forecasting model for predictive scaling.

Why is this needed?

The time series forecasting model need to be trained beforehand so that the acutal prediction job can utilize it.

Can you provide the dataset required for algorithm training?

What would you like to be added?

The dataset required for algorithm training

Why is this needed?

This can help our participants conduct algorithmic research and verify the excellence of the Kapacity algorithm

Expose the unified metric provider interface as a service

What would you like to be added?

Expose the unified metric provider interface as a (probably gRPC) service.

Why is this needed?

This enables external algorithm jobs to utilize the internal unified metric provider interface so that they don't need to impl metrics query themselves.

Support controller reconcile concurrency configuration

What would you like to be added?

Add a reconcile-concurrency flag to manager to configure the reconcile concurrency of every controller.

Why is this needed?

The reconcile concurrency defaults to 1 which does not meet the efficiency requirement in production.

[FAQ] More details about `recommendation of Pod resource specifications (CPU, memory, etc.) intelligent algorithms`

Thank you very much for open-sourcing Kapacity. Sharing experiences from the industry is very important for this field. My main research direction includes container workload resource recommendation, so even if the resource recommendation algorithm of Kapacity will be open-sourced in the future, I still want to learn about your ideas as soon as possible. Could you please provide a detailed introduction of the models and methods you use for resource recommendation?

Ref: https://kapacity.netlify.app/docs/roadmap/

Support arbitrary metrics query for promethues metric provider

What would you like to be added?

Support arbitrary metrics query for promethues metric provider.

Why is this needed?

Currently the Promethues metric provider only supports resource metrics query which limits the metrics that algorithms can use. We need a machinism to support arbitrary PQL query to make most of this provider.

[Umbrella] Introduce predictive portrait

What would you like to be added?

This issue tracks the tasks TBD for introducing predictive portrait to IHPA.

Why is this needed?

Planned feature to support predictive HPA.

Introduce replicas prediction job

What would you like to be added?

Introduce the Python program which gathers metrics, does replicas prediction, and writes out results. This program can be run as Kubernetes CronJob. The model used by the program should be trained beforehand so that it would not rely on GPUs.

Why is this needed?

It is needed to do the real prediction work for the predictive portrait.

Reasonable behavior on selection of portraits with the same priority

What would you like to be added?

Currently, the behavior of the portrait selection is undefined when we have multiple portraits with the same priority. We need to introduce a reasonable behavior to it, that is to select the one which desires most replicas.

Why is this needed?

To provide a more reasonable and predictable behavior.

Full replica control support for Deployment (as well as ReplicaSet)

What would you like to be added?

Introduce a best-effort pod sort algorithm for Deployment as well as ReplicaSet.

Why is this needed?

To fully support replica status control.

as a user, if I can use http api control kapacity

What would you like to be added?

Hello，I want to know, if I can use api（such as http invoke）,to control the kapacity behavior . If it is yes, where could I get the api?

Why is this needed?

when I could use http method to control kapacity behavior， I could make our own control portal page. and use http api to publish config into k8s cluster.

Support Kubernetes CronJob as external horizontal portrait algorithm job

What would you like to be added?

Support Kubernetes CronJob as external horizontal portrait algorithm job.

Why is this needed?

It is a lightweight way to run complex long-running algorithms.

Support collecting result of external algorithm job from ConfigMap to HorizontalPortrait

What would you like to be added?

Support collecting result of external algorithm job from ConfigMap to HorizontalPortrait.

Why is this needed?

It is a lightweight way to collect result of external algorithm jobs which are able to write ConfigMap in cluster.

in which time. the release version cloud get

What would you like to be added?

according the wechat article (point this). It says in June. the v0.2 will be publish. but I find it not get. what I more concerned is , in which time. I could get release version in order to use it in prod env.
wait for your reply. thanks.

Why is this needed?

want to know release version in which time could get

request help: when could I get dashboard project?

What would you like to be added?

Hello， I learned that kapacity has a dashboard project which will introducted，I want to know when I could get it.

Why is this needed?

use dashboard as a control plane

traas-stack / kapacity Goto Github PK

kapacity's Introduction

Core Features

Intelligent HPA

To start using Kapacity

Community & Support

Contributing

kapacity's People

Contributors

Stargazers

Watchers

Forkers

kapacity's Issues

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

What would you like to be added?

Why is this needed?

Recommend Projects

Recommend Topics

Recommend Org