Giter Site home page Giter Site logo

traas-stack / kapacity Goto Github PK

View Code? Open in Web Editor NEW
147.0 11.0 13.0 797 KB

An open cloud native capacity solution which helps you achieve ultimate resource utilization in an intelligent and risk-free way.

Home Page: https://kapacity.netlify.app

License: Apache License 2.0

Dockerfile 0.34% Makefile 1.56% Go 76.22% Python 21.88%
aiops autoscaling capacity cloud-native finops hpa kubernetes monitoring risk-mitigation time-series-forecasting

kapacity's Introduction

logo

Go Reference License GoVersion Go Report Card

English | 中文


♻️ Kapacity is an open cloud native capacity solution which helps you achieve ultimate resource utilization in an intelligent and risk-free way.

It automates your scaling, mitigates capacity risks, saves your effort as well as cost.

Kapacity is built upon core ideas and years of experience of the large-scale production capacity system at Ant Group, which saves ~100k cores yearly with high stability and zero downtime, combined with best practices from the cloud native community.

Watch our talk (in Chinese) at KubeCon China 2023 "How We Build Production-Grade HPA: From Effective Algorithm to Risk-Free Autoscaling" to learn the core idea and principles of Kapacity's Intelligent HPA in depth!

🚀 Please note that Kapacity is still under active development, and not all features proposed have been implemented. Feel free to directly talk to us through community if you have any wish or doubt.


Core Features

Intelligent HPA

Kubernetes HPA is a common way used to scale cloud native workloads automatically, but it has some BIG limitations listed below which make it less effective and practical in real world large-scale production use:

  • HPA works in a reactive way, which means it would only work AFTER the target metrics exceeding the expected value. It can hardly provide rapid and graceful response to traffic peaks, especially for applications with longer startup times.
  • HPA calculates replica count based on metrics by a simple ratio algorithm, with an assumption that the replica count must have a strict linear correlation with related metrics. However, this is not always the case in real world.
  • Scaling is a highly risky operation in production, but HPA provides little risk mitigation means other than scaling rate control.
  • HPA is a Kubernetes built-in, well, this is not a limitation literally, but it does limit some functions/behaviors to specific Kubernetes versions, and there is no way for end users to extend or adjust its functionality for their own needs.

So we build Intelligent HPA (IHPA), an intelligent, risk-defensive, highly adaptive and customizable substitution for HPA. It has below core features:

  • Autoscaling powered by multiple intelligent algorithms, all combinable and customizable
    • Algorithm which predicts appropriate replica counts in the future, utilizing time series forecasting of metrics and advanced metrics-replicas modeling, which makes it suitable for a variety of scenarios in real world production, such as multi period and trending traffic, load affected by multiple traffics, non-linear correlation between load and replica count, and so on.
    • Algorithm which detects abnormal traffic or potential capacity risks, and suggests a safe replica count proactively.
    • Also, the classic reactive ratio algorithm and cron-based replica control are batteries included.
  • Scaling with multiple risk defense means
    • Fine-grained pod state control which enables a multi-stage scale down. You can scale down a pod by only turning off its traffics, or releasing its resources without actually stopping the application or deleting the pod. This can greatly increase the speed of rollback (scale up again) if needed.
    • Fully customizable gray change for both scale up and scale down. You can even combine it with the pod state control mechanism to achieve multi-stage gray change.
    • Automatic risk mitigation based on customizable stability checks. You can let it monitor arbitrary metrics (not limited to the metrics which drive autoscaling) for risk detection, or even define your own detection logic, and it can automatically take actions such as suspend or rollback the scaling to mitigate risks.
  • Open and highly extensible architecture
    • IHPA is split into three independent modules for replica count calculation, workload replicas control and overall autoscaling process management. Each module is replaceable and extensible.
    • Various extension points are exposed which makes the behavior of IHPA fully customizable and extensible. For example, you can customize how to control traffics of the pod, which pods shall be scaled down first, how to detect risks during autoscaling and so on.

To start using Kapacity

See our documentation on kapacity.netlify.app.

Walking through the Quick Start Tutorial is also a good way to get started.

Community & Support

You've got questions, or have any ideas? Here's the ways:

  • Have some general questions or ideas? → GitHub Discussions
  • Want to report a bug or request a feature? → GitHub Issues
  • Want further more connections? Join our community by:
    • Slack (for English speakers mainly)
    • DingTalk (for Chinese speakers mainly, group number is 27855025593)

Contributing

Any form of contributing is warmly welcomed 🤗, read the contribution guidelines for details.

kapacity's People

Contributors

archerny avatar dayko2019 avatar zqzten avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kapacity's Issues

Support automatic readiness gate injection

What would you like to be added?

Introduce an admission webhook to automatically inject the kapacity.traas.io/online readiness gate to pods which needed.

Why is this needed?

To support the readiness gate pod traffic controller.

Introduce time series forecasting model training script

What would you like to be added?

Introduce the Python script to train the time series forecasting model for predictive scaling.

Why is this needed?

The time series forecasting model need to be trained beforehand so that the acutal prediction job can utilize it.

Expose the unified metric provider interface as a service

What would you like to be added?

Expose the unified metric provider interface as a (probably gRPC) service.

Why is this needed?

This enables external algorithm jobs to utilize the internal unified metric provider interface so that they don't need to impl metrics query themselves.

Support controller reconcile concurrency configuration

What would you like to be added?

Add a reconcile-concurrency flag to manager to configure the reconcile concurrency of every controller.

Why is this needed?

The reconcile concurrency defaults to 1 which does not meet the efficiency requirement in production.

[FAQ] More details about `recommendation of Pod resource specifications (CPU, memory, etc.) intelligent algorithms`

Thank you very much for open-sourcing Kapacity. Sharing experiences from the industry is very important for this field. My main research direction includes container workload resource recommendation, so even if the resource recommendation algorithm of Kapacity will be open-sourced in the future, I still want to learn about your ideas as soon as possible. Could you please provide a detailed introduction of the models and methods you use for resource recommendation?

Ref: https://kapacity.netlify.app/docs/roadmap/

Support arbitrary metrics query for promethues metric provider

What would you like to be added?

Support arbitrary metrics query for promethues metric provider.

Why is this needed?

Currently the Promethues metric provider only supports resource metrics query which limits the metrics that algorithms can use. We need a machinism to support arbitrary PQL query to make most of this provider.

Introduce replicas prediction job

What would you like to be added?

Introduce the Python program which gathers metrics, does replicas prediction, and writes out results. This program can be run as Kubernetes CronJob. The model used by the program should be trained beforehand so that it would not rely on GPUs.

Why is this needed?

It is needed to do the real prediction work for the predictive portrait.

Reasonable behavior on selection of portraits with the same priority

What would you like to be added?

Currently, the behavior of the portrait selection is undefined when we have multiple portraits with the same priority. We need to introduce a reasonable behavior to it, that is to select the one which desires most replicas.

Why is this needed?

To provide a more reasonable and predictable behavior.

as a user, if I can use http api control kapacity

What would you like to be added?

Hello,I want to know, if I can use api(such as http invoke),to control the kapacity behavior . If it is yes, where could I get the api?

Why is this needed?

when I could use http method to control kapacity behavior, I could make our own control portal page. and use http api to publish config into k8s cluster.

in which time. the release version cloud get

What would you like to be added?

according the wechat article (point this). It says in June. the v0.2 will be publish. but I find it not get. what I more concerned is , in which time. I could get release version in order to use it in prod env.
wait for your reply. thanks.

Why is this needed?

want to know release version in which time could get

request help: when could I get dashboard project?

What would you like to be added?

Hello, I learned that kapacity has a dashboard project which will introducted,I want to know when I could get it.

Why is this needed?

use dashboard as a control plane

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.