Giter Site home page Giter Site logo

sustainable-computing-io / susql-operator Goto Github PK

View Code? Open in Web Editor NEW
6.0 6.0 1.0 413 KB

a Kubernetes operator that aggregates energy data from tagged resources

Home Page: http://susql.org

License: Apache License 2.0

Dockerfile 3.50% Makefile 22.09% Go 54.48% Shell 19.92%
energy kepler kubernetes monitoring openshift operator susql sustainability

susql-operator's People

Contributors

mamy-cs avatar trent-s avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

trent-s

susql-operator's Issues

Improve deployment flexibility

Enable additional parameters to enable deployment in additional environments.
(Essentially port the flexibility added to the github.com/trent-s/susql-operator clone.)

Improve http vs https selection within prometheus_manager.go

https and http traffic needs to be handled differently and correctly. Here is sample log output:

ERROR [Reconcile]: Querying Prometheus didn't work: read tcp 10.217.0.203:33522->10.217.5.82:9091: read: connection reset by peer
ERROR [GetMetricValuesForPodNames]: Querying Prometheus didn't work: read tcp 10.217.0.203:33524->10.217.5.82:9091: read: connection reset by peer

Improve configuration by using configmaps

Much configuration is currently done with environment variables at deploy. Using a configmap approach is preferable. It is less "ad-hoc" and provides a more dynamic method of updating values.

Improve tuning options

  • Measurement interval is currently hardcoded. Should at least be tunable on deployment.
  • Review the code for other items that should be user tunable on deployment.
  • Consider if any items should be dynamically tunable without redeployment.

Double counting on controller restart

The controller keeps track of how much energy per pod has already been accounted for in main memory. This table is lost on restart of the controller meaning total energy consumption will be incorrectly doubled on restart when the label group is not setup to reset counts on restart.

Kepler Decoupling Part 2

  • There are variable names and labels that use the string "kepler" which should be renamed to something more general such as "input"? "base input"?
  • There are also various strings that use kepler. This should be renamed where reasonable.

Enable customization of output metric

Currently SusQL aggregates total energy as Joules. It could be useful to enable customization of this metric. For example, financial cost, or carbon footprint might be other options.

Access rights

Users should be restricted to aggregate metrics they have access to.

Add runtime logging framework

Currently runtime errors and certain warnings are available from container console output.

However, it would be very nice for debugging, verification, and exploration to adopt a logging framework and use it to be able to customize additional output. e.g., log output as INFO, DEBUG, WARNING, ERROR, etc, and allow customization of desired output level at deploy time.

Inclusion of the logging framework should be trivial, and inclusion of additional output within the code will not be challenging, but may take a day or two to go through the code identifying potentially interesting data to log.

update pointers to use new repository, etc

There are a number of self-references to the old repository github.com/metalcycling/susql .
Likewise we should probably find a new location to store docker images other than docker.io/metalcycling .

Reference:

./cmd/main.go:	susqlv1 "github.com/metalcycling/susql/api/v1"
./cmd/main.go:	"github.com/metalcycling/susql/internal/controller"
./go.mod:module github.com/metalcycling/susql
./test/labelgroups.sh:namespace=metalcycling
./internal/controller/resource_manager.go:	susql "github.com/metalcycling/susql/api/v1"
./internal/controller/labelgroup_controller.go:	susql "github.com/metalcycling/susql/api/v1"
./internal/controller/suite_test.go:	susqlv1 "github.com/metalcycling/susql/api/v1"
./PROJECT:repo: github.com/metalcycling/susql
./PROJECT:  path: github.com/metalcycling/susql/api/v1
./deployment/susql-controller/values.yaml:containerImage: docker.io/metalcycling/susql-controller
./deployment/deploy.sh:SUSQL_REGISTRY="docker.io/metalcycling"

Refactor to use Operator SDK

Refactoring this project from Kubebuilder to OperatorSDK would enable improved operator packaging and deployment.

security issue with built images

Apparently our build process pulls in oldish packages with a potential security impact. I'm not overly worried about container security, but it would be good if we can easily fix this.

 Path : /home/xyz/.local/share/containers/storage/overlay/.../diff/usr/lib/x86_64-linux-gnu/libcurl.so.4.8.0

Installed version : 7.88.1

Fixed version : 8.4.0

Path : /home/xyz/.local/share/containers/storage/overlay/.../diff/usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4.8.0

Installed version : 7.88.1

Fixed version : 8.4.0

And

 Path : /home/xyz/.local/share/containers/storage/overlay/.../diff/usr/bin/curl

Installed version : 7.88.1

Fixed version : 8.4.0

Documentation Enhancement

For example:

  • Describe additional use cases along with detailed configuration/usage steps.
  • Give detailed instructions on deployment and usage in mulitple environments including OpenShift, Kind, etc.
  • Consider useful information that would be useful to lead to a KubeCon presentation.

user permission support

susql needs to understand user permission [admin view vs normal viewer for single or multiple namespaces)

list of very minor fixes

  • internal/controller/prometheus_manager.go: fmt.Printf("ERROR [GetMetricValuesForPodNames]: Couldn't created an HTTP client: %v\n", err)
    Note: Need to change "created" to "create"

apparent double counting bug

When deleting and restarting workloads, there appears to be a double counting bug.
In the following demo, workloads are deleted and restarted.
before restarting susql-demo-1 accurately shows that it is the sum of all susql-demo-1 workloads. However on restarting, it shows unexpected behavior. Details in the comments.

Investigate scope of interest specification

Currently the "scope of interest" is specified by creating label groups and attaching these labels to pods of interest. Consider approaches to make this easier for users.

Fix memory leak

Susql is being killed periodically on the gdr-test cluster due to OOM errors.

  • Review code for obvious memory leaks
  • Monitor memory usage
  • Possibly increase memory quota if it is too small

Cleanup error/warning messages

  1. During deployment the kepler-check functionality displays errors as it is waiting for the test pod to come up.
    e.g.,
Checking if Kepler is deployed...
Error from server (Forbidden): error when creating "STDIN": pods "kepler-check" is forbidden: error looking up service account openshift-kepler-operator/prometheus-k8s: serviceaccount "prometheus-k8s" not found
Error from server (NotFound): pods "kepler-check" not found
Error from server (NotFound): pods "kepler-check" not found
Error from server (NotFound): error when deleting "kepler-check.yaml": pods "kepler-check" not found
Kepler service at 'http://prometheus-k8s.monitoring.svc.cluster.local:9090/' was found. Using it to collect SusQL data.
  1. Investigate the following intermittent errors:
2024-02-01T17:21:09Z ERROR controller-runtime.source.EventHandler failed to get informer from cache {"error": "failed to get API group resources: unable to retrieve the complete list of server APIs: [susql.ibm.com/v1](http://susql.ibm.com/v1): the server could not find the requested resource"}

and

2024-02-01T17:21:39Z ERROR setup problem running manager {"error": "failed to wait for labelgroup caches to sync: timed out waiting for cache to be synced for Kind *v1.LabelGroup"}

Improve Prometheus storage naming

internal/controller/prometheus_manager.go contains the following code:

var (
        susqlMetrics = &SusqlMetrics{
                totalEnergy: prometheus.NewGaugeVec(prometheus.GaugeOpts{
                        Namespace: "kepler",
                        Name:      "node_platform_joules_total",
                        Help:      "Accumulated energy over time for set of labels",
                }, susqlPrometheusLabelNames),
        }

Concerns:

  • Is this the right namespace? Do we need to include a namespace at all?
  • The "name" feels wrong. Something closer to "application_aggregated_joules" seems better
  • Gauges are commonly used for values that go up and down. Since a counter is used for monotonically increasing data, perhaps it would be a better choice.

Enable input metric customization

Although it is possible to customize various details about how to connect to Prometheus, the referenced metric is hardcoded.

It would be useful to enable this to be specified at deployment time. Additionally, a short (1-2 word?) description of this metric should also be configurable, as this too is hardcoded in multiple locations.

Investigate ODH integration

Ideally, ODH would automatically create and apply labelgroups for workloads it creates, and then it would provide a dashboard for susql generated metrics.

add cluster connection check

Add cluster connection before proceeding into deploying the operator and throwing other unrelated errors.
E0213 16:08:14.306333 2593 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused E0213 16:08:14.317558 2593 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused E0213 16:08:14.318604 2593 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused E0213 16:08:14.319466 2593 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused E0213 16:08:14.320811 2593 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused The connection to the server localhost:8080 was refused - did you specify the right host or port?

Investigate an interval query

How to use existing time series data in SusQL Prometheus to perform an interval based power consumption query on the labels of interest.

SusQL as a bundled operator

Once issue 11 is finished:

  • Verify that make bundle works
  • Verify proper local functionality of make bundled SusQL operator.
  • Prepare to submit to operator hub.
  • Submit to operator hub

Log configuration information on deploy

It would be useful to keep a log of deployment information for debugging, and even as a reference to double check how a current susql implementation is deployed.

  • Preserve previous deployment information by copying .susql-deploy-info.txt exists copy it to .susql-deploy-info-last.txt overwriting older file if necessary.
  • Output information to .susql-deploy-info.txt in sh readable format
  • Incuding deployed date/time stamp and host name
  • Added .susql-deploy-info.txt and .susql-deploy-info-last.txt to .gitignore

This information could be useful for either un-deployment or re-deployment.

In the future, it may be useful to include detailed errors, warnings, and other output, perhaps in another file. It is simple enough to redirect stdout/stderr, but since this script is slightly interactive, redirection could be problematic unless we remove interaction.

Handling of SusQL time series data

Enhance and document time series susql data.

Enable operations such as max, min, average, total, etc over custom time ranges along with label/pod selection for cli, api, and dashboard. (Currently only aggregate and label selection supported.)

Reconsider where data is located. e.g., storing data in multiple locations can lead to double counting as reported in issue 3. (Currently, data seems to be stored separately in pod memory, CR, and Prometheus.)

Fix install logic error and doc typo

Fix an installation bug that causes the current working directory to take on an unexpected value on error conditions. Specifically with the original code cd ${SUSQL_DIR} && make manifests && make install && cd - if either of the make commands fail, the the current working directly will stay in ${SUSQK_DIR} and not return to the starting point as intended with cd -. The simplest solution to ensure that the final cd is executed is to separate it out on a separate line as follows:

cd ${SUSQL_DIR} && make manifests && make install
cd -

Improvements to susqltop

  • Replace two calls to kubectl in loop with one. Parse out required data rather than calling twice. Or even better yet, call this (kubectl get labelgroups -o json --all-namespaces) once and then extract data with jq. While we are at it, for extra credit do similar improvement to test/labelgroups.sh.
  • Rewrite susqltop in go using Cobra api. Sample code includes: https://github.ibm.com/CognitiveAdvisor/cactl

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.