Light

converged-computing / metrics-operator Goto Github PK

View Code? Open in Web Editor NEW

2.0 4.0 0.0 18.53 MB

Testing designs for a benchmarking operator (in experimental mode!)

Home Page: https://converged-computing.github.io/metrics-operator/

License: MIT License

Makefile 5.11% Go 75.85% Dockerfile 0.43% Smarty 0.60% Shell 0.67% Python 17.34%

converged-computing high-performance-computing hpc kubernetes metrics operator

metrics-operator's Introduction

metrics-operator

Developing metrics and a catalog of applications to assess different kinds of Kubernetes performance. We likely will choose different metrics that are important for HPC. Note that I haven't started the operator yet because I'm testing ideas for the design. To learn more:

⭐️ Documentation ⭐️
🐯️ Python Module 🐯️

Dinosaur TODO

Figure out issue with errors.IsNotFound not working...
We need a way for the entrypoint command to monitor (based on the container) to differ (potentially)
For larger metric collections, we should have a log streaming mode (and not wait for Completed/Successful)
For services we are measuring, we likely need to be able to kill after N seconds (to complete job) or to specify the success policy on the metrics containers instead of the application
Add assertions checking for python tests
Plotting examples (python parsers) needed for
- io-sysstat
- app-kripke
- app-quicksilver
- app-pennant

License

HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.

See LICENSE, COPYRIGHT, and NOTICE for details.

SPDX-License-Identifier: (MIT)

LLNL-CODE- 842614

metrics-operator's People

Contributors

Stargazers

Watchers

metrics-operator's Issues

Rate / completions should likely be scoped to an app

My original design assumed these would be globally relevant but I don't think that's the case. They should be metric-specific options instead as to not confuse the user they are applicable across metrics (they are not).

Consider metric app template

For the app-* metrics, I'm starting to see common patterns - there is some number of custom options, and then custom logic to derive entrypoints for a launcher and one or more workers. But the code files are getting very redundant! I'm wondering if there is some way (that would work with the limits of go interfaces) to have common JobSet patterns. In this case the launcher / worker would be a template that has the rest populated by a simpler struct.

Think of how to integrate flux operator

we would want to be able to run a flux operator application and measure metrics for it.

volumes need target containers

Right now volumes are added to all pods in the set, and it needs to be selected.

Metrics table: ability to collapse rows

The table is getting long! I think (for long descriptions) it would be good to find a way to collapse rows. I haven't looked much into it yet but I suspect it is possible. https://stackoverflow.com/questions/57550993/datatable-button-expand-collapse-row-jquery

Timing options

We should provide a start / end time for the entire collection. E.g., for storage (using FIO) it's likely the tool collects the time, but this likely isn't the case for most, and it would be an interesting (albeit simple) comparison metric.

add nodeSelector to metrics pods

We want to be able to assign hwloc metrics to run on specific nodes, so we need the nodeSelector of the pod exposed.

storage metric: fio (good for NFS)

https://docs.gitlab.com/ee/administration/operations/filesystem_benchmarking.html

Metrics / apps to consider

These are important to the labs! If you'd like to see an app, metric, or other added, please comment here.

Unsure

Dr. Memory https://drmemory.org/page_running.html

More Workflow Based

parsl (demo for molecular design)
merlin (demo) - too many steps / services to be considered a proxy app
fireworks (demo)
balsam (containers built but part of server seems buggy and/or proprietary, this is likely not going to be IT for a way to orchestrate workflows)
[mlcommons-deepcam (also very complex to actually setup, I stopped at the base container
nextflow ml workflow example
snakemake bioscience example workflow
weave (demos)

In Progress / Attempted

perfKitBenchmark https://github.com/GoogleCloudPlatform/PerfKitBenchmarker (run locally on pod(s))? So likely we would use the launcher / worker model see here - note - having trouble.
darshan https://github.com/darshan-hpc/darshan (WIP) likely this should be a spack view? rse-ops/hpc-apps#24 - views are done, but segfaulted. https://github.com/converged-computing/metrics-operator/tree/add/darshan
exaMPM based on cabana, didn't get MPI working ref

Recent readings / tools for performance

use pod anti-affinity with hostname

This will allow us to do 1:1 mapping of nodes to pods. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#namespace-selector. Likely we want a variable to control this is creating the replicated job, or jobset etc.

operator needs pre-defined delineation of sections and settings

I'm writing a small Python parsing library for metric logs, and I realize we need:

structured way to define different sections for splitting
timestamp between each section collection
dump of options / settings at the beginning (we can get this from the spec but better to not rely on it and keep with the log).

Python helper library should parse metric sectioned output

Devise registry strategy

I should be able to search for and view metrics by type, and get a description / link to more information. Ideally this could be derived via another command provided by the operator that parses metadata.

Consider mode to save entire jobset yaml

If we want to be able to reproduce a run, we could arguably generate yaml and save to logs: https://github.com/flux-framework/flux-operator/blob/b54246feaba2ca7abeca62efd42accbbaacff13e/controllers/flux/logging.go

OR we could provide a means to do this via Python (probably the better idea).

Addons to create (or add, lol)

Timing addon: should prefix the command with time (assumed time available in container)
Commands addon: should allow for arbitrary post commands to any entrypoint

Just here to say ....

Love the logo!

Add osu-benchmarks metric

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.