redhat-developer / osd-monitor-poc Goto Github PK

View Code? Open in Web Editor NEW

8.0 11.0 20.0 15.46 MB

Shell 18.19% HTML 76.37% Dockerfile 5.44%

osd-monitor-poc's Introduction

This repository collects a number of interrelated dockerfiles for building openshift.io's self-monitoring facilities.
They are based on Performance Co-pilot (www.pcp.io) RPMs, built into small special-purpose configurations.

osd-monitor-poc's People

Contributors

Stargazers

Watchers

Forkers

kbsingh jfchevrette vpavlin aditya-konarde goodwinos alexxnica kryndex aslakknutsen xyntrix jmelis riuvshin dipak-pawar alexeykazakov snowind skabashnyuk matousjobanek miteshvp deepak1725

osd-monitor-poc's Issues

fix osd->zabbix feed reliability

internal dtsd/housekeeping/issues/926

replace webapi-guard components with oauth_proxy

The OSD-side webapi-guard components should integrate into OAUTH SSO and offer programmatic access to the timeseries guarded by pmwebd. File service provided by the current guard httpd could probably be done by pmwebd too, so we wouldn't need an apache httpd.

https://github.com/bitly/oauth2_proxy

a container is already built into the OSIO registry

OSD: bayesian & keycloak monitoring

Both bayesian and keycloak namespaces on prod & stage have some OSD-style pcp instrumentation in them, and a functional local osd-monitor pod. However, neither reports to zabbix, though this may be necessary. This is a placeholder for this known absence.

toy jaeger configuration for rhche

@ibuziuk From gazing at eclipse-che's repo on github, it seems the configuration changes you need to use the toy jaeger now in osd-monitor in -stg is to add the following to your env/configmap:

CHE_TRACING_ENABLED: "true"
JAEGER_ENDPOINT: "http://osd-monitor-jaeger:14268/api/traces"
JAEGER_SERVICE_NAME: "che-server"
JAEGER_SAMPLER_MANAGER_HOST_PORT: "osd-monitor-jaeger:5778"
JAEGER_SAMPLER_TYPE: "const"
JAEGER_SAMPLER_PARAM: "1"
JAEGER_REPORTER_MAX_QUEUE_SIZE: "10000"

OSO: pull data from OSO prometheus server

There are indications that some of our target OSO clusters may expose a usable prometheus-server federation-type webapi, via which we might be able to pull in a least basic container-level system data. We need information & testing to see this work (@kbsingh), and then some pcp work in order to exploit it.

performancecopilot/pcp#444

OSO: tenant enumeration

For the oso-monitor to reliably track tenants that come and go, it needs this:

fabric8-services/fabric8-tenant#369
= fabric8-services/fabric8-tenant#371

This reverse lookup facility from tenant-URL to tenant may help for tenant-initiated introduction, but is unlikely to be sufficient as the OSD side may lose such messages. For redundancy/reliability we should have both.
fabric8-services/fabric8-tenant#476

Partitioning tenants by 'profile' will allow grouping them into different level of metric collection/retention categories.
fabric8-services/fabric8-tenant#355

OSO: tenant end-user code, instrumentation insertion

In order to collect metrics from tenant-written code, it seems the best way is for us to manipulate the DCs for the TENANT-{run,stage} namespaces/pods before they're activated. @aslakknutsen

fabric8-services/fabric8-tenant#512

Add repository description in README.md

Use environment variable for devshift tag length in cico_build.sh

cico_build.sh needs to make use of the environment variable DEVSHIFT_TAG_LEN instead of a hardcoded value

OSO: activate jenkins monitoring

fabric8io/fabric8-build-team#24

activate osd pcp monitoring for fabric8-auth-proxy

including its prometheus data

publish time series access API

pmwebd's graphite webapi would be the default way to access metric time series by OSIO components, whether for OSD-side components themselves, or tenant OSO side code. This API needs to be better documented, secured, exposed, pushed, filed, stamped, indexed, briefed, debriefed, and numbered.

activate pcp pmcd+pmda logging to stderr

performancecopilot/pcp#365

OSO: activate eclipse-ws pod monitoring

redhat-developer/rh-che#522

pcp repos

Frank, regarding the various pcp.repo used in the Dockerfiles - these are pointing to https://copr-be.cloud.fedoraproject.org/results/fche/pcp/epel-7-$basearch/

Is that built from your pcpfans tree? I noticed there are undocumented pmwebd options being used, and likely quite a lot of other stuff that hasn't been merged with pcp upstream. This is something I can help with, but would need some guidelines on what you want to keep in your pcpfans tree, what's experimental, and what's only temporary.

zabbix low-level-discovery metrics needed

performancecopilot/pcp#439
internal dtsd/housekeeping/issues/1464
@mmclanerh

pmdapostgresql regression with >4.2 pcp

tracking progress being made at performancecopilot/pcp#590

PCP graphite function support

For supplying more cooked timeseries results via pmwebd's graphite webapi, this is needed. Work in progress. @aslakknutsen

performancecopilot/pcp#122

need pcp2zabbix with PMNS_CHANGED / optional-metric tolerance

performancecopilot/pcp#521

cc: @aditya-konarde

activate pcp archive compression

This should in theory reduce our PV disk space requirements by a factor greater than 10, though in exchange for more temporary RAM. performancecopilot/pcp#422 performancecopilot/pcp#386

OSO scaleup testing

We need to estimate the limitations of an oso-monitor pod's fanout:

how many concurrent pmloggers can it practically handle?
does that number match up with pmdaprometheus fanout limits?
how well does pmwebd handle forecast-typical queries about so many fresh archives?
does pmwebd need to be parallelized further, or replicated?
if any of these limits are too small to encompass the entire online-OSO-tenant population, we need to extent oso-monitor to shard users into partitions, and a add a pmwebd-demultiplexing proxy in the front.

Enabling monitoring for rhche on dsaas / dsaas-stg and exposing metrics to zabbix

rhche-host service on dsaas / dsaas-stg exposes 8087 port for obtaining metrics in Prometheus format:

rhche-host ClusterIP 172.30.149.180 <none> 8080/TCP,8087/TCP 52d

Currently it is possible to obtain (ClassLoader / JVM / Tomcat) metrics from osd monitor via service name & port combo. e.g curl rhche-host:8087:

Those metrics need to be consumed & visualized by osd monitor + exposed to zabbix. Currently the most important metrics are the following:

# HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
# TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 43.0
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
# TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads 43.0
# HELP jvm_threads_states_threads The current number of threads having NEW state
# TYPE jvm_threads_states_threads gauge
jvm_threads_states_threads{state="new",} 0.0
jvm_threads_states_threads{state="runnable",} 11.0
jvm_threads_states_threads{state="blocked",} 0.0
jvm_threads_states_threads{state="waiting",} 25.0
jvm_threads_states_threads{state="timed-waiting",} 7.0
jvm_threads_states_threads{state="terminated",} 0.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads 38.0

# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{area="nonheap",id="Code Cache",} 2.3396352E7
jvm_memory_committed_bytes{area="nonheap",id="Metaspace",} 6.9337088E7
jvm_memory_committed_bytes{area="nonheap",id="Compressed Class Space",} 8519680.0
jvm_memory_committed_bytes{area="heap",id="PS Eden Space",} 1.5204352E7
jvm_memory_committed_bytes{area="heap",id="PS Survivor Space",} 524288.0
jvm_memory_committed_bytes{area="heap",id="PS Old Gen",} 4.9283072E7
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="nonheap",id="Code Cache",} 2.2897152E7
jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 6.7874864E7
jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 8108328.0
jvm_memory_used_bytes{area="heap",id="PS Eden Space",} 1.322024E7
jvm_memory_used_bytes{area="heap",id="PS Survivor Space",} 484352.0
jvm_memory_used_bytes{area="heap",id="PS Old Gen",} 4.0086792E7
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{area="nonheap",id="Code Cache",} 2.5165824E8
jvm_memory_max_bytes{area="nonheap",id="Metaspace",} -1.0
jvm_memory_max_bytes{area="nonheap",id="Compressed Class Space",} 1.073741824E9
jvm_memory_max_bytes{area="heap",id="PS Eden Space",} 1.77733632E8
jvm_memory_max_bytes{area="heap",id="PS Survivor Space",} 524288.0
jvm_memory_max_bytes{area="heap",id="PS Old Gen",} 3.58088704E8

# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
# TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{id="direct",} 17.0
jvm_buffer_count_buffers{id="mapped",} 0.0
# HELP jvm_buffer_memory_used_bytes An estimate of the memory that the Java virtual machine is using for this buffer pool
# TYPE jvm_buffer_memory_used_bytes gauge
jvm_buffer_memory_used_bytes{id="direct",} 515770.0
jvm_buffer_memory_used_bytes{id="mapped",} 0.0
# HELP jvm_buffer_total_capacity_bytes An estimate of the total capacity of the buffers in this pool
# TYPE jvm_buffer_total_capacity_bytes gauge
jvm_buffer_total_capacity_bytes{id="direct",} 515770.0
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0

# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young
generation memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total 5.75668224E8
# HELP jvm_gc_max_data_size_bytes Max size of old generation memory pool
# TYPE jvm_gc_max_data_size_bytes gauge
jvm_gc_max_data_size_bytes 3.58088704E8
# HELP jvm_gc_live_data_size_bytes Size of old generation memory pool after a full GC
# TYPE jvm_gc_live_data_size_bytes gauge
jvm_gc_live_data_size_bytes 3.9021816E7
# HELP jvm_gc_pause_seconds Time spent in GC pause
# TYPE jvm_gc_pause_seconds summary
jvm_gc_pause_seconds_count{action="end of minor GC",cause="Allocation Failure",} 37.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="Allocation Failure",} 0.444
jvm_gc_pause_seconds_count{action="end of minor GC",cause="GCLocker Initiated GC",} 1.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="GCLocker Initiated GC",} 0.006
jvm_gc_pause_seconds_count{action="end of major GC",cause="Ergonomics",} 1.0
jvm_gc_pause_seconds_sum{action="end of major GC",cause="Ergonomics",} 0.136
# HELP jvm_gc_pause_seconds_max Time spent in GC pause
# TYPE jvm_gc_pause_seconds_max gauge
jvm_gc_pause_seconds_max{action="end of minor GC",cause="Allocation Failure",} 0.045
jvm_gc_pause_seconds_max{action="end of minor GC",cause="GCLocker Initiated GC",} 0.006
jvm_gc_pause_seconds_max{action="end of major GC",cause="Ergonomics",} 0.136
# HELP jvm_gc_memory_promoted_bytes_total Count of positive increases in the size of the old generation memory pool before GC to after GC
# TYPE jvm_gc_memory_promoted_bytes_total counter
jvm_gc_memory_promoted_bytes_total 1.0042528E7


# HELP process_files_max_files The maximum file descriptor count
# TYPE process_files_max_files gauge
process_files_max_files 1048576.0
# HELP process_files_open_files The open file descriptor count
# TYPE process_files_open_files gauge
process_files_open_files 77.0



# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
# TYPE process_cpu_usage gauge
process_cpu_usage 0.002553191489361702
# HELP system_cpu_count The number of processors available to the Java virtual machine
# TYPE system_cpu_count gauge
system_cpu_count 4.0
# HELP system_cpu_usage The "recent cpu usage" for the whole system
# TYPE system_cpu_usage gauge
system_cpu_usage 0.13787234042553193

Number of live threads is currently by far the most important metric since it would allow to investigate P1 issue - openshiftio/openshift.io#4626

Add liveness and readiness probes to osd-monitor.yml

The deployment config osd-monitor.yml needs to have:
For Api Containers

A liveness probe
A readiness probe

PCP dynamic pmlogger

Required to make OSO tenant logging as efficient/dynamic as possible.

Already in good hands:
performancecopilot/pcp#372

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.