Giter Site home page Giter Site logo

osd-monitor-poc's Introduction

  • This repository collects a number of interrelated dockerfiles for building openshift.io's self-monitoring facilities.

  • They are based on Performance Co-pilot (www.pcp.io) RPMs, built into small special-purpose configurations.

osd-monitor-poc's People

Contributors

aditya-konarde avatar alexeykazakov avatar fche avatar jfchevrette avatar jmelis avatar kbsingh avatar matousjobanek avatar miteshvp avatar skabashnyuk avatar vpavlin avatar xcoulon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

osd-monitor-poc's Issues

OSD: bayesian & keycloak monitoring

Both bayesian and keycloak namespaces on prod & stage have some OSD-style pcp instrumentation in them, and a functional local osd-monitor pod. However, neither reports to zabbix, though this may be necessary. This is a placeholder for this known absence.

toy jaeger configuration for rhche

@ibuziuk From gazing at eclipse-che's repo on github, it seems the configuration changes you need to use the toy jaeger now in osd-monitor in -stg is to add the following to your env/configmap:

CHE_TRACING_ENABLED: "true"
JAEGER_ENDPOINT: "http://osd-monitor-jaeger:14268/api/traces"
JAEGER_SERVICE_NAME: "che-server"
JAEGER_SAMPLER_MANAGER_HOST_PORT: "osd-monitor-jaeger:5778"
JAEGER_SAMPLER_TYPE: "const"
JAEGER_SAMPLER_PARAM: "1"
JAEGER_REPORTER_MAX_QUEUE_SIZE: "10000"

OSO: pull data from OSO prometheus server

There are indications that some of our target OSO clusters may expose a usable prometheus-server federation-type webapi, via which we might be able to pull in a least basic container-level system data. We need information & testing to see this work (@kbsingh), and then some pcp work in order to exploit it.

performancecopilot/pcp#444

OSO: tenant enumeration

For the oso-monitor to reliably track tenants that come and go, it needs this:

fabric8-services/fabric8-tenant#369
= fabric8-services/fabric8-tenant#371

This reverse lookup facility from tenant-URL to tenant may help for tenant-initiated introduction, but is unlikely to be sufficient as the OSD side may lose such messages. For redundancy/reliability we should have both.
fabric8-services/fabric8-tenant#476

Partitioning tenants by 'profile' will allow grouping them into different level of metric collection/retention categories.
fabric8-services/fabric8-tenant#355

publish time series access API

pmwebd's graphite webapi would be the default way to access metric time series by OSIO components, whether for OSD-side components themselves, or tenant OSO side code. This API needs to be better documented, secured, exposed, pushed, filed, stamped, indexed, briefed, debriefed, and numbered.

pcp repos

Frank, regarding the various pcp.repo used in the Dockerfiles - these are pointing to https://copr-be.cloud.fedoraproject.org/results/fche/pcp/epel-7-$basearch/

Is that built from your pcpfans tree? I noticed there are undocumented pmwebd options being used, and likely quite a lot of other stuff that hasn't been merged with pcp upstream. This is something I can help with, but would need some guidelines on what you want to keep in your pcpfans tree, what's experimental, and what's only temporary.

OSO scaleup testing

We need to estimate the limitations of an oso-monitor pod's fanout:

  • how many concurrent pmloggers can it practically handle?
  • does that number match up with pmdaprometheus fanout limits?
  • how well does pmwebd handle forecast-typical queries about so many fresh archives?
  • does pmwebd need to be parallelized further, or replicated?
  • if any of these limits are too small to encompass the entire online-OSO-tenant population, we need to extent oso-monitor to shard users into partitions, and a add a pmwebd-demultiplexing proxy in the front.

Enabling monitoring for rhche on dsaas / dsaas-stg and exposing metrics to zabbix

rhche-host service on dsaas / dsaas-stg exposes 8087 port for obtaining metrics in Prometheus format:

rhche-host ClusterIP 172.30.149.180 <none> 8080/TCP,8087/TCP 52d

Currently it is possible to obtain (ClassLoader / JVM / Tomcat) metrics from osd monitor via service name & port combo. e.g curl rhche-host:8087:

image

Those metrics need to be consumed & visualized by osd monitor + exposed to zabbix. Currently the most important metrics are the following:

# HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
# TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 43.0
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
# TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads 43.0
# HELP jvm_threads_states_threads The current number of threads having NEW state
# TYPE jvm_threads_states_threads gauge
jvm_threads_states_threads{state="new",} 0.0
jvm_threads_states_threads{state="runnable",} 11.0
jvm_threads_states_threads{state="blocked",} 0.0
jvm_threads_states_threads{state="waiting",} 25.0
jvm_threads_states_threads{state="timed-waiting",} 7.0
jvm_threads_states_threads{state="terminated",} 0.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads 38.0

# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{area="nonheap",id="Code Cache",} 2.3396352E7
jvm_memory_committed_bytes{area="nonheap",id="Metaspace",} 6.9337088E7
jvm_memory_committed_bytes{area="nonheap",id="Compressed Class Space",} 8519680.0
jvm_memory_committed_bytes{area="heap",id="PS Eden Space",} 1.5204352E7
jvm_memory_committed_bytes{area="heap",id="PS Survivor Space",} 524288.0
jvm_memory_committed_bytes{area="heap",id="PS Old Gen",} 4.9283072E7
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="nonheap",id="Code Cache",} 2.2897152E7
jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 6.7874864E7
jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 8108328.0
jvm_memory_used_bytes{area="heap",id="PS Eden Space",} 1.322024E7
jvm_memory_used_bytes{area="heap",id="PS Survivor Space",} 484352.0
jvm_memory_used_bytes{area="heap",id="PS Old Gen",} 4.0086792E7
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{area="nonheap",id="Code Cache",} 2.5165824E8
jvm_memory_max_bytes{area="nonheap",id="Metaspace",} -1.0
jvm_memory_max_bytes{area="nonheap",id="Compressed Class Space",} 1.073741824E9
jvm_memory_max_bytes{area="heap",id="PS Eden Space",} 1.77733632E8
jvm_memory_max_bytes{area="heap",id="PS Survivor Space",} 524288.0
jvm_memory_max_bytes{area="heap",id="PS Old Gen",} 3.58088704E8

# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
# TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{id="direct",} 17.0
jvm_buffer_count_buffers{id="mapped",} 0.0
# HELP jvm_buffer_memory_used_bytes An estimate of the memory that the Java virtual machine is using for this buffer pool
# TYPE jvm_buffer_memory_used_bytes gauge
jvm_buffer_memory_used_bytes{id="direct",} 515770.0
jvm_buffer_memory_used_bytes{id="mapped",} 0.0
# HELP jvm_buffer_total_capacity_bytes An estimate of the total capacity of the buffers in this pool
# TYPE jvm_buffer_total_capacity_bytes gauge
jvm_buffer_total_capacity_bytes{id="direct",} 515770.0
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0

# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young
generation memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total 5.75668224E8
# HELP jvm_gc_max_data_size_bytes Max size of old generation memory pool
# TYPE jvm_gc_max_data_size_bytes gauge
jvm_gc_max_data_size_bytes 3.58088704E8
# HELP jvm_gc_live_data_size_bytes Size of old generation memory pool after a full GC
# TYPE jvm_gc_live_data_size_bytes gauge
jvm_gc_live_data_size_bytes 3.9021816E7
# HELP jvm_gc_pause_seconds Time spent in GC pause
# TYPE jvm_gc_pause_seconds summary
jvm_gc_pause_seconds_count{action="end of minor GC",cause="Allocation Failure",} 37.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="Allocation Failure",} 0.444
jvm_gc_pause_seconds_count{action="end of minor GC",cause="GCLocker Initiated GC",} 1.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="GCLocker Initiated GC",} 0.006
jvm_gc_pause_seconds_count{action="end of major GC",cause="Ergonomics",} 1.0
jvm_gc_pause_seconds_sum{action="end of major GC",cause="Ergonomics",} 0.136
# HELP jvm_gc_pause_seconds_max Time spent in GC pause
# TYPE jvm_gc_pause_seconds_max gauge
jvm_gc_pause_seconds_max{action="end of minor GC",cause="Allocation Failure",} 0.045
jvm_gc_pause_seconds_max{action="end of minor GC",cause="GCLocker Initiated GC",} 0.006
jvm_gc_pause_seconds_max{action="end of major GC",cause="Ergonomics",} 0.136
# HELP jvm_gc_memory_promoted_bytes_total Count of positive increases in the size of the old generation memory pool before GC to after GC
# TYPE jvm_gc_memory_promoted_bytes_total counter
jvm_gc_memory_promoted_bytes_total 1.0042528E7


# HELP process_files_max_files The maximum file descriptor count
# TYPE process_files_max_files gauge
process_files_max_files 1048576.0
# HELP process_files_open_files The open file descriptor count
# TYPE process_files_open_files gauge
process_files_open_files 77.0



# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
# TYPE process_cpu_usage gauge
process_cpu_usage 0.002553191489361702
# HELP system_cpu_count The number of processors available to the Java virtual machine
# TYPE system_cpu_count gauge
system_cpu_count 4.0
# HELP system_cpu_usage The "recent cpu usage" for the whole system
# TYPE system_cpu_usage gauge
system_cpu_usage 0.13787234042553193
 

Number of live threads is currently by far the most important metric since it would allow to investigate P1 issue - openshiftio/openshift.io#4626

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.