Giter Site home page Giter Site logo

osd-monitor-poc's Issues

OSD: bayesian & keycloak monitoring

Both bayesian and keycloak namespaces on prod & stage have some OSD-style pcp instrumentation in them, and a functional local osd-monitor pod. However, neither reports to zabbix, though this may be necessary. This is a placeholder for this known absence.

OSO scaleup testing

We need to estimate the limitations of an oso-monitor pod's fanout:

  • how many concurrent pmloggers can it practically handle?
  • does that number match up with pmdaprometheus fanout limits?
  • how well does pmwebd handle forecast-typical queries about so many fresh archives?
  • does pmwebd need to be parallelized further, or replicated?
  • if any of these limits are too small to encompass the entire online-OSO-tenant population, we need to extent oso-monitor to shard users into partitions, and a add a pmwebd-demultiplexing proxy in the front.

pcp repos

Frank, regarding the various pcp.repo used in the Dockerfiles - these are pointing to https://copr-be.cloud.fedoraproject.org/results/fche/pcp/epel-7-$basearch/

Is that built from your pcpfans tree? I noticed there are undocumented pmwebd options being used, and likely quite a lot of other stuff that hasn't been merged with pcp upstream. This is something I can help with, but would need some guidelines on what you want to keep in your pcpfans tree, what's experimental, and what's only temporary.

toy jaeger configuration for rhche

@ibuziuk From gazing at eclipse-che's repo on github, it seems the configuration changes you need to use the toy jaeger now in osd-monitor in -stg is to add the following to your env/configmap:

CHE_TRACING_ENABLED: "true"
JAEGER_ENDPOINT: "http://osd-monitor-jaeger:14268/api/traces"
JAEGER_SERVICE_NAME: "che-server"
JAEGER_SAMPLER_MANAGER_HOST_PORT: "osd-monitor-jaeger:5778"
JAEGER_SAMPLER_TYPE: "const"
JAEGER_SAMPLER_PARAM: "1"
JAEGER_REPORTER_MAX_QUEUE_SIZE: "10000"

OSO: pull data from OSO prometheus server

There are indications that some of our target OSO clusters may expose a usable prometheus-server federation-type webapi, via which we might be able to pull in a least basic container-level system data. We need information & testing to see this work (@kbsingh), and then some pcp work in order to exploit it.

performancecopilot/pcp#444

Enabling monitoring for rhche on dsaas / dsaas-stg and exposing metrics to zabbix

rhche-host service on dsaas / dsaas-stg exposes 8087 port for obtaining metrics in Prometheus format:

rhche-host ClusterIP 172.30.149.180 <none> 8080/TCP,8087/TCP 52d

Currently it is possible to obtain (ClassLoader / JVM / Tomcat) metrics from osd monitor via service name & port combo. e.g curl rhche-host:8087:

image

Those metrics need to be consumed & visualized by osd monitor + exposed to zabbix. Currently the most important metrics are the following:

# HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
# TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 43.0
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
# TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads 43.0
# HELP jvm_threads_states_threads The current number of threads having NEW state
# TYPE jvm_threads_states_threads gauge
jvm_threads_states_threads{state="new",} 0.0
jvm_threads_states_threads{state="runnable",} 11.0
jvm_threads_states_threads{state="blocked",} 0.0
jvm_threads_states_threads{state="waiting",} 25.0
jvm_threads_states_threads{state="timed-waiting",} 7.0
jvm_threads_states_threads{state="terminated",} 0.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads 38.0

# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{area="nonheap",id="Code Cache",} 2.3396352E7
jvm_memory_committed_bytes{area="nonheap",id="Metaspace",} 6.9337088E7
jvm_memory_committed_bytes{area="nonheap",id="Compressed Class Space",} 8519680.0
jvm_memory_committed_bytes{area="heap",id="PS Eden Space",} 1.5204352E7
jvm_memory_committed_bytes{area="heap",id="PS Survivor Space",} 524288.0
jvm_memory_committed_bytes{area="heap",id="PS Old Gen",} 4.9283072E7
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="nonheap",id="Code Cache",} 2.2897152E7
jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 6.7874864E7
jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 8108328.0
jvm_memory_used_bytes{area="heap",id="PS Eden Space",} 1.322024E7
jvm_memory_used_bytes{area="heap",id="PS Survivor Space",} 484352.0
jvm_memory_used_bytes{area="heap",id="PS Old Gen",} 4.0086792E7
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{area="nonheap",id="Code Cache",} 2.5165824E8
jvm_memory_max_bytes{area="nonheap",id="Metaspace",} -1.0
jvm_memory_max_bytes{area="nonheap",id="Compressed Class Space",} 1.073741824E9
jvm_memory_max_bytes{area="heap",id="PS Eden Space",} 1.77733632E8
jvm_memory_max_bytes{area="heap",id="PS Survivor Space",} 524288.0
jvm_memory_max_bytes{area="heap",id="PS Old Gen",} 3.58088704E8

# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
# TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{id="direct",} 17.0
jvm_buffer_count_buffers{id="mapped",} 0.0
# HELP jvm_buffer_memory_used_bytes An estimate of the memory that the Java virtual machine is using for this buffer pool
# TYPE jvm_buffer_memory_used_bytes gauge
jvm_buffer_memory_used_bytes{id="direct",} 515770.0
jvm_buffer_memory_used_bytes{id="mapped",} 0.0
# HELP jvm_buffer_total_capacity_bytes An estimate of the total capacity of the buffers in this pool
# TYPE jvm_buffer_total_capacity_bytes gauge
jvm_buffer_total_capacity_bytes{id="direct",} 515770.0
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0

# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young
generation memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total 5.75668224E8
# HELP jvm_gc_max_data_size_bytes Max size of old generation memory pool
# TYPE jvm_gc_max_data_size_bytes gauge
jvm_gc_max_data_size_bytes 3.58088704E8
# HELP jvm_gc_live_data_size_bytes Size of old generation memory pool after a full GC
# TYPE jvm_gc_live_data_size_bytes gauge
jvm_gc_live_data_size_bytes 3.9021816E7
# HELP jvm_gc_pause_seconds Time spent in GC pause
# TYPE jvm_gc_pause_seconds summary
jvm_gc_pause_seconds_count{action="end of minor GC",cause="Allocation Failure",} 37.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="Allocation Failure",} 0.444
jvm_gc_pause_seconds_count{action="end of minor GC",cause="GCLocker Initiated GC",} 1.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="GCLocker Initiated GC",} 0.006
jvm_gc_pause_seconds_count{action="end of major GC",cause="Ergonomics",} 1.0
jvm_gc_pause_seconds_sum{action="end of major GC",cause="Ergonomics",} 0.136
# HELP jvm_gc_pause_seconds_max Time spent in GC pause
# TYPE jvm_gc_pause_seconds_max gauge
jvm_gc_pause_seconds_max{action="end of minor GC",cause="Allocation Failure",} 0.045
jvm_gc_pause_seconds_max{action="end of minor GC",cause="GCLocker Initiated GC",} 0.006
jvm_gc_pause_seconds_max{action="end of major GC",cause="Ergonomics",} 0.136
# HELP jvm_gc_memory_promoted_bytes_total Count of positive increases in the size of the old generation memory pool before GC to after GC
# TYPE jvm_gc_memory_promoted_bytes_total counter
jvm_gc_memory_promoted_bytes_total 1.0042528E7


# HELP process_files_max_files The maximum file descriptor count
# TYPE process_files_max_files gauge
process_files_max_files 1048576.0
# HELP process_files_open_files The open file descriptor count
# TYPE process_files_open_files gauge
process_files_open_files 77.0



# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
# TYPE process_cpu_usage gauge
process_cpu_usage 0.002553191489361702
# HELP system_cpu_count The number of processors available to the Java virtual machine
# TYPE system_cpu_count gauge
system_cpu_count 4.0
# HELP system_cpu_usage The "recent cpu usage" for the whole system
# TYPE system_cpu_usage gauge
system_cpu_usage 0.13787234042553193
 

Number of live threads is currently by far the most important metric since it would allow to investigate P1 issue - openshiftio/openshift.io#4626

publish time series access API

pmwebd's graphite webapi would be the default way to access metric time series by OSIO components, whether for OSD-side components themselves, or tenant OSO side code. This API needs to be better documented, secured, exposed, pushed, filed, stamped, indexed, briefed, debriefed, and numbered.

OSO: tenant enumeration

For the oso-monitor to reliably track tenants that come and go, it needs this:

fabric8-services/fabric8-tenant#369
= fabric8-services/fabric8-tenant#371

This reverse lookup facility from tenant-URL to tenant may help for tenant-initiated introduction, but is unlikely to be sufficient as the OSD side may lose such messages. For redundancy/reliability we should have both.
fabric8-services/fabric8-tenant#476

Partitioning tenants by 'profile' will allow grouping them into different level of metric collection/retention categories.
fabric8-services/fabric8-tenant#355

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.