-
This repository collects a number of interrelated dockerfiles for building openshift.io's self-monitoring facilities.
-
They are based on Performance Co-pilot (www.pcp.io) RPMs, built into small special-purpose configurations.
osd-monitor-poc's Introduction
osd-monitor-poc's People
Forkers
kbsingh jfchevrette vpavlin aditya-konarde goodwinos alexxnica kryndex aslakknutsen xyntrix jmelis riuvshin dipak-pawar alexeykazakov snowind skabashnyuk matousjobanek miteshvp deepak1725osd-monitor-poc's Issues
fix osd->zabbix feed reliability
internal dtsd/housekeeping/issues/926
replace webapi-guard components with oauth_proxy
The OSD-side webapi-guard components should integrate into OAUTH SSO and offer programmatic access to the timeseries guarded by pmwebd. File service provided by the current guard httpd could probably be done by pmwebd too, so we wouldn't need an apache httpd.
https://github.com/bitly/oauth2_proxy
a container is already built into the OSIO registry
OSD: bayesian & keycloak monitoring
Both bayesian and keycloak namespaces on prod & stage have some OSD-style pcp instrumentation in them, and a functional local osd-monitor
pod. However, neither reports to zabbix, though this may be necessary. This is a placeholder for this known absence.
toy jaeger configuration for rhche
@ibuziuk From gazing at eclipse-che's repo on github, it seems the configuration changes you need to use the toy jaeger now in osd-monitor in -stg is to add the following to your env/configmap:
CHE_TRACING_ENABLED: "true"
JAEGER_ENDPOINT: "http://osd-monitor-jaeger:14268/api/traces"
JAEGER_SERVICE_NAME: "che-server"
JAEGER_SAMPLER_MANAGER_HOST_PORT: "osd-monitor-jaeger:5778"
JAEGER_SAMPLER_TYPE: "const"
JAEGER_SAMPLER_PARAM: "1"
JAEGER_REPORTER_MAX_QUEUE_SIZE: "10000"
OSO: pull data from OSO prometheus server
There are indications that some of our target OSO clusters may expose a usable prometheus-server federation-type webapi, via which we might be able to pull in a least basic container-level system data. We need information & testing to see this work (@kbsingh), and then some pcp work in order to exploit it.
OSO: tenant enumeration
For the oso-monitor to reliably track tenants that come and go, it needs this:
fabric8-services/fabric8-tenant#369
= fabric8-services/fabric8-tenant#371
This reverse lookup facility from tenant-URL to tenant may help for tenant-initiated introduction, but is unlikely to be sufficient as the OSD side may lose such messages. For redundancy/reliability we should have both.
fabric8-services/fabric8-tenant#476
Partitioning tenants by 'profile' will allow grouping them into different level of metric collection/retention categories.
fabric8-services/fabric8-tenant#355
OSO: tenant end-user code, instrumentation insertion
In order to collect metrics from tenant-written code, it seems the best way is for us to manipulate the DCs for the TENANT-{run,stage} namespaces/pods before they're activated. @aslakknutsen
Add repository description in README.md
Use environment variable for devshift tag length in cico_build.sh
cico_build.sh needs to make use of the environment variable DEVSHIFT_TAG_LEN instead of a hardcoded value
OSO: activate jenkins monitoring
activate osd pcp monitoring for fabric8-auth-proxy
including its prometheus data
publish time series access API
pmwebd's graphite webapi would be the default way to access metric time series by OSIO components, whether for OSD-side components themselves, or tenant OSO side code. This API needs to be better documented, secured, exposed, pushed, filed, stamped, indexed, briefed, debriefed, and numbered.
activate pcp pmcd+pmda logging to stderr
OSO: activate eclipse-ws pod monitoring
pcp repos
Frank, regarding the various pcp.repo used in the Dockerfiles - these are pointing to https://copr-be.cloud.fedoraproject.org/results/fche/pcp/epel-7-$basearch/
Is that built from your pcpfans tree? I noticed there are undocumented pmwebd options being used, and likely quite a lot of other stuff that hasn't been merged with pcp upstream. This is something I can help with, but would need some guidelines on what you want to keep in your pcpfans tree, what's experimental, and what's only temporary.
zabbix low-level-discovery metrics needed
performancecopilot/pcp#439
internal dtsd/housekeeping/issues/1464
@mmclanerh
pmdapostgresql regression with >4.2 pcp
tracking progress being made at performancecopilot/pcp#590
PCP graphite function support
For supplying more cooked timeseries results via pmwebd's graphite webapi, this is needed. Work in progress. @aslakknutsen
need pcp2zabbix with PMNS_CHANGED / optional-metric tolerance
activate pcp archive compression
This should in theory reduce our PV disk space requirements by a factor greater than 10, though in exchange for more temporary RAM. performancecopilot/pcp#422 performancecopilot/pcp#386
OSO scaleup testing
We need to estimate the limitations of an oso-monitor pod's fanout:
- how many concurrent pmloggers can it practically handle?
- does that number match up with pmdaprometheus fanout limits?
- how well does pmwebd handle forecast-typical queries about so many fresh archives?
- does pmwebd need to be parallelized further, or replicated?
- if any of these limits are too small to encompass the entire online-OSO-tenant population, we need to extent oso-monitor to shard users into partitions, and a add a pmwebd-demultiplexing proxy in the front.
Enabling monitoring for rhche on dsaas / dsaas-stg and exposing metrics to zabbix
rhche-host
service on dsaas / dsaas-stg exposes 8087 port for obtaining metrics in Prometheus format:
rhche-host ClusterIP 172.30.149.180 <none> 8080/TCP,8087/TCP 52d
Currently it is possible to obtain (ClassLoader / JVM / Tomcat) metrics from osd monitor via service name & port combo. e.g curl rhche-host:8087
:
Those metrics need to be consumed & visualized by osd monitor + exposed to zabbix. Currently the most important metrics are the following:
# HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
# TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 43.0
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
# TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads 43.0
# HELP jvm_threads_states_threads The current number of threads having NEW state
# TYPE jvm_threads_states_threads gauge
jvm_threads_states_threads{state="new",} 0.0
jvm_threads_states_threads{state="runnable",} 11.0
jvm_threads_states_threads{state="blocked",} 0.0
jvm_threads_states_threads{state="waiting",} 25.0
jvm_threads_states_threads{state="timed-waiting",} 7.0
jvm_threads_states_threads{state="terminated",} 0.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads 38.0
# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{area="nonheap",id="Code Cache",} 2.3396352E7
jvm_memory_committed_bytes{area="nonheap",id="Metaspace",} 6.9337088E7
jvm_memory_committed_bytes{area="nonheap",id="Compressed Class Space",} 8519680.0
jvm_memory_committed_bytes{area="heap",id="PS Eden Space",} 1.5204352E7
jvm_memory_committed_bytes{area="heap",id="PS Survivor Space",} 524288.0
jvm_memory_committed_bytes{area="heap",id="PS Old Gen",} 4.9283072E7
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="nonheap",id="Code Cache",} 2.2897152E7
jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 6.7874864E7
jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 8108328.0
jvm_memory_used_bytes{area="heap",id="PS Eden Space",} 1.322024E7
jvm_memory_used_bytes{area="heap",id="PS Survivor Space",} 484352.0
jvm_memory_used_bytes{area="heap",id="PS Old Gen",} 4.0086792E7
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{area="nonheap",id="Code Cache",} 2.5165824E8
jvm_memory_max_bytes{area="nonheap",id="Metaspace",} -1.0
jvm_memory_max_bytes{area="nonheap",id="Compressed Class Space",} 1.073741824E9
jvm_memory_max_bytes{area="heap",id="PS Eden Space",} 1.77733632E8
jvm_memory_max_bytes{area="heap",id="PS Survivor Space",} 524288.0
jvm_memory_max_bytes{area="heap",id="PS Old Gen",} 3.58088704E8
# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
# TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{id="direct",} 17.0
jvm_buffer_count_buffers{id="mapped",} 0.0
# HELP jvm_buffer_memory_used_bytes An estimate of the memory that the Java virtual machine is using for this buffer pool
# TYPE jvm_buffer_memory_used_bytes gauge
jvm_buffer_memory_used_bytes{id="direct",} 515770.0
jvm_buffer_memory_used_bytes{id="mapped",} 0.0
# HELP jvm_buffer_total_capacity_bytes An estimate of the total capacity of the buffers in this pool
# TYPE jvm_buffer_total_capacity_bytes gauge
jvm_buffer_total_capacity_bytes{id="direct",} 515770.0
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0
# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young
generation memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total 5.75668224E8
# HELP jvm_gc_max_data_size_bytes Max size of old generation memory pool
# TYPE jvm_gc_max_data_size_bytes gauge
jvm_gc_max_data_size_bytes 3.58088704E8
# HELP jvm_gc_live_data_size_bytes Size of old generation memory pool after a full GC
# TYPE jvm_gc_live_data_size_bytes gauge
jvm_gc_live_data_size_bytes 3.9021816E7
# HELP jvm_gc_pause_seconds Time spent in GC pause
# TYPE jvm_gc_pause_seconds summary
jvm_gc_pause_seconds_count{action="end of minor GC",cause="Allocation Failure",} 37.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="Allocation Failure",} 0.444
jvm_gc_pause_seconds_count{action="end of minor GC",cause="GCLocker Initiated GC",} 1.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="GCLocker Initiated GC",} 0.006
jvm_gc_pause_seconds_count{action="end of major GC",cause="Ergonomics",} 1.0
jvm_gc_pause_seconds_sum{action="end of major GC",cause="Ergonomics",} 0.136
# HELP jvm_gc_pause_seconds_max Time spent in GC pause
# TYPE jvm_gc_pause_seconds_max gauge
jvm_gc_pause_seconds_max{action="end of minor GC",cause="Allocation Failure",} 0.045
jvm_gc_pause_seconds_max{action="end of minor GC",cause="GCLocker Initiated GC",} 0.006
jvm_gc_pause_seconds_max{action="end of major GC",cause="Ergonomics",} 0.136
# HELP jvm_gc_memory_promoted_bytes_total Count of positive increases in the size of the old generation memory pool before GC to after GC
# TYPE jvm_gc_memory_promoted_bytes_total counter
jvm_gc_memory_promoted_bytes_total 1.0042528E7
# HELP process_files_max_files The maximum file descriptor count
# TYPE process_files_max_files gauge
process_files_max_files 1048576.0
# HELP process_files_open_files The open file descriptor count
# TYPE process_files_open_files gauge
process_files_open_files 77.0
# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
# TYPE process_cpu_usage gauge
process_cpu_usage 0.002553191489361702
# HELP system_cpu_count The number of processors available to the Java virtual machine
# TYPE system_cpu_count gauge
system_cpu_count 4.0
# HELP system_cpu_usage The "recent cpu usage" for the whole system
# TYPE system_cpu_usage gauge
system_cpu_usage 0.13787234042553193
Number of live threads is currently by far the most important metric since it would allow to investigate P1 issue - openshiftio/openshift.io#4626
Add liveness and readiness probes to osd-monitor.yml
The deployment config osd-monitor.yml needs to have:
For Api Containers
- A liveness probe
- A readiness probe
PCP dynamic pmlogger
Required to make OSO tenant logging as efficient/dynamic as possible.
Already in good hands:
performancecopilot/pcp#372
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.