cloudfoundry / prometheus-boshrelease Goto Github PK

bosh release for prometheus ecosystem

License: Apache License 2.0

Shell 92.11% Dockerfile 0.26% HTML 7.63%

bosh-release prometheus cloud-foundry metrics

prometheus-boshrelease's Introduction

Prometheus BOSH Release

This is a BOSH release for Prometheus, Alertmanager, and Grafana. It also includes various Prometheus exporters and Grafana plugins.

The detailed list of included components and their maintenance status is available on VERSIONS.md

Questions? Pop in our Slack channel!

Usage
Contributing
License

Usage

Requirements

In order to use this BOSH release you will need:

BOSH CLI v2
An already deployed BOSH environment, please check BOSH deployment security groups because Prometheus will connect to BOSH (ports: 25555 - Director API, 8443 - UAA API)
A compatible cloud-config with a default option for network and vm_types (you can use the example that comes from cf-deployment)

Although not mandatory, it is recommended to deploy the node exporter addon in order to get system metrics.

Clone the repository

First, clone this repository into your workspace:

git clone https://github.com/bosh-prometheus/prometheus-boshrelease
cd prometheus-boshrelease
export BOSH_ENVIRONMENT=<name>

Then checkout to the release branch you want to use, so manifest files will be in synch with the release version:

git checkout v...

Basic deployment

To deploy a basic prometheus server with alertmanager and grafana (but no exporters) use the following command:

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml

Once deployed, look for the nginx instance IP address:

bosh -d prometheus instances

You can reach each component's web ui at:

alertmanager: http://<nginx-ip-address>:9093
grafana: http://<nginx-ip-address>:3000
prometheus: http://<nginx-ip-address>:9090

Credentials for each components can be located at the tmp/deployment-vars.yml file.

Using BOSH Service Discovery

If you want to use the BOSH Service Discovery in order to dynamically discover your exporters then add the monitor-bosh.yml op file by running the following command (filling the required variables with your own values):

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  -o manifests/operators/monitor-bosh.yml \
  -v bosh_url= \
  -v bosh_username= \
  -v bosh_password= \
  --var-file bosh_ca_cert= \
  -v metrics_environment=

NOTE: metrics_environment is an arbitrary name to identify your environment (test, nyc-prod, ...)

If you have configured your bosh-deployment to use UAA user management (via the uaa.yml ops file) we recommend adding the add-bosh-exporter-uaa-clients.yml op file to your bosh-deployment and then adding the enable-bosh-uaa.yml ops file to the prometheus deployment by running the following command (filling the required variables with your own values):

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  -o manifests/operators/monitor-bosh.yml \
  -o manifests/operators/enable-bosh-uaa.yml \
  -v bosh_url= \
  --var-file bosh_ca_cert= \
  -v metrics_environment=

In case you have configured manually an UAA client_id for the bosh_exporter (different from bosh_exporter), then run the following command instead:

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  -o manifests/operators/monitor-bosh.yml \
  -o manifests/operators/enable-bosh-uaa.yml \
  -o manifests/operators/configure-bosh-exporter-uaa-client-id.yml \
  -v bosh_url= \
  -v uaa_bosh_exporter_client_id= \
  -v uaa_bosh_exporter_client_secret= \
  --var-file bosh_ca_cert= \
  -v metrics_environment=

Monitoring Cloud Foundry

If you want to monitor your Cloud Foundry platform, first update your cf-deployment adding the add-prometheus-uaa-clients.yml op file.

This will add the UAA clients required to gather information from the Cloud Foundry API and Firehose. Then add the monitor-cf.yml ops file by running the following command (filling the required variables with your own values):

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  -o manifests/operators/monitor-bosh.yml \
  -v bosh_url= \
  -v bosh_username= \
  -v bosh_password= \
  --var-file bosh_ca_cert= \
  -v metrics_environment= \
  -o manifests/operators/monitor-cf.yml \
  -v metron_deployment_name= \
  -v system_domain= \
  -v uaa_clients_cf_exporter_secret= \
  -v loggregator_ca_name= \
  -v skip_ssl_verify=

NOTE: metron_deployment_name property should match the deployment property of your metron_agent or loggregator_agent jobs. Use:

your system_domain (metron_agent) for cf-deployment before v2.0.0

cf (loggregator_agent) for cf-deployment starting from the v2.0.0

cf for Pivotal Application Service

NOTE: loggregator_ca_name property should match the full credhub path of loggregator_ca certificate variable, ex: /bosh-mydirector/cf/loggregator_ca.

NOTE: You can switch to legacy implementation of firehose_exporter and legacy cloud foundry dashboards by adding the following ops-files:
on prometheus deployment, adapt:
...
-o manifests/operators/monitor-cf.yml \
-o manifests/operators/deprecated/monitor-cf-attic.yml \
-v uaa_clients_firehose_exporter_secret= \
-v traffic_controller_external_port= \
...
When using add-prometheus-uaa-clients.yml on cloud foundry deployment, adapt:
...
-o manifests/operators/cf/add-prometheus-uaa-clients.yml
-o manifests/operators/deprecated/cf/add-prometheus-uaa-clients-attic.yml
...
This will switch deployment to firehose_exporter-attic, cloudfoundry_dashboards-attic and cloudfoundry_alerts-attic

Register Cloud Foundry routes

If you want to access alertmanager, grafana, and prometheus web ui's using your Cloud Foundry system domain instead of IP addresses, then you can register those routes inside your Cloud Foundry environment using the enable-cf-route-registrar.yml op file by running the following command (filling the required variables with your own values):

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  ...
  -o manifests/operators/enable-cf-route-registrar.yml \
  -v system_domain= \
  -v cf_deployment_name=

The op file will register the following routes:

https://alertmanager.<cf system domain>
https://grafana.<cf system domain>
https://prometheus.<cf system domain>

Use UAA for Grafana authentication

If you want to allow users registered at your Cloud Foundry environment to access the Grafana dashboards (Viewer mode only), first update your cf-deployment adding the add-grafana-uaa-clients.yml op file. This will add the UAA client required by the Grafana-UAA integration.

Then add the enable-grafana-uaa.yml op file by running the following command (filling the required variables with your own values):

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  ...
  -o manifests/operators/enable-grafana-uaa.yml \
  -v system_domain= \
  -v uaa_clients_grafana_secret= \
  --var-file uaa_ssl.ca= \
  --var-file uaa_ssl.certificate= \
  --var-file uaa_ssl.private_key=

Operations files

Additional operations files are located at the manifests/operators directory. Those files includes a basic configuration, so extra ops files might be needed for additional configuration.

Please review the op files before deploying them to check the requirements, dependencies and necessary variables.

File	Description	exporter	dashboards	alerts
alertmanager-group-by-alertname.yml	Groups `alertmanager` alerts by name
alertmanager-hipchat-receiver.yml	Configures a HipChat receiver for `alertmanager`
alertmanager-opsgenie-receiver.yml	Configures a OpsGenie receiver for `alertmanager`
alertmanager-pagerduty-receiver.yml	Configures a PagerDuty receiver for `alertmanager`
alertmanager-pushover-receiver.yml	Configures a Pushover receiver for `alertmanager`
alertmanager-slack-receiver.yml	Configures a Slack receiver for `alertmanager`
alertmanager-victorops-receiver.yml	Configures a VictorOps receiver for `alertmanager`
alertmanager-webhook-receiver.yml	Configures a generic webhook receiver for `alertmanager`
alertmanager-web-external-url.yml	Configures the URL under which `alertmanager` is externally reachable
configure-bosh-exporter-uaa-client-id.yml	Configures a custom `bosh_exporter` UAA `client_id` for the enable-bosh-uaa.yml op-file
enable-bosh-uaa.yml	Configures monitor-bosh.yml to use an UAA client (you must apply the add-bosh-exporter-uaa-clients.yml op file to your bosh-deployment)
enable-cf-route-registrar.yml	Registers `alertmanager`, `grafana`, and `prometheus` as Cloud Foundry routes (under your `system domain`)
enable-grafana-uaa.yml	Configures `grafana` user authentication to use Cloud Foundry UAA (you must apply the add-grafana-uaa-clients.yml op file to your cf-deployment)
enable-grafana-generic-oauth.yml	Configures `grafana` user authentication to use a generic OAuth2 provider
enable-service-discovery.yml	Enable service discovery files using BOSH links
enable-proxy-alertmanager.yml	Enables http(s) proxy for `alertmanager`
enable-proxy-blackbox-exporter.yml	Enables http(s) proxy for `blackbox_exporter`
enable-proxy-bosh-exporter.yml	Enables http(s) proxy for `bosh_exporter`
enable-proxy-cf-exporter.yml	Enables http(s) proxy for `cf_exporter`
enable-proxy-firehose-exporter.yml	Enables http(s) proxy for `firehose_exporter`
enable-proxy-grafana.yml	Enables http(s) proxy for `grafana`
enable-proxy-kubernetes.yml	Enables http(s) proxy for `kube_state_metrics_exporter`
enable-proxy-prometheus.yml	Enables http(s) proxy for `prometheus`
enable-proxy-shield-exporter.yml	Enables http(s) proxy for `shield_exporter`
enable-proxy-stackdriver-exporter.yml	Enables http(s) proxy for `stackdriver_exporter`
enable-root-url.yml	Enables `root_url` for `grafana`
migrate_from_prometheus_1.yml	Allows migrating an instance from Prometheus 1.x to Prometheus 2.x
monitor-bosh.yml	Enables monitoring BOSH `jobs` and `processes` and enables Service Discovery	x	x	x
monitor-cadvisor.yml	Enables monitoring cAdvisor	x
monitor-cf.yml	Enables monitoring Cloud Foundry via the Cloud Foundry and Cloud Foundry Firehose exporters (you must apply the add-prometheus-uaa-clients.yml op file to your cf-deployment)	x	x	x
monitor-collectd.yml	Enables monitoring Collectd	x
monitor-concourse.yml	Enables monitoring Concourse CI >= v3.8.0 (you must apply the enable-prometheus-metrics.yml op file to your concourse-deployment)		x	x
monitor-concourse-influxdb.yml	Enables monitoring Concourse CI < v3.8.0. Requires node exporter on Concourse VMs (probably as a BOSH add-on) and InfluxDB to be deployed independently and configured as a data source in Grafana as well as Concourse configured to send events to InfluxDB		x
monitor-consul.yml	Enables monitoring Consul	x	x	x
monitor-credhub.yml	Enables monitoring Credhub	x		x
monitor-elasticsearch.yml	Enables monitoring Elasticsearch	x	x	x
monitor-graphite.yml	Enables monitoring Graphite	x
monitor-haproxy.yml	Enables monitoring HAProxy	x	x	x
monitor-http-probe.yml	Enables monitoring HTTP(s) endpoints via the Blackbox exporter	x	x	x
monitor-influxdb.yml	Enables monitoring InfluxDB	x
monitor-kubernetes.yml	Enables monitoring Kubernetes	x	x	x
monitor-memcached.yml	Enables monitoring Memcached	x
monitor-mongodb.yml	Enables monitoring MongoDB	x
monitor-mysql.yml	Enables monitoring MySQL	x	x	x
monitor-nats.yml	Enables monitoring NATS	x
monitor-node.yml	Enables monitoring system metrics via the node exporter		x
monitor-p-rabbitmq.yml	Enables monitoring RabbitMQ for PCF (requires the monitor-cf.yml op file)		x	x
monitor-p-redis.yml	Enables monitoring Redis for PCF (requires the monitor-cf.yml op file)		x	x
monitor-postgres.yml	Enables monitoring PostgreSQL	x	x	x
monitor-pushgateway.yml	Deploys a PushGateway	x
monitor-rabbitmq.yml	Enables monitoring RabbitMQ	x	x	x
monitor-redis.yml	Enables monitoring Redis	x	x	x
monitor-shield.yml	Enables monitoring Shield	x	x	x
monitor-stackdriver.yml	Enables monitoring Stackdriver	x
monitor-statsd.yml	Enables monitoring Statsd	x
monitor-vault.yml	Enables monitoring Vault	x		x
nginx-vm-extension.yml	Adds a VM Extension block to the `nginx` instance, useful to attach a Load Balancer
prometheus-web-external-url.yml	Configures the URL under which `prometheus` is externally reachable
use-sqlite3.yml	Use sqlite3 instead of postgres

In addition, some deprecated ops-files allows to switch back to legacy behaviours

File	Description	exporter	dashboards	alerts
deprecated/monitor-cf-attic.yml	Use legacy implementation of `monitor-cf.yml`	x	x	x
deprecated/cf/add-prometheus-uaa-clients-attic.yml	Adds UAA client in cloud foundry deployment when using `monitor-cf-attic.yml`
deprecated/enable-cf-loggregator-v2.yml	Enables Cloud Foundry Loggregator V2 API calls in the legacy `firehose_exporter`

Deployment variables and the var-store

Some operators files requires additional information to provide environment-specific or sensitive configuration such as various credentials. To do this in the default configuration, we use the --vars-store. This flag takes the name of a yml file that it will read and write to. Where necessary credential values are not present, it will generate new values based on the type information stored at the different deployment files. Necessary variables that BOSH can't generate need to be supplied as well. See each particular op files you're using for any additional necessary variables.

See also the BOSH CLI documentation for more information about ways to supply such additional variables.

Contributing

Refer to CONTRIBUTING.md.

License

Apache License 2.0, see LICENSE.

prometheus-boshrelease's People

Contributors

Stargazers

Watchers

Forkers

orange-cloudfoundry omearaj janaurka szaouam arthurhlt cppforlife vmware-archive brightzheng100 st3v jmcarp mjsjinsu pillopl jigsheth57 jccarte shinji62 wfernandes ljfranklin jtuchscherer comdaze wjk940 cnelson vchrisr benjaminguttmann-avtq dr8tsh making avarteqgmbh kkitai engineerbetter drnic peterellisjones joaquinrz govau ntdt bhudlemeyer wjwoodson ramonskie 18f sba30 daxterm geofffranks ryanaross robbo10 cwlbraa arghya88 manashcsc maning8 matthewfischer scoobed acho-bacho sifiaicha svrc stephen-reaves ruixiang nwmac riguelbf sapient007 cze-p7s1 irfanhabib rmoorman naveen-goswami springerpe bonzofenix alexvianet currycan keymon gdenn jacoblnewton mylucidreality kinjelom alphagov robplahn daichi703n aryanet taskinfurkan srothwell01 xyloman fbuchmeier jacopen mhtike pdev1989 hae-anwar-bardai pdubey2 jupilhwang denverops infra-red dhiren051 mdcarwile-az bengerman13 scottgai andinod dohq bottkars maxknee gstackio rabbitmq bkez322 dystudio 3v1lw1th1n akhettal tsato1222

prometheus-boshrelease's Issues

CF Exporter - Filter Collector

Hi,

I am facing currently an issue with the cf_exporter.
When I define any collector mentioned in the specs of the exporter the export isn't able to start and shows errors like :

time="2017-07-21T12:43:56Z" level=error msg="Collector filter [ApplicationEvents] is not supported" source="cf_exporter.go:213"

Any ideas?

Regards,

Benjamin

Prometheus API can respond but Prometheus WebUI can't do.

Hi.

Although I did setup prometheus-17.6.0 with bosh, I couldn't get data from prometheus webui(no data points).
but Prometheus API is working.

$ curl -g 'http://100.99.50.80:9090/api/v1/series?match[]=cf_organization_info{environment="hoge"}'
{"status":"success","data":[{"__name__":"cf_organization_info","deployment":"cf","environment":"hoge","instance":"localhost:9193","job":"cf","organization_id":"e00539b9-4437-4e88-9a9a-6b807861383c","organization_name":"org","quota_name":"default"},
...
..
.

Why would it happen?

bosh_exporter should not filter.azs by default

Hi,

What is the use case for the 'else' statement here: https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/master/jobs/bosh_exporter/templates/bin/bosh_exporter_ctl#L50-L51 ? We just ran into an issue where we had no job-level information after deploying prometheus-release and we tracked it down to this - we didn't intend to filter at all but by default it was filtering on AZ (and we didn't have any interesting jobs in that AZ).

Thanks,

Bump elasticsearch_exporter again

Looks like 0.3.2 introduced a regression that was fixed in 0.3.3. Unfortunately, the regression causes the exporter to panic on /metrics, so it's basically unusable for now. Fix at prometheus-community/elasticsearch_exporter#61.

Create MongoDB dashboards & alerts

The MongoDB dashboards are currently empty and there is no job/package for MongoDB alerts.

Errands / delete deployment as unhealthy job

Hi,
it's seems that errand / or deleted job/release are count as unhealthy and so trigger alert.
This unhealthy job remain forever right now.

Thanks

Authentication issue

when I am upgrading prometheus from version 12.3.3 to 14.0.0, and hitting the grafana dashboard I am getting authentication pop up box multiple times asking for username and password and saying your connection is not private.

Enhancement of documentation

Hi,
I am currently trying to install prometheus-boshrelease on OS CF, backed by OpenStack. The deployment went through smoothly after creating a OS deployment manifest, I am able to open Grafana, but I am not able:

to see the dashboards provided
when importing the dashboards manually, any datapoints

So it seems, that I am doing something wrong. Could any give me heads up? I would also participate creating the documentation.

Thanks a lot & BR,
Johannes

Alerting on diego low remaining memory seems incorrect

First, thank for the great work with release that's help a lot.

I have just an error with alerting, I have always a "DiegoLowRemainingMemory" alert on the alert manager. But I have 82 GB available in my cells as the Cf cells capacity dashboard says. This value on the dashboard seems actually correct as the query used to build this dashboard.

For me the alerting have a bad expression because as we can see here https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/master/src/cloudfoundry_alerts/diego.alerts#L26 it never does a sum with multiple cells we can have in a deployment.

Am I wrong ? If no, I can propose this solution: sum(avg(firehose_value_metric_rep_capacity_remaining_memory) by(bosh_deployment, bosh_job_name, bosh_job_id)) to replace the actual expression in the alert.

Create cadvisor dashboards and alerts

There aren't any dashboards or alerts for cadvisor metrics.

Alert BOSHJobHighCPULoad does not take number of CPUs into account

The BOSHJobHighCPULoad Alert queries the bosh_job_load_avg01 metric. The problem is that the warning threshold value for this metric is dependend on the number of CPUs. Generally speaking 100% cpu load on 1 core is indicated by a 1. 100% cpu load on a 16 core machine would result in a 16.

Since the query for this alert does not divide the mtric by the number of CPUs it completely useless: On a 1 core machine the default of 5 will mean that you have to fix it immediately. On a 4 core machine a load average of 5 still indicates a slight problem. But on an 8 core machine a load avg of 5 absolutely fine.

So my request: can the load avg please be divided by the number of cpus in the machine before comparing it to the threshold value?

dropping unnecessary property namespacing

it would be nice to drop job name property namespacing to remove some of the repetitiveness. for example:

prometheus.evaluation_interval would become just evaluation_interval.

wdyt? i would be happy to do a PR that applies this to all the jobs.

Prometheus Benchmark Dashboard

While deploying prometheus bosh release on a large (~200vms) deployment, we met scraping performance issues.
Theses dashboards was incredibly usefull to identify and fix :

https://grafana.com/dashboards/2078
http://demo.robustperception.io:3000/dashboard/db/prometheus-benchmark-1-6-x?refresh=1m&orgId=1

It could be a usefull addition to the built-in dashboard
cc @ArthurHlt

Update Cloud Foundry alerts

The Cloud Foundry alerts needs review & update.

grafana dashboard error

Hello, hit his problem since 15.0.0 (reproduced on 17.0.0 and latest 17.0.2)

Grafana post starts fails with a strange file

Updating dashboard /var/vcap/jobs/cloudfoundry_dashboards/prometheus_cf_exporter.json at Wed Jun  7 10:52:23 UTC 2017
Validating /var/vcap/store/grafana/dashboards//prometheus_cf_exporter.json
Updating dashboard /var/vcap/jobs/cloudfoundry_dashboards/prometheus_firehose_exporter.json at Wed Jun  7 10:52:23 UTC 2017
Validating /var/vcap/store/grafana/dashboards//prometheus_firehose_exporter.json
Updating dashboard agent.cert at Wed Jun  7 10:52:23 UTC 2017
Validating /var/vcap/store/grafana/dashboards//agent.cert
parse error: Invalid numeric literal at line 1, column 11

This file is present in /var/vcap/store, dont know where it comes from.

grafana/47c0244a-3c59-4837-8602-4b619ef29c60:/var/vcap/store/grafana/dashboards# pwd                                                                       
/var/vcap/store/grafana/dashboards                                                                                                                         
grafana/47c0244a-3c59-4837-8602-4b619ef29c60:/var/vcap/store/grafana/dashboards# ls -lrt                                                                   
total 860                                                                                                                                                  
-rw-r--r-- 1 root root  6687 Jun  7 10:52 bosh_deployments.json                                                                                            
-rw-r--r-- 1 root root 35055 Jun  7 10:52 bosh_jobs.json                                                                                                   
-rw-r--r-- 1 root root 35046 Jun  7 10:52 bosh_overview.json                                                                                               
-rw-r--r-- 1 root root 15181 Jun  7 10:52 bosh_processes.json                                                                                              
-rw-r--r-- 1 root root 28081 Jun  7 10:52 prometheus_bosh_exporter.json
-rw-r--r-- 1 root root 11501 Jun  7 10:52 prometheus_bosh_tsdb_exporter.json
-rw-r--r-- 1 root root 16435 Jun  7 10:52 cf_apps_events.json
-rw-r--r-- 1 root root 13365 Jun  7 10:52 cf_apps_latency.json
-rw-r--r-- 1 root root 36548 Jun  7 10:52 cf_apps_requests.json
-rw-r--r-- 1 root root 20817 Jun  7 10:52 cf_apps_system.json
-rw-r--r-- 1 root root 19154 Jun  7 10:52 cf_bbs.json
-rw-r--r-- 1 root root 21344 Jun  7 10:52 cf_cc.json
-rw-r--r-- 1 root root 20338 Jun  7 10:52 cf_cells_capacity.json
-rw-r--r-- 1 root root 25104 Jun  7 10:52 cf_component_metrics.json
-rw-r--r-- 1 root root 13976 Jun  7 10:52 cf_diego_auctions.json
-rw-r--r-- 1 root root 11125 Jun  7 10:52 cf_diego_health.json
-rw-r--r-- 1 root root 21552 Jun  7 10:52 cf_doppler_server.json
-rw-r--r-- 1 root root 26061 Jun  7 10:52 cf_etcd.json
-rw-r--r-- 1 root root 22500 Jun  7 10:52 cf_etcd_operations.json
-rw-r--r-- 1 root root 11880 Jun  7 10:52 cf_garden_linux.json
-rw-r--r-- 1 root root 78176 Jun  7 10:52 cf_kpis.json
-rw-r--r-- 1 root root 16062 Jun  7 10:52 cf_lrps_tasks.json
-rw-r--r-- 1 root root 15427 Jun  7 10:52 cf_metron_agent.json
-rw-r--r-- 1 root root 13902 Jun  7 10:52 cf_metron_agent_doppler.json
-rw-r--r-- 1 root root 10589 Jun  7 10:52 cf_organization_memory_quotas.json
-rw-r--r-- 1 root root 15680 Jun  7 10:52 cf_organization_summary.json
-rw-r--r-- 1 root root 11204 Jun  7 10:52 cf_route_emitter.json
-rw-r--r-- 1 root root 27468 Jun  7 10:52 cf_router.json
-rw-r--r-- 1 root root 22637 Jun  7 10:52 cf_space_summary.json
-rw-r--r-- 1 root root 47125 Jun  7 10:52 cf_summary.json
-rw-r--r-- 1 root root 22052 Jun  7 10:52 cf_uaa.json
-rw-r--r-- 1 root root 65631 Jun  7 10:52 prometheus_cf_exporter.json
-rw-r--r-- 1 root root 53417 Jun  7 10:52 prometheus_firehose_exporter.json
-rw-r--r-- 1 root root  1058 Jun  7 10:52 agent.cert

multiple bosh exporters / exporter authent

Hello, i have complex deployment, incuding multiple bosh director.
I understand i can configure different ports for multiple exporters. However i cant see how to configure multiple exporters in the same vm/instance group.

Do i have to configure separate vm/instance group per bosh exporter ?
In that case, is it possible to secure prometheus => bosh exporter link (basic auth, ssl ?)
Could bosh-links help wiring multiple bosh exporter to prometheus server transparently ?

thx
Pierre

BTW: fantastic job on this bosh release! Really nice out of the box grafana / prometheus / alerting experience out of the box !

Add a warning regarding the `cf_exporter` resource consumption

The cf_exporter can slow down the cloud_controller dramatically when the scrape_interval has a low value.

remove v6 final release as it cannot be build in a clean env in bosh.io

seems like it contained invalid spec files

Create MySQL alerts

The MySQL alerts are actually empty.

Create documentation on how to monitor BOSH and Cloud Foundry

Is anyone having issues with `label_values` in v17 Grafana?

Just upgraded to the 17.0.0 release and I am having a quite weird issue with Grafana.

If I keep the label_values in the template as is it doesn't return anything. If I change it to just use the label it works.

I ended up changing the dashboards to label_values(environments).

Has anyone seen something like this?

Support arbitrary alerts from deployment manifest

As a user of this release, I want to add miscellaneous custom alerts without writing a whole new bosh release. Instead, it would be useful to configure arbitrary alerts like this:

...
properties:
  prometheus:
    custom_rules:
    - (( file "path/to/some/rules" ))
    - (( file "path/to/different/rules" ))
...

The release would then write the contents of prometheus.custom_rules to a file and add the file path to rule_files. If that makes sense, we'd be happy to send a patch. WDYT @frodenas ?

cc @cnelson

Split off dashboards and alerts for customization?

@frodenas @mkuratczyk what do you think about splitting off dashboards and alerts so its easier to customize them? Do you have other ideas on how to do it?

Here is a sample of what it could look like: https://github.com/dlapiduz/prometheus-custom-boshrelease/

spec.address issue with grafana job

When we deploy the prometheus without bosh power dns, the deployment is failing. The jobs' script is using the bosh hostname.
<spec.address> is used for the job scripts but without the powerdns it can't resolve the hostname.
I think if we could choose whether we will use spec.address or spec.ip, it would be better.

Corrupt packages in release v13?

Hi,
I tried to upgrade to Version 13 today and ran into an issue during the compile phase:

 Started compiling packages
  Started compiling packages > grafana/5271e14376bcd8484d934e4379fcfb1e8467adea
  Started compiling packages > cf_exporter/149d9106a695629e0fe86f38edd4501261dad41e
  Started compiling packages > rabbitmq_exporter/85c19dbb43607b7d2f6eb6a35c0ca30fa313ca93
   Failed compiling packages > cf_exporter/149d9106a695629e0fe86f38edd4501261dad41e: Action Failed get_task: Task cede0d5c-cac7-4027-79ea-975a73f6201a result: Compiling package cf_exporter: Fetching package cf_exporter: Fetching package blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0 /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get233330094', stdout: 'Error running app - Getting dav blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Get http://172.16.106.4:25250/0c/f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: read tcp 172.16.106.32:34312->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1 (00:07:33)
   Failed compiling packages > rabbitmq_exporter/85c19dbb43607b7d2f6eb6a35c0ca30fa313ca93: Action Failed get_task: Task 03f1847d-d5e1-40c6-7021-fd250418c042 result: Compiling package rabbitmq_exporter: Fetching package rabbitmq_exporter: Fetching package blob 43f0c2f9-6d98-404c-ab10-fde39d50f1b7: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get 43f0c2f9-6d98-404c-ab10-fde39d50f1b7 /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get610207793', stdout: 'Error running app - Getting dav blob 43f0c2f9-6d98-404c-ab10-fde39d50f1b7: Get http://172.16.106.4:25250/ce/43f0c2f9-6d98-404c-ab10-fde39d50f1b7: read tcp 172.16.106.31:42296->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1 (00:07:39)
   Failed compiling packages > grafana/5271e14376bcd8484d934e4379fcfb1e8467adea: Action Failed get_task: Task c49dfcb5-680c-4c96-5298-0f406bc7df17 result: Compiling package grafana: Fetching package grafana: Fetching package blob 8e050cc5-6ec7-4b63-80d3-65b957a8179e: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get 8e050cc5-6ec7-4b63-80d3-65b957a8179e /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get218393412', stdout: 'Error running app - Getting dav blob 8e050cc5-6ec7-4b63-80d3-65b957a8179e: Get http://172.16.106.4:25250/7b/8e050cc5-6ec7-4b63-80d3-65b957a8179e: read tcp 172.16.106.30:56982->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1 (00:09:08)
   Failed compiling packages (00:09:08)

Error 450001: Action Failed get_task: Task cede0d5c-cac7-4027-79ea-975a73f6201a result: Compiling package cf_exporter: Fetching package cf_exporter: Fetching package blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0 /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get233330094', stdout: 'Error running app - Getting dav blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Get http://172.16.106.4:25250/0c/f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: read tcp 172.16.106.32:34312->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1

The other packages compiled smoothly. Might it be the case, that those 3 files are corrupt?

Add smoke-test

Add smoke-tests so we can test that the final deploy succeed (ie alertmanager, prometheus and grafana)

Prometheus data files stored under root partition

Prometheus data files currently stored under root partition /var/vcap/store/prometheus and that can quickly cause root partition running out of disk space. Can we add an attribute to spec to override the attribute to point to data partition /var/vcap/data/prometheus?

It seems that storage.local.path is hardcoded to use /var/vcap/store/prometheus
https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/master/jobs/prometheus/templates/bin/prometheus_ctl#L90

grafana v4.2.0 release

Hi
Do you have a plan update of grafana job?
I think useful feature are merged

https://community.grafana.com/t/release-notes-for-grafana-v4-2-0/167

dup-ed grafana

It would be great if all just have one grafana realease (like https://github.com/vito/grafana-boshrelease). May be @vito can donate it to cf community if we promise to keep it nice and tidy.

high cpu usage on director and slow bosh CLI with bosh_exporter

Running against a small PCF 1.9, the bosh_exporter scrape interval at 30sec is really causing bosh task queueing as expected but that is impacting the bosh user experience

in the bosh director, top reports 50% of cpu usage for user
in the bosh cli, this takes close to 2 min to do a "bosh vms" across 5 bosh releases
The FAQ is not so clear about this
Attached queues example

Moving to a scrap interval of 10min changes this fully, but is likely to impact alerting on bosh healthy messages.
I am planning to change and default to use BoshHMforwarder.
On PCF the ECS team has made that easy with a tile - http://www.ecsteam.com/deploying-bosh-health-metrics-forwarder-pivotal-cloud-foundry-tile
I would think defaulting this bosh release to using boshhmforwarder (even without a tile, bringing its own as part of this release) would be a wiser choice

Audit events

Any plans to add auditing events to the cf_exporter? It would be nice to gather those details per app too.

Grafana-UAA integration

Create a document explaining how you can integrate Grafana user authentication with Cloud Foundry UAA.

firehose_exporter.metrics.environment values

Hi,
currently trying to upgrade to 15.0.0 and was stopped by the following error:

Error 100: Unable to render instance groups for deployment. Errors are:
   - Unable to render jobs for instance group 'prometheus'. Errors are:
     - Unable to render templates for job 'bosh_exporter'. Errors are:
       - Error filling in template 'bosh_exporter_ctl' (line 76: Can't find property '["bosh_exporter.metrics.environment"]')
     - Unable to render templates for job 'firehose_exporter'. Errors are:
       - Error filling in template 'firehose_exporter_ctl' (line 70: Can't find property '["firehose_exporter.metrics.environment"]')
     - Unable to render templates for job 'cf_exporter'. Errors are:

Looking at the code at:

https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/589eb4e4292c1687910978d7330c0aed09d89444/jobs/firehose_exporter/templates/bin/firehose_exporter_ctl#L70

I see it is mandatory. Is this a mistake or intended behaviour? If so, what would be appropriate values for it?

Filter out bosh_job ~ compilation* in alerts

The alert BOSHJobEphemeralDiskWillFillIn2Hours is firing during compilation due to the compilation workers.

Metrics showing for a non-existent CF component after scale down

I scaled my routing tier to include 2 routers and then scaled back down to a single router to validate my dashboard settings and ability to add metrics as the platform changed. I noticed the second router is still showing in the dashboards. Using the prometheus graph tool and put in my query and it shows both routers. What can I look at to validate this behavior?

Scale-out

As numbers of metrics ca be huge, for huge deployment how do we achieve scale-out architecture ?

Thanks

Wrong Port for Postgres Exporter

According to the specs here: https://github.com/prometheus/prometheus/wiki/Default-port-allocations the Postgres Exporter exposes to 9187. Currently configured in prometheus-boshrelease we have 9113.

Is there any special reason why?

After upgrade from 12.0.0 to 12.2.0 graphs not showing deployment details

I upgrade to 12.2 and my graphs are not showing bosh deployment details.

firehose_exporter internal servererror 500 after update to Pivotal RabbitMQ Service 1.8.*

Hello,

I'm using the prometheus bosh release in an Pivotal Cloud Foundry environment.
After updating the rabbitmq service tile to 1.8.* I am experiencing an issue with the firehose_exporter.

The update added an additional service broker for dedicated rabbitmq services.
This additional service broker producing following issue in firehose_exporter:

collected metric firehose_value_metric_p_rabbitmq_log_sender_total_messages_read label:<name:"bosh_deployment" value:"cf-rabbitmq" > label:<name:"bosh_job_id" value:"8daeceac-05cf-4094-a1ab-d9355dac1584" > label:<name:"bosh_job_ip" value:"192.168.17.15" > label:<name:"bosh_job_name" value:"rabbitmq-broker" > label:<name:"environment" value:"P" > label:<name:"origin" value:"p-rabbitmq" > label:<name:"unit" value:"count" > gauge:<value:0 > has help "Cloud Foundry Firehose 'logSenderTotalMessagesRead' value metric from 'p-rabbitmq'." but should have "Cloud Foundry Firehose 'logSenderTotalMessagesRead' value metric from 'p.rabbitmq'."

Turning off the logs from dedicated rabbitmq service broker fix this issue temporary.

Can you please update the firehose_exporter to consume logs from shared rabbitmq service broker and from dedicated rabbitmq service broker?

Thank you,
Martin

Some metrics are missing

Hi
I try to monitoring CF with prometheus-boshrelease
currently I set up cf exporter, firehose exporter

Most metrics are collected with this exporter
but some metric were not show in prometheus DB

Ex)
firehose_counter_event_gorouter_bad_gateways_delta
firehose_counter_event_gorouter_bad_gateways_total
firehose_counter_event_gorouter_rejected_requests_delta
firehose_counter_event_gorouter_rejected_requests_total
firehose_counter_event_bbs_* metrics empty, except two metrics "firehose_counter_event_bbs_request_count_delta", "firehose_counter_event_bbs_request_count_total"

ENV)
cf : 238
diego : 0.1476.0

cf_exporter, version 0.4.3 (branch: master, revision: 9e37d9069bbb87d739d2c326981fed917ec016e4)
build user: root@1d21624a3782
build date: 20170216-02:38:17
go version: go1.7.5

firehose_exporter, version 4.1.0 (branch: master, revision: 95333eab4c8295bf727faa564add32422f5d71c6)
build user: root@387dfcd86a81
build date: 20170216-01:18:51
go version: go1.7.5

Fix Alert CFAuctioneerFetchStatesDurationTooHigh

Hello,

I'm using prometheus in an pivotal cloudfoundry environment and have an issue with the CFAuctioneerFetchStatesDurationTooHigh alert.

This alert uses the AuctioneerFetchStatesDuration metric and divides this metric by 1.000.000.
But this metric is measured in nanoseconds. Therefor it have to be divided by 1.000.000.000.
https://docs.pivotal.io/pivotalcf/1-11/monitoring/kpi.html#AuctioneerFetchStatesDuration

Monitoring multiple bosh directors

Is it possible to monitor multiple bosh directors with a single prometheus deployment? We would like to run a single prometheus that collects metrics from multiple directors (staging and production). It looks like the bosh exporter only talks to a single director, so we'd need to run multiple instances of the exporter. And that would mean running each exporter on a separate vm, since bosh doesn't know how to run multiple instances of the same job on the same host. And that would mean prometheus wouldn't be able to read service discovery files generated by bosh exporters, since we can only colocate prometheus with one of the exporters.

It seems like our options for this use case would be:

Teach the bosh exporter to monitor multiple bosh directors
Run multiple vms, each with prometheus and bosh exporter, and use federation to combine metrics

Does the first option make sense to you @frodenas, or do you think the exporter should only monitor a single director? Or am I missing a simpler solution?

cc @cnelson

Create System (node) alerts

There aren't any alert configured for system (node_exporter) metrics.

Interesting behaviour during upgrade

Hi,
I did an extremely smooth upgrade from v239 to v245 of CF Release and Diego (according to specs of the CF Release) today. There were not errors the deployment went through.

I have to findings, which I would like to share and also discuss:

I the picture above you can see, that the capacity decreased in this demo deployment although CF is running fine and we are still able to scale our sample application

The Compilation Workers are visible in the dashboard. Perhaps we should enable filtering on those for the dashboards?

"No data points" when viewing Apps dashboards

Been wanting to kick the tires on Prometheus for a while and this release got it up and going super quickly! The only issue I'm having is the Apps dashboards don't display any info (firehose and bosh dashboards are fine). Here's my config, a cf-deployment ops file: https://github.com/cloudfoundry/capi-ci/blob/master/cf-deployment-operations/add-prometheus.yml.

Looks like the cf_exporter is in charge of generating metrics for the Apps dashboards. Even when changing the debug level for the exporter job, the only log line is level=info msg="Listening on :9193" source="cf_exporter.go:278". The expected metrics, like "cf_total_application_events" also don't appear in the Prometheus metrics explorer.

Appreciate the help and thanks for building this release!

firehose_value_metric_etcd_is_leader duplicated

We are starting to have a look and test the prometheus boshrelease with 2 firehose exporters, Prometheus started to trigger one alert CFEtcdMoreThanOneLeader (1 active):

ALERT CFEtcdMoreThanOneLeader
  IF count(firehose_value_metric_etcd_is_leader == 1) BY (environment, bosh_deployment) > 1
  FOR 10m
  LABELS {service="cf-etcd", severity="critical"}
  ANNOTATIONS {description="CF etcd cluster at deployment `{{$labels.environment}}/{{$labels.bosh_deployment}}` had more than one leader in the last 10 minutes: {{value}}", summary="CF etcd cluster at deployment `{{$labels.environment}}/{{$labels.bosh_deployment}}` > 1 leader"}

because there are two "similar" metrics. I have checked the status of the cluster and it is healthy, the leader is 10.230.16.79 (no partitions, only one leader). In the next picture you can see the "same" metric was reported by a different Firehose instance at a different time (see instance tag) so Prometheus considers these different metrics.

I do not have enough experience with Prometheus, but, assuming that tagging the metric with the firehose_exporter instance is needed, where do you think this issue should be fixed: in the alert definition or doing some kind of "duplicates" deletion? Any other ideas?

Thanks!

Mysql Dashboard open source license type

When prometheus BOSH Release v17.0.0 released, All Dashboards file moved from packages to job templates.
But mysql dashboard LICENSE file was removed. d5abc49
What is open source license type of mysql dashboard?(Apache or AGPL-3.0)

Better stemcell alerts

I was talking to @LinuxBozo about the bosh outdated stemcell alerts that I contributed here, and he pointed out it's easy to miss outdated stemcells. If the prometheus deploy that bumps the expected version fails, or gets canceled, or if an operator pauses the concourse job that deploys it, etc, we won't notice if stemcells are out of date.

I'm wondering if we can do better by adding a tiny exporter that emits the current stemcell version for a particular stemcell series. Two quick proposals:

First approach

The stemcell exporter queries http://bosh.io/api/v1/stemcells/ and emits metrics like this:

bosh_stemcell_info{bosh_stemcell_version="3312.29"} 1
bosh_stemcell_info{bosh_stemcell_version="3312.28"} 0
bosh_stemcell_info{bosh_stemcell_version="3312.27"} 0
...

Then we can write a query like this:

bosh_deployment_stemcell_info * on(bosh_stemcell_version) group_left bosh_stemcell_info

This gives a simple prometheus query, but could potentially miss stemcells that are from the wrong series entirely. I don't know if that's realistic, but we can handle that using a different approach:

Second approach

The stemcell exporter emits a single metric, for the expected stemcell:

bosh_stemcell_info{bosh_stemcell_version="3312.29"} 1

Then we can find outdated stemcells by listing all deployments, then subtracting deployments with the expected version:

bosh_deployment_stemcell_info unless (bosh_deployment_stemcell_info * on(bosh_stemcell_version) group_left bosh_stemcell_info)

I'm still new to prometheus, so maybe there's a simpler approach. WDYT?

cc @cnelson

Bosh Dashboard and "All" deployment failed.

Hi,

In grafana since v11 it seems that all bosh dashboard are failing.
Means when I select "All" jobs for example I got this error

TypeError: Cannot read property 'replace' of undefined

Grafana Admin Passwort Update Failing

Hi,

we updated today from Prometheus v 17.0.0 to 17.5.0 and the deployment failed because of Grafana not successfully executing the post-script.

Error: Failed to update user password
 
NAME:
   Grafana cli admin reset-admin-password - reset-admin-password <new password>
 
USAGE:
   Grafana cli admin reset-admin-password [command options] [arguments...]
 
OPTIONS:
   --homepath   path to grafana install/home path, defaults to working directory
   --config     path to config file

But to me the command executed seems correct. Any suggestions?

cloudfoundry / prometheus-boshrelease Goto Github PK

prometheus-boshrelease's Introduction

Prometheus BOSH Release

Table of Contents

Usage

Requirements

Clone the repository

Basic deployment

Using BOSH Service Discovery

Monitoring Cloud Foundry

Register Cloud Foundry routes

Use UAA for Grafana authentication

Operations files

Deployment variables and the var-store

Contributing

License

prometheus-boshrelease's People

Contributors

Stargazers

Watchers

Forkers

prometheus-boshrelease's Issues

First approach

Second approach

Recommend Projects

Recommend Topics

Recommend Org