Giter Site home page Giter Site logo

prometheus-boshrelease's Introduction

Prometheus BOSH Release

This is a BOSH release for Prometheus, Alertmanager, and Grafana. It also includes various Prometheus exporters and Grafana plugins.

The detailed list of included components and their maintenance status is available on VERSIONS.md

Questions? Pop in our Slack channel!

Table of Contents

Usage

Requirements

In order to use this BOSH release you will need:

Although not mandatory, it is recommended to deploy the node exporter addon in order to get system metrics.

Clone the repository

First, clone this repository into your workspace:

git clone https://github.com/bosh-prometheus/prometheus-boshrelease
cd prometheus-boshrelease
export BOSH_ENVIRONMENT=<name>

Then checkout to the release branch you want to use, so manifest files will be in synch with the release version:

git checkout v...

Basic deployment

To deploy a basic prometheus server with alertmanager and grafana (but no exporters) use the following command:

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml

Once deployed, look for the nginx instance IP address:

bosh -d prometheus instances

You can reach each component's web ui at:

  • alertmanager: http://<nginx-ip-address>:9093
  • grafana: http://<nginx-ip-address>:3000
  • prometheus: http://<nginx-ip-address>:9090

Credentials for each components can be located at the tmp/deployment-vars.yml file.

Using BOSH Service Discovery

If you want to use the BOSH Service Discovery in order to dynamically discover your exporters then add the monitor-bosh.yml op file by running the following command (filling the required variables with your own values):

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  -o manifests/operators/monitor-bosh.yml \
  -v bosh_url= \
  -v bosh_username= \
  -v bosh_password= \
  --var-file bosh_ca_cert= \
  -v metrics_environment=

NOTE: metrics_environment is an arbitrary name to identify your environment (test, nyc-prod, ...)

If you have configured your bosh-deployment to use UAA user management (via the uaa.yml ops file) we recommend adding the add-bosh-exporter-uaa-clients.yml op file to your bosh-deployment and then adding the enable-bosh-uaa.yml ops file to the prometheus deployment by running the following command (filling the required variables with your own values):

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  -o manifests/operators/monitor-bosh.yml \
  -o manifests/operators/enable-bosh-uaa.yml \
  -v bosh_url= \
  --var-file bosh_ca_cert= \
  -v metrics_environment=

In case you have configured manually an UAA client_id for the bosh_exporter (different from bosh_exporter), then run the following command instead:

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  -o manifests/operators/monitor-bosh.yml \
  -o manifests/operators/enable-bosh-uaa.yml \
  -o manifests/operators/configure-bosh-exporter-uaa-client-id.yml \
  -v bosh_url= \
  -v uaa_bosh_exporter_client_id= \
  -v uaa_bosh_exporter_client_secret= \
  --var-file bosh_ca_cert= \
  -v metrics_environment=

Monitoring Cloud Foundry

If you want to monitor your Cloud Foundry platform, first update your cf-deployment adding the add-prometheus-uaa-clients.yml op file.

This will add the UAA clients required to gather information from the Cloud Foundry API and Firehose. Then add the monitor-cf.yml ops file by running the following command (filling the required variables with your own values):

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  -o manifests/operators/monitor-bosh.yml \
  -v bosh_url= \
  -v bosh_username= \
  -v bosh_password= \
  --var-file bosh_ca_cert= \
  -v metrics_environment= \
  -o manifests/operators/monitor-cf.yml \
  -v metron_deployment_name= \
  -v system_domain= \
  -v uaa_clients_cf_exporter_secret= \
  -v loggregator_ca_name= \
  -v skip_ssl_verify=

NOTE: metron_deployment_name property should match the deployment property of your metron_agent or loggregator_agent jobs. Use:

NOTE: loggregator_ca_name property should match the full credhub path of loggregator_ca certificate variable, ex: /bosh-mydirector/cf/loggregator_ca.

NOTE: You can switch to legacy implementation of firehose_exporter and legacy cloud foundry dashboards by adding the following ops-files:

  • on prometheus deployment, adapt:
    ...
    -o manifests/operators/monitor-cf.yml \
    -o manifests/operators/deprecated/monitor-cf-attic.yml \
    -v uaa_clients_firehose_exporter_secret= \
    -v traffic_controller_external_port= \
    ...
    
  • When using add-prometheus-uaa-clients.yml on cloud foundry deployment, adapt:
    ...
    -o manifests/operators/cf/add-prometheus-uaa-clients.yml
    -o manifests/operators/deprecated/cf/add-prometheus-uaa-clients-attic.yml
    ...
    

This will switch deployment to firehose_exporter-attic, cloudfoundry_dashboards-attic and cloudfoundry_alerts-attic

Register Cloud Foundry routes

If you want to access alertmanager, grafana, and prometheus web ui's using your Cloud Foundry system domain instead of IP addresses, then you can register those routes inside your Cloud Foundry environment using the enable-cf-route-registrar.yml op file by running the following command (filling the required variables with your own values):

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  ...
  -o manifests/operators/enable-cf-route-registrar.yml \
  -v system_domain= \
  -v cf_deployment_name=

The op file will register the following routes:

  • https://alertmanager.<cf system domain>
  • https://grafana.<cf system domain>
  • https://prometheus.<cf system domain>

Use UAA for Grafana authentication

If you want to allow users registered at your Cloud Foundry environment to access the Grafana dashboards (Viewer mode only), first update your cf-deployment adding the add-grafana-uaa-clients.yml op file. This will add the UAA client required by the Grafana-UAA integration.

Then add the enable-grafana-uaa.yml op file by running the following command (filling the required variables with your own values):

bosh -d prometheus deploy manifests/prometheus.yml \
  --vars-store tmp/deployment-vars.yml \
  ...
  -o manifests/operators/enable-grafana-uaa.yml \
  -v system_domain= \
  -v uaa_clients_grafana_secret= \
  --var-file uaa_ssl.ca= \
  --var-file uaa_ssl.certificate= \
  --var-file uaa_ssl.private_key=

Operations files

Additional operations files are located at the manifests/operators directory. Those files includes a basic configuration, so extra ops files might be needed for additional configuration.

Please review the op files before deploying them to check the requirements, dependencies and necessary variables.

File Description exporter dashboards alerts
alertmanager-group-by-alertname.yml Groups alertmanager alerts by name
alertmanager-hipchat-receiver.yml Configures a HipChat receiver for alertmanager
alertmanager-opsgenie-receiver.yml Configures a OpsGenie receiver for alertmanager
alertmanager-pagerduty-receiver.yml Configures a PagerDuty receiver for alertmanager
alertmanager-pushover-receiver.yml Configures a Pushover receiver for alertmanager
alertmanager-slack-receiver.yml Configures a Slack receiver for alertmanager
alertmanager-victorops-receiver.yml Configures a VictorOps receiver for alertmanager
alertmanager-webhook-receiver.yml Configures a generic webhook receiver for alertmanager
alertmanager-web-external-url.yml Configures the URL under which alertmanager is externally reachable
configure-bosh-exporter-uaa-client-id.yml Configures a custom bosh_exporter UAA client_id for the enable-bosh-uaa.yml op-file
enable-bosh-uaa.yml Configures monitor-bosh.yml to use an UAA client (you must apply the add-bosh-exporter-uaa-clients.yml op file to your bosh-deployment)
enable-cf-route-registrar.yml Registers alertmanager, grafana, and prometheus as Cloud Foundry routes (under your system domain)
enable-grafana-uaa.yml Configures grafana user authentication to use Cloud Foundry UAA (you must apply the add-grafana-uaa-clients.yml op file to your cf-deployment)
enable-grafana-generic-oauth.yml Configures grafana user authentication to use a generic OAuth2 provider
enable-service-discovery.yml Enable service discovery files using BOSH links
enable-proxy-alertmanager.yml Enables http(s) proxy for alertmanager
enable-proxy-blackbox-exporter.yml Enables http(s) proxy for blackbox_exporter
enable-proxy-bosh-exporter.yml Enables http(s) proxy for bosh_exporter
enable-proxy-cf-exporter.yml Enables http(s) proxy for cf_exporter
enable-proxy-firehose-exporter.yml Enables http(s) proxy for firehose_exporter
enable-proxy-grafana.yml Enables http(s) proxy for grafana
enable-proxy-kubernetes.yml Enables http(s) proxy for kube_state_metrics_exporter
enable-proxy-prometheus.yml Enables http(s) proxy for prometheus
enable-proxy-shield-exporter.yml Enables http(s) proxy for shield_exporter
enable-proxy-stackdriver-exporter.yml Enables http(s) proxy for stackdriver_exporter
enable-root-url.yml Enables root_url for grafana
migrate_from_prometheus_1.yml Allows migrating an instance from Prometheus 1.x to Prometheus 2.x
monitor-bosh.yml Enables monitoring BOSH jobs and processes and enables Service Discovery x x x
monitor-cadvisor.yml Enables monitoring cAdvisor x
monitor-cf.yml Enables monitoring Cloud Foundry via the Cloud Foundry and Cloud Foundry Firehose exporters (you must apply the add-prometheus-uaa-clients.yml op file to your cf-deployment) x x x
monitor-collectd.yml Enables monitoring Collectd x
monitor-concourse.yml Enables monitoring Concourse CI >= v3.8.0 (you must apply the enable-prometheus-metrics.yml op file to your concourse-deployment) x x
monitor-concourse-influxdb.yml Enables monitoring Concourse CI < v3.8.0. Requires node exporter on Concourse VMs (probably as a BOSH add-on) and InfluxDB to be deployed independently and configured as a data source in Grafana as well as Concourse configured to send events to InfluxDB x
monitor-consul.yml Enables monitoring Consul x x x
monitor-credhub.yml Enables monitoring Credhub x x
monitor-elasticsearch.yml Enables monitoring Elasticsearch x x x
monitor-graphite.yml Enables monitoring Graphite x
monitor-haproxy.yml Enables monitoring HAProxy x x x
monitor-http-probe.yml Enables monitoring HTTP(s) endpoints via the Blackbox exporter x x x
monitor-influxdb.yml Enables monitoring InfluxDB x
monitor-kubernetes.yml Enables monitoring Kubernetes x x x
monitor-memcached.yml Enables monitoring Memcached x
monitor-mongodb.yml Enables monitoring MongoDB x
monitor-mysql.yml Enables monitoring MySQL x x x
monitor-nats.yml Enables monitoring NATS x
monitor-node.yml Enables monitoring system metrics via the node exporter x
monitor-p-rabbitmq.yml Enables monitoring RabbitMQ for PCF (requires the monitor-cf.yml op file) x x
monitor-p-redis.yml Enables monitoring Redis for PCF (requires the monitor-cf.yml op file) x x
monitor-postgres.yml Enables monitoring PostgreSQL x x x
monitor-pushgateway.yml Deploys a PushGateway x
monitor-rabbitmq.yml Enables monitoring RabbitMQ x x x
monitor-redis.yml Enables monitoring Redis x x x
monitor-shield.yml Enables monitoring Shield x x x
monitor-stackdriver.yml Enables monitoring Stackdriver x
monitor-statsd.yml Enables monitoring Statsd x
monitor-vault.yml Enables monitoring Vault x x
nginx-vm-extension.yml Adds a VM Extension block to the nginx instance, useful to attach a Load Balancer
prometheus-web-external-url.yml Configures the URL under which prometheus is externally reachable
use-sqlite3.yml Use sqlite3 instead of postgres

In addition, some deprecated ops-files allows to switch back to legacy behaviours

File Description exporter dashboards alerts
deprecated/monitor-cf-attic.yml Use legacy implementation of monitor-cf.yml x x x
deprecated/cf/add-prometheus-uaa-clients-attic.yml Adds UAA client in cloud foundry deployment when using monitor-cf-attic.yml
deprecated/enable-cf-loggregator-v2.yml Enables Cloud Foundry Loggregator V2 API calls in the legacy firehose_exporter

Deployment variables and the var-store

Some operators files requires additional information to provide environment-specific or sensitive configuration such as various credentials. To do this in the default configuration, we use the --vars-store. This flag takes the name of a yml file that it will read and write to. Where necessary credential values are not present, it will generate new values based on the type information stored at the different deployment files. Necessary variables that BOSH can't generate need to be supplied as well. See each particular op files you're using for any additional necessary variables.

See also the BOSH CLI documentation for more information about ways to supply such additional variables.

Contributing

Refer to CONTRIBUTING.md.

License

Apache License 2.0, see LICENSE.

prometheus-boshrelease's People

Contributors

aegershman avatar alexvianet avatar arthurhlt avatar aveyrenc avatar bandesz avatar benjaminguttmann-avtq avatar bkez322 avatar chitoku-k avatar dark5un avatar drnic avatar frodenas avatar geofffranks avatar infra-red avatar ionphractal avatar jmcarp avatar jtuchscherer avatar kinjelom avatar making avatar mchabane avatar mjsjinsu avatar mkuratczyk avatar peterellisjones avatar pommi avatar psycofdj avatar rkoster avatar romain-dartigues avatar sba30 avatar solera-concourse avatar thehandsomezebra avatar ywei2017 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prometheus-boshrelease's Issues

CF Exporter - Filter Collector

Hi,

I am facing currently an issue with the cf_exporter.
When I define any collector mentioned in the specs of the exporter the export isn't able to start and shows errors like :

time="2017-07-21T12:43:56Z" level=error msg="Collector filter [ApplicationEvents] is not supported" source="cf_exporter.go:213"

Any ideas?

Regards,

Benjamin

Prometheus API can respond but Prometheus WebUI can't do.

Hi.

Although I did setup prometheus-17.6.0 with bosh, I couldn't get data from prometheus webui(no data points).
but Prometheus API is working.

$ curl -g 'http://100.99.50.80:9090/api/v1/series?match[]=cf_organization_info{environment="hoge"}'
{"status":"success","data":[{"__name__":"cf_organization_info","deployment":"cf","environment":"hoge","instance":"localhost:9193","job":"cf","organization_id":"e00539b9-4437-4e88-9a9a-6b807861383c","organization_name":"org","quota_name":"default"},
...
..
.

Why would it happen?

bosh_exporter should not filter.azs by default

Hi,

What is the use case for the 'else' statement here: https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/master/jobs/bosh_exporter/templates/bin/bosh_exporter_ctl#L50-L51 ? We just ran into an issue where we had no job-level information after deploying prometheus-release and we tracked it down to this - we didn't intend to filter at all but by default it was filtering on AZ (and we didn't have any interesting jobs in that AZ).

Thanks,

Authentication issue

Hi

when I am upgrading prometheus from version 12.3.3 to 14.0.0, and hitting the grafana dashboard I am getting authentication pop up box multiple times asking for username and password and saying your connection is not private.

Enhancement of documentation

Hi,
I am currently trying to install prometheus-boshrelease on OS CF, backed by OpenStack. The deployment went through smoothly after creating a OS deployment manifest, I am able to open Grafana, but I am not able:

  • to see the dashboards provided
  • when importing the dashboards manually, any datapoints

So it seems, that I am doing something wrong. Could any give me heads up? I would also participate creating the documentation.

Thanks a lot & BR,
Johannes

Alerting on diego low remaining memory seems incorrect

First, thank for the great work with release that's help a lot.

I have just an error with alerting, I have always a "DiegoLowRemainingMemory" alert on the alert manager. But I have 82 GB available in my cells as the Cf cells capacity dashboard says. This value on the dashboard seems actually correct as the query used to build this dashboard.

For me the alerting have a bad expression because as we can see here https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/master/src/cloudfoundry_alerts/diego.alerts#L26 it never does a sum with multiple cells we can have in a deployment.

Am I wrong ? If no, I can propose this solution: sum(avg(firehose_value_metric_rep_capacity_remaining_memory) by(bosh_deployment, bosh_job_name, bosh_job_id)) to replace the actual expression in the alert.

Alert BOSHJobHighCPULoad does not take number of CPUs into account

The BOSHJobHighCPULoad Alert queries the bosh_job_load_avg01 metric. The problem is that the warning threshold value for this metric is dependend on the number of CPUs. Generally speaking 100% cpu load on 1 core is indicated by a 1. 100% cpu load on a 16 core machine would result in a 16.

Since the query for this alert does not divide the mtric by the number of CPUs it completely useless: On a 1 core machine the default of 5 will mean that you have to fix it immediately. On a 4 core machine a load average of 5 still indicates a slight problem. But on an 8 core machine a load avg of 5 absolutely fine.

So my request: can the load avg please be divided by the number of cpus in the machine before comparing it to the threshold value?

grafana dashboard error

Hello, hit his problem since 15.0.0 (reproduced on 17.0.0 and latest 17.0.2)

Grafana post starts fails with a strange file

Updating dashboard /var/vcap/jobs/cloudfoundry_dashboards/prometheus_cf_exporter.json at Wed Jun  7 10:52:23 UTC 2017
Validating /var/vcap/store/grafana/dashboards//prometheus_cf_exporter.json
Updating dashboard /var/vcap/jobs/cloudfoundry_dashboards/prometheus_firehose_exporter.json at Wed Jun  7 10:52:23 UTC 2017
Validating /var/vcap/store/grafana/dashboards//prometheus_firehose_exporter.json
Updating dashboard agent.cert at Wed Jun  7 10:52:23 UTC 2017
Validating /var/vcap/store/grafana/dashboards//agent.cert
parse error: Invalid numeric literal at line 1, column 11

This file is present in /var/vcap/store, dont know where it comes from.

grafana/47c0244a-3c59-4837-8602-4b619ef29c60:/var/vcap/store/grafana/dashboards# pwd                                                                       
/var/vcap/store/grafana/dashboards                                                                                                                         
grafana/47c0244a-3c59-4837-8602-4b619ef29c60:/var/vcap/store/grafana/dashboards# ls -lrt                                                                   
total 860                                                                                                                                                  
-rw-r--r-- 1 root root  6687 Jun  7 10:52 bosh_deployments.json                                                                                            
-rw-r--r-- 1 root root 35055 Jun  7 10:52 bosh_jobs.json                                                                                                   
-rw-r--r-- 1 root root 35046 Jun  7 10:52 bosh_overview.json                                                                                               
-rw-r--r-- 1 root root 15181 Jun  7 10:52 bosh_processes.json                                                                                              
-rw-r--r-- 1 root root 28081 Jun  7 10:52 prometheus_bosh_exporter.json
-rw-r--r-- 1 root root 11501 Jun  7 10:52 prometheus_bosh_tsdb_exporter.json
-rw-r--r-- 1 root root 16435 Jun  7 10:52 cf_apps_events.json
-rw-r--r-- 1 root root 13365 Jun  7 10:52 cf_apps_latency.json
-rw-r--r-- 1 root root 36548 Jun  7 10:52 cf_apps_requests.json
-rw-r--r-- 1 root root 20817 Jun  7 10:52 cf_apps_system.json
-rw-r--r-- 1 root root 19154 Jun  7 10:52 cf_bbs.json
-rw-r--r-- 1 root root 21344 Jun  7 10:52 cf_cc.json
-rw-r--r-- 1 root root 20338 Jun  7 10:52 cf_cells_capacity.json
-rw-r--r-- 1 root root 25104 Jun  7 10:52 cf_component_metrics.json
-rw-r--r-- 1 root root 13976 Jun  7 10:52 cf_diego_auctions.json
-rw-r--r-- 1 root root 11125 Jun  7 10:52 cf_diego_health.json
-rw-r--r-- 1 root root 21552 Jun  7 10:52 cf_doppler_server.json
-rw-r--r-- 1 root root 26061 Jun  7 10:52 cf_etcd.json
-rw-r--r-- 1 root root 22500 Jun  7 10:52 cf_etcd_operations.json
-rw-r--r-- 1 root root 11880 Jun  7 10:52 cf_garden_linux.json
-rw-r--r-- 1 root root 78176 Jun  7 10:52 cf_kpis.json
-rw-r--r-- 1 root root 16062 Jun  7 10:52 cf_lrps_tasks.json
-rw-r--r-- 1 root root 15427 Jun  7 10:52 cf_metron_agent.json
-rw-r--r-- 1 root root 13902 Jun  7 10:52 cf_metron_agent_doppler.json
-rw-r--r-- 1 root root 10589 Jun  7 10:52 cf_organization_memory_quotas.json
-rw-r--r-- 1 root root 15680 Jun  7 10:52 cf_organization_summary.json
-rw-r--r-- 1 root root 11204 Jun  7 10:52 cf_route_emitter.json
-rw-r--r-- 1 root root 27468 Jun  7 10:52 cf_router.json
-rw-r--r-- 1 root root 22637 Jun  7 10:52 cf_space_summary.json
-rw-r--r-- 1 root root 47125 Jun  7 10:52 cf_summary.json
-rw-r--r-- 1 root root 22052 Jun  7 10:52 cf_uaa.json
-rw-r--r-- 1 root root 65631 Jun  7 10:52 prometheus_cf_exporter.json
-rw-r--r-- 1 root root 53417 Jun  7 10:52 prometheus_firehose_exporter.json
-rw-r--r-- 1 root root  1058 Jun  7 10:52 agent.cert

multiple bosh exporters / exporter authent

Hello, i have complex deployment, incuding multiple bosh director.
I understand i can configure different ports for multiple exporters. However i cant see how to configure multiple exporters in the same vm/instance group.

Do i have to configure separate vm/instance group per bosh exporter ?
In that case, is it possible to secure prometheus => bosh exporter link (basic auth, ssl ?)
Could bosh-links help wiring multiple bosh exporter to prometheus server transparently ?

thx
Pierre

BTW: fantastic job on this bosh release! Really nice out of the box grafana / prometheus / alerting experience out of the box !

Is anyone having issues with `label_values` in v17 Grafana?

Just upgraded to the 17.0.0 release and I am having a quite weird issue with Grafana.

If I keep the label_values in the template as is it doesn't return anything. If I change it to just use the label it works.

I ended up changing the dashboards to label_values(environments).

Has anyone seen something like this?

Support arbitrary alerts from deployment manifest

As a user of this release, I want to add miscellaneous custom alerts without writing a whole new bosh release. Instead, it would be useful to configure arbitrary alerts like this:

...
properties:
  prometheus:
    custom_rules:
    - (( file "path/to/some/rules" ))
    - (( file "path/to/different/rules" ))
...

The release would then write the contents of prometheus.custom_rules to a file and add the file path to rule_files. If that makes sense, we'd be happy to send a patch. WDYT @frodenas ?

cc @cnelson

spec.address issue with grafana job

When we deploy the prometheus without bosh power dns, the deployment is failing. The jobs' script is using the bosh hostname.
<spec.address> is used for the job scripts but without the powerdns it can't resolve the hostname.
I think if we could choose whether we will use spec.address or spec.ip, it would be better.

Corrupt packages in release v13?

Hi,
I tried to upgrade to Version 13 today and ran into an issue during the compile phase:

 Started compiling packages
  Started compiling packages > grafana/5271e14376bcd8484d934e4379fcfb1e8467adea
  Started compiling packages > cf_exporter/149d9106a695629e0fe86f38edd4501261dad41e
  Started compiling packages > rabbitmq_exporter/85c19dbb43607b7d2f6eb6a35c0ca30fa313ca93
   Failed compiling packages > cf_exporter/149d9106a695629e0fe86f38edd4501261dad41e: Action Failed get_task: Task cede0d5c-cac7-4027-79ea-975a73f6201a result: Compiling package cf_exporter: Fetching package cf_exporter: Fetching package blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0 /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get233330094', stdout: 'Error running app - Getting dav blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Get http://172.16.106.4:25250/0c/f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: read tcp 172.16.106.32:34312->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1 (00:07:33)
   Failed compiling packages > rabbitmq_exporter/85c19dbb43607b7d2f6eb6a35c0ca30fa313ca93: Action Failed get_task: Task 03f1847d-d5e1-40c6-7021-fd250418c042 result: Compiling package rabbitmq_exporter: Fetching package rabbitmq_exporter: Fetching package blob 43f0c2f9-6d98-404c-ab10-fde39d50f1b7: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get 43f0c2f9-6d98-404c-ab10-fde39d50f1b7 /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get610207793', stdout: 'Error running app - Getting dav blob 43f0c2f9-6d98-404c-ab10-fde39d50f1b7: Get http://172.16.106.4:25250/ce/43f0c2f9-6d98-404c-ab10-fde39d50f1b7: read tcp 172.16.106.31:42296->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1 (00:07:39)
   Failed compiling packages > grafana/5271e14376bcd8484d934e4379fcfb1e8467adea: Action Failed get_task: Task c49dfcb5-680c-4c96-5298-0f406bc7df17 result: Compiling package grafana: Fetching package grafana: Fetching package blob 8e050cc5-6ec7-4b63-80d3-65b957a8179e: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get 8e050cc5-6ec7-4b63-80d3-65b957a8179e /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get218393412', stdout: 'Error running app - Getting dav blob 8e050cc5-6ec7-4b63-80d3-65b957a8179e: Get http://172.16.106.4:25250/7b/8e050cc5-6ec7-4b63-80d3-65b957a8179e: read tcp 172.16.106.30:56982->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1 (00:09:08)
   Failed compiling packages (00:09:08)

Error 450001: Action Failed get_task: Task cede0d5c-cac7-4027-79ea-975a73f6201a result: Compiling package cf_exporter: Fetching package cf_exporter: Fetching package blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0 /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get233330094', stdout: 'Error running app - Getting dav blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Get http://172.16.106.4:25250/0c/f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: read tcp 172.16.106.32:34312->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1

The other packages compiled smoothly. Might it be the case, that those 3 files are corrupt?

Add smoke-test

Add smoke-tests so we can test that the final deploy succeed (ie alertmanager, prometheus and grafana)

Prometheus data files stored under root partition

Prometheus data files currently stored under root partition /var/vcap/store/prometheus and that can quickly cause root partition running out of disk space. Can we add an attribute to spec to override the attribute to point to data partition /var/vcap/data/prometheus?

It seems that storage.local.path is hardcoded to use /var/vcap/store/prometheus
https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/master/jobs/prometheus/templates/bin/prometheus_ctl#L90

high cpu usage on director and slow bosh CLI with bosh_exporter

Running against a small PCF 1.9, the bosh_exporter scrape interval at 30sec is really causing bosh task queueing as expected but that is impacting the bosh user experience

  • in the bosh director, top reports 50% of cpu usage for user
  • in the bosh cli, this takes close to 2 min to do a "bosh vms" across 5 bosh releases
    The FAQ is not so clear about this
    Attached queues example

Moving to a scrap interval of 10min changes this fully, but is likely to impact alerting on bosh healthy messages.
I am planning to change and default to use BoshHMforwarder.
On PCF the ECS team has made that easy with a tile - http://www.ecsteam.com/deploying-bosh-health-metrics-forwarder-pivotal-cloud-foundry-tile
I would think defaulting this bosh release to using boshhmforwarder (even without a tile, bringing its own as part of this release) would be a wiser choice
screen shot 2017-03-05 at 23 33 56

Audit events

Any plans to add auditing events to the cf_exporter? It would be nice to gather those details per app too.

Grafana-UAA integration

Create a document explaining how you can integrate Grafana user authentication with Cloud Foundry UAA.

firehose_exporter.metrics.environment values

Hi,
currently trying to upgrade to 15.0.0 and was stopped by the following error:

Error 100: Unable to render instance groups for deployment. Errors are:
   - Unable to render jobs for instance group 'prometheus'. Errors are:
     - Unable to render templates for job 'bosh_exporter'. Errors are:
       - Error filling in template 'bosh_exporter_ctl' (line 76: Can't find property '["bosh_exporter.metrics.environment"]')
     - Unable to render templates for job 'firehose_exporter'. Errors are:
       - Error filling in template 'firehose_exporter_ctl' (line 70: Can't find property '["firehose_exporter.metrics.environment"]')
     - Unable to render templates for job 'cf_exporter'. Errors are:

Looking at the code at:

https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/589eb4e4292c1687910978d7330c0aed09d89444/jobs/firehose_exporter/templates/bin/firehose_exporter_ctl#L70

I see it is mandatory. Is this a mistake or intended behaviour? If so, what would be appropriate values for it?

Metrics showing for a non-existent CF component after scale down

I scaled my routing tier to include 2 routers and then scaled back down to a single router to validate my dashboard settings and ability to add metrics as the platform changed. I noticed the second router is still showing in the dashboards. Using the prometheus graph tool and put in my query and it shows both routers. What can I look at to validate this behavior?

screen shot 2016-12-02 at 10 08 44 am
screen shot 2016-12-02 at 10 14 10 am

Scale-out

As numbers of metrics ca be huge, for huge deployment how do we achieve scale-out architecture ?

Thanks

firehose_exporter internal servererror 500 after update to Pivotal RabbitMQ Service 1.8.*

Hello,

I'm using the prometheus bosh release in an Pivotal Cloud Foundry environment.
After updating the rabbitmq service tile to 1.8.* I am experiencing an issue with the firehose_exporter.

The update added an additional service broker for dedicated rabbitmq services.
This additional service broker producing following issue in firehose_exporter:

  • collected metric firehose_value_metric_p_rabbitmq_log_sender_total_messages_read label:<name:"bosh_deployment" value:"cf-rabbitmq" > label:<name:"bosh_job_id" value:"8daeceac-05cf-4094-a1ab-d9355dac1584" > label:<name:"bosh_job_ip" value:"192.168.17.15" > label:<name:"bosh_job_name" value:"rabbitmq-broker" > label:<name:"environment" value:"P" > label:<name:"origin" value:"p-rabbitmq" > label:<name:"unit" value:"count" > gauge:<value:0 > has help "Cloud Foundry Firehose 'logSenderTotalMessagesRead' value metric from 'p-rabbitmq'." but should have "Cloud Foundry Firehose 'logSenderTotalMessagesRead' value metric from 'p.rabbitmq'."

Turning off the logs from dedicated rabbitmq service broker fix this issue temporary.

Can you please update the firehose_exporter to consume logs from shared rabbitmq service broker and from dedicated rabbitmq service broker?

Thank you,
Martin

Some metrics are missing

Hi
I try to monitoring CF with prometheus-boshrelease
currently I set up cf exporter, firehose exporter

Most metrics are collected with this exporter
but some metric were not show in prometheus DB

Ex)
firehose_counter_event_gorouter_bad_gateways_delta
firehose_counter_event_gorouter_bad_gateways_total
firehose_counter_event_gorouter_rejected_requests_delta
firehose_counter_event_gorouter_rejected_requests_total
firehose_counter_event_bbs_* metrics empty, except two metrics "firehose_counter_event_bbs_request_count_delta", "firehose_counter_event_bbs_request_count_total"

ENV)
cf : 238
diego : 0.1476.0

cf_exporter, version 0.4.3 (branch: master, revision: 9e37d9069bbb87d739d2c326981fed917ec016e4)
build user: root@1d21624a3782
build date: 20170216-02:38:17
go version: go1.7.5

firehose_exporter, version 4.1.0 (branch: master, revision: 95333eab4c8295bf727faa564add32422f5d71c6)
build user: root@387dfcd86a81
build date: 20170216-01:18:51
go version: go1.7.5

Monitoring multiple bosh directors

Is it possible to monitor multiple bosh directors with a single prometheus deployment? We would like to run a single prometheus that collects metrics from multiple directors (staging and production). It looks like the bosh exporter only talks to a single director, so we'd need to run multiple instances of the exporter. And that would mean running each exporter on a separate vm, since bosh doesn't know how to run multiple instances of the same job on the same host. And that would mean prometheus wouldn't be able to read service discovery files generated by bosh exporters, since we can only colocate prometheus with one of the exporters.

It seems like our options for this use case would be:

  • Teach the bosh exporter to monitor multiple bosh directors
  • Run multiple vms, each with prometheus and bosh exporter, and use federation to combine metrics

Does the first option make sense to you @frodenas, or do you think the exporter should only monitor a single director? Or am I missing a simpler solution?

cc @cnelson

Interesting behaviour during upgrade

Hi,
I did an extremely smooth upgrade from v239 to v245 of CF Release and Diego (according to specs of the CF Release) today. There were not errors the deployment went through.

I have to findings, which I would like to share and also discuss:
screen shot 2017-01-06 at 11 16 47
I the picture above you can see, that the capacity decreased in this demo deployment although CF is running fine and we are still able to scale our sample application

screen shot 2017-01-06 at 11 19 23
The Compilation Workers are visible in the dashboard. Perhaps we should enable filtering on those for the dashboards?

"No data points" when viewing Apps dashboards

Been wanting to kick the tires on Prometheus for a while and this release got it up and going super quickly! The only issue I'm having is the Apps dashboards don't display any info (firehose and bosh dashboards are fine). Here's my config, a cf-deployment ops file: https://github.com/cloudfoundry/capi-ci/blob/master/cf-deployment-operations/add-prometheus.yml.

Looks like the cf_exporter is in charge of generating metrics for the Apps dashboards. Even when changing the debug level for the exporter job, the only log line is level=info msg="Listening on :9193" source="cf_exporter.go:278". The expected metrics, like "cf_total_application_events" also don't appear in the Prometheus metrics explorer.

Appreciate the help and thanks for building this release!

firehose_value_metric_etcd_is_leader duplicated

We are starting to have a look and test the prometheus boshrelease with 2 firehose exporters, Prometheus started to trigger one alert CFEtcdMoreThanOneLeader (1 active):

ALERT CFEtcdMoreThanOneLeader
  IF count(firehose_value_metric_etcd_is_leader == 1) BY (environment, bosh_deployment) > 1
  FOR 10m
  LABELS {service="cf-etcd", severity="critical"}
  ANNOTATIONS {description="CF etcd cluster at deployment `{{$labels.environment}}/{{$labels.bosh_deployment}}` had more than one leader in the last 10 minutes: {{value}}", summary="CF etcd cluster at deployment `{{$labels.environment}}/{{$labels.bosh_deployment}}` > 1 leader"}

because there are two "similar" metrics. I have checked the status of the cluster and it is healthy, the leader is 10.230.16.79 (no partitions, only one leader). In the next picture you can see the "same" metric was reported by a different Firehose instance at a different time (see instance tag) so Prometheus considers these different metrics.

image

I do not have enough experience with Prometheus, but, assuming that tagging the metric with the firehose_exporter instance is needed, where do you think this issue should be fixed: in the alert definition or doing some kind of "duplicates" deletion? Any other ideas?

Thanks!

Mysql Dashboard open source license type

When prometheus BOSH Release v17.0.0 released, All Dashboards file moved from packages to job templates.
But mysql dashboard LICENSE file was removed. d5abc49
What is open source license type of mysql dashboard?(Apache or AGPL-3.0)

Better stemcell alerts

I was talking to @LinuxBozo about the bosh outdated stemcell alerts that I contributed here, and he pointed out it's easy to miss outdated stemcells. If the prometheus deploy that bumps the expected version fails, or gets canceled, or if an operator pauses the concourse job that deploys it, etc, we won't notice if stemcells are out of date.

I'm wondering if we can do better by adding a tiny exporter that emits the current stemcell version for a particular stemcell series. Two quick proposals:

First approach

The stemcell exporter queries http://bosh.io/api/v1/stemcells/ and emits metrics like this:

bosh_stemcell_info{bosh_stemcell_version="3312.29"} 1
bosh_stemcell_info{bosh_stemcell_version="3312.28"} 0
bosh_stemcell_info{bosh_stemcell_version="3312.27"} 0
...

Then we can write a query like this:

bosh_deployment_stemcell_info * on(bosh_stemcell_version) group_left bosh_stemcell_info

This gives a simple prometheus query, but could potentially miss stemcells that are from the wrong series entirely. I don't know if that's realistic, but we can handle that using a different approach:

Second approach

The stemcell exporter emits a single metric, for the expected stemcell:

bosh_stemcell_info{bosh_stemcell_version="3312.29"} 1

Then we can find outdated stemcells by listing all deployments, then subtracting deployments with the expected version:

bosh_deployment_stemcell_info unless (bosh_deployment_stemcell_info * on(bosh_stemcell_version) group_left bosh_stemcell_info)

I'm still new to prometheus, so maybe there's a simpler approach. WDYT?

cc @cnelson

Bosh Dashboard and "All" deployment failed.

Hi,

In grafana since v11 it seems that all bosh dashboard are failing.
Means when I select "All" jobs for example I got this error

TypeError: Cannot read property 'replace' of undefined

Grafana Admin Passwort Update Failing

Hi,

we updated today from Prometheus v 17.0.0 to 17.5.0 and the deployment failed because of Grafana not successfully executing the post-script.

Error: Failed to update user password
 
NAME:
   Grafana cli admin reset-admin-password - reset-admin-password <new password>
 
USAGE:
   Grafana cli admin reset-admin-password [command options] [arguments...]
 
OPTIONS:
   --homepath   path to grafana install/home path, defaults to working directory
   --config     path to config file

But to me the command executed seems correct. Any suggestions?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.