Giter Site home page Giter Site logo

clusterlabs / ha_cluster_exporter Goto Github PK

View Code? Open in Web Editor NEW
76.0 13.0 34.0 2.93 MB

Prometheus exporter for Pacemaker based Linux HA clusters

License: Apache License 2.0

Go 68.02% Makefile 2.86% Shell 27.40% Python 1.73%
pacemaker high-availability golang prometheus-exporter metrics monitoring cluster linux prometheus

ha_cluster_exporter's People

Contributors

angelabriel avatar cbosdo avatar dependabot[bot] avatar diegoakechi avatar dirkmueller avatar kermat avatar mallozup avatar mbiagetti avatar nikhil-vats avatar quytm avatar rmakram-ims avatar sathia27 avatar stefanotorresi avatar thecloudsaver avatar vvidic avatar witekest avatar yeoldegrove avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ha_cluster_exporter's Issues

depracate ha configured total metric?

I have the feeling that this 2 metrics are providing same info that an user could grab by reading other metrics of cluster, doing a total etc.

In sake of minimalism we could perhaps just remove them

ha_cluster_nodes_configured_total
ha_cluster_resources_configured_total

I think historically they were added as migration from hawk-apiserver original metrics but I was also doubting on their need to exist because imho one could do a promql for this 2 metrics.

Imho I would vote to remove them if we can in sake of simplicity and minimalism

add label to metric

currenlty

# TYPE cluster_nodes gauge
cluster_nodes{type="DC"} 1
cluster_nodes{type="configured"} 2
cluster_nodes{type="expected_up"} 2
cluster_nodes{type="maintenance"} 0
cluster_nodes{type="member"} 2
cluster_nodes{type="online"} 2
cluster_nodes{type="pending"} 0
cluster_nodes{type="ping"} 0
cluster_nodes{type="remote"} 0
cluster_nodes{type="shutdown"} 0
cluster_nodes{type="standby"} 0
cluster_nodes{type="standby_onfail"} 0
cluster_nodes{type="unclean"} 0
cluster_nodes{type="unknown"} 0

we should add
cluster_nodes{node="foobar1" type="unknown"} 0

DOC: find a way to generate documentation automatically and write other things manually

Description:

This issue is after discussion with me, @diegoakechi and @stefanotorresi.

We agreed that we should find a compromise between generating docs automatically and maintain them manually.

Tasks:

  • research and investigate what we can automatically create (Name of metric, HELP etc, list)

This task might depend also if we want to refactor the exporter with the usage of collector, which will change a bit the API so also the automatic generation can be impacted. We should then wait a little before doing this

update readme description

Desc

the readme contain an outdated text:

This prometheus exporter is used to serve metrics for pacemaker https://github.com/ClusterLabs/pacemaker. It should run inside a node of the cluster or both.

We now export more metrics like SBD, drbd and other components.

We should write a more generic description for this.

An example would be: (it might need improvement and research)

Like the exporter is used to serve metrics for ha clusters and their components.
As second usage will be that, the exporter is intented to work also as only exporter for a single component, in this case it will just serve the metrics for this compenent

Add label RING_ID to corosync metric

Create a metric
corosync_ring_errors("ring_id=1") 1 corosync_ring_errors("ring_id=42") 0

here so we can point what ring is failing. A zero or 1 is a failure.

implement metric about resource configuration monitoring

desc:

Implement a metric which will give the date when a configuration is updated.

A pseudo/metric implementation would more or less like :

ha_cluster_pacemaker_resource_configuration_changes= Mon Oct 28 11:03:29 2019
the value should be timestamp

How to implement this:

we have already the crm_mon command which expose this information with

<crm_mon version="2.0.0">
    <summary>
        <stack type="corosync" />
        <current_dc present="true" version="1.1.18+20180430.b12c320f5-3.15.1-b12c320f5" name="damadog-hana01" id="1084780051" with_quorum="true" />
        <last_update time="Mon Oct 28 11:03:34 2019" />
        <last_change time="Mon Oct 28 11:03:29 2019" user="root" client="crm_attribute" origin="damadog-hana02" />```

the `last_change` is the field we are looking for.

todo:

search to convert the Mon Oct 28 11:03:29 2019 to a prometheus compatible value perhaps a timestamp

Check if empty map

We should check if the maps we are accessing are empty to avoid some possible errors if any


if len(map) == 0 {
    ....
}

Research on error handling in context of metrics

Desc:

I think we should in case of error set an empty metrics but not error/panic.
We should log an error but not interrupt.

It is something to be researched when the exporter need to panic

[challange]: build exporter package for archlinux or debian or centos7 other distro with OBS

description:

we need to build the exporter for other distros

A good one would be the centos7 since we already have the repo.

In case archlinux or debian, we could add repos there.

we build the pkg with:
https://build.opensuse.org/package/show/server:monitoring/prometheus-ha_cluster_exporter

Tutorial:

https://duncan.codes/tutorials/rpm-packaging/

what you will learn:

you will learn to packaging

warning:

this issue is for hackers level. Are you ready to be one? If yes , take this challenge. ( we will help along the way for any info required)

If want to take this issue

DRBD Metric-02: Implement remote connection

desc

follow-up from #25

impmplement something like
ha_cluster_drbd_resource_remote_connection{resource_name="1-single-0", role="primary", volume="0" peer-node-id="2" peer-role="Secondary"} 1

Feature Request: Implement listen IPv4 option

Currently, it is not possible to alter which address the exporter will listen on; I propose the addition of a -ip4 flag.

E.g. ./ha_cluster_exporter --ip4 127.0.0.1

Pull request to follow :)

make from 2 only 1 metric with label

from this 2 metrics create 1

cluster_node_resources{node="ayoub-monitoring-hana01",resource_name="rsc_saphana_prd_hdb00",role="master"} 1

cluster_resources_status{status="master"} 1

cluster_node_resources{node="ayoub-monitoring-hana01",resource_name="rsc_saphana_prd_hdb00",role="master" status="orpanhed"}} 1

Error/warning need more context

We have such erorr/warning:

log.Warnln(err). Imho we should use log.Warnf("context bro erorr by doing x %s," err) so we have a more context info

Configured metric

In past we used the metric configured . We should expose this metric separately.

add clustername to metrics

Description

In case of multiple cluster we should add a label which will highlight that the metric belong to that clustername so we can see which cluster belong to

This might need to research and we should do on the dashboard level rather then on exporter

Design a creative and emotional logo for ha exporter

desc:

It would be nice to have a logo for this project.

The logo should be a mix or remind to some elements of Prometheus, Grafana and also something with Cluster and HA . This is a starting point. Any imagination is welcome!

refactor pacemaker metrics like DRBD one

THe pr of drbd introduce a refactor of pacemaker (#26) .

This metric with types is similar to DRBD metrics.

We should follow the same pattern. Also add:

  • move all function that belong to pacemaker to that new file
  • add unit-tests with a XML data example

Travis configuration seems off

We currently have 4 jobs, 2 for each stage (gofmt and build), but something must be off: we should only have 1 job per stage.

research on how to remove/reset metrics

Currenlty we have a function for resetting some metric with labels.

Historically I didn't find any API for doing this job with the prometheus api. This issue is more a placeholder for research if we can use some API to remove the metrics/reset the state them with prometheus api.

improve english in readme

The readme might contains some english phrases which need rewords or better statement for improving UX

create a makefile target for running tests with code coverage and visualize them

the golang language has the feature to create and display code coverage of unit- tests

Task:

  • research about this interesting feature

  • create a makefile target like make coverage this will execute tests with code coverage
    By default we don't need it , so we will have make coverage and make test will run without coverage. The make coverage will run go test with some flags

  • this task should create an html golang report and then open a fenster in web-browser to visualize it.

rationale:

We don't want to enable codecoverage or similar in travis or CI because codecovarage can be a really wrong metric for running in PRs or automation.

We just want to use code coverage from time to time to see what we could perhaps cover but there should be a human analyzing it.

As use-case could be:

As dev I run make coverage and this will open automatically the browser fenster to the code-coverage of golang in html

Then I can analyze from time to time what perhaps isn't covered by tests. ( we can't cover everything by tests since we don't have binary).

The code-coverage thing should be manual human process .

Feature Request: Implement Log Levelling

It would be very nice to see the implementation of log levelling to enable specification of log levels that are output which is a useful feature to have in use within a production environment.

In the current operation, the exporter continually output's info level logs.

Example:

2019/10/11 11:10:10 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:10 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:15 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:15 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:20 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:20 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:25 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:25 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:30 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:30 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:35 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:35 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:40 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:40 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:45 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:45 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:50 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:50 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:55 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:55 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:00 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:00 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:05 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:05 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:10 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:10 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:15 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:15 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:20 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:20 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:25 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:25 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:30 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:30 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:35 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:35 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:40 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:40 [INFO]: Reading quorum status with corosync-quorumtool...

differentiate node type and status

Desc

From metric ha_cluster_nodes
differentiate node "type" and "status":
type: member/remote
status: combinations of "online/standby/standby_onfail/maintenance/pending/unclean/shutdown/expected_up/is_dc".

follow-up from
#58 (comment)

Bug / Feature Request: Cluster SBD Error'ing (when not present)

I have noticed that when operating the exporter on a machine that does not have sbd setup the following set of errors are seen repeatedly:

2019/10/11 15:14:12 [INFO]: Reading cluster SBD configuration..
2019/10/11 15:14:12 [ERROR] Could not open sbd config file open /etc/sysconfig/sbd: no such file or directory

There should maybe be a check such as

if _, err := os.Stat("/etc/sysconfig/sbd"); os.IsNotExist(err) {
	log.Println("[INFO]: SBD is not present... skipping.")
} else {

}

This would give a single output of 0000/00/00 00:00:00 [INFO]: SBD is not present... skipping. instead of continually repeating an error in the absence of sbd. :)

The cluster binaries locations varies depending on Releases or Distributions

Depending on distro or on the service pack, the cluster binaries locations can vary, causing the exporter to not find the executable. For example, between SLE-12-SP3 and SLE-15, the binary drbdsetup has changed location and the exporter shows the following error:

Oct 31 13:32:14 dev-1 ha_cluster_exporter[73317]: time="2019-10-31T13:32:14Z" level=warning msg="Could not initialize DRBD collector: '/usr/sbin/drbdsetup' not found: stat /usr/sbin/drbdsetup: n...r directory\n"

And here we see the real location

dev-1:~ # which drbdsetup
/sbin/drbdsetup

The exporter should allow different locations and ideally should try to find it by itself.

implement metric about constraints with ban

desc:

Metric should look like
ha_cluster_pacemaker_constraint{type="ban|prefer" id="foobar"}

How to implement it:

  1. use crm resource ban rsc_ip_PRD_HDB00
  2. the crm_mon already show this info in field: <bans>

so it will show following:

    <bans>
        <ban id="cli-ban-rsc_ip_PRD_HDB00-on-damadog-hana02" resource="rsc_ip_PRD_HDB00" node="damadog-hana02" weight="-1000000" master_only="false" />
    </bans>

And we can expose the ID of resource and add type

Feature Request: Implement a landing page

Currently, if you visit the exporter on http://localhost:9002/ you get a 404 error, which is not an issue per se, however, would be nice to have a simple landing page which just explains what the exporter is- similar to various others such as node exporter.

This could be done using something such as:

	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		w.Write([]byte(`<html>
			<head>
				<title>HACluster Exporter</title>
			</head>
			<body>
				<h1>HACluster Exporter</h1>
				<p><a href="/metrics">Metrics</a></p>
                                <br />
				<h2>More information:</h2>
				<p><a href="https://github.com/ClusterLabs/ha_cluster_exporter">github.com/ClusterLabs/ha_cluster_exporter</a></p>
			</body>
			</html>`))
	})

Implement DRBD monitoring

metric:

  1. drbd Resource:
  • ha_cluster_drbd_resource{resource_name="1-single-0", role="primary", volume="0", disk-state="uptodate", replication-state="etablished", peer-disk-state"uptodate"}

  • ha_cluster_drbd_resource{resource_name="vg2", role="secondary", volume="0", disk-state="uptodate" replication-state="etablished", peer-disk-state"uptodate"}

  • ha_cluster_drbd_resource{resource_name="vg2", role="secondary", volume="1", disk-state="outdated" replication-state="etablished", peer-disk-state"uptodate"}


drbdsetup status --json

[
{
  "name": "1-single-0",
  "node-id": 1,
  "role": "Primary",
  "suspended": false,
  "write-ordering": "flush",
  "devices": [
    {
      "volume": 0,
      "minor": 2,
      "disk-state": "UpToDate",
      "client": false,
      "quorum": true,
      "size": 409600,
      "read": 8457,
      "written": 536513,
      "al-writes": 4,
      "bm-writes": 0,
      "upper-pending": 0,
      "lower-pending": 0
    } ],
  "connections": [
    {
      "peer-node-id": 2,
      "name": "SLE15-sp1-gm-drbd1145296-node2",
      "connection-state": "Connected", 
      "congested": false,
      "peer-role": "Secondary",
      "ap-in-flight": 0,
      "rs-in-flight": 0,
      "peer_devices": [
        {
          "volume": 0,
          "replication-state": "Established",
          "peer-disk-state": "UpToDate",
          "peer-client": false,
          "resync-suspended": "no",
          "received": 8202,
          "sent": 534390,
          "out-of-sync": 0,
          "pending": 0,
          "unacked": 0,
          "has-sync-details": false,
          "has-online-verify-details": false,
          "percent-in-sync": 100.00
        } ]
    } ]
}
,```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.