clusterlabs / ha_cluster_exporter Goto Github PK

View Code? Open in Web Editor NEW

76.0 13.0 34.0 2.93 MB

Prometheus exporter for Pacemaker based Linux HA clusters

License: Apache License 2.0

Go 68.02% Makefile 2.86% Shell 27.40% Python 1.73%

pacemaker high-availability golang prometheus-exporter metrics monitoring cluster linux prometheus

ha_cluster_exporter's People

Contributors

Stargazers

Watchers

ha_cluster_exporter's Issues

easy-fix: change travis import_path to correct one

https://github.com/ClusterLabs/ha_cluster_exporter/blob/master/.travis.yml#L10 point to wrong github project.

Change the path to the actual repository Clusterlab/ha_cluster_exporter

EASY-FIX: Add go fmt check in travis

we should add a check in travis to check that golang code is properly formatted.

depracate ha configured total metric?

I have the feeling that this 2 metrics are providing same info that an user could grab by reading other metrics of cluster, doing a total etc.

In sake of minimalism we could perhaps just remove them

ha_cluster_nodes_configured_total
ha_cluster_resources_configured_total

I think historically they were added as migration from hawk-apiserver original metrics but I was also doubting on their need to exist because imho one could do a promql for this 2 metrics.

Imho I would vote to remove them if we can in sake of simplicity and minimalism

add label to metric

currenlty

# TYPE cluster_nodes gauge
cluster_nodes{type="DC"} 1
cluster_nodes{type="configured"} 2
cluster_nodes{type="expected_up"} 2
cluster_nodes{type="maintenance"} 0
cluster_nodes{type="member"} 2
cluster_nodes{type="online"} 2
cluster_nodes{type="pending"} 0
cluster_nodes{type="ping"} 0
cluster_nodes{type="remote"} 0
cluster_nodes{type="shutdown"} 0
cluster_nodes{type="standby"} 0
cluster_nodes{type="standby_onfail"} 0
cluster_nodes{type="unclean"} 0
cluster_nodes{type="unknown"} 0

we should add
cluster_nodes{node="foobar1" type="unknown"} 0

DOC: find a way to generate documentation automatically and write other things manually

Description:

This issue is after discussion with me, @diegoakechi and @stefanotorresi.

We agreed that we should find a compromise between generating docs automatically and maintain them manually.

Tasks:

research and investigate what we can automatically create (Name of metric, HELP etc, list)

This task might depend also if we want to refactor the exporter with the usage of collector, which will change a bit the API so also the automatic generation can be impacted. We should then wait a little before doing this

update readme description

Desc

the readme contain an outdated text:

This prometheus exporter is used to serve metrics for pacemaker https://github.com/ClusterLabs/pacemaker. It should run inside a node of the cluster or both.

We now export more metrics like SBD, drbd and other components.

We should write a more generic description for this.

An example would be: (it might need improvement and research)

Like the exporter is used to serve metrics for ha clusters and their components.
As second usage will be that, the exporter is intented to work also as only exporter for a single component, in this case it will just serve the metrics for this compenent

Create Index in Readme.md

we should have an index in Markdown for each central point.

CI: add basic things of travis ci

add basic ci for golang here.

Add cluster prefix to metrics

We should check if we need a prefix of cluster or ha_cluster to each metrics

easy-fix / testing: expand current test for pacemaker

description:

the following file has some unit-tests. https://github.com/ClusterLabs/ha_cluster_exporter/blob/master/pacemaker_metrics_test.go#L76

It would be nice to extend this tests following same pattern Just testing other attirbutes.

Add label RING_ID to corosync metric

Create a metric
corosync_ring_errors("ring_id=1") 1 corosync_ring_errors("ring_id=42") 0

here so we can point what ring is failing. A zero or 1 is a failure.

implement metric about resource configuration monitoring

desc:

Implement a metric which will give the date when a configuration is updated.

A pseudo/metric implementation would more or less like :

ha_cluster_pacemaker_resource_configuration_changes= Mon Oct 28 11:03:29 2019
the value should be timestamp

How to implement this:

we have already the crm_mon command which expose this information with

<crm_mon version="2.0.0">
    <summary>
        <stack type="corosync" />
        <current_dc present="true" version="1.1.18+20180430.b12c320f5-3.15.1-b12c320f5" name="damadog-hana01" id="1084780051" with_quorum="true" />
        <last_update time="Mon Oct 28 11:03:34 2019" />
        <last_change time="Mon Oct 28 11:03:29 2019" user="root" client="crm_attribute" origin="damadog-hana02" />```

the `last_change` is the field we are looking for.

todo:

search to convert the Mon Oct 28 11:03:29 2019 to a prometheus compatible value perhaps a timestamp

Check if empty map

We should check if the maps we are accessing are empty to avoid some possible errors if any


if len(map) == 0 {
    ....
}

Research on error handling in context of metrics

Desc:

I think we should in case of error set an empty metrics but not error/panic.
We should log an error but not interrupt.

It is something to be researched when the exporter need to panic

[challange]: build exporter package for archlinux or debian or centos7 other distro with OBS

description:

we need to build the exporter for other distros

A good one would be the centos7 since we already have the repo.

In case archlinux or debian, we could add repos there.

we build the pkg with:
https://build.opensuse.org/package/show/server:monitoring/prometheus-ha_cluster_exporter

Tutorial:

https://duncan.codes/tutorials/rpm-packaging/

what you will learn:

you will learn to packaging

warning:

this issue is for hackers level. Are you ready to be one? If yes , take this challenge. ( we will help along the way for any info required)

If want to take this issue

Bug: DRBD Error'ing (when not present)

Similar to #28 - an error is seen when DRBD is not set up on the system,

2019/10/16 14:34:28 fork/exec /sbin/drbdsetup: no such file or directory

DRBD Metric-02: Implement remote connection

desc

follow-up from #25

impmplement something like
ha_cluster_drbd_resource_remote_connection{resource_name="1-single-0", role="primary", volume="0" peer-node-id="2" peer-role="Secondary"} 1

Feature Request: Implement listen IPv4 option

Currently, it is not possible to alter which address the exporter will listen on; I propose the addition of a -ip4 flag.

E.g. ./ha_cluster_exporter --ip4 127.0.0.1

Pull request to follow :)

make from 2 only 1 metric with label

from this 2 metrics create 1

cluster_node_resources{node="ayoub-monitoring-hana01",resource_name="rsc_saphana_prd_hdb00",role="master"} 1

cluster_resources_status{status="master"} 1

cluster_node_resources{node="ayoub-monitoring-hana01",resource_name="rsc_saphana_prd_hdb00",role="master" status="orpanhed"}} 1

implement a way to count the resorcourse

desc

I have hardcoded the number because I need to implement the way we lookup the number.

https://github.com/MalloZup/ha_cluster_exporter/blob/master/main.go#L412

PKGing: release a new version of rpm exporter

release a new rpm of the release

make log compatible to systemd daemon

Right now we print in terminal.
We should just print the golang output to systemd for beeing a service.

Error/warning need more context

We have such erorr/warning:

log.Warnln(err). Imho we should use log.Warnf("context bro erorr by doing x %s," err) so we have a more context info

Configured metric

In past we used the metric configured . We should expose this metric separately.

Add detailed value documentation to `*_total` metrics

good idea. We should also do this for all the other ones, in the future, because actually non of the *_total metrics document their value.

Originally posted by @stefanotorresi in #92

Add changelog file

discuss which format take fro changelog and add one if needed

use Register method instead of MustRegister

Use Register method (https://godoc.org/github.com/openshift/origin/vendor/github.com/prometheus/client_golang/prometheus#Registry.Register)

This allow us to catch the error and print a error msg .

Mustregister just panic without giving the opportunity of catching errors

check/test if in different sles/openSUSE systems version

Check if crm_mon return same data when running on different systems

actions:

run this on older systems

add clustername to metrics

Description

In case of multiple cluster we should add a label which will highlight that the metric belong to that clustername so we can see which cluster belong to

This might need to research and we should do on the dashboard level rather then on exporter

Design a creative and emotional logo for ha exporter

desc:

It would be nice to have a logo for this project.

The logo should be a mix or remind to some elements of Prometheus, Grafana and also something with Cluster and HA . This is a starting point. Any imagination is welcome!

Create SBD metric about number of device configured

the metric should give 0 if there is not sbd device is configured and give a number int of the number

implement Refresh parameter for cmdline exporter

desc

we should just add a flag for refreshing XX seconds so the binary can be changes

refactor pacemaker metrics like DRBD one

THe pr of drbd introduce a refactor of pacemaker (#26) .

This metric with types is similar to DRBD metrics.

We should follow the same pattern. Also add:

move all function that belong to pacemaker to that new file
add unit-tests with a XML data example

Travis configuration seems off

We currently have 4 jobs, 2 for each stage (gofmt and build), but something must be off: we should only have 1 job per stage.

use labels instead of multiples metrics

we have to much metrics.

We can easy use labels with that

For example, rather than http_responses_500_total and http_responses_403_total, create a single metric called http_responses_total with a code label for the HTTP response code. You can then process the entire metric as one in rules and graphs.

https://prometheus.io/docs/practices/instrumentation/

research on how to remove/reset metrics

Currenlty we have a function for resetting some metric with labels.

Historically I didn't find any API for doing this job with the prometheus api. This issue is more a placeholder for research if we can use some API to remove the metrics/reset the state them with prometheus api.

improve english in readme

The readme might contains some english phrases which need rewords or better statement for improving UX

Rename ha_cluster_node_resoucres to ha_cluster_resources

tasks:

update code and tests
update specification on metrics
update dashboard and alerts when needed

create a makefile target for running tests with code coverage and visualize them

the golang language has the feature to create and display code coverage of unit- tests

Task:

research about this interesting feature
create a makefile target like make coverage this will execute tests with code coverage
By default we don't need it , so we will have make coverage and make test will run without coverage. The make coverage will run go test with some flags
this task should create an html golang report and then open a fenster in web-browser to visualize it.

rationale:

We don't want to enable codecoverage or similar in travis or CI because codecovarage can be a really wrong metric for running in PRs or automation.

We just want to use code coverage from time to time to see what we could perhaps cover but there should be a human analyzing it.

As use-case could be:

As dev I run make coverage and this will open automatically the browser fenster to the code-coverage of golang in html

Then I can analyze from time to time what perhaps isn't covered by tests. ( we can't cover everything by tests since we don't have binary).

The code-coverage thing should be manual human process .

Install logrous in golang modules

Desc:

we should use logrous (https://github.com/sirupsen/logrus) for a more fine grained control over logs.

Tasks:

install logrous in go module file

Not mandatory

1. bonus convert some logs info to debug mode

Feature Request: Implement Log Levelling

It would be very nice to see the implementation of log levelling to enable specification of log levels that are output which is a useful feature to have in use within a production environment.

In the current operation, the exporter continually output's info level logs.

Example:

2019/10/11 11:10:10 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:10 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:15 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:15 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:20 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:20 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:25 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:25 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:30 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:30 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:35 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:35 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:40 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:40 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:45 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:45 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:50 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:50 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:55 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:55 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:00 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:00 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:05 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:05 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:10 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:10 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:15 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:15 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:20 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:20 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:25 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:25 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:30 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:30 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:35 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:35 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:40 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:40 [INFO]: Reading quorum status with corosync-quorumtool...

add version cmdline

add CLI option to print version of the exporter

differentiate node type and status

Desc

From metric ha_cluster_nodes
differentiate node "type" and "status":
type: member/remote
status: combinations of "online/standby/standby_onfail/maintenance/pending/unclean/shutdown/expected_up/is_dc".

follow-up from
#58 (comment)

Bug / Feature Request: Cluster SBD Error'ing (when not present)

I have noticed that when operating the exporter on a machine that does not have sbd setup the following set of errors are seen repeatedly:

2019/10/11 15:14:12 [INFO]: Reading cluster SBD configuration..
2019/10/11 15:14:12 [ERROR] Could not open sbd config file open /etc/sysconfig/sbd: no such file or directory

There should maybe be a check such as

if _, err := os.Stat("/etc/sysconfig/sbd"); os.IsNotExist(err) {
	log.Println("[INFO]: SBD is not present... skipping.")
} else {

}

This would give a single output of 0000/00/00 00:00:00 [INFO]: SBD is not present... skipping. instead of continually repeating an error in the absence of sbd. :)

The cluster binaries locations varies depending on Releases or Distributions

Depending on distro or on the service pack, the cluster binaries locations can vary, causing the exporter to not find the executable. For example, between SLE-12-SP3 and SLE-15, the binary drbdsetup has changed location and the exporter shows the following error:

Oct 31 13:32:14 dev-1 ha_cluster_exporter[73317]: time="2019-10-31T13:32:14Z" level=warning msg="Could not initialize DRBD collector: '/usr/sbin/drbdsetup' not found: stat /usr/sbin/drbdsetup: n...r directory\n"

And here we see the real location

dev-1:~ # which drbdsetup
/sbin/drbdsetup

The exporter should allow different locations and ideally should try to find it by itself.

implement metric about constraints with ban

desc:

Metric should look like
ha_cluster_pacemaker_constraint{type="ban|prefer" id="foobar"}

How to implement it:

use crm resource ban rsc_ip_PRD_HDB00
the crm_mon already show this info in field: <bans>

so it will show following:

    <bans>
        <ban id="cli-ban-rsc_ip_PRD_HDB00-on-damadog-hana02" resource="rsc_ip_PRD_HDB00" node="damadog-hana02" weight="-1000000" master_only="false" />
    </bans>

And we can expose the ID of resource and add type

Feature Request: Implement a landing page

Currently, if you visit the exporter on http://localhost:9002/ you get a 404 error, which is not an issue per se, however, would be nice to have a simple landing page which just explains what the exporter is- similar to various others such as node exporter.

This could be done using something such as:

	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		w.Write([]byte(`<html>
			<head>
				<title>HACluster Exporter</title>
			</head>
			<body>
				<h1>HACluster Exporter</h1>
				<p><a href="/metrics">Metrics</a></p>
                                <br />
				<h2>More information:</h2>
				<p><a href="https://github.com/ClusterLabs/ha_cluster_exporter">github.com/ClusterLabs/ha_cluster_exporter</a></p>
			</body>
			</html>`))
	})

metric:

drbd Resource:

ha_cluster_drbd_resource{resource_name="1-single-0", role="primary", volume="0", disk-state="uptodate", replication-state="etablished", peer-disk-state"uptodate"}
ha_cluster_drbd_resource{resource_name="vg2", role="secondary", volume="0", disk-state="uptodate" replication-state="etablished", peer-disk-state"uptodate"}
ha_cluster_drbd_resource{resource_name="vg2", role="secondary", volume="1", disk-state="outdated" replication-state="etablished", peer-disk-state"uptodate"}

drbdsetup status --json

[
{
  "name": "1-single-0",
  "node-id": 1,
  "role": "Primary",
  "suspended": false,
  "write-ordering": "flush",
  "devices": [
    {
      "volume": 0,
      "minor": 2,
      "disk-state": "UpToDate",
      "client": false,
      "quorum": true,
      "size": 409600,
      "read": 8457,
      "written": 536513,
      "al-writes": 4,
      "bm-writes": 0,
      "upper-pending": 0,
      "lower-pending": 0
    } ],
  "connections": [
    {
      "peer-node-id": 2,
      "name": "SLE15-sp1-gm-drbd1145296-node2",
      "connection-state": "Connected", 
      "congested": false,
      "peer-role": "Secondary",
      "ap-in-flight": 0,
      "rs-in-flight": 0,
      "peer_devices": [
        {
          "volume": 0,
          "replication-state": "Established",
          "peer-disk-state": "UpToDate",
          "peer-client": false,
          "resync-suspended": "no",
          "received": 8202,
          "sent": 534390,
          "out-of-sync": 0,
          "pending": 0,
          "unacked": 0,
          "has-sync-details": false,
          "has-online-verify-details": false,
          "percent-in-sync": 100.00
        } ]
    } ]
}
,```

clusterlabs / ha_cluster_exporter Goto Github PK

ha_cluster_exporter's People

Contributors

Stargazers

Watchers

Forkers

ha_cluster_exporter's Issues

Description:

Tasks:

Desc

description:

desc:

How to implement this:

todo:

Desc:

description:

Tutorial:

what you will learn:

warning:

desc

desc

actions:

Description

desc:

desc

tasks:

rationale:

Desc:

Tasks:

Not mandatory

Desc

desc:

How to implement it:

metric:

Recommend Projects

Recommend Topics

Recommend Org