netapp / harvest Goto Github PK

View Code? Open in Web Editor NEW

145.0 18.0 36.0 43.69 MB

Open-metrics endpoint for ONTAP and StorageGRID

Home Page: https://netapp.github.io/harvest/latest

License: Apache License 2.0

Makefile 0.47% Shell 0.96% Go 98.31% Dockerfile 0.08% CUE 0.12% JavaScript 0.06%

netapp-public prometheus monitoring grafana-dashboard observability storagegrid go

harvest's Introduction

What is NetApp Harvest?

Harvest is the open-metrics endpoint for ONTAP and StorageGRID

NetApp Harvest brings observability to ONTAP and StorageGRID clusters. Harvest collects performance, capacity and hardware metrics from ONTAP and StorageGRID, transforms them, and routes them to your choice of time-series database.

The included Grafana dashboards deliver the datacenter insights you need, while new metrics can be collected with a few edits of the included template files.

Harvest is open-source, built with Go, released under an Apache2 license, and offers great flexibility in how you collect, augment, and export your datacenter metrics.

To get started, follow our quickstart guide or install Harvest.

Community

There is a vibrant community of Harvest users on Discord and GitHub discussions. Come join! 👋

Documentation

📕 https://netapp.github.io/harvest/

Videos

Developed with 💙 by NetApp - Privacy Policy

harvest's People

Contributors

Stargazers

Watchers

harvest's Issues

harvest stop does not stop pollers that have been renamed

Steps to reproduce - will add

Edit harvest.yml and add/enable one poller, call it foo
Verify not running

bin/harvest status
Datacenter            Poller                PID        PromPort        Status
+++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
nane                  foo                                              not running
+++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++

Start poller

bin/harvest start
Datacenter            Poller                PID        PromPort        Status
+++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
nane                  foo                   5828                       running
+++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++

Edit harvest.yml change name of foo to foo2
Status fails because the "wrong" poller is queried

bin/harvest status
Datacenter            Poller                PID        PromPort        Status
+++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
nane                  foo2                                             not running
+++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++

If you run harvest start you will create a new poller named foo2 while the first started poller is still running

ps aux | grep poller
root      5828  3.0  0.0 2795344 76752 ?       Sl   11:17   0:04 bin/poller --poller foo --loglevel 2 --promPort  --daemon
root      5912 49.8  0.0 2869588 97988 ?       Sl   11:19   0:02 bin/poller --poller foo2 --loglevel 2 --promPort  --daemon

start/stop/status should be more resilient to name changes. In a few places, we already interrogate /proc, extract command line arguments, and parse them. We should do the same for stop/status too. In other words, stopping and status should not depend on the names in harvest.yml, they instead should query the OS.

Improve config tool

The config tool helps configure pollers that monitor ONTAP clusters.

The tool should:

validate harvest configuration flie (harvest.yml)

Implemented, but needs more testing:

create a client certificate on the local system
create a read-only harvest user for ONTAP
install client certificate on ONTAP

Known issues

client certificate is re-generated each time

We may want to integrate this tool with the Zapi tool

Add Prometheus service discovery

Harvest should use one of Prometheus's existing service discovery options to make it easier for customers to add hundreds of pollers to Harvest without having to specify the exact port for each poller and each Prometheus target.

Reference

https://prometheus.io/docs/guides/file-sd/
https://prometheus.io/blog/2018/07/05/implementing-custom-sd/
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dns_sd_config
https://yetiops.net/posts/prometheus-srv-discovery/

Make vendored copy of dependencies

This is a small change. I've found that using vendored dependencies can help reduce build flakiness, particularly when the project has many dependencies it pulls in. It doesn't look like there are many here, but maybe this can help future proof it.

Add Grafana dashboards that use InfluxDB datasource

README wrong link

The READ me says to
wget https://github.com/NetApp/harvest/releases/download/v21.05.0/harvest-21.05.0.tar.gz
but with the current release it should be
wget https://github.com/NetApp/harvest/releases/download/v21.05.1/harvest-21.05.1-1.tar.gz

InfluxDB exporter should support URL end-point

Is your feature request related to a problem? Please describe.
From Slack:

Steve S  3:19 PM
for the influxdb exporter is it possible to just pass a URL?

Chris Grindstaff  3:21 PM
@Steve S can you give an example of what you want? you mean instead of decomposing the url into an addr and port?

Steve S  3:25 PM
correct - we have a large scale configuration behind load balancer  (so we’re not hitting a single host).

Chris Grindstaff  3:28 PM
make sense - I'll create an issue to track - if you're interested this is the line of code to change if you want to give it a try

Describe the solution you'd like
Allow me to exporter to a URL instead of address and IP

Replace current Harvest logging with Zerolog

Current logging is pretty basic. We should be using standard framework to solve this.

logrus is currently in maintenance. That leaves Zerolog and zap. Zerolog is simpler and integrates well with lumberjack.

Handle counter wrapping

Monotonically increasing counters wrap

Simplify install by renaming harvest.example.yml to harvest.yml

Requires changes in packaging scripts

Adding promPort to pollers section in harvest.conf

Is your feature request related to a problem? Please describe.
We want to have only one exporter and define the ports for the prometheus exporter in the single pollers

Describe the solution you'd like
Add the parameter promPort to the Poller section.

Additional context
Example conf file

  Exporters:
  harvest:
    exporter: prometheus
    addr: 0.0.0.0
    global_prefix: netapp_
    master: True

Defaults:
  collectors:
    - Zapi
    - ZapiPerf
  exporters:
    - harvest
  use_insecure_tls: true
  auth_style: basic_auth
  username: <user>
  password: <password>

Pollers:
  clusterA:
    datacenter: DC1
    addr: clusterA
    promPort: 25000
  clusterB:
    datacenter: DC2
    addr: clusterB
    promPort: 25001

Improve ZAPI tool

The Zapi tool (harvest zapi) retrieves available counters and metadata from ONTAP (CDOT or 7mode). The tool should enable you to configure Harvest to collect any ONTAP metric.

Currently the tool is able to

retrieve ZAPI attributes and ZAPIPERF objects and counters
export ZAPIPERF objects into Harvest .yaml templates

Issues:

ugly output
exported templates not properly tested

Add ASUP analytics

Similar to Harvest 1.6, Harvest2 should offer an option (ASUP, EMS, etc) to track # of installs, # pollers, etc.

Code that sends ASUP from Harvest 1.6

# Send AutoSupport with statistics about Harvest poller
sub send_autosupport_stats()
{
	
	# Get counts of nodes and volumes monitored
	# No point to continue if we don't have node instances
	my $nodes_count = keys %{$instance{"system:node"}};
	return 0 unless ($nodes_count > 0);
	my $vol_count = keys %{$instance{volume}};
	
	# Get OS
	my $distro = `cat /etc/os-release 2>/dev/null | grep PRETTY` || $Config{osname} || "unknown";
	# If we read os-release, what we need is the string after = optionally between single/double quotes
	$distro =~ s/^\s?PRETTY_NAME\s?=\s?['"]?(.*?)\s?['"]?.?$/$1/ms; # if $distro =~ /PRETTY/;

	# Runtime of worker
	my $runtime = `ps -ly --pid=$$ --no-headers 2>/dev/null` || "--";
	$runtime =~ s/.*\s(\d\d:\d\d:\d\d)\s.*/$1/ms;

	# Compose name for Harvest instance
	my $host_name = hostname();
	my $harvest_name = $connection{group} . "_" . $host_name;;

	# Stats about worker performance
	my $stats = "";
	for my $k ( qw (metrics fails skips plugin_time api_time last_time) )
	{
		$stats = $stats . ";" . $connection{statistics}{$k};
	}

	my $log_message = "HARVEST $VERSION [$harvest_name] [$distro] [$connection{product};$connection{host_version_generation};$connection{host_version_major}] [$nodes_count;$vol_count;$runtime] [$stats]";

	logger ("DEBUG", "[send_autosupport_stats] Composed AutuSupport log with statistics [$log_message]");

	my $server = $connection{server_obj};

	for my $node (keys %{$instance{"system:node"}})
	{
		my $node_name = $instance{"system:node"}{$node}{name};

		my $in = NaElement->new("autosupport-invoke");
		$in->child_add_string("force", "true");
		$in->child_add_string("message", $log_message);
		$in->child_add_string("node-name", $node_name);
		$in->child_add_string("type", "all");

		my $out = $server->invoke_elem($in);

		if ($out->results_status() eq "passed")
		{
			logger ("NORMAL", "[send_autosupport_stats] Sent AutoSupport Log Message for Node [$node_name] with statistics [$log_message].");
		}
		else
		{
			logger ("WARNING", "[send_autosupport_stats] Sending AutoSupport log message for Node [$node_name] failed with reason: $out->results_reason(). Message was [$log_message].");
		}
	}

	return 0;
}

Support global_prefix on Grafana import

Is your feature request related to a problem? Please describe.
Support exporter global_prefix for Grafana dashboard import. If a prefix is defined today, all queries on the dashboards have to be updated manually

Describe the solution you'd like
While importing the dashboards via the Grafana tool, the exporter section should be checked for global_prefix entries.

Describe alternatives you've considered
Add a command parameter (e.g. --db-prefix) to the Grafana tool like the --datasource parameter for naming the prefix and updating the queries in the dashboards.

Prometheus exporter: incomplete metatags

Describe the bug

instance labels (when exported as separate metric) and status metrics do not get the "HELP" and "TYPE" metatags

This was reported by a customer, I will add more details when I get them from the customer and/or reproduce the issues myself.

Improve custom ZAPI template instructions

See wiki

Add examples to subtemplates

Add REST support to Harvest

ZAPIs are being deprecated in ONTAP.
From slack forum:

ZAPIs will still be available in the version of ONTAP released around Oct 2022.
That version of ONTAP will have support till 2025.

Harvest stops working after reboot on CentOS / RHEL (tmpfilesd configuration missing)

Describe the bug
After reboot harvest pollers do not start.

Environment

Harvest: harvest version 21.05.1-1 (commit 2211c00) (build date 2021-05-21T01:28:12+0530) linux/amd64
OS: RHEL 7.9
Install method: yum

To Reproduce

yum install harvest.rpm
configure poller
systemctl enable harvest
restart

Expected behavior
Harvest pollers should run after restart

Actual behavior
Harvest pollers don't start because /var/run/harvest is missing and can't be created by itself

error mkdir [/var/run/harvest/]: mkdir /var/run/harvest/: permission denied

Possible solution, workaround, fix
The RPM creates the directory /var/run/harvest for the PID files with mkdir. The directory /var/run in RHEL/CentOS 7 is managed by tmpfilesd and created from scratch during reboot. You need to include a tmpfiles-configuration for the directory.

Example:
cat /usr/lib/tmpfiles.d/harvest.conf
d /var/run/harvest 0755 harvest harvest -

Additional context

Harvest Docker

Finish prototype
Be mindful of service discovery.

Volume dashboard does not display volume names

Kicking the tires on 2.0. The Netpp Detail: Volume dashboard does not display volume names on any of the charts, except the "volumes in cluster" table. All other charts on the dashboard display data but no volume labels

Profile ZAPI XML parsing

With the recent improvements, revisit our ZAPI XML parsing

Add Helm charts

Document how capacity metrics are calculated

Build release artifacts for other OSes

Alpine Linux for small containers - see relevant stackoverflow
Mac (darwin) - several folks have asked for this
Evaluate GoReleaser

While Harvest runs on the Mac (several of us develop/run on Macs), we aren't packaging pre-compiled binaries for it yet.

Building is easy with

GOOS=darwin make

you'll also want to edit your harvest.yml and add a cluster since the unix poller doesn't work on the Mac.

Evaluate moving documents to readthedocs

https://readthedocs.org/

Create harvest doctor to validate customer environments

Similar to brew doctor and support bundle. Collect information, validate, and present to customer.
Compress and offer to share.

Ideas:

validate yaml
verify harvest is or is not running
verify that Prometheus is or is not reachable
verify permissions
verify credentials
hit the Prometheus endpoint(s) and show results
check versions of Prometheus, Grafana, InfluxDB, OS, and embedded release/commit/build date
validate /var/run/harvest permissions

Extract what's reasonable from Troubleshooting Harvest automate and/or include

Add workload counters in ZapiPerf

Work is done, but didn't make the 21.05 release. Merge after 21.05 is cut

version flag is missing new line on some shells

$ bin/harvest --version
harvest version 21.05.1919 darwin/amd6 $
---------------------------------------^  cursor is left here, should be on next line

This happens in bash, does not happen in zsh or fish.

Address all shellcheck warnings

"systemctl status harvest" not accurate

my harvest environment seems to be working, however the systemctl status harvest command does not show my poller as running, even though it works...

some details about the environment:

 systemctl status harvest
● harvest.service - Harvest
   Loaded: loaded (/etc/systemd/system/harvest.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2021-05-21 08:06:19 UTC; 5min ago
  Process: 8875 ExecStart=/opt/harvest/bin/harvest restart --config /opt/harvest/harvest.yml (code=exited, status=0/SUCCESS)
 Main PID: 8882 (poller)
    Tasks: 8
   Memory: 31.2M
   CGroup: /system.slice/harvest.service
           └─8882 bin/poller --poller cluster1 --loglevel 2 --config /opt/harvest/harvest.yml --daemon

May 21 08:06:19 rhel6 systemd[1]: Starting Harvest...
May 21 08:06:19 rhel6 harvest[8875]: Datacenter            Poller                PID        PromPort        Status
May 21 08:06:19 rhel6 harvest[8875]: +++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
May 21 08:06:19 rhel6 harvest[8875]: lod                   cluster1                                         not running
May 21 08:06:19 rhel6 systemd[1]: Started Harvest.

Polling Harvest:

curl localhost:31000/metrics | grep volume_write_data
volume_write_data{datacenter="lod",cluster="cluster1",node="cluster1-01",svm="nfs_svm",aggr="aggr1",type="flexvol"} 0
volume_write_data{datacenter="lod",cluster="cluster1",node="cluster1-01",svm="nfs_svm",aggr="aggr1",type="flexvol"} 282.42660719938857
volume_write_data{datacenter="lod",cluster="cluster1",node="cluster1-01",svm="cluster1-01",aggr="aggr0",type="flexvol"} 11622.309120154781

my Harvest configuration

$ more harvest.yml
Exporters:
  Harvest:
    exporter: prometheus
    port: 31000
    addr: 0.0.0.0

Defaults:
  collectors:
    - Zapi
    - ZapiPerf
  exporters:
    - Harvest
  use_insecure_tls: true

Pollers:
  cluster1:
    datacenter: lod
    addr: 192.168.0.101
    auth_style: basic_auth
    username: admin
    password: Netapp1!

Environment configuration:

NetApp Lab on Demand "Using Trident with Kubernetes and ONTAP v4.0"
host: RHEL 7.5
installation method: yum install harvest-21.05.1-1.x86_64.rpm

Grafana Dashboards panels getting 404

Describe the bug
Some metrics in some Grafana dashboards panels getting an error code 404

Environment

Harvest version: harvest version 21.05.1-1 (commit 2211c00) (build date 2021-05-21T01:28:12+0530) linux/amd64
OS: CentOS Linux release 7.9.2009 (Core)
Install method: rhel
ONTAP Version: 9.7

To Reproduce
Just open any dashboard

Expected behavior
all panels should return values

Actual behavior
some of them returns error code 404

Possible solution, workaround, fix
Disable Exemplars feature

Additional context
I did notice that only a panels with that feature enabled getting 404. I did some tests and looks like the issue coming from that query - query_exemplars. When I disabled Exemplars feature in each panel in each dashboard, all panels started to get a metrics and all looks good.

Disk serial number and is-failed are missing from cdot query

Describe the bug
disk serial was in rc2, but isn't in 21.05

Environment

Harvest version: harvest version 21.05.2512-v21.05.0 (commit fc433fe) (build date 2021-05-25T12:21:55-0400) linux/amd64
Command line arguments used: bin/harvest start
OS: RHEL 8
Install method: native
ONTAP Version: 9.9
Other:

See change here
6bb79ec

Workaround
edit conf/zapi/cdot/9.8.0/disk.yaml and add ^serial-number => serial_number to the counters/storage-disk-info/disk-inventory-info section

counters:
  storage-disk-info:
    - ^^disk-uid
    - ^^disk-name               => disk
    - disk-inventory-info:
      - bytes-per-sector        => bytes_per_sector
      - capacity-sectors        => sectors
      - ^disk-type              => type
      - ^is-shared              => shared
      - ^model                  => model
      - ^serial-number          => serial_number

In the same file export it like so:

export_options:
  instance_keys:
    - node
    - disk
  instance_labels:
    - type
    - model
    - outage
    - owner_node
    - shared
    - shelf
    - shelf_bay
    - serial_number

Enhance systemd service implementation to monitor all pollers

Is your feature request related to a problem? Please describe.
In the current way systemd service is implemented, harvest just forks all poller child processes and closes. Systemd has no concept of keeping track of all the children. If I have more then one poller I can never be sure that all are running by monitoring the systemd service.

Example:
Harvest Systemd status is active

● harvest.service - Harvest
   Loaded: loaded (/etc/systemd/system/harvest.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-05-25 09:56:15 CEST; 24min ago
  Process: 2512 ExecStart=/opt/harvest/bin/harvest restart --config /opt/harvest/harvest.yml (code=exited, status=0/SUCCESS)
 Main PID: 28539 (code=exited, status=0/SUCCESS)
    Tasks: 30
   Memory: 104.6M
   CGroup: /system.slice/harvest.service
           ├─2521 bin/poller --poller unix --loglevel 2 --config /opt/harvest/harvest.yml --daemon
           ├─2538 bin/poller --poller ontap1 --loglevel 2 --config /opt/harvest/harvest.yml --daemon
           └─2548 bin/poller --poller ontap2 --loglevel 2 --config /opt/harvest/harvest.yml --daemon

I kill a single poller, service status remains active.

kill 2548

# ./bin/harvest status
Datacenter            Poller                PID        PromPort        Status
+++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
local                   unix                2521                       running
DC1                     ontap1              2538                       running
DC2                     ontap2                                         not running
+++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++

● harvest.service - Harvest
   Loaded: loaded (/etc/systemd/system/harvest.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-05-25 09:56:15 CEST; 25min ago
  Process: 2512 ExecStart=/opt/harvest/bin/harvest restart --config /opt/harvest/harvest.yml (code=exited, status=0/SUCCESS)
 Main PID: 28539 (code=exited, status=0/SUCCESS)
    Tasks: 20
   Memory: 63.2M
   CGroup: /system.slice/harvest.service
           ├─2521 bin/poller --poller unix --loglevel 2 --config /opt/harvest/harvest.yml --daemon
           └─2538 bin/poller --poller ontap1 --loglevel 2 --config /opt/harvest/harvest.yml --daemon

Describe the solution you'd like
Systemd harvest service should monitor the status of all pollers and should switch to "failed" if any poller is down. Alternatively create single service instances, one for each poller.

Describe alternatives you've considered
Monitoring the service and getting the status of every poller is essential for production usage.

Support multiple datasources in Grafana dashboards

grafana tool enhancement

Improve yaml parsing

https://github.com/ghodss/yaml which uses https://pkg.go.dev/gopkg.in/yaml.v2 is probably the best candidate

Release 21.05.0

Cherry pick remaining changes
Use Jenkins to create release artifacts
Update changelog for 21.05
Back fill rc2 and rc1 sections of changelog
Create NOTICE file for 21.05
Commit, push, create a new release pull request - use same all the way through to final
Use GitHub Create Release to public
Announce

Add more metrics to Grafana dashboards

The current set of Prometheus dashboards display a small fraction of the metrics Harvest collects.
We should consider:

culling metrics we don't plan on ever displaying
add useful ones to existing dashboards
make it easy for customers to decide on their own

Unified Manager (AIQUM) as data source for clusters

Is your feature request related to a problem? Please describe.
With more than 100 clusters there is a lot of work to keep the harvest.conf updated.

Describe the solution you'd like
Harvest connects to AIQUM to get a list of all clusters (maybe with the possibility to filter by annotations). Afterwards it is using this list to connect to every single cluster and gather all the performance / health information.

Performance metrics dont display volume name

when polling performance metrics, it looks like the following:
volume_write_latency{datacenter="lod",cluster="cluster1",node="cluster1-01",svm="cluster1-01",aggr="aggr0",type="flexvol"} 79.36234458259325
=> volume name is missing

whereas other metrics display the following
volume_size_available{datacenter="lod",cluster="cluster1",volume="trident_qtree_pool_nas2_LRHCOABWFT",node="cluster1-01",svm="nfs_svm",aggr="aggr1",style="flexvol"} 5367836672
=> volume name is present

is that expected behavior?

Environment configuration:

NetApp Lab on Demand "Using Trident with Kubernetes and ONTAP v4.0"
host: RHEL 7.5
installation method: yum install harvest-21.05.1-1.x86_64.rpm
Harvest collectors: zapi & zapiperf
ONTAP 9.7

PrometheusPort in poller args is a string - should be an int

harvest/cmd/poller/poller.go

Line 758 in e63ae7f

flags.StringVar(&args.PromPort, "promPort", "", "Prometheus Port")

allow TLS server verification for basic auth

Describe the bug
Pollers won't start with this combination of options:

use_insecure_tls: false
auth_style: basic_auth
username: user
password: pass

Environment
Provide accurate information about the environment to help us reproduce the issue.

Harvest version: harvest version 21.05.1-1 (commit 2211c00) (build date 2021-05-21T01:28:12+0530) linux/amd64
OS: RHEL 7.9
Install method: yum
ONTAP Version: 9.7P12

To Reproduce
Set use_insecure_tls to false and use basic authentication with username & password

Expected behavior
Pollers should start, server certifcate in ONTAP is valid.

Actual behavior
Pollers won't start.

2021/05/25 09:40:40  (warning) (poller) (hostname): init collector-object (Zapi:Node): connection error => invalid parameter => use_insecure_tls is false, but no certificates
2021/05/25 09:40:40  (warning) (poller) (hostname): aborting collector (Zapi)
2021/05/25 09:40:40  (warning) (poller) (hostname): init collector-object (ZapiPerf:SystemNode): connection error => invalid parameter => use_insecure_tls is false, but no certificates
2021/05/25 09:40:40  (warning) (poller) (hostname): aborting collector (ZapiPerf)
2021/05/25 09:40:40  (warning) (poller) (hostname): no collectors initialized, stopping
2021/05/25 09:40:40  (info   ) (poller) (hostname): cleaning up and stopping [pid=22027]

Possible solution, workaround, fix
Fix client.go logic to allow certificate verification with basic auth.

Review default set of counters gathered

With an eye towards adding any that are missing and removing redundant or rarely used ones

Poller ignores config when called directly

go run cmd/poller/poller.go --config /x/y/z/foo.yml -p unix

incorrectly uses options config: harvest.yml

Add Universal Collector to collect metrics from files or HTTP endpoint

Is your feature request related to a problem? Please describe.
Harvest 1.6 supported running extensions (scripts in any language) to easily collect custom metrics. One example is collecting NFS mount information from ONTAP CLI. But extensions were able to collect any metric from any source. This functionality is missing in Harvest 2.0. Many users already have some scripts to collect custom metrics which they can not easily integrate to Harvest. Currently only solution is to rewrite those scripts in Go an create a Harvest collector.

Describe the solution you'd like
Idea was suggested by @georgmey. Create a collector that will collect metrics in open-metric-format from files or a remote HTTP endpoint. This would be very similar to Prometheus' node exporter.

Describe alternatives you've considered
Implementing the same Extension framework that we had in Harvest 1.6 would require too much work. Moreover, running custom scripts as part of Harvest runtime would raise considerable security concerns (one of the reasons we discontinued 1.6).

Additional context
The general workflow of the Universal collector:

User has a custom script or program that generates metrics and writes them either in a file on exposes on an HTTP endpoint
Universal Collector collects those metrics and parses them into the Matrix.
User can choose to which databases to export these metrics (as per usual).

The Universal collector will be only good for small number of metrics. It would not be a good solution for large chunks of metrics where high-performance is critical.

NAbox compatibility

As Alpine Linux doesn't use glibc, harvest 2 doesn't currently work, and prevents integration in NAbox 3

This can be solved either by providing an alpine build in the pipeline (or apk package), or compiling harvest statically.

Sorry @vgratian I totally forgot about that one !

Thoughts ?

Improve Grafana tool

Needs more tests.

Feature request: Make datasource a variable.
See

  -v, --variable            use datasource as variable, overrides: --datasource

Shelf metrics appear to only collect metrics for one shelf

Describe the bug
Shelf metrics appear to only be collected for one shelf

Environment
Provide accurate information about the environment to help us reproduce the issue.

Harvest version: harvest version 21.05.1-1 (commit 2211c00) (build date 2021-05-21T01:28:12+0530) linux/amd64
Command line arguments used: [e.g. bin/harvest start --config=foo.yml --collectors Zapi]
OS: RHEL 7.9
Install method: yum
ONTAP Version: 9.7
Other:

To Reproduce
N/A

Expected behavior
metrics such as shelf_temperature_reading shelf_psu_power_drawn should report metrics for each shelf id.

Actual behavior
One time series is returned per cluster for what appears to be a shelf id at random.

Possible solution, workaround, fix
N/A

Additional context
Appears to be related to the Shelf plugin used at /conf/zapi/cdot/9.8.0/shelf.yaml, shelf_labels appears to correctly collect data for each shelf.

Status label on cluster_status metric disappears

Describe the bug
Status label on cluster_status metric disappears roughly after 15mins of starting Harvest

Environment
Provide accurate information about the environment to help us reproduce the issue.

Harvest version: harvest version 21.05.1-1 (commit 2211c00) (build date 2021-05-21T01:28:12+0530) linux/amd64
Command line arguments used: [e.g. bin/harvest start --config=foo.yml --collectors Zapi]
OS: RHEL 7.9
Install method: yum
ONTAP Version: 9.7
Other:

To Reproduce
curl -s localhost:12990/metrics | grep cluster_status

Expected behavior
Should report cluster_status metric as
cluster_status{cluster="clustername",datacenter="dc",status="ok"} 1

Actual behavior
Initially reports correct metric then after 15minutes roughly the metric turns into
cluster_status{cluster="clustername",datacenter="dc"} 1

Possible solution, workaround, fix
None

Harvest should calculate capacity metrics similar to AIQ.UM

Harvest 1.6 requested these metrics from AIQ.UM.
Harvest 2 should calculate them using the same algorithms as AIQ.UM.

Harvest Plugins

Harvest v21.05.01 has limited support for dynamically linked Go plugins built using buildmode=plugin. Unfortunately Go's support for dynamic plugins is weak and comes with significant drawbacks.

Basically, Go plugins were not designed as a way for other people to extend your app. They were designed for you to extend your app.

Before outlining the pros and cons of Go's plugins, let's explore why we want them.

Why do we want Harvest plugins?

Plugins allow 3rd parties to extend Harvest without (re)building it. You program to Harvest's API and, at runtime, Harvest dynamically loads your code into its process and calls it.
You only "pay" for what you use. If you don't use a feature from plugin A, you don't pay for it in disk footprint, you don't load the code into memory, etc. This is less important than #1.

Plugins, as a concept, are great. They allow us to build a loosely-coupled modular system. Harvest's current implementation doesn't address 1 or 2, and arguably introduces more problems than it solves.

Cons of Go Plugins

Plugins and Harvest must be complied with the exact same version of Go.
Plugins and Harvest must be compiled with the same GOPATH.
Any imported packages between Harvest and the plugin must be the exact same version.
Plugins and Harvest can NOT vendor dependencies. If either have vendored dependencies, Harvest won't work.
Debuggers don't work with dynamic code - this means you can't use a debugger with Harvest right now because the interesting parts of Harvest are implemented as Go plugins.
Makes cross compiling harder or impossible (see Windows, Alpine, etc.)
Creates ~3x larger executables - with buildmode=plugin the bin directory is 140 MB, without buildmode=plugin the bin directory is 48M
~7x slower builds

# with buildmode=plugin
$ make clean
$ time GOOS=darwin make build
Executed in   24.94 secs

# without buildmode=plugin
$ make clean
$ time GOOS=darwin make build
Executed in    3.39 secs

Experience of others

We're not the only team to hit issues with Go plugins. I haven't found a project that recommends them.

Traefik

Traefik tried and abandoned plugins due to development pain

traefik/traefik#1336 (comment)
traefik/traefik#1336 (comment)

OK, bad news...
We can't load an external plugin if Traefik is built with CGO_ENABLED=0, and we really need this to build a statically linked golang executable to run in a Docker container.
golang/go#19569 and there are no plan on this golang/go#19569 (comment)
If you compile traefik binary on your laptop, and a plugin in docker on your laptop, it does not work either: Error loading plugin: error opening plugin: plugin.Open: plugin was built with a different version of package

Ultimately they had so many problems they abandoned Go plugins and built a Go interpreter and use that instead.

Telegraf

Telegraf added Go plugin support, but they consider it experimental with limited support and it still requires a custom build of Telegraf. Their issue tracker has the usual build and version issues everyone does.

influxdata/telegraf#7162

Prometheus and VictoriaMetrics

Not sure if they rejected or never tried. Probably rejected.

Caddy

Not sure if Caddy learned from others and rejected Go plugins or went a different way from the beginning. The way you extend Caddy is by building a custom version yourself with side-effect loading init functions. No dynamic linking. Edit Caddy main and include imports.

Options for Harvest

Remove buildmode=plugin code. If folks want to extend Harvest with plugins they clone, add their code, and build their own version of Harvest.
Add an approach similar to Caddy and Benthos - build your plugin and add an import to Harvest's main . Benthos example
Keep what we have - not great given the downsides: no debugger, bigger executables, no cross compile.
Add some sort of exec model where we can call any process and read/write to stdout/stdin. Not great from a performance or security point of view. A poor man's RPC.
RPC - something like Hashicorp's. Performance may be a concern here too - the trick with both of these last two is to avoid too many trips across the RPC layer.

Recommendation

We should go with #1 & #2. Keep the plugin concept in Harvest, but don't implement it with Go plugins. First we remove buildmode=plugin (already done) and then we work on #2 using an architecture similar to Caddy's.

Until we implement #2, if you want to extend Harvest, you extend it the same way you would most open-source projects: clone, make your changes, keep your fork up to date.

Resources

Is anyone actually using go plugins? Jun 2019
Go plugins Nov 2020
Things to avoid while using Golang plugins Nov 2018
On Golang, "static" binaries, cross-compiling and plugins May 2017

netapp / harvest Goto Github PK

harvest's Introduction

What is NetApp Harvest?

Community

Documentation

Videos

harvest's People

Contributors

Stargazers

Watchers

Forkers

harvest's Issues

Reference

Harvest Plugins

Why do we want Harvest plugins?

Cons of Go Plugins

Experience of others

Traefik

Telegraf

Prometheus and VictoriaMetrics

Caddy

Options for Harvest

Recommendation

Resources

Recommend Projects

Recommend Topics

Recommend Org