caas-team / sparrow Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 3.0 706 KB

A monitoring tool to gather infrastructure network information

License: Apache License 2.0

Go 99.12% Dockerfile 0.14% Smarty 0.74%

infrastructure monitoring network sparrow

sparrow's People

Contributors

Stargazers

Watchers

Forkers

y-eight

sparrow's Issues

Bug: A check that fails to run when initialized won't run at all

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Currently, during the reconciliation of checks, if a new config enters the cCfgChecks channel:

// after checking if the check is already registered (if yes, it is skipped), the check is carried out

		go func() {
			err := check.Run(ctx)
			if err != nil {
				log.ErrorContext(ctx, "Failed to run check", "name", name, "error", err)
			}
		}()

if an error is thrown, then the check won't run at all, as it is already registered.

Expected Behavior

A check that fails to run after being initialized, should still try to run again.

Steps To Reproduce

It's complicated to reproduce this. A simple unit test could be written, with a check that has a Run method that simply errors out. The test should wait for the checks retry/ delay window, and see if any results with errors are coming in.

Relevant logs and/or screenshots, environment information, etc.

No response

Who can address the issue?

No response

Anything else?

No response

Feature: Support other remote state backends

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

Currently only gitlab is implemented as target manager.

Solution Description

Refactor the implemented gitlab target manager to use git instead of gitlab API calls.

That way we can support all git based remote repositories as remote state backends.

Who can address the issue?

Go devs

Additional Context

https://github.com/go-git/go-git

Bug: Panic during Check Shutdown

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Currently, the Sparrow is shutting don the fanIn data channels for the checks itself.
When shutting down a check that is currently running, the result data channel might be already closed. This is resulting in a panic while the check tries to write in the result channel.down

Expected Behavior

A soft shutdown should be done. The check might close the resultData fanIn channel itself.
The check should not write in a closed channel to be sure not to panic.

Steps To Reproduce

No response

Relevant logs and/or screenshots, environment information, etc.

No response

Who can address the issue?

No response

Anything else?

No response

Feature: Automatic Central Registration of Instances

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

Sparrow instance can be spawned automatically. Other instances need to be configured (runtime config for checks) to know about this newly spawned instance.

Solution Description

Not 100% defined yet. A list needs to be available with all running sparrow instances incl. domain names. The checks should perform health & latency checks to all instances.

After a brainstorming session we came to the following conclusion:

Use git to manage a centralized targets configuration, which the Sparrow will read & update (registration & liveness update) periodically. Those will serve as a list of globalTargets for each Check.
- This will eventually make the Sparrow only able to do DNS, TCP, UDP, ICMP, gRPC, HTTP protocols
- There shouldn't be any merge conflicts, if each sparrow pushes a file with its health report/ registration info
- possible format {"url": "pipapo", "lastSeen": "goland timestamp UTC"}
- The pull & push procedure should be resilient & not deadlock the Sparrow. It should just fail after a few retries and try again later.
The Sparrow will either pass the globalTargets periodically down to the Check instances, OR the Check instances will get the globalTargets from the Sparrow themselves (possibly before running the check).
The Sparow should decide whether a globalTarget is healthy/not and remove it from the in-memory list it has.
The repository with the global targets should be maintained by a simple pipeline, that removes old targets, which haven't updated their registration in a while.
The Check instances should still define extraTargets, to add more targets to themselves.

Who can address the issue?

Devs

Additional Context

No response

Feature: Expose Checks Result as Prometheus Metrics

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

Currently, the check results are exposed by a RESTful API in JSON format. The data should be exposed as metrics as well.

Solution Description

Results gathered by the checks should be exposed as Prometheus metrics.

The check itself should set the metrics.

Who can address the issue?

No response

Additional Context

No response

Feature: Global HTTP client

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

Every check and loader needs to create their own http client. When we implement logging and tracing, we would have to put this responsibility on the checks and loader.

Solution Description

Create a global http.Client on the sparrow struct. Inject this client into the context of the loader and checks. The we can use autoinstrumentation to automatically add traces and logging (and other middleware if the need arises) to every request.

Who can address the issue?

No response

Additional Context

No response

Prune old images and charts

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

The github container registry is getting/will get really big because we're pushing all commit images and charts into it.

In order not to lose clarity we should prune old images and charts.

Solution Description

Add a new github action that runs periodically that prunes all images older than 7 days or so.

I've found a github action that can do this, but it's unmaintained. I've linked it below.

Additionally we should think about only packaging and pushing the charts if they've changed.

Who can address the issue?

Everyone who wants to explore github actions.

Additional Context

https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule
https://github.com/vlaurin/action-ghcr-prune

Feature: File Loader should also dynamically load files

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

The experimental file loader should also load the configuration file periodically. This will simplify testing and harmonize our loaders, which should have a common interface.

Solution Description

Add a simple for loop with some select cases, to periodically reload the file, send it to the config channel and gracefully handle shutdowns. No fancy error handling or update of the config file is needed, a simple reload of the file is enough.

Remember to add the done channel to the select loop to handle gracefull shutdown.

Who can address the issue?

Everyone.

Additional Context

No response

Enhance Health Check Configuration

Problem Description

The latency check is configurable for several parameters like the interval. The health check should be configurable as well.

Solution

The following parameters should be configurable:

interval the check should run
the http timeout
the retry count
the initial retry delay

The implementation can be done similar to the latency check

Code smell: error is never produced in Shutdown

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Both the Health and Latency Checks never return an error when running the Shutdown implementation. We should either:

check for error cases and return them (catch eventual panics that could happen in the channels (h.done <- true, done has a capacity of 1)
remove the error from the function alltogether

Expected Behavior

Functions that return errors, which are ignored or only logged are a code smell. An error means that something has gone wrong and should be handled.

Steps To Reproduce

No reproduction needed.

Relevant logs and/or screenshots, environment information, etc.

No response

Who can address the issue?

No response

Anything else?

No response

Refactor: remove any typed Configuration

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

We're using any as a type in multiple places to parse the Runtime Configuration of the checks. Since we're not actually using such a feature, we should move away from using the any interface. It complicates our code and makes parsing configuration for Checks prone to errors.

Solution Description

Use a RuntimeConfiguration struct to handle the configuration for all checks (as done now) and also use this in all places where the RuntimeConfiguration is touched upon (like the reconciling of the config itself), so we use something that's as close to universal as possible.
This object will be passed around and the per-check Config can be used in the needed places, with functions like GetHealthConfig().

Who can address the issue?

No response

Additional Context

No response

Feature: DNS Name to start a sparrow instance

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

The sparrow instance should have a DNS name, so it can be uniquely identified from other sparrows.

Solution Description

Add a configuration parameter to implement the solution.
Add a DNS validation for the name
Pass the name to the TargetManager constructor (currently has a hardcoded value)

Who can address the issue?

Everyone

Additional Context

No response

Harmonized sparrow.Run() and error handling

As noted in #51, we should work on how the sparrow handles errors in its vital components (api, targets & possibly also the loaders, which currently don't any solid error handling outside context cancelled errors).

Context and Problem description

Unfortunately, our handling of errors in sparrow is done in multiple ways, which can result in abrupt stopping of the sparrow or ignored errors. The api uses an error channel, the sparrow itself has none, the checks have their own way of shutting down and handling errors.

Currently, if an error happens during the API server's operation, it is handled by the api(ctx) function of the sparrow and returned (if it's not non-expected error or the context is done).

If we return the error when running the api(ctx) function, no graceful shutdown can be done, as the Sparrow's run function would return instantly with the error.

Proposal

To handle errors in the sparrow in a harmonized way, I think we should introduce one global shutdown function, that the sparrow will execute, for the components it manages (api , targetManager & loader), when a non-recoverable failure happens. This function will shutdown the api and targetManager instances and then just log the errors, if they happen (no point at acting on the errors anymore).

General error handling in the Sparrow's run function can be handled via an error channel and a dedicated handler function, that will shut down the sparrow components if needed. Each sparrow component should keep on doing its job, unless something unrecoverable happens: an error should be sent down the error channel and the sparrow will initiate the cleanup procedure.

Additionally, all the sparrow components should start similarly (the api starts in a separate goroutine because it returns an error, the other functions just run and handle errors on their own). This will harmonize the Run function and should make the code simpler to expand.

Originally posted by @puffitos in #51 (comment)

cc @lvlcn-t

Feature: Use tcp for traceroute

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

The current implementation of the traceroute check uses udp to perform its logic. This is not ideal since udp is not connection oriented. This means that the check can never really know whether a packet reached its destination, unless that destination sends back an ICMP packet, which tends to be blocked by firewalls.

Solution Description

We can get around the above issues by reimplementing the traceroute functionality on top of TCP instead of UDP. We can essentially send a TCP SYN packet to initiate a handshake with a server on its open port. The underlying logic of the check stays the same, we're only changing the protocol.

Who can address the issue?

Anyone who wants to read some RFCs and feels like playing around with berkeley sockets

Additional Context

https://en.wikipedia.org/wiki/Berkeley_sockets
https://www.ietf.org/rfc/rfc793.txt#:~:text=Transmission%20Control%20Protocol-,3.%20%20FUNCTIONAL%20SPECIFICATION,-3.1.%20%20Header%20Format
https://github.com/mct/tcptraceroute/blob/master/probe.c#L80

Feature: Generate Check Boilerplate Code

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

Currently every check needs to be implemented manually and there's a risk that some default setups aren't correctly implemented. Also everytime a new check is registered it needs to be added to the runtime package which holds the runtime configuration for all checks.

Solution Description

Add a generate script and some templates to be filled by this script. Therefore you can use go templating with the text/template standard library. The script should generate the new check's boilerplate code and append the new check to the already existing runtime.Config and its utilizing methods.

Who can address the issue?

Everyone who wants to experience go templates 📝

Additional Context

I've started to implement this in feat/generate-check if you need some inspiration.

Refactoring: Check reconciliation simplification

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

The check (config) reconciliation logic should be simplified. We've currently employed a map of check names to checks, which we are exclusively using to register, delete or update the checks. We're using this map structure, to handle more complicated logic like:

if the map is empty, register all checks
iterate over the map and start unregistering checks, in case one of the names is not in the incoming config
iterate over the created checks (from the incoming config) and update/ register the checks, if they exist in the map.

Even though we are not misusing the data structure, it doesn't make for the most simple experience to have to deal with a map. This makes the whole reconciliation logic (the triplet of functions - ReconcileCheck, registerCheck and unregisterCheck) a tad complicated.

Solution Description

The sparrow is currently in charge of actively reconciling the configurations of the various checks, when a new runtime.Config appears in its runtime channel.

There are multiple ways we could handle this, but the creation of a dedicated struct which will handle the Checks seems like the way to go. A ChecksController if you may, which will handle the reconciliation logic of the Checks (registration, update and removal), depending on the runtime configuration channel. Given the modular approach of the sparrow, this feels like the proper way to go; another component of the struct, which can start, act and shutdown autonomously (like the API, the TargetManager, the Loader and even the Check instances themselves).

A more simple solution would be to just create a simple struct to replace the map[string]Check and add some business logic to it (i.e with `updateConfigForCheck(name string, cfg checks.Config).

The end result should make the registration, update and removal of checks a simple affair, which is easy to read and interpret and, most importantly, test.

Who can address the issue?

Everyone

Additional Context

No response

Bug: Nil pointer in http loader if endpoint is not reachable

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

If the endpoint to load the checks' configuration is not reachable the sparrow will panic with nil pointer exception.

runtimeCfg will be nill if the endpoint is not reachable even after retry logic.

sparrow/pkg/config/http.go

Line 76 in c4a6b54

hl.cCfgChecks <- runtimeCfg.Checks

Currently, just a warning log will be printed and the func is not returning/handling the error properly.

Expected Behavior

Loader should not panic if it is not able to load the remote checks' config.

An else block to cover the none-err case might be a solution.

Steps To Reproduce

No response

Relevant logs and/or screenshots, environment information, etc.

Using config file: /config/.sparrow.yaml
{"time":"2024-01-23T08:50:09.525650498Z","level":"INFO","source":{"function":"github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1","file":"/home/runner/work/sparrow/sparrow/cmd/run.go","line":82},"msg":"Running sparrow"}
{"time":"2024-01-23T08:50:09.52593781Z","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/gitlab.go","line":80},"msg":"Starting global gitlabTargetManager reconciler"}
{"time":"2024-01-23T08:50:09.526036566Z","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).api.func1","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/api.go","line":81},"msg":"Serving Api","addr":":8080"}
{"time":"2024-01-23T08:50:59.534917331Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).GetRuntimeConfig","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":96},"msg":"Http get request failed","url":"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
{"time":"2024-01-23T08:50:59.53498963Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run.Retry.func2","file":"/home/runner/work/sparrow/sparrow/internal/helper/retry.go","line":49},"msg":"Effector call failed, retrying in 1s"}
{"time":"2024-01-23T08:51:30.538843352Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).GetRuntimeConfig","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":96},"msg":"Http get request failed","url":"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
{"time":"2024-01-23T08:51:30.53888887Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run.Retry.func2","file":"/home/runner/work/sparrow/sparrow/internal/helper/retry.go","line":49},"msg":"Effector call failed, retrying in 2s"}
{"time":"2024-01-23T08:51:39.532393218Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).fetchFileList","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/gitlab/gitlab.go","line":222},"msg":"Failed to fetch file list","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/tree?ref=main\": dial tcp: lookup gitlab.devops.telekom.de: i/o timeout"}
{"time":"2024-01-23T08:51:39.532471895Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).FetchFiles","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/gitlab/gitlab.go","line":135},"msg":"Failed to fetch files","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/tree?ref=main\": dial tcp: lookup gitlab.devops.telekom.de: i/o timeout"}
{"time":"2024-01-23T08:51:39.532491426Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).refreshTargets","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/gitlab.go","line":203},"msg":"Failed to update global targets","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/tree?ref=main\": dial tcp: lookup gitlab.devops.telekom.de: i/o timeout"}
{"time":"2024-01-23T08:51:39.532511174Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/gitlab.go","line":99},"msg":"Failed to get global targets","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/tree?ref=main\": dial tcp: lookup gitlab.devops.telekom.de: i/o timeout"}
{"time":"2024-01-23T08:51:50.561977627Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).GetRuntimeConfig","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":96},"msg":"Http get request failed","url":"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main\": dial tcp: lookup gitlab.devops.telekom.de on 10.43.0.10:53: read udp 10.42.3.11:59453->10.43.0.10:53: i/o timeout"}
{"time":"2024-01-23T08:51:50.561926507Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).PostFile","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/gitlab/gitlab.go","line":331},"msg":"Failed to post file","error":"Post \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow.caas-t01.telekom.de.json\": dial tcp: lookup gitlab.devops.telekom.de on 10.43.0.10:53: read udp 10.42.3.11:59453->10.43.0.10:53: i/o timeout"}
{"time":"2024-01-23T08:51:50.56204608Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).updateRegistration","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/gitlab.go","line":185},"msg":"Failed to register global gitlabTargetManager","error":"Post \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow.caas-t01.telekom.de.json\": dial tcp: lookup gitlab.devops.telekom.de on 10.43.0.10:53: read udp 10.42.3.11:59453->10.43.0.10:53: i/o timeout"}
{"time":"2024-01-23T08:51:50.562066866Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/gitlab.go","line":105},"msg":"Failed to register self as global target","error":"Post \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow.caas-t01.telekom.de.json\": dial tcp: lookup gitlab.devops.telekom.de on 10.43.0.10:53: read udp 10.42.3.11:59453->10.43.0.10:53: i/o timeout"}
{"time":"2024-01-23T08:51:50.562031059Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run.Retry.func2","file":"/home/runner/work/sparrow/sparrow/internal/helper/retry.go","line":49},"msg":"Effector call failed, retrying in 4s"}
{"time":"2024-01-23T08:52:24.569515778Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).GetRuntimeConfig","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":96},"msg":"Http get request failed","url":"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
{"time":"2024-01-23T08:52:24.569558001Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":72},"msg":"Could not get remote runtime configuration","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
{"time":"2024-01-23T08:52:24.569574733Z","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":75},"msg":"Successfully got remote runtime configuration"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x81c50c]

goroutine 35 [running]:
github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run(0xc000206b88, {0xb70440?, 0xc000217b00?})
    /home/runner/work/sparrow/sparrow/pkg/config/http.go:76 +0xac
github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).Run.func1()
    /home/runner/work/sparrow/sparrow/pkg/sparrow/run.go:107 +0x2d
created by github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).Run in goroutine 1
    /home/runner/work/sparrow/sparrow/pkg/sparrow/run.go:106 +0xfe
Stream closed EOF for caas-sparrow/sparrow-857b77d885-mkkjk (sparrow)

Who can address the issue?

No response

Anything else?

No response

Feature: Implement DNS Check

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

A DNS check should be implemented.

Solution Description

For the MVP of the check, the following requirement needs to be fulfilled:

A target is resolvable by the DNS

Who can address the issue?

No response

Additional Context

No response

Match up Health & Latency Checks

Problem Description

Currently, the two checks (health and latency) have been designed separately from each other. The functionality is similar. The checks architecture should be similar.

Additionally, the latency check should use time references in exposed API keys e.g. if the interval is defined in seconds; use intervalSec: 1.

Solution

The checks should be matched up so that functionality and code readability is more clear and easier.

Bug: Test race condition

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Test for latency check is failing with race conditions.

See: https://github.com/caas-team/sparrow/actions/runs/7058951310/job/19215535279

Hints by @puffitos see: #26 (comment)

Expected Behavior

No response

Steps To Reproduce

No response

Relevant logs and/or screenshots, environment information, etc.

No response

Who can address the issue?

No response

Anything else?

No response

Feature: Add ServiceMonitors in Helm

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

Since #45 introduced prometheus metrics, it would be nice to have optional ServiceMonitors in the helm chart.

Solution Description

Extend the helm chart with an option for deploying a service monitor that targets sparrow's metrics endpoint. Look at grafana as an example

Who can address the issue?

Anyone familiar with prometheus and writing helm charts

Additional Context

No response

Option to Disable the Target Manager

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

Currently, the Target Manager is enabled by default. The sparrow is working without the Target Manager fine. If the use case is not needed it is not possible to disable the TM.

Solution Description

Introduce a flag to disable the Target Manager if not needed.

Who can address the issue?

No response

Additional Context

End2End testing is quiet complicated due to a default-enabled TM.

Bug: Latency Check doesn't retry on request failure

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Currently the latency check doesn't use its retry mechanism because no error is returned.

Expected Behavior

The retry mechanism is triggered if the request fails.

Steps To Reproduce

Startup the sparrow
Set one of the latency targets to an invalid url
Look at the logs

Relevant logs and/or screenshots, environment information, etc.

$ go run main.go run --config .tmp/start-config.yaml 
Using config file: .tmp/start-config.yaml
{"time":"2024-01-25T14:56:22.596885375+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1","file":"/home/installadm/dev/github/sparrow/cmd/run.go","line":82},"msg":"Running sparrow"}
{"time":"2024-01-25T14:56:22.597003935+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).api.func1","file":"/home/installadm/dev/github/sparrow/pkg/sparrow/api.go","line":81},"msg":"Serving Api","addr":":8080"}
{"time":"2024-01-25T14:56:22.596999388+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/installadm/dev/github/sparrow/pkg/sparrow/targets/gitlab.go","line":80},"msg":"Starting global gitlabTargetManager reconciler"}
{"time":"2024-01-25T14:56:22.5971122+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*FileLoader).Run","file":"/home/installadm/dev/github/sparrow/pkg/config/file.go","line":48},"msg":"Reading config from file","file":"./.tmp/run-config.yaml"}
{"time":"2024-01-25T14:56:22.597416945+01:00","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).registerCheck","file":"/home/installadm/dev/github/sparrow/pkg/sparrow/run.go","line":232},"msg":"Check is not registered","name":"health"}
{"time":"2024-01-25T14:56:22.59755689+01:00","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).registerCheck","file":"/home/installadm/dev/github/sparrow/pkg/sparrow/run.go","line":232},"msg":"Check is not registered","name":"dns"}
{"time":"2024-01-25T14:56:22.597578372+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/checks/latency.(*Latency).Run","file":"/home/installadm/dev/github/sparrow/pkg/checks/latency/latency.go","line":91},"msg":"Starting latency check","interval":"20s"}
{"time":"2024-01-25T14:56:42.612641142+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks/latency.getLatency","file":"/home/installadm/dev/github/sparrow/pkg/checks/latency/latency.go","line":284},"msg":"Error while checking latency","url":"xhttps://example.com","error":"Get \"xhttps://example.com\": unsupported protocol scheme \"xhttps\""}

Who can address the issue?

No response

Anything else?

No response

Check Config (and RuntimeConfig) Validation

          @puffitos Can you create an issue to the validation part you've mentioned in your PR?

Originally posted by @lvlcn-t in #93 (comment)

Issue

The checks.Runtime Interface should also have a Validate method to validate the configuration per check.

Motivation

The Validate can then trickle up to the runtime.Config struct, which should have a validate Method to validate all configurations. This will ensure only valid configurations are passed to the running checks and should add to the robustness of the sparrow.

Bug: Configuration handling is getting too complicated

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

We should discuss the alternative of how we parse the startup and runtime configuration. Currently, we have to define multiple flags and their string mapping, and we must provide all those long strings in a YAML file, linearly, like this:

runParameterOne: 1
mostImporntantRunParameterTwenty: 20

This is inflating the codebase and it's questionable whether it's offering us any concrete advantages.

Expected Behavior

We're able to load the (startup) configuration from one file, which is formatted as follows:

checks:
  - check1: {}
  - check2: {}
targetManager: 
  gitlab: {}
# and so on

Steps To Reproduce

No response

Relevant logs and/or screenshots, environment information, etc.

No response

Who can address the issue?

Everyone who has some time to check how the cobra & viper framework works.

Anything else?

No response

Refactor: Move global target merging responsibility to checks

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

The global targets are injected to the checks config into their target field. This was ok for the target manager MVP but the checks will become more complex and not every check can implement a string slice as target field.

Solution Description

Add a dedicated global targets field to the checks config structs. This field can then be merged into the targets field by the check itself depending on its config implementation.

Who can address the issue?

Everyone

Additional Context

No response

Feature: Implement Traceroute Check

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

Features

We want to gather data similar to the output of traceroute. We want to declare a list of targets and sparrow should collect the hops through which the packets travel to reach that destination.

Metrics

For now, we want to collect the following metrics for every target:

num hops
path taken
- Hop number
- IP address

Solution Description

Features

Unlike traditional traceroute, the check will be using tcp. This is to avoid requiring root permissions or cap_net_raw. I'm proposing the config to look like this:

checks:
  traceroute:
    retry:
      delay: 10s
      count: 3
    timeout: 30s
    targets:
      - https://google.com
      - https://bing.com:80
      - https://myservice.com:12345

Metrics

Num Hops should be a simple counter:

For example: sparrow_traceroute_hops_count{target="https://google.com"} 12

The path taken is a bit more difficult, as we can't really convey the graph like nature of the data to prometheus. My suggestion is, we export these metrics in OpenTelemtry compatible format. We could use the otel sdk to create the metrics and then ship them of to a trace aggregator like jaeger. This makes it easy to adopt sparrow, as grafana already has a native jaeger datasource, so there would be no need for hacking together our own grafana datasource.

While we can't collect traces in prometheus, we can atleast link a timeseries to a trace using prometheus exemplars. This is not a requirement, but makes the UX nicer when viewing the data in grafana

Feature: Implement ICMP Check

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

tbd

Solution Description

tbd

Who can address the issue?

No response

Additional Context

No response

Feature: Grafana Dashboards

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

#69 introduces serviceMonitors to the helm chart. How about we also provide grafana dashboards, that go along with the service monitors. We could use the dashboard that the SRE team provides as a baseline. We should only work on this once the prometheus metrics are semi stabilized, otherwise we waste a lot of time on maintaining the dashboards.

Solution Description

No response

Who can address the issue?

Anyone willing to spend some time in grafana writing promql queries

Additional Context

No response

Feature: OpenTelemetry

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

The prometheus format is not sufficient for exporting metrics from the traceroute check, aside from the amount of hops. To provide more complex data, it might be necessary to support a second data format like opentelemetry.

Solution Description

Implement the OTEL library and inject it into the checks, so they can write their own traces. Traceroute can use this to collect more detailed data about how every single invocation of the check behaves, essentially allowing a user to visualize how packets move from sparrow to their target

Who can address the issue?

No response

Additional Context

https://github.com/open-telemetry/opentelemetry-go
https://opentelemetry.io/docs/languages/go/
https://www.jaegertracing.io/docs/1.54/client-libraries/#go
https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/traces/

Bug: Missing startup probe causes gap in data collection

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

If a sparrow pod is restarted, it causes a gap in monitoring. This is due to the healthcheck returning an HTTP 200 to the kubernetes probe. Since Kubernetes thinks the new sparrow pod is ready to accept traffic, prometheus scrape requests are sent to that new pod immediately, before the startup routine is actually finished.

Screenshot from prometheus below

Expected Behavior

A restart of the sparrow deployment waits for the new pod to have finished its first reconciliation cycle. Only then will the pod accept traffic. This ensures, that prometheus never scrapes a pod while still in startup.

Steps To Reproduce

Spin up sparrow on kubernetes
Restart the deployment
HTTP GET /metrics or /openapi
You will notice sparrow returns no metrics for a few seconds after startup. It also does not show any checks in the /openapi endpoint

Relevant logs and/or screenshots, environment information, etc.

Who can address the issue?

No response

Anything else?

No response

Bug: Global Targets need to be handled properly by Checks

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Currently eg. the health check is using the global targets provided by the target manager directly.
Registered global target https://sparrow-b.telekom.de will be used as it is. Every sparrow will provide its health check endpoint at /checks/health. The result is that the endpoint seems unhealthy.

Expected Behavior

The sparrow should use the global targets for its checks correctly. Therefore it should append the check's health endpoint (similar to other checks that are using the global targets).
Eg.: https://sparrow-b.telekom.de/checks/health

Steps To Reproduce

No response

Relevant logs and/or screenshots, environment information, etc.

No response

Who can address the issue?

No response

Anything else?

No response

Cleanup API with dynamic Handler

Motivation

With #91 we have removed the handler functions from the checks interface. Currently the http handler are not used by the checks. The http & latency checks are able to use endpoints that are returning 200 HTTP OK. For checking other sparrow instances the global health sparrow endpoint can be used/is used.

Refactor/Cleanup

The api is still configured to allow HTTP handles to be registered manually. To cleanup the code base the functionality needs to be removed including all related func.

The following sections need to be checked and cleaned up:

https://github.com/caas-team/sparrow/blob/main/pkg%2Fsparrow%2Fapi.go#L56

https://github.com/caas-team/sparrow/blob/main/pkg%2Fsparrow%2Fapi.go#L258

https://github.com/caas-team/sparrow/tree/main/pkg%2Fapi

Bug: Removed targets remain as prometheus metrics

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Sparrow starts with two targets:

https://google.com
https://example.com

this creates two prometheus for those targets:

sparrow_latency_duration_seconds{status="200",target="https://google.com"} 0.014013145
sparrow_latency_duration_seconds{status="200",target="https://example.com"} 0.003617327
...

If the config gets updated and one target gets removed, the last prometheus metric for the deleted target remains as the latest prometheus metrics.
New targets:

https://google.com

Prometheus metrics:

sparrow_latency_duration_seconds{status="200",target="https://google.com"} 0.014013145 <- This remains and is not cleaned up
sparrow_latency_duration_seconds{status="200",target="https://example.com"} 0.007915321

Expected Behavior

Lets discuss how (if) we should handle this case.

Maybe it's possible to remove metrics from the prometheus registry

Steps To Reproduce

Deploy sparrow with two targets for any check using http loader
Let it run for a check cycle
Remove one target from the config
HTTP GET /metrics

Relevant logs and/or screenshots, environment information, etc.

Screenshot shows what happened after google.com was removed from the config, compared to a target that is currently active

Since the sparrow metrics api still exposes the value for google.com, prometheus stores it on every scrape

Who can address the issue?

@puffitos @y-eight @lvlcn-t Let's discuss how we handle this next week

Anything else?

No response

Switch to time.After

Problem Description

Currently, the retry pattern implementation and loader uses a timer. This can be simplified by using time.After.

sparrow/internal/helper/retry.go

Line 51 in faab23e

timer := time.NewTimer(delay)

sparrow/pkg/config/http.go

Line 72 in faab23e

timer := time.NewTimer(gl.cfg.Loader.Interval)

Bug: Panic if loader interval is not set

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

Currently the loaders would panic if no interval is set because it then calls time.NewTicker(0) which panics.

Solution Description

We should at least validate if the interval is > 0.

But imo this offers the opportunity to introduce another feature to the loaders.

We could disable the continuous loading of the check config with an interval of 0. This would also align with the feature introduced by #101 which kinda does this for the instance registration and update.

If we introduced this proposal, we'd only need to check our startup config for interval < 0.

Who can address the issue?

Everyone

Additional Context

No response

Feature: Dynamically switch logging verbosity

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

The logger does have a fixed verbosity.

The logging verbosity should be switchable during the runtime but at least by defining an environment variable.

Solution Description

No response

Who can address the issue?

No response

Additional Context

No response

Bug: Viper BindStruct hidden behind feature flag in viper >=v1.18.0

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

We need to be careful when updating viper beyond v1.18.0

#73 (comment)

We should write a test case that checks whether marshalling from env variables works properly everywhere, so that we notice when someone updates the library and breaks it.

Expected Behavior

To fix this, we need to update our build and test pipelines to include the feature flag -tags=viper_bind_struct

Steps To Reproduce

No response

Relevant logs and/or screenshots, environment information, etc.

No response

Who can address the issue?

Anyone

Anything else?

No response

CI sec check fail & no linting is present

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

The CI checks regarding security fail & no linting is done on the entirety of the repository. This may result in poorly maintained code and errors which go unchecked.

Expected Behavior

We have automatic linting to avoid typical code mistakes & code smells.

Steps To Reproduce

No response

Relevant logs and/or screenshots, environment information, etc.

No response

Who can address the issue?

Everyone

Anything else?

No response

Feature: Shutdown for loaders

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

To harmonize the component interfaces of the sparrow and allow for graceful shutdowns, all components should have a shutdown routine integrated.

Solution Description

Add a Shutdown(ctx) error function to the Loader interface and implement it.

Who can address the issue?

Everyone
cc @lvlcn-t

Additional Context

Low level priority task. This is only needed to have a more standardized way of how all sparrow components work. The sparrow should be responsible for shutting down its components, if needed, and the components should always just keep on running as long as they can.

Feature: make the update feature of the target manager configurable

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

The update functionality of the target manager (update with current timestamp) should be configurable. This allows the target manager to activate/deactivate the update registration feature.

Solution Description

Make the update-registration feature configurable (enabled/disable) and separate the interval from the registration interval. The two features should be separate.

Who can address the issue?

No response

Additional Context

No response

Investigate on & Improve Latency Metrics

Problem to investigate & solve

Currently, 3 different latency metrics are available.

Counter
Latency time
Histogram

If the health check fails (internally) the latency time will be 0. The status code as well.

This might be ok for the counter and latency metrics but might be not the best practice for the histogram. The buckets will be filled.

Example with 2 errors and 308 total requests:

# HELP sparrow_latency_duration Latency of targets in seconds
# TYPE sparrow_latency_duration histogram
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.005"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.01"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.025"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.05"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.1"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.25"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.5"} 288
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="1"} 307
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="2.5"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="5"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="10"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="+Inf"} 308
sparrow_latency_duration_sum{target="https://gitlab.devops.telekom.de"} 120.39378972299998
sparrow_latency_duration_count{target="https://gitlab.devops.telekom.de"} 308

As @puffitos stated in #45 we should probably solve this with labelling or another set of metrics. E.g. label for the checks state.