caas-team / sparrow Goto Github PK
View Code? Open in Web Editor NEWA monitoring tool to gather infrastructure network information
License: Apache License 2.0
A monitoring tool to gather infrastructure network information
License: Apache License 2.0
Currently, during the reconciliation of checks, if a new config enters the cCfgChecks
channel:
// after checking if the check is already registered (if yes, it is skipped), the check is carried out
go func() {
err := check.Run(ctx)
if err != nil {
log.ErrorContext(ctx, "Failed to run check", "name", name, "error", err)
}
}()
if an error is thrown, then the check won't run at all, as it is already registered.
A check that fails to run after being initialized, should still try to run again.
It's complicated to reproduce this. A simple unit test could be written, with a check that has a Run method that simply errors out. The test should wait for the checks retry/ delay window, and see if any results with errors are coming in.
No response
No response
No response
Currently only gitlab is implemented as target manager.
Refactor the implemented gitlab target manager to use git instead of gitlab API calls.
That way we can support all git based remote repositories as remote state backends.
Go devs
Currently, the Sparrow is shutting don the fanIn data channels for the checks itself.
When shutting down a check that is currently running, the result data channel might be already closed. This is resulting in a panic while the check tries to write in the result channel.down
A soft shutdown should be done. The check might close the resultData fanIn channel itself.
The check should not write in a closed channel to be sure not to panic.
No response
No response
No response
No response
Sparrow instance can be spawned automatically. Other instances need to be configured (runtime config for checks) to know about this newly spawned instance.
Not 100% defined yet. A list needs to be available with all running sparrow instances incl. domain names. The checks should perform health & latency checks to all instances.
After a brainstorming session we came to the following conclusion:
Sparrow
will read & update (registration & liveness update) periodically. Those will serve as a list of globalTargets
for each Check
.
{"url": "pipapo", "lastSeen": "goland timestamp UTC"}
Sparrow
. It should just fail after a few retries and try again later.Sparrow
will either pass the globalTargets
periodically down to the Check
instances, OR the Check
instances will get the globalTargets
from the Sparrow
themselves (possibly before running the check).Sparow
should decide whether a globalTarget
is healthy/not and remove it from the in-memory list it has.Check
instances should still define extraTargets
, to add more targets to themselves.Devs
No response
Currently, the check results are exposed by a RESTful API in JSON format. The data should be exposed as metrics as well.
Results gathered by the checks should be exposed as Prometheus metrics.
The check itself should set the metrics.
No response
No response
Every check and loader needs to create their own http client. When we implement logging and tracing, we would have to put this responsibility on the checks and loader.
Create a global http.Client
on the sparrow struct. Inject this client into the context of the loader and checks. The we can use autoinstrumentation to automatically add traces and logging (and other middleware if the need arises) to every request.
No response
No response
The github container registry is getting/will get really big because we're pushing all commit images and charts into it.
In order not to lose clarity we should prune old images and charts.
Add a new github action that runs periodically that prunes all images older than 7 days or so.
I've found a github action that can do this, but it's unmaintained. I've linked it below.
Additionally we should think about only packaging and pushing the charts if they've changed.
Everyone who wants to explore github actions.
https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule
https://github.com/vlaurin/action-ghcr-prune
The experimental file
loader should also load the configuration file periodically. This will simplify testing and harmonize our loaders, which should have a common interface.
Add a simple for loop with some select cases, to periodically reload the file, send it to the config channel and gracefully handle shutdowns. No fancy error handling or update of the config file is needed, a simple reload of the file is enough.
Remember to add the done channel to the select loop to handle gracefull shutdown.
Everyone.
No response
The latency check is configurable for several parameters like the interval. The health check should be configurable as well.
The following parameters should be configurable:
The implementation can be done similar to the latency check
Both the Health
and Latency
Checks
never return an error when running the Shutdown
implementation. We should either:
h.done <- true
, done has a capacity of 1)Functions that return errors, which are ignored or only logged are a code smell. An error means that something has gone wrong and should be handled.
No reproduction needed.
No response
No response
No response
We're using any
as a type in multiple places to parse the Runtime Configuration of the checks. Since we're not actually using such a feature, we should move away from using the any interface. It complicates our code and makes parsing configuration for Checks prone to errors.
No response
No response
The sparrow instance should have a DNS name, so it can be uniquely identified from other sparrows.
Everyone
No response
As noted in #51, we should work on how the sparrow handles errors in its vital components (api
, targets
& possibly also the loaders
, which currently don't any solid error handling outside context cancelled errors).
Unfortunately, our handling of errors in sparrow is done in multiple ways, which can result in abrupt stopping of the sparrow or ignored errors. The api
uses an error channel, the sparrow itself has none, the checks have their own way of shutting down and handling errors.
Currently, if an error happens during the API server's operation, it is handled by the api(ctx)
function of the sparrow and returned (if it's not non-expected error or the context is done).
If we return the error when running the api(ctx)
function, no graceful shutdown can be done, as the Sparrow's run function would return instantly with the error.
To handle errors in the sparrow in a harmonized way, I think we should introduce one global shutdown function, that the sparrow will execute, for the components it manages (api
, targetManager
& loader
), when a non-recoverable failure happens. This function will shutdown the api and targetManager instances and then just log the errors, if they happen (no point at acting on the errors anymore).
General error handling in the Sparrow's run
function can be handled via an error channel and a dedicated handler function, that will shut down the sparrow components if needed. Each sparrow component should keep on doing its job, unless something unrecoverable happens: an error should be sent down the error channel and the sparrow will initiate the cleanup procedure.
Additionally, all the sparrow components should start similarly (the api starts in a separate goroutine because it returns an error, the other functions just run and handle errors on their own). This will harmonize the Run function and should make the code simpler to expand.
Originally posted by @puffitos in #51 (comment)
cc @lvlcn-t
The current implementation of the traceroute check uses udp to perform its logic. This is not ideal since udp is not connection oriented. This means that the check can never really know whether a packet reached its destination, unless that destination sends back an ICMP packet, which tends to be blocked by firewalls.
We can get around the above issues by reimplementing the traceroute functionality on top of TCP instead of UDP. We can essentially send a TCP SYN packet to initiate a handshake with a server on its open port. The underlying logic of the check stays the same, we're only changing the protocol.
Anyone who wants to read some RFCs and feels like playing around with berkeley sockets
https://en.wikipedia.org/wiki/Berkeley_sockets
https://www.ietf.org/rfc/rfc793.txt#:~:text=Transmission%20Control%20Protocol-,3.%20%20FUNCTIONAL%20SPECIFICATION,-3.1.%20%20Header%20Format
https://github.com/mct/tcptraceroute/blob/master/probe.c#L80
Currently every check needs to be implemented manually and there's a risk that some default setups aren't correctly implemented. Also everytime a new check is registered it needs to be added to the runtime
package which holds the runtime configuration for all checks.
Add a generate script and some templates to be filled by this script. Therefore you can use go templating with the text/template
standard library. The script should generate the new check's boilerplate code and append the new check to the already existing runtime.Config
and its utilizing methods.
Everyone who wants to experience go templates ๐
I've started to implement this in feat/generate-check if you need some inspiration.
The check (config) reconciliation logic should be simplified. We've currently employed a map of check names to checks, which we are exclusively using to register, delete or update the checks. We're using this map structure, to handle more complicated logic like:
Even though we are not misusing the data structure, it doesn't make for the most simple experience to have to deal with a map. This makes the whole reconciliation logic (the triplet of functions - ReconcileCheck
, registerCheck
and unregisterCheck
) a tad complicated.
The sparrow is currently in charge of actively reconciling the configurations of the various checks, when a new runtime.Config appears in its runtime channel.
There are multiple ways we could handle this, but the creation of a dedicated struct which will handle the Checks seems like the way to go. A ChecksController
if you may, which will handle the reconciliation logic of the Checks (registration, update and removal), depending on the runtime configuration channel. Given the modular approach of the sparrow, this feels like the proper way to go; another component of the struct, which can start, act and shutdown autonomously (like the API, the TargetManager
, the Loader
and even the Check
instances themselves).
A more simple solution would be to just create a simple struct to replace the map[string]Check and add some business logic to it (i.e with `updateConfigForCheck(name string, cfg checks.Config).
The end result should make the registration, update and removal of checks a simple affair, which is easy to read and interpret and, most importantly, test.
Everyone
No response
If the endpoint to load the checks' configuration is not reachable the sparrow will panic with nil pointer exception.
runtimeCfg
will be nill if the endpoint is not reachable even after retry logic.
Line 76 in c4a6b54
Currently, just a warning log will be printed and the func is not returning/handling the error properly.
Loader should not panic if it is not able to load the remote checks' config.
An else block to cover the none-err case might be a solution.
No response
Using config file: /config/.sparrow.yaml
{"time":"2024-01-23T08:50:09.525650498Z","level":"INFO","source":{"function":"github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1","file":"/home/runner/work/sparrow/sparrow/cmd/run.go","line":82},"msg":"Running sparrow"}
{"time":"2024-01-23T08:50:09.52593781Z","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/gitlab.go","line":80},"msg":"Starting global gitlabTargetManager reconciler"}
{"time":"2024-01-23T08:50:09.526036566Z","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).api.func1","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/api.go","line":81},"msg":"Serving Api","addr":":8080"}
{"time":"2024-01-23T08:50:59.534917331Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).GetRuntimeConfig","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":96},"msg":"Http get request failed","url":"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
{"time":"2024-01-23T08:50:59.53498963Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run.Retry.func2","file":"/home/runner/work/sparrow/sparrow/internal/helper/retry.go","line":49},"msg":"Effector call failed, retrying in 1s"}
{"time":"2024-01-23T08:51:30.538843352Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).GetRuntimeConfig","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":96},"msg":"Http get request failed","url":"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
{"time":"2024-01-23T08:51:30.53888887Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run.Retry.func2","file":"/home/runner/work/sparrow/sparrow/internal/helper/retry.go","line":49},"msg":"Effector call failed, retrying in 2s"}
{"time":"2024-01-23T08:51:39.532393218Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).fetchFileList","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/gitlab/gitlab.go","line":222},"msg":"Failed to fetch file list","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/tree?ref=main\": dial tcp: lookup gitlab.devops.telekom.de: i/o timeout"}
{"time":"2024-01-23T08:51:39.532471895Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).FetchFiles","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/gitlab/gitlab.go","line":135},"msg":"Failed to fetch files","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/tree?ref=main\": dial tcp: lookup gitlab.devops.telekom.de: i/o timeout"}
{"time":"2024-01-23T08:51:39.532491426Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).refreshTargets","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/gitlab.go","line":203},"msg":"Failed to update global targets","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/tree?ref=main\": dial tcp: lookup gitlab.devops.telekom.de: i/o timeout"}
{"time":"2024-01-23T08:51:39.532511174Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/gitlab.go","line":99},"msg":"Failed to get global targets","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/tree?ref=main\": dial tcp: lookup gitlab.devops.telekom.de: i/o timeout"}
{"time":"2024-01-23T08:51:50.561977627Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).GetRuntimeConfig","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":96},"msg":"Http get request failed","url":"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main\": dial tcp: lookup gitlab.devops.telekom.de on 10.43.0.10:53: read udp 10.42.3.11:59453->10.43.0.10:53: i/o timeout"}
{"time":"2024-01-23T08:51:50.561926507Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).PostFile","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/gitlab/gitlab.go","line":331},"msg":"Failed to post file","error":"Post \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow.caas-t01.telekom.de.json\": dial tcp: lookup gitlab.devops.telekom.de on 10.43.0.10:53: read udp 10.42.3.11:59453->10.43.0.10:53: i/o timeout"}
{"time":"2024-01-23T08:51:50.56204608Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).updateRegistration","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/gitlab.go","line":185},"msg":"Failed to register global gitlabTargetManager","error":"Post \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow.caas-t01.telekom.de.json\": dial tcp: lookup gitlab.devops.telekom.de on 10.43.0.10:53: read udp 10.42.3.11:59453->10.43.0.10:53: i/o timeout"}
{"time":"2024-01-23T08:51:50.562066866Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/gitlab.go","line":105},"msg":"Failed to register self as global target","error":"Post \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow.caas-t01.telekom.de.json\": dial tcp: lookup gitlab.devops.telekom.de on 10.43.0.10:53: read udp 10.42.3.11:59453->10.43.0.10:53: i/o timeout"}
{"time":"2024-01-23T08:51:50.562031059Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run.Retry.func2","file":"/home/runner/work/sparrow/sparrow/internal/helper/retry.go","line":49},"msg":"Effector call failed, retrying in 4s"}
{"time":"2024-01-23T08:52:24.569515778Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).GetRuntimeConfig","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":96},"msg":"Http get request failed","url":"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
{"time":"2024-01-23T08:52:24.569558001Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":72},"msg":"Could not get remote runtime configuration","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/228644/repository/files/config%2Eyaml/raw?ref=main\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
{"time":"2024-01-23T08:52:24.569574733Z","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run","file":"/home/runner/work/sparrow/sparrow/pkg/config/http.go","line":75},"msg":"Successfully got remote runtime configuration"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x81c50c]
goroutine 35 [running]:
github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run(0xc000206b88, {0xb70440?, 0xc000217b00?})
/home/runner/work/sparrow/sparrow/pkg/config/http.go:76 +0xac
github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).Run.func1()
/home/runner/work/sparrow/sparrow/pkg/sparrow/run.go:107 +0x2d
created by github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).Run in goroutine 1
/home/runner/work/sparrow/sparrow/pkg/sparrow/run.go:106 +0xfe
Stream closed EOF for caas-sparrow/sparrow-857b77d885-mkkjk (sparrow)
No response
No response
A DNS check should be implemented.
For the MVP of the check, the following requirement needs to be fulfilled:
No response
No response
Currently, the two checks (health and latency) have been designed separately from each other. The functionality is similar. The checks architecture should be similar.
Additionally, the latency check should use time references in exposed API keys e.g. if the interval is defined in seconds; use intervalSec: 1.
The checks should be matched up so that functionality and code readability is more clear and easier.
Test for latency check is failing with race conditions.
See: https://github.com/caas-team/sparrow/actions/runs/7058951310/job/19215535279
Hints by @puffitos see: #26 (comment)
No response
No response
No response
No response
No response
Since #45 introduced prometheus metrics, it would be nice to have optional ServiceMonitors in the helm chart.
Extend the helm chart with an option for deploying a service monitor that targets sparrow's metrics endpoint. Look at grafana as an example
Anyone familiar with prometheus and writing helm charts
No response
Currently, the Target Manager is enabled by default. The sparrow is working without the Target Manager fine. If the use case is not needed it is not possible to disable the TM.
Introduce a flag to disable the Target Manager if not needed.
No response
End2End testing is quiet complicated due to a default-enabled TM.
Currently the latency check doesn't use its retry mechanism because no error is returned.
The retry mechanism is triggered if the request fails.
$ go run main.go run --config .tmp/start-config.yaml
Using config file: .tmp/start-config.yaml
{"time":"2024-01-25T14:56:22.596885375+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1","file":"/home/installadm/dev/github/sparrow/cmd/run.go","line":82},"msg":"Running sparrow"}
{"time":"2024-01-25T14:56:22.597003935+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).api.func1","file":"/home/installadm/dev/github/sparrow/pkg/sparrow/api.go","line":81},"msg":"Serving Api","addr":":8080"}
{"time":"2024-01-25T14:56:22.596999388+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/installadm/dev/github/sparrow/pkg/sparrow/targets/gitlab.go","line":80},"msg":"Starting global gitlabTargetManager reconciler"}
{"time":"2024-01-25T14:56:22.5971122+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*FileLoader).Run","file":"/home/installadm/dev/github/sparrow/pkg/config/file.go","line":48},"msg":"Reading config from file","file":"./.tmp/run-config.yaml"}
{"time":"2024-01-25T14:56:22.597416945+01:00","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).registerCheck","file":"/home/installadm/dev/github/sparrow/pkg/sparrow/run.go","line":232},"msg":"Check is not registered","name":"health"}
{"time":"2024-01-25T14:56:22.59755689+01:00","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).registerCheck","file":"/home/installadm/dev/github/sparrow/pkg/sparrow/run.go","line":232},"msg":"Check is not registered","name":"dns"}
{"time":"2024-01-25T14:56:22.597578372+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/checks/latency.(*Latency).Run","file":"/home/installadm/dev/github/sparrow/pkg/checks/latency/latency.go","line":91},"msg":"Starting latency check","interval":"20s"}
{"time":"2024-01-25T14:56:42.612641142+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks/latency.getLatency","file":"/home/installadm/dev/github/sparrow/pkg/checks/latency/latency.go","line":284},"msg":"Error while checking latency","url":"xhttps://example.com","error":"Get \"xhttps://example.com\": unsupported protocol scheme \"xhttps\""}
No response
No response
@puffitos Can you create an issue to the validation part you've mentioned in your PR?
Originally posted by @lvlcn-t in #93 (comment)
The checks.Runtime Interface should also have a Validate method to validate the configuration per check.
The Validate can then trickle up to the runtime.Config struct, which should have a validate Method to validate all configurations. This will ensure only valid configurations are passed to the running checks and should add to the robustness of the sparrow.
We should discuss the alternative of how we parse the startup and runtime configuration. Currently, we have to define multiple flags and their string mapping, and we must provide all those long strings in a YAML file, linearly, like this:
runParameterOne: 1
mostImporntantRunParameterTwenty: 20
This is inflating the codebase and it's questionable whether it's offering us any concrete advantages.
We're able to load the (startup) configuration from one file, which is formatted as follows:
checks:
- check1: {}
- check2: {}
targetManager:
gitlab: {}
# and so on
No response
No response
Everyone who has some time to check how the cobra & viper framework works.
No response
The global targets are injected to the checks config into their target
field. This was ok for the target manager MVP but the checks will become more complex and not every check can implement a string slice as target field.
Add a dedicated global targets field to the checks config structs. This field can then be merged into the targets field by the check itself depending on its config implementation.
Everyone
No response
We want to gather data similar to the output of traceroute. We want to declare a list of targets and sparrow should collect the hops through which the packets travel to reach that destination.
For now, we want to collect the following metrics for every target:
Unlike traditional traceroute, the check will be using tcp. This is to avoid requiring root permissions or cap_net_raw. I'm proposing the config to look like this:
checks:
traceroute:
retry:
delay: 10s
count: 3
timeout: 30s
targets:
- https://google.com
- https://bing.com:80
- https://myservice.com:12345
Num Hops should be a simple counter:
For example: sparrow_traceroute_hops_count{target="https://google.com"} 12
The path taken is a bit more difficult, as we can't really convey the graph like nature of the data to prometheus. My suggestion is, we export these metrics in OpenTelemtry compatible format. We could use the otel sdk to create the metrics and then ship them of to a trace aggregator like jaeger. This makes it easy to adopt sparrow, as grafana already has a native jaeger datasource, so there would be no need for hacking together our own grafana datasource.
While we can't collect traces in prometheus, we can atleast link a timeseries to a trace using prometheus exemplars. This is not a requirement, but makes the UX nicer when viewing the data in grafana
tbd
tbd
No response
No response
#69 introduces serviceMonitors to the helm chart. How about we also provide grafana dashboards, that go along with the service monitors. We could use the dashboard that the SRE team provides as a baseline. We should only work on this once the prometheus metrics are semi stabilized, otherwise we waste a lot of time on maintaining the dashboards.
No response
Anyone willing to spend some time in grafana writing promql queries
No response
The prometheus format is not sufficient for exporting metrics from the traceroute check, aside from the amount of hops. To provide more complex data, it might be necessary to support a second data format like opentelemetry.
Implement the OTEL library and inject it into the checks, so they can write their own traces. Traceroute can use this to collect more detailed data about how every single invocation of the check behaves, essentially allowing a user to visualize how packets move from sparrow to their target
No response
https://github.com/open-telemetry/opentelemetry-go
https://opentelemetry.io/docs/languages/go/
https://www.jaegertracing.io/docs/1.54/client-libraries/#go
https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/traces/
If a sparrow pod is restarted, it causes a gap in monitoring. This is due to the healthcheck returning an HTTP 200 to the kubernetes probe. Since Kubernetes thinks the new sparrow pod is ready to accept traffic, prometheus scrape requests are sent to that new pod immediately, before the startup routine is actually finished.
Screenshot from prometheus below
A restart of the sparrow deployment waits for the new pod to have finished its first reconciliation cycle. Only then will the pod accept traffic. This ensures, that prometheus never scrapes a pod while still in startup.
No response
No response
Currently eg. the health check is using the global targets provided by the target manager directly.
Registered global target https://sparrow-b.telekom.de
will be used as it is. Every sparrow will provide its health check endpoint at /checks/health
. The result is that the endpoint seems unhealthy
.
The sparrow should use the global targets for its checks correctly. Therefore it should append the check's health endpoint (similar to other checks that are using the global targets).
Eg.: https://sparrow-b.telekom.de/checks/health
No response
No response
No response
No response
With #91 we have removed the handler functions from the checks interface. Currently the http handler are not used by the checks. The http & latency checks are able to use endpoints that are returning 200 HTTP OK. For checking other sparrow instances the global health sparrow endpoint can be used/is used.
The api is still configured to allow HTTP handles to be registered manually. To cleanup the code base the functionality needs to be removed including all related func.
The following sections need to be checked and cleaned up:
https://github.com/caas-team/sparrow/blob/main/pkg%2Fsparrow%2Fapi.go#L56
https://github.com/caas-team/sparrow/blob/main/pkg%2Fsparrow%2Fapi.go#L258
Sparrow starts with two targets:
https://google.com
https://example.com
this creates two prometheus for those targets:
sparrow_latency_duration_seconds{status="200",target="https://google.com"} 0.014013145
sparrow_latency_duration_seconds{status="200",target="https://example.com"} 0.003617327
...
If the config gets updated and one target gets removed, the last prometheus metric for the deleted target remains as the latest prometheus metrics.
New targets:
https://google.com
Prometheus metrics:
sparrow_latency_duration_seconds{status="200",target="https://google.com"} 0.014013145 <- This remains and is not cleaned up
sparrow_latency_duration_seconds{status="200",target="https://example.com"} 0.007915321
Lets discuss how (if) we should handle this case.
Maybe it's possible to remove metrics from the prometheus registry
Screenshot shows what happened after google.com was removed from the config, compared to a target that is currently active
Since the sparrow metrics api still exposes the value for google.com, prometheus stores it on every scrape
@puffitos @y-eight @lvlcn-t Let's discuss how we handle this next week
No response
Currently, the retry pattern implementation and loader uses a timer. This can be simplified by using time.After.
sparrow/internal/helper/retry.go
Line 51 in faab23e
Line 72 in faab23e
Currently the loaders would panic if no interval is set because it then calls time.NewTicker(0)
which panics.
We should at least validate if the interval is > 0.
But imo this offers the opportunity to introduce another feature to the loaders.
We could disable the continuous loading of the check config with an interval of 0. This would also align with the feature introduced by #101 which kinda does this for the instance registration and update.
If we introduced this proposal, we'd only need to check our startup config for interval < 0.
Everyone
No response
The logger does have a fixed verbosity.
The logging verbosity should be switchable during the runtime but at least by defining an environment variable.
No response
No response
No response
We need to be careful when updating viper beyond v1.18.0
We should write a test case that checks whether marshalling from env variables works properly everywhere, so that we notice when someone updates the library and breaks it.
To fix this, we need to update our build and test pipelines to include the feature flag -tags=viper_bind_struct
No response
No response
Anyone
No response
The CI checks regarding security fail & no linting is done on the entirety of the repository. This may result in poorly maintained code and errors which go unchecked.
We have automatic linting to avoid typical code mistakes & code smells.
No response
No response
Everyone
No response
To harmonize the component interfaces of the sparrow and allow for graceful shutdowns, all components should have a shutdown routine integrated.
Add a Shutdown(ctx) error
function to the Loader interface and implement it.
Everyone
cc @lvlcn-t
Low level priority task. This is only needed to have a more standardized way of how all sparrow components work. The sparrow should be responsible for shutting down its components, if needed, and the components should always just keep on running as long as they can.
The update functionality of the target manager (update with current timestamp) should be configurable. This allows the target manager to activate/deactivate the update registration feature.
Make the update-registration feature configurable (enabled/disable) and separate the interval from the registration interval. The two features should be separate.
No response
No response
Currently, 3 different latency metrics are available.
If the health check fails (internally) the latency time will be 0. The status code as well.
This might be ok for the counter and latency metrics but might be not the best practice for the histogram. The buckets will be filled.
Example with 2 errors and 308 total requests:
# HELP sparrow_latency_duration Latency of targets in seconds
# TYPE sparrow_latency_duration histogram
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.005"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.01"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.025"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.05"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.1"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.25"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.5"} 288
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="1"} 307
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="2.5"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="5"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="10"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="+Inf"} 308
sparrow_latency_duration_sum{target="https://gitlab.devops.telekom.de"} 120.39378972299998
sparrow_latency_duration_count{target="https://gitlab.devops.telekom.de"} 308
As @puffitos stated in #45 we should probably solve this with labelling or another set of metrics. E.g. label for the checks state.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.