Giter Site home page Giter Site logo

grafana / alloy Goto Github PK

View Code? Open in Web Editor NEW
766.0 93.0 67.0 70.99 MB

OpenTelemetry Collector distribution with programmable pipelines

Home Page: https://grafana.com/oss/alloy

License: Apache License 2.0

Jsonnet 1.73% Makefile 0.38% Dockerfile 0.17% Go 96.17% Smarty 0.12% Shell 0.27% HTML 0.01% CSS 0.17% TypeScript 0.97%
collector grafana loki monitoring observability opentelemetry opentelemetry-collector prometheus

alloy's Introduction

Grafana Alloy logo Grafana Alloy logo

Latest Release Documentation link

Grafana Alloy is an open source OpenTelemetry Collector distribution with built-in Prometheus pipelines and support for metrics, logs, traces, and profiles.

What can Alloy do?

  • Programmable pipelines: Use a rich expression-based syntax for configuring powerful observability pipelines.

  • OpenTelemetry Collector Distribution: Alloy is a distribution of OpenTelemetry Collector and supports dozens of its components, alongside new components that make use of Alloy's programmable pipelines.

  • Big tent: Alloy embraces Grafana's "big tent" philosophy, where Alloy can be used with other vendors or open source databases. It has components to perfectly integrate with multiple telemetry ecosystems:

  • Kubernetes-native: Use components to interact with native and custom Kubernetes resources; no need to learn how to use a separate Kubernetes operator.

  • Shareable pipelines: Use modules to share your pipelines with the world.

  • Automatic workload distribution: Configure Alloy instances to form a cluster for automatic workload distribution.

  • Centralized configuration support: Alloy supports retrieving its configuration from a server for centralized configuration management.

  • Debugging utilities: Use the built-in UI for visualizing and debugging pipelines.

Example

otelcol.receiver.otlp "example" {
  grpc {
    endpoint = "127.0.0.1:4317"
  }

  output {
    metrics = [otelcol.processor.batch.example.input]
    logs    = [otelcol.processor.batch.example.input]
    traces  = [otelcol.processor.batch.example.input]
  }
}

otelcol.processor.batch "example" {
  output {
    metrics = [otelcol.exporter.otlp.default.input]
    logs    = [otelcol.exporter.otlp.default.input]
    traces  = [otelcol.exporter.otlp.default.input]
  }
}

otelcol.exporter.otlp "default" {
  client {
    endpoint = "my-otlp-grpc-server:4317"
  }
}

Getting started

Check out our documentation to see:

Release cadence

A new minor release is planned every six weeks.

The release cadence is best-effort: if necessary, releases may be performed outside of this cadence, or a scheduled release date can be moved forwards or backwards.

Minor releases published on cadence include updating dependencies for upstream OpenTelemetry Collector code if new versions are available. Minor releases published outside of the release cadence may not include these dependency updates.

Patch and security releases may be published at any time.

Community

To engage with the Alloy community:

Contributing

Refer to our contributors guide to learn how to contribute.

Thanks to all the people who have already contributed!

alloy's People

Contributors

56quarters avatar captncraig avatar clayton-cornell avatar cyriltovena avatar dependabot[bot] avatar erikbaranowski avatar github-actions[bot] avatar gotjosh avatar hainenber avatar hjet avatar jcreixell avatar jdbaldry avatar jkroepke avatar joe-elliott avatar karengermond avatar kgeckhart avatar korniltsev avatar mapno avatar marctc avatar mattdurham avatar ptodev avatar rfratto avatar rgeyer avatar rlankfo avatar simonswine avatar spartan0x117 avatar thampiotr avatar thepalbi avatar tpaschalis avatar wildum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alloy's Issues

Flow API: support returning unevaluated config files

To enable showing unevaluated config files as described in #517, the API needs to be extended to expose this information.

Doing this securely isn't simple and will take some effort to implement properly: we'd still want to scrub any hard-coded strings which are passed to any attribute with a secret type.

Work with upstream to make node_exporter integration easier to maintain

grafana/agent#1228 updated our dependency on node_exporter from v1.0.1 to v1.3.1. This ended up being a lot more work than I thought it would be, primarily due to its mapping of YAML to the global kingpin flags node_exporter creates.

It will be easier (and less error prone) to maintain the node_exporter integration in the future if we work with upstream to make the integration easier to maintain. I haven't put a ton of thought into it, but I think this at least removing the usage of unexported globals that configure collectors.

CI: automate fuzz tests

We should create a GitHub Action to run our fuzz tests for some fixed period of time.

There's a few ways we could do this:

  • Per PR, per push
  • As a cronjob

Note that it's not currently possible to do go test ./... -fuzz=. which fails in Go 1.18 with cannot use -fuzz flag with multiple packages.

In the meantime, we would have to enumerate the packages returned by go list ./... and fuzz each of them.

Scrape custom pods

Hi the docs state the following:

The pod must not have an annotation matching prometheus.io/scrape: "false" (this wouldn't be there unless you explicitly add it or if you deploy a Helm chart that has it).
The pod must have a port with a name ending in -metrics. This is the port that will be scraped by the Agent. A lot of people using Helm struggle with this, since Helm charts don't usually follow this. You would need to add a new scrape config to scrape helm charts or find a way to tweak the Helm chart to follow this rules.
The pod must have a label named name with any non-empty value. Helm usually lets you add extra labels, so this is less of a problem for Helm users.
The pod must currently be running. (i.e., Kubernetes must not report it having a phase of Succeeded or Failed).

But it is unclear to me what is meant with a port with a name ending in -metrics. A pod only has a containerPort for as far as I know that doesn't have a name property.

Windows Service Error: A timeout was reached (30000 milliseconds) while waiting for the Grafana Agent service to connect.

Similar to windows_exporter (And a bunch of other Go applications on windows) Grafana Agent can fail to start as a service on windows following a windows update or other high CPU Event during Service startup.

I wrote a fairly detailed analysis on the issue here that explains the cause but in an attempt to be brief this comes down to the way go initialises packages vs the time in which windows expects a response from an application starting as a service within 30s or 60s depending on your version of windows as per this diagram:

image

A way to work around this can be to delay the start of the service (Default 60s or 120s depending on your version of windows) but there are still some situations where the resource contention does not clear up in time and the service fails to start with the following recorded in the event log:

The Grafana Agent service failed to start due to the following error: 
The service did not respond to the start or control request in a timely fashion.
A timeout was reached (30000 milliseconds) while waiting for the Grafana Agent service to connect.

Ive submitted a PR against windows_exporter to move the Service initiation code out of main and into its own package so that the window service can be started as early in the startup as possible rather than in main() and i intend to do the same against grafana agent.

Air Gap install

Hi

I'm running the agent with the loki helm chart behind a firewall that will not let me access the internet.
Is there a way to tell the operator which registry it should use in the statefulsets that are created?

Thanks for your help

[Tempo] Improve vision on dropped spans

Currently observability on dropped spans is poor. The agent is based on the OTEL collector which has this issue:

open-telemetry/opentelemetry-collector#2434

Additionally if batches are dropped b/c the queue is full they go unmetriced. This is logged, but not metriced as a failed send: "Dropping data because sending_queue is full"

https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/monitoring.md#secondary-monitoring

Let's keep an eye on these issues and revendor when they are fixed. If we have spare cycles these are also good issues for upstream contributions.

[traces] spanmetrics and AutomaticLogging unable to get otel span tags as a metric/log tag

Grafana agent version: v0.26.1

I have configured the spanmetrics and automatic_logging features for the traces agent. With dimensions and span_attributes respectively.

I am trying to record the otel.library.name. Any other tag is recorded successfully in both pipelines (logs and metrics), for example net.peer.name.

When I inspect the span, its tags for otel including otel.library.name are there but when it comes to log them it fails silently.

external_labels.cluster should be more prominently configurable

I'm using the cluster external label to distinguish metrics/logs pushed from several clusters.

The cluster label seems to get derived from the namespace and resource name of the GrafanaAgent resource.

If I change the resource name of GrafanaAgent to $clustername, resulting pods are somewhat confusingly named ($clustername-0, $clustername-logs-*), and there's no reference to it being grafana-agent anymore.
Maybe we could always prefix them with grafana-agent?

There is fields to set the name of the label itself from cluster to something else (metricsExternalLabelName, logsExternalLabelName).

I might be able to set these labels by setting grafanaagent.spec.metrics.externalLabels (if it takes priority, didn't check), but there's no option to set labels for the grafanaagent.spec.logs hierarchy. Maybe this was missed when designing the CRD?

Maybe we should expose a grafanaagent.spec.clusterName field, that's used if set instead of ${grafanaagent.spec.metadata.namespace}/${grafanaagent.spec.metadata.name}?

Kubernetes security: Potentially set `readOnlyRootFilesystem: true` in sample agent yamls

It's a security best practice to run containers with a read only root filesystem. It keeps things immutable and reduces the attack surface should a container be compromised. It would be good to adjust the sample yamls and jsonnet behaviour to set this by default if we can.

While a Kubernetes administrator can configure AdmissionControllers to prevent containers running without this, or even forcibly add the settings, it would be good to be more secure by default and offer it from the get-go.

From some limited testing, the only adjustment that needs to be made to account for this, is that the WAL and positions directories need to be mounted as a specific volume, since those directories need to be writable. They can be backed by emptyDir: {} if desired rather than a long-running persistent volume, though that may have some caveats in terms of WAL / positions persistence. https://kubernetes.io/docs/concepts/storage/volumes/#emptydir

I'm unsure if the traces collection portion also needs some writable directory, at a glance it doesn't seem so.

(Feature) PodLogs namespaceSelector add matchExpressions

Today it's not possible to set matchExpressions on the namespaceSelector in PodLogs.
This is a very useful feature if you have multiple loki instances and you only want to send all namespaces with label x to loki A.

For example I want to be able to to gather logs from all namespaces where platform= true.

I know this is a bit more then what you can do in a prometheus serviceMonitor but I think this would be nice since podLogs can be very general.
But in my case not as general so I want to use the any field and it will be to painful to use matchNames in bigger clusters.

➜ k explain podlogs.spec.namespaceSelector                 
KIND:     PodLogs
VERSION:  monitoring.grafana.com/v1alpha1

RESOURCE: namespaceSelector <Object>

DESCRIPTION:
     Selector to select which namespaces the Pod objects are discovered from.

FIELDS:
   any	<boolean>
     Boolean describing whether all namespaces are selected in contrast to a
     list restricting them.

   matchNames	<[]string>
     List of namespace names.

I can help out with implementing this if it's a feature you want.

Lower default active series in agent integration

The agent integration exposes ~700 active series out of the box, most of which may not be useful for just scraping agent metrics.

Some extra options should be provided to narrow down the default set of collected metrics to as small as possible.

Log when the agent is not able to discover services/pods

When trying out the agent v0.26.1 on a cluster with an invalid rbac configured (i.e. invalid service account), the agent is not logging any error so it makes it impossible to debug what is failing as there are no errors returned in the logs:

ts=2022-08-17T08:56:34.585432153Z caller=server.go:191 level=info msg="server listening on addresses" http=[::]:8080 grpc=127.0.0.1:12346 http_tls_enabled=false grpc_tls_enabled=false
ts=2022-08-17T08:56:34.585885928Z caller=node.go:85 level=info agent=prometheus component=cluster msg="applying config"
ts=2022-08-17T08:56:34.586011104Z caller=remote.go:180 level=info agent=prometheus component=cluster msg="not watching the KV, none set"
ts=2022-08-17T08:56:34Z level=info caller=traces/traces.go:143 msg="Traces Logger Initialized" component=traces
ts=2022-08-17T08:56:34.589747493Z caller=integrations.go:138 level=warn msg="integrations-next is enabled. integrations-next is subject to change"
ts=2022-08-17T08:56:34.598095095Z caller=reporter.go:107 level=info msg="running usage stats reporter"
ts=2022-08-17T08:56:34.601227094Z caller=wal.go:197 level=info agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 msg="replaying WAL, this may take a while" dir=/var/lib/grafana-agent/data/f991db25d94df6b6a2d34e419c0bdac7/wal
ts=2022-08-17T08:56:34.602215367Z caller=wal.go:244 level=info agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 msg="WAL segment loaded" segment=0 maxSegment=0
ts=2022-08-17T08:56:34.602621943Z caller=kubernetes.go:313 level=info agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component="discovery manager" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-08-17T08:56:34.603232667Z caller=kubernetes.go:313 level=info agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component="discovery manager" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-08-17T08:56:34.603757106Z caller=kubernetes.go:313 level=info agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component="discovery manager" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-08-17T08:56:34.605157196Z caller=dedupe.go:112 agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component=remote level=info remote_name=f991db-4b5dd5 url=http://thanos-receive-ingestor-default.thanos.svc:19291/api/v1/receive msg="Starting WAL watcher" queue=f991db-4b5dd5
ts=2022-08-17T08:56:34.605333032Z caller=dedupe.go:112 agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component=remote level=info remote_name=f991db-4b5dd5 url=http://thanos-receive-ingestor-default.thanos.svc:19291/api/v1/receive msg="Starting scraped metadata watcher"
ts=2022-08-17T08:56:34.605842841Z caller=dedupe.go:112 agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component=remote level=info remote_name=f991db-4b5dd5 url=http://thanos-receive-ingestor-default.thanos.svc:19291/api/v1/receive msg="Replaying WAL" queue=f991db-4b5dd5
ts=2022-08-17T08:56:39.427639186Z caller=entrypoint.go:249 level=info msg="reload of config file requested"
ts=2022-08-17T08:56:45.698794738Z caller=dedupe.go:112 agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component=remote level=info remote_name=f991db-4b5dd5 url=http://thanos-receive-ingestor-default.thanos.svc:19291/api/v1/receive msg="Done replaying WAL" duration=11.092992576s

In the mean time, with the same service monitors, prometheus operator in agent mode logs those errors:

ts=2022-08-17T09:21:06.200Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"thanos\""
ts=2022-08-17T09:21:15.976Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"thanos\""
ts=2022-08-17T09:21:15.976Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"thanos\""
ts=2022-08-17T09:21:41.453Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"thanos\""
ts=2022-08-17T09:21:41.454Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"thanos\""

Docs feedback: /set-up/install-agent-macos.md

Howdy, i just installed the agent via brew on mac os 13.0 (22A380) and there's no file at $(brew --prefix)/etc/grafana-agent/config.yml as the docs are stating.

The directory exists but it's empty.

~ » ls -la /opt/homebrew/etc/grafana-agent/
total 0
drwxr-xr-x   2 tim  admin   64 Oct 31 11:16 .
drwxrwxr-x  21 tim  admin  672 Oct 31 11:16 ..

Either this is unintended or the docs needs to be updated to add our own config file first otherwise the grafana-agent doesn't really start.

[feature request] spannmetrics filtering

Is your proposal related to a problem?
context: we're using grafana agent to route a subset of traces to grafana cloud and to calculate spannmetrics. We don't control the code generating the spans, only the configuration (sampling/logging/metrics) of grafana agent.

problem: since we don't control span metric we can't make sure that some unique data (I.E span.name) will not be generated, this results in both high memory usage on the agent side and generating high cardinality data on the metrics.

Describe the solution you'd like
I would like to be able to choose what spans are used in spannmetrics calculations

Describe alternatives you've considered
currenly we're dropping the unwanted metrics on remote_write but there's still the issue of high memory usage by the agent

Additional context
none

syntax: diagnostics error type not returned consistently

Alloy syntax code currently doesn't consistently use the diag.Diagnostics consistently through all the exposed APIs. For example, using a block name as an attribute returns a fmt.Errorf instead of diag.Diagnostics.

We should make sure that diag.Diagnostics are used consistently so that multiple errors can be aggregated and reported against with correct position information.

Agent will fail to start if windows_events.bookmark_path file is truncated

I'm using windows_events together with bookmark_path to gather Event logs on Windows.
If the host system is shut down incorrectly (power outage, BSOD, etc...), the bookmark.xml file can get truncated.
Because it's invalid Grafana Agent service will fail to start until I remove the file manually.

Can the agent flush the bookmark file to disk from time to time? Or use a SQLite db instead of an .xml file?
And is there a workaround to ignore truncated/corrupted bookmarks during startup?

Flow UI: show evaluated config blocks

Currently, the Flow UI will only show evaluated components, but not any of the config blocks (logging, tracing). We should find a way to expose these so users can verify that a config block evaluated as they expected.

Config blocks should:

  • Be displayed in a table on the home page
    • A separate table may be appropriate since config blocks don't have health
  • Appear in the config graph
  • Be given a detail page similar to components

traces: Improve Error Message on Tail Sampling policy

This policy:

        tail_sampling:
          policies:
          - type: and
            and:
              and_sub_policy:
              - type: string_attribute
                string_attribute:
                  key: foo
                  values:
                    - bar
              - type: string_attribute
                string_attribute:
                  key: also_foo
                  values:
                    - also_bar

Causes the agent to log this error and exit:

msg="error creating the agent server entrypoint" err="failed to create tracing instance default: failed to create pipeline: failed to load otelConfig from agent traces config: malformed sampling policy"

Removing the type: and key fixes this issue. This config is slightly askew from the upstream config which requires type: and.

Can we improve this error message?

Also, this minor difference between configs has caused a fair amount of confusion, can we just accept the type: and key to be compatible with upstream?

Manually stored metrics do not have metadata

Samples which are manually written to the WAL (i.e., written outside of the normal scraping process of a metrics instance) do not support metadata being sent over remote write.

Metadata is used to display metric type and help info from Prometheus query frontends like Grafana.

Remote write currently expects that metadata comes from an instance of scrape.Manager, which is not always the case for code relying on remote write to send samples.

This impacts at least the following:

  • spanmetrics
  • integrations-next (which uses an external scrape manager whose metadata is not exposed directly to remote_write)
  • Metrics from integrations when using Grafana Agent Operator (since it uses integrations-next; see above)
  • Flow components

This went mostly unnoticed because most metrics still have metadata. I first became aware of this when prototyping grafana/agent#1261, where scraping and remote_write was initially fully decoupled, causing all metadata to disappear.

It's not immediately obvious what the best solution here is. Some way of being able to customize what metadata gets collected by remote_write would be ideal though.

Flow UI: Page not very responsive while loading large component

Description
Component detail page with large amount of exports values takes a long time to load due to large amount of returning data (~20 - 30 seconds loading time)
Sometimes this will cause the entire page to crash
Page reaction on user action is very slow on larger data size

Flow: Add component for multi-tenant remote_write support

The Prometheus remote_write protocol doesn't have the notion of multi-tenancy itself. For this reason different backends offer use various methods to enable multi-tenancy, most often using an X-Scope-OrgID header, or labels with a special meaning.

When using a prometheus.remote_write component, the Prometheus queue_manager reads WAL segments sequentially and enqueues metrics opportunistically to be batched off as remote_write requests. This behaviour offers no fine-grained control at the request level. Our current suggestion to users is that they add an extra header to their endpoint block, but these per-endpoint headers are static, and we don't support write_relabel_config filtering in Flow yet.

The new component would act as a remote_write middleware, receiving metrics from upstream components, extracting a given label (say tenant), and batching timeseries grouped by this label value.

It would then send discrete remote_write requests for these batches while also adding the correct X-Scope-OrgID header for each one.

Notes

  • This was inspired by the conversation in grafana/agent#1821 and existing solutions like cortex-tenant.
  • I don't believe it would be easy to try and shoehorn this behaviour into the current prometheus.remote_write component.
  • Using this component would entail some performance penalty which should be measured and explicitly documented.

Windows installer should allow providing a custom config file

There is no one step to install the agent and point to the custom YAML config file.

The agent has to be installed, it has to run, then we have to run a separate command to repoint the config file. We would like a way to do a silent install and to include a YAML config file in one step.

[Flow] Ability to drag blocks on graph around

Currently you can't move the blocks on the graph once the Agent starts. It would be good if we could move them around, as when you have lots of connected components the lines can overlap in weird ways and make it hard to read. Being able to click and hold onto a block in the graph, and move it somewhere else with the arrows adjusting automatically would be great :)

Flow web app: expose unevaluated config files

It would be nice to have an endpoint which shows the unevaluated config file for all config blocks and components. This can be viewable in a single page (similar to Prometheus' config file endpoint), or we could also show the unevaluated config per-component when viewing a component-specific page.

However, doing this securely isn't simple: we'd still want to scrub any hard-coded strings which are passed to any attribute with a secret type.

#517

Kubernetes security: hostPath mounts should be read only

This one may be an oversight but there are places where hostPaths are mounted without being in read-only mode:

https://github.com/grafana/agent/blob/main/production/kubernetes/agent-loki.yaml#L79-L86

Per https://kubernetes.io/docs/concepts/storage/volumes/#hostpath

HostPath volumes present many security risks, and it is a best practice to avoid the use of HostPaths when possible. When a HostPath volume must be used, it should be scoped to only the required file or directory, and mounted as ReadOnly.

If restricting HostPath access to specific directories through AdmissionPolicy, volumeMounts MUST be required to use readOnly mounts for the policy to be effective.

It would be good to evaluate/fix these where and if possible.

[feature request] ability to filter/collect X slowest traces in tail sampling

Is your proposal related to a problem?
we would like to be able to select only a subset (10) slowest/fastest transactions and save them into tempo

Describe the solution you'd like
I would like to have topk/bottomk filter in tailsampling processors

Describe alternatives you've considered
None, there's currently an ability to filter traces based on transaction time but the duration needs to be hardcoded and not relative to rest of the traces

Additional context
none

__agent_hostname should be available as a meta label for prometheus scrape configs

It would be useful if __agent_hostname was available as a meta label for prometheus scrape configs.

Our specific case is to run the agent on each node, with a static scrape config and a local endpoint. For example, our configuration might look like this:

prometheus:
  configs:
    - name: agent
      host_filter: false
      scrape_configs:
        - job_name: example
          metrics_path: '/metrics'
          static_configs:
            - targets: ['127.0.0.1:8080']

The labels for this might look something like:

      "labels": {
        "instance": "127.0.0.1:8080",
        "job": "example"
      },
      "discovered_labels": {
        "__address__": "127.0.0.1:8080",
        "__metrics_path__": "/metrics",
        "__scheme__": "http",
        "job": "example"
      },

This series will not be unique across multiple instances. Setting the target to ${HOSTNAME}:8080 requires enabling variable interpolation, and may resolve to to a network interface rather than 127.0.0.1

The solution I've found is to set the instance label manually using interpolation:

          static_configs:
            - targets: ['127.0.0.1:8080']
              labels:
                instance: "${HOSTNAME}:8080"

However, this requires enabling variable interpolation, and seems to be somewhat less reliable than the mechanism used to set the instance label on the agent integrations.

Ideally, agent_hostname would be available as a meta label globally, so that I could use it in a relabel config wherever I would like.

Grafana-agent Errors

I'm seeing the following errors

C:\Program Files\Grafana Agent>.\agent-windows-amd64.exe -config.file=.\agent-config.yaml
panic: ReqQueryValueEx failed: The system cannot find the file specified. errno 2

goroutine 1 [running]:
github.com/leoluk/perflib_exporter/perflib.QueryNameTable(0x459e310, 0xb, 0xc0000c39e0)
        /go/pkg/mod/github.com/grafana/[email protected]/perflib/nametable.go:43 +0x330
github.com/leoluk/perflib_exporter/perflib.init.0()
        /go/pkg/mod/github.com/grafana/[email protected]/perflib/perflib.go:266 +0x45

image

image

Failed scrapes prevents series from being deletable

When a scrape fails, series written to the Appender will be discarded with a Rollback. This breaks the garbage collection check, where it looks to see if a series is pending a commit. However, the pending commit status is only updated on Commit, and not on Rollback.

This means the series will never be considered for deletion, and may permanently leak if it never appears in another scrape.

This bug is not present in the upstream Prometheus Agent, where a simplification of the GC logic avoided the need for a "pending commit" status. We may not want to bother fixing this downstream and just switch over to Prometheus Agent instead (though that's lightly blocked by grafana/agent#1332).

You can reproduce with the following:

server:
  http_listen_port: 12345 

metrics:
  global:
    remote_write:
    - url: CONFIGURE_ME 
  configs:
  - name: default
    scrape_configs:
    - job_name: integrations/agent 
      static_configs:
      - targets: ['127.0.0.1:12345']
        labels:
          __metrics_path__: /integrations/agent/metrics
     sample_limit: 1 

integrations:
  agent:
    enabled: true 
    scrape_integration: false 

This causes all scrapes to fail, as the agent target always has more than one sample, and series will never be deleted.

Panic when reloading with changed remote_write config

I was testing grafana/agent#1478 with a dummy remote_write endpoint, when the Agent suddenly panicked and shut down.

Incidentally, it was around the same time I was trying to reload a config file when curl localhost:12345/-/reload hanged; after a while the command returned and the Agent was down. Before and after that, all calls to /-/reload worked fine, so I'm not sure whether it was related or an once-off occurence.

I'll investigate this further as I wasn't able to reproduce this under the same conditions, but I'm leaving the stack trace here for future reference and for others.

ts=2022-04-19T09:42:47.020614397Z caller=gokit.go:65 level=debug msg="GET /agent/api/v1/metrics/integrations/sd?instance=ebpf&integrations=ebpf (200) 181.043µs"
ts=2022-04-19T09:42:50.709134329Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=error remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Failed to flush metadata"
ts=2022-04-19T09:42:50.70948193Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=error remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="non-recoverable error while sending metadata" count=178 err="context canceled"
ts=2022-04-19T09:42:50.709647309Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=info remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Scraped metadata watcher stopped"
ts=2022-04-19T09:42:50.710576469Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=info remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Remote storage stopped."
ts=2022-04-19T09:42:50.711054063Z caller=manager.go:271 level=info agent=prometheus msg="stopped instance" instance=1441107aae5b2bac39bf41c80aa47c88
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x26d30ec]

goroutine 276 [running]:
github.com/grafana/agent/pkg/metrics/instance.(*Instance).Appender(0xc00024a100?, {0x52744b0?, 0xc000696800?})
	/root/agent/pkg/metrics/instance/instance.go:568 +0x2c
github.com/grafana/agent/pkg/integrations/v2/autoscrape.(*agentAppender).Appender(0xc00014bce0, {0x52744b0, 0xc000696800})
	/root/agent/pkg/integrations/v2/autoscrape/autoscrape.go:248 +0xa3
github.com/prometheus/prometheus/scrape.newScrapePool.func2.3({0x52744b0?, 0xc000696800?})
	/root/go/pkg/mod/github.com/grafana/[email protected]/scrape/scrape.go:311 +0x35
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport(0xc0007bf380, {0xed9f07abf?, 0x7746400?, 0x7746400?}, {0xed9f07abf?, 0x7746400?, 0x7746400?}, 0x0)
	/root/go/pkg/mod/github.com/grafana/[email protected]/scrape/scrape.go:1257 +0x3bc
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0xc0007bf380, 0xc000bda480?)
	/root/go/pkg/mod/github.com/grafana/[email protected]/scrape/scrape.go:1216 +0x345
created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync
	/root/go/pkg/mod/github.com/grafana/[email protected]/scrape/scrape.go:587 +0xa1e

Edit: as far as I could tell, this happened with the default agent-local-config.yaml file, as I was changing the dummy remote_write with my Grafana cloud stack's details.

Improve windows installer

During interactive windows installer, we can only enable windows_exporter on/off. So agent is not fully functional after agent setup, as it is not sending data anywhere. Config has to be updated anyway.

Since some of our users will be installing agent in interactive mode, would it great to:

  • add more options in windows installer to fill in prometheus/loki endpoints
  • enable/disable logs collection

That way agent would be reporting data right after installation is done.

Metrics: GC stale series separately from truncating WAL

Background

Today, Grafana Agent performs a metrics "garbage collection" every 60 minutes (when metrics collection is enabled). The garbage collection process does the following for a specific point in time (the "GC timestamp"):

  1. Mark in-memory series that haven't received a write since the last GC for deletion.
  2. Remove in-memory series that were marked for deletion during the previous GC.
  3. Cut the current WAL segment and create a new one.
  4. Create a new WAL checkpoint from the lower 2/3rds of segment files, composed of all active series records and samples newer than the GC timestamp.
  5. Delete the lower 2/3rds of WAL segment files which composed the checkpoint.

The details of how the GC timestamp is determined is out of scope for this proposal.

When Grafana Agent first released, this GC process happened every minute, but was lowered 60 minutes due to the high constant IOPS introduced by the frequency of the GC.

Performing the GC every 60 minutes has the following effects compared to the original 60 second frequency:

  • Average IOPS rate is dramatically lowered, due to only needing to create checkpoints 60x less often
  • Memory usage increases, due to keeping stale series in-memory 60x longer

Proposal

The proposal is to supplement the current process with two following additions:

  1. Use the staleness marker added by Prometheus' scraper as a marker for deletion.
  2. Delete marked in-memory series every 5 minutes in a background goroutine. Only series that have been marked for deletion for at least 5 minutes should be deleted.

5 minutes is chosen slightly arbitrarily: it needs to be big enough to allow for flapping series to be unmarked for deletion, and 4 minutes is the maximum recommended scrape interval for any target.

This change will reduce the overall memory consumption of Grafana Agent in environments that experience frequent series churn by allowing stale series to be removed within 10 minutes instead of 2 hours.

Trade-offs

Pros

  • Lowers total memory usage in environments that have frequently changing targets or series churn

Cons

  • May cause more duplicate series records in the WAL: new series records will be created for targets which have flapping series that are scraped very infrequently. Being able to disable this enhancement may be a suitable workaround.

Memory Impact

Let's imagine a Prometheus target that returns a unique set of 1000 series every 60 seconds, and is scraped every 15 seconds.

Today, when the GC runs after the first hour, there will be 60k active series, and roughly all of them will be marked for deletion. During the second GC, after another hour, there will be a total of 120k active series, and the marked series will be deleted, bringing the total back down to 60k. The active series will continue fluctuating between 60k and 120k for this target for the process lifetime.

With this proposal, series will be marked as stale on subsequent scrapes, and deleted within 10 minutes. After 5 minutes, there will be 5,000 active series. The active series will continue fluctuating between 5,000 and 10,000 for this target for the process lifetime.

Summary

We are not able to directly correlate specific memory improvements to this change; too many factors determine the cost of an individual series. Similarly, we are not able to say whether this would improve agent memory usage in the general case.

We will only be able to say that this change deletes stale series roughly 12x faster than before, which may lower memory usage in some cases.

Upstreaming to Prometheus Agent

This change is something that Prometheus Agent can also benefit from. While we are still working on unifying the code and using Prometheus Agent directly, I propose we implement this downstream first to experiment, and then upstream once we know whether it's a good idea.

scraping errors should be logged with `logLevel: info`

Far too often, I realized some metrics were missing, because grafana-agent had no permissions to access the target (grafana/agent#1186), or because some paths were wrong.

I don't really want to run grafana-agent with logLevel: debug, as that's also spamming healthchecks into the logs.

I feel like if scraping explicitly configured scraping targets fails that should be logged with info (or even warn), but not debug.

Proposal: Add conditional expressions to Alloy syntax

Alloy syntax currently does not support conditional expressions. Adding them would bring the language a step closer to being able to solve general-purpose problems and provide flexibility in the way that users write Alloy configuration files.

Some things to consider when choosing a design.

  • Syntax familiarity for users
  • Readability and terseness, both when reading and writing Alloy configurations
  • Easiness to maintain as a language feature

Current

Option 1: Do nothing

One could argue that Alloy does not need conditional expressions, and that all conditional logic should be moved either to the configuration management layer, or to the component layer where the component would decide its behavior based on some external stimuli.

Option 2: Add Templating conditionals

An option would be to add conditionals that don’t resolve to a value, but rather define what the text of the final config should render to.
Ignoring the specific syntax, the idea is presented below

local.file "LABEL" {
  filename = FILE_NAME
  {{ if env(“POLL_FREQUENCY”) != “” }}
  poll_frequency = env(“POLL_FREQUENCY”)
  {{ else }} 
  poll_frequency = “1m”
  {{ end }}
}

The rendering would then need to happen first, before the config file gets evaluated.

This option makes it easier to work when the conditions dictate what kind of blocks should be utilized, but is not the only way to go about it, if we make use of the enabled reserved keyword.

Option 3: Add Value conditionals

The other option would be to add conditionals that resolve to a value and are handled at evaluation time.

Ignoring the specific syntax, the idea is presented below

local.file "LABEL" {
  filename = FILE_NAME
  poll_frequency = {{ if env(“POLL_FREQUENCY”) != “” }} env(“POLL_FREQUENCY”) {{ else }} “1m” {{ end }}.
}

This option could be enabled by various syntax implementations.

Option 3a: Ternary expressions

One option to implement conditionals is to do something similar to HCL’s implementation of ternary expressions, for either single-line or multiline statements.

attr = condition ? true_val : false_val
attr = (condition) ?
  (some
  multiline
  statement) : (another
   multiline)

We don’t currently use the ‘?’ or ‘:’ tokens in Alloy, so it might be relatively easy from an implementation standpoint to distinguish their use in conditionals.

Option 3b: Standard Library function

Another option is to implement conditionals, by adding a new function is Alloy’s stdlib. The function signature would look like:

if(condition, true_val, false_val)

Alloy already supports multi-line statements in function calls, so the implementation effort would be easier. It also provides a useful precedent of new features being available as stdlib functions, similar to how map/reduce/filter might be in the near future.

Option 3c: Multiline conditional statements

A third option is to implement conditionals in multiline statements, similar to bash’s if/fi blocks. We could use curly braces to denote where statements begin and end, but I’d discourage this as it might be confusing with regular Alloy blocks when scanning a configuration file, and if we go this route we should either depend on specific keywords or other symbols.

if (condition) 
  ret1
elseif (condition_2)
  ret2
else 
  ret
endif

Option 3d: Switch-like statements

A fourth option would be to implement conditionals in pseudo-switch blocks by evaluating conditions in order and returning the statement defined in the first one that evaluates to true, similar to what Erlang does.

if
condition ->  statement#1
condition2 -> statement grafana/agent#2
condition3 -> statement grafana/agent#3
true -> final_statement
end

Proposal

Regarding the three main approaches:

  • Doing nothing: My personal opinion is that the dynamic nature of Alloy lends itself to conditional expressions; adding them will simplify configuration files, and give more flexibility to power users. This option might keep the Alloy syntax smaller as a language, but it moves the goalposts and makes it rely on an external configuration system.
  • Template rendering: My personal opinion is that this is not an approach we should move towards, as the Alloy syntax is a programming language and not a templating language. Configuration errors would be harder to diagnose and would only manifest at evaluation time. And while this approach seems to make it easier to conditionally change the type of blocks to use (eg. prometheus.* vs otelcol.*), I believe that it will make it harder to read and reason about.
  • Returning values: My personal opinion is that this is the most flexible of our options. It makes it the easiest to reason about a configuration file, and can be reused in different contexts, from the selection of which blocks to activate, to general Alloy expressions that in the future might be able to perform relabeling rules, or modifying log lines.

My proposal is to go with Option 3, and implement a type of Value conditionals.

Syntax Proposal

After looking through the syntax options, they all come with pros and cons.
The HCL-like format might be familiar to users, but requires some work in the Alloy syntax parser.
The stdlib function would probably be the easiest to implement.
For both of them, writing complex nested conditions is harder.

The multiline conditionals and switch-like statements on the other hand, make it easier to evaluate multiple conditions, but does exactly fit the HCL-inspiredness of the language. On the other hand, this should not discourage us from going forward with what we think is best for Alloy itself.

I propose we go with Option 3d and the if/elseif/end statements.

Example usage

Using the proposed solution will enable us to use the following configuration:

local.file "apikey" {
  filename  = "/var/data/my-api-key.txt"
  is_secret = if env(“CLUSTER”) == “prod” then "true" else "false" end
}

Implementation

The implementation consists of updating the Alloy syntax parser to recognize these new tokens to build the expression. The conditions in each branch should be valid Alloy expressions that results in a boolean value. The result values may be of any type (either a Alloy builtin type or a capsule), but they must both be of the same type, so that Alloy understands what type the whole conditional expression will return.

Let me know how this sounds to you!

syntax: require all capsule values to implement marker interface

Today, a Go value maps to a Alloy syntax capsule when any of the following are true:

  1. The Go type implements the capsule marker interface
  2. It is a Go interface type but not interface{}
  3. It is a Go map which is not keyed by string
  4. It does not map to a Alloy null, number, string, bool, object, or function.

These rules were intended to make capsules feel "magical," have historically caused bugs (#2361, grafana/agent#2362) and are generally confusing to remember (I even had to look up the rules again just when writing this). The biggest issue with capsules is that capsule information can get lost if a type is casted to an interface{}, which can happen in many situations, including storing a capsule value in a *vm.Scope as a variable.

About a month ago I suggested changing these rules, and this issue is a more formal proposal of that suggestion.

Proposal

I propose that:

  1. Go types only map to an Alloy capsule if they implement the capsule marker interface
  2. We introduce a high-level API to encapsulate any Go value
  3. Go types which do not map to any Alloy type now cause an encoding error (rather than defaulting to a capsule type)

This proposal only affects developers building components on top of Alloy. Users of Alloy are not impacted.

The new API looks like this:

package syntax

// Capsule wraps around a value and marks it as an Alloy capsule, 
// allowing arbitrary Go types to be passed around in Alloy. 
//
// A manual capsule type can be created by implementing the 
// non-pointer receiver AlloyCapsule method on a type.    
type Capsule[T any] struct {
  val T 
} 

// Encapsulate creates a new Capsule[T] for a value.
func Encapsulate[T any](v T) Capsule[T] {
  return Capsule[T]{val: v} 
} 

// NewCapsule creates a new *Capsule[T] for a value. This is 
// useful when capsules values are optional. 
func NewCapsule[T any](v T) *Capsule[T] {
  return &Capsule[T]{val: v} 
}

// AlloyCapsule marks Capsule[T] as a capsule type. 
func (c Capsule[T]) AlloyCapsule() {} 

// Get returns the underlying value for the Capsule. If 
// the Capsule is nil, Get returns the zero value for the 
// underlying type. 
func (c *Capsule[T]) Get() T {
  if c == nil {
    var zero T 
    return zero 
  }
  return c.val 
}

Example usage

Using the new proposed API above, the exports for prometheus.remote_write change to:

package remotewrite 

... 

type Exports struct {
  Receiver syntax.Capsule[storage.Appendable] `alloy:"receiver,attr"`
}

The arguments type for types which accept capsules would also change:

package scrape // prometheus.scrape 

... 

type Arguments struct {
  ForwardTo []syntax.Capsule[storage.Appendable] `alloy:"forward_to,attr"`

  ...
}

Discussion

This proposal does come at a cost: developing against Alloy becomes slightly more tedious for capsule values, as it removes the magic introduced in the initial implementation.

However, the new rules are much easier to remember, and the new API should offload most of the tedium introduced by the new rules. This proposal even potentially makes Alloy easier to understand for developers reading the code, as it now must be explicit when something is being encapsulated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.