beam-telemetry / telemetry_metrics_prometheus_core Goto Github PK

View Code? Open in Web Editor NEW

34.0 5.0 25.0 138 KB

Core Prometheus Telemetry.Metrics Reporter package for telemetry_metrics_prometheus

License: Apache License 2.0

Elixir 100.00%

elixir telemetry observability metrics monitoring prometheus reporter

telemetry_metrics_prometheus_core's People

Contributors

Stargazers

Watchers

telemetry_metrics_prometheus_core's Issues

resetting sum/counter metrics

Hello,
we are switching to this library for metrics exposure with an application that used https://hex.pm/packages/prometheus beforehand.
And by doing so we run into usage-scenarios that we can't handle at the moment (correct me please if i am wrong).

The biggest problem is when a sanity-check triggers which leads to sum/counter metrics getting re-initialized with absolute values.
For sum-metrics we found an (ugly) workaround: we pull the metric value from the ets-table and then calculate the delta to our desired absolute value and emit an event that will have the desired effect.
In case of counter metrics this approach is not possible (but its also fine to use sum with prometheus_type: counter in this case).

So we would really like to able to simply set an absolute value for sum and counter metrics.

Thanks for your attention to this topic,
Ole

Logger.warn/1 is deprecated

While building I get a few warning for deprecated use of Logger.warn:

warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
  lib/core/aggregator.ex:58: TelemetryMetricsPrometheus.Core.Aggregator.filter_and_drop_time_series_with_bad_tag_values/2
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
  lib/core/registry.ex:196: TelemetryMetricsPrometheus.Core.Registry.register_metrics/2
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
  lib/core/registry.ex:203: TelemetryMetricsPrometheus.Core.Registry.register_metrics/2

Can I make a PR?

persistence when supervisor restarts / down

Today this library uses :ets, would it be possible to adopt a way to persist value of metrics when supervisor restarts? Usually when deploying with kubernetes, pods restarts and consequently applications and resets metrics in :ets. I think we can use :ets and save metrics in DETS too, and read metrics from DETS when supervisor start.

`distribution` assumes we are collecting a histogram of time

Hi! I'm trying to collect a histogram of the size in bytes of a piece of data in addition to some other timing metrics. When I do this I get warnings about both inconsistent units and not using seconds. This seems incorrect to me as I'm collecting something that's not time-based. I'm relatively new to Prometheus but AFAICT it's not unexpected to use the histogram metric type to collect non-timing data.

As an example:

def metrics do
  [
       distribution("my_app.response_sent.body_size",
        buckets: [0.1, 1.0, 10.0],
        unit: {:byte, :kilobyte}
      ),
      distribution("my_app.response_sent.duration",
        buckets: [0.01, 0.1, 1.0, 10.0],
        unit: :second
      )
  ]
end

This results in the following warning logs:

15:30:10.285 [warn] Multiple time units found in your Telemetry.Metrics definitions.

Prometheus recommends using consistent time units to make view creation simpler.

You can disable this validation check by adding `consistent_units: false` in the validations options on reporter init.
15:30:10.285 [warn] Prometheus requires that time units MUST only be offered in seconds according to their guidelines, though this is not always practical.

https://prometheus.io/docs/instrumenting/writing_clientlibs/#histogram.

You can disable this validation check by adding `require_seconds: false` in the validations options on reporter init.

On the whole, I appreciate the warnings, so it would be nice to be able to maybe disable them at an individual metric level?

`child_spec` specs don't take `metrics` into account

If doing

  def child_spec(_) do
    TelemetryMetricsPrometheus.Core.child_spec(metrics: metrics())
  end

Dialyzer complains about no local return, as the specs for Core.child_spec doesn't mention any metrics:

telemetry_metrics_prometheus_core/lib/core.ex

Line 141 in c3cc031

@spec child_spec(prometheus_options()) :: Supervisor.child_spec()

The point is on the type prometheus_option, which is way more restricted than what for example start_link will admit.

No option to delete value related to specific set of tags in ETS table

There is no option, to delete an existing entry in ETS table. For example, if I have a sum metric with some tags, there is no option to remove value related to a specific set of tags. Because of that, size of reports generated during scrapes can only grow, and there is no possibility to remove values, that are no longer needed from these reports.

Add exception handling when formatting labels

It is possible for labels to have values which do not have the String.Chars protocol implemented, such as Ecto structs. If one of these values is encountered, the entire scrape is unable to be processed.

We should add exception handling around this operation, logging a warning that it occured, and continuing on with the export.

sum of float fails because of ets:update_counter/4 usage

I have a sum over the amount of request duration from Plug.Telemetry more or less defined like this:

      sum(
        "http.request.duration.seconds.total",
        event_name: "phoenix.endpoint.stop",
        measurement: :duration,
        unit: {:native, :second}
      )

Telemetry.Metrics converts the native time integer to a float.
I think the behaviour makes sense.
The alternative of just using the output from System.convert_time_unit/3 would be that most of my measurements would be 0s because they are sub-second.

However, this fails because the Sum module's handle_event/4 is using ets:update_counter/4 which only works with integers.
See line 61 of from:

telemetry_metrics_prometheus_core/lib/core/sum.ex

Lines 53 to 67 in cd8bea3

    
           def handle_event(_event, measurements, metadata, config) do 
        
             with true <- EventHandler.keep?(config.keep, metadata), 
        
                  {:ok, measurement} <- 
        
                    EventHandler.get_measurement(measurements, metadata, config.measurement), 
        
                  mapped_values <- config.tag_values_fun.(metadata), 
        
                  :ok <- EventHandler.validate_tags_in_tag_values(config.tags, mapped_values) do 
        
               labels = Map.take(mapped_values, config.tags) 
        
               key = {config.name, labels} 
        
               _res = :ets.update_counter(config.table, key, measurement, {key, 0}) 
        
               :ok 
        
             else 
        
               false -> :ok 
        
               error -> EventHandler.handle_event_error(error, config) 
        
             end 
        
           end

I'm not quite sure how to attack this problem.
If the conversion happened when rendering instead of when storing this wouldn't be an issue, but that only solves it for :native vs :second problem.
in general I think it should be possible to sum any numeric. Not just integers.

Prometheus metric name via reporter_options ?

Instead of using custom metric name and overriding event_name and measurement.. what about accepting name in reporter_options?

From:

Metrics.distribution("prometheus_metrics.scrape.duration.milliseconds",
  event_name: [:prometheus_metrics, :plug, :stop],
  measurement: :duration,
  unit: {:native, :second},
  reporter_options: [
    buckets: [0.05, 0.1, 0.2, 0.5, 1],
  ]
)

To:

Metrics.distribution("prometheus_metrics.plug.stop",
  unit: {:native, :second},
  reporter_options: [
    name: "prometheus_metrics.scrape.duration.milliseconds"
    buckets: [0.05, 0.1, 0.2, 0.5, 1],
  ]
)

Exemplar support

I'd like to add support for exemplars as in the OpenMetrics Specification.

telemetry_poller probably not necessary

Hi @bryannaegele! The lib looks great, the only feedback I have is that the dependency on telemetry_poller is probably not necessary. The recent versions of telemetry start its own process to perform measurements and as long as the user depends on it and defines the proper metrics, the metrics will reach out to this lib.

Issues with ~S sigil on Elixir main branch

I have issues compiling this part:

telemetry_metrics_prometheus_core/lib/core/exporter.ex

Lines 108 to 121 in a0e82ac

    
             defp escape(value) do 
        
               value 
        
               |> to_string() 
        
               |> String.replace(~S("), ~S(\")) 
        
               |> String.replace(~S(\\), ~S(\\\\)) 
        
               |> String.replace(~S(\n), ~S(\\n)) 
        
             end 
        
             defp escape_help(value) do 
        
               value 
        
               |> to_string() 
        
               |> String.replace(~S(\\), ~S(\\\\)) 
        
               |> String.replace(~S(\n), ~S(\\n)) 
        
             end

using Elixir's main branch.

Using git bisect I narrowed down the issue to this commit: elixir-lang/elixir@51d23cb

It looks like now ~S does not work as it did before and some reworks are needed to avoid compile-time errors:

== Compilation error in file lib/core/exporter.ex ==
** (MismatchedDelimiterError) mismatched delimiter found on lib/core/exporter.ex:116:3:
     error: unexpected reserved word: end
     │
 114 │     |> String.replace(~S(\\), ~S(\\\\))
     │                      └ unclosed delimiter
 115 │     |> String.replace(~S(\n), ~S(\\n))
 116 │   end
     │   └ mismatched closing delimiter (expected ")")

Initialising counter metrics

Please forgive me if this is a dumb question - I'm still learning my way around metrics in general and Prometheus in particular.

Would it make sense to have, at least as an option, the ability to initialise counter metrics to 0 when they're first set up? As it stands (if my reading of things is correct), they'll return a null value until the first time they're incremented. That leads to a not-entirely-intuitive result when later you're using the increase function in PromQL and the very first event doesn't count because there's no "increase" between null and 1.

Question: Design Considerations

Thanks for making this library. The abstractions seem really great. We are looking at using this library in our enterprise systems and I have a few design questions.

Multiple Handler Processes

Have you considered using a dynamic supervisor and spawning a child for every metric that is subscribed to? I'm thinking along the lines of multiple multiple Registry children instead of just one.

My reasoning being that every call to telemetry.execute is synchronous and has to wait for the registry to processes the message. A high volume of events being emitted could cause a call in Plug to block slowing down web request response times.

Maybe this is a non-issue because event handling is so efficient that it never gets bottlenecked in practice. I'm just curious.

Best-effort reporting

When an event is handled it looks like it is writing to an :ets table that has {:write_concurrency, true}, so it should be extremely fast. The only vectors that I can see that could potentially cause an issue is the :keep and :tag_values options:

Telemetry.Metrics.counter("http.request.count",
  keep: fn _metadata ->
    # something with high latency, resource consumption, or error prone 
    true
  end,
  tag_values: fn metadata ->
    # something with high latency, resource consumption, or error prone 
    metadata
  end
)

I'm not sure of the likelihood of this happening, but Murphy's Law. Given that a developer could cause self harm here if they are not aware of the complete impact of those functions, have you considered reporting metrics in a best-effort type of fashion?

My thought would be something along the lines of spawning a task using Task.Supervisor.async_nolink/2 immediately when an event is received. My reasoning is that I would prefer my application to continue to respond to web requests quickly regardless of my metrics being reported.

If a developer did do something resource intensive in those functions I'm thinking it would eventually manifest itself in an OOM or some other issue, but response latency would not be affected until a complete crash. The host application would be completely protected from a failed task because of a runtime exception or a timeout.

Sorry for the long post and thank you if you made it this far reading. I'm curious what your thoughts are on these questions -- even if for my own learning.

Thanks,
Spencer

Add support for Pushgateway

Hello,

Is there any plan to add support for sending metrics to a Pushgateway? I have a few use cases for this including:

batch jobs which may last less than the scrape-interval.
emitting an application_started counter metric (it is hard to do this right now because if my app crashes on startup before the first scrape the metric is never consumed by Prometheus).

I am not 100% sure where this functionality belongs, I originally started in the prom_ex library and made my way down to here.

I can implement this if someone can point me in the right direction. I believe we just need to get an extra config value into the Counter/Sum/LastValue/Distribution modules' handle_event functions, which we would indicate if the metric should be written to ets like is it now, or sent to the pushgateway. I just don't know where that extra config value should be declared.

Update mix file repo refs
Update readme repo refs
Add a second maintainer

Where are time units actually used?

I have scoured this repo, and it is not used.

Memory leak

Hi. We are using your library in our project, and consider it helpful. Thank you.

However, we found a memory leak. If nobody ever asks for metrics through API, ETS dist table is never clean up.

We've got 1Gb of data and 3M of records in dist table before noticing it.

Using the same default buckets as other libraries

Hey all,

I would like to suggest that we use a default value for our buckets that other libraries use.

Python

https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544

DEFAULT_BUCKETS = (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, INF)

Golang

https://github.com/prometheus/client_golang/blob/66837e3298bdc57a828794c23bacb253ed8c04cd/prometheus/histogram.go#L68

DefBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}

Java

https://github.com/prometheus/client_java/blob/6ae0c961cfc19b8ef0575c98ac82ee55b11002ed/simpleclient/src/main/java/io/prometheus/client/Histogram.java#L88

private double[] buckets = new double[] { .005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10 };

Overlap

Value	Duration	Python	Golang	Java
0.005	5ms	Yes	Yes	Yes
0.01	10ms	Yes	Yes	Yes
0.025	25ms	Yes	Yes	Yes
0.05	50ms	Yes	Yes	Yes
0.075	75ms	Yes	No	Yes
0.1	100ms	Yes	Yes	Yes
0.25	250ms	Yes	Yes	Yes
0.5	500ms	Yes	Yes	Yes
0.75	750ms	Yes	No	Yes
1.0	1s	Yes	Yes	Yes
2.5	2.5s	Yes	Yes	Yes
5.0	5s	Yes	Yes	Yes
7.5	7.5s	Yes	No	Yes
10.0	10s	Yes	Yes	Yes

	def handle_event(_event, measurements, metadata, config) do
	with true <- EventHandler.keep?(config.keep, metadata),
	{:ok, measurement} <-
	EventHandler.get_measurement(measurements, metadata, config.measurement),
	mapped_values <- config.tag_values_fun.(metadata),
	:ok <- EventHandler.validate_tags_in_tag_values(config.tags, mapped_values) do
	labels = Map.take(mapped_values, config.tags)
	key = {config.name, labels}
	_res = :ets.update_counter(config.table, key, measurement, {key, 0})
	:ok
	else
	false -> :ok
	error -> EventHandler.handle_event_error(error, config)
	end
	end

	defp escape(value) do
	value
	\|> to_string()
	\|> String.replace(~S("), ~S(\"))
	\|> String.replace(~S(\\), ~S(\\\\))
	\|> String.replace(~S(\n), ~S(\\n))
	end

	defp escape_help(value) do
	value
	\|> to_string()
	\|> String.replace(~S(\\), ~S(\\\\))
	\|> String.replace(~S(\n), ~S(\\n))
	end