beam-telemetry / telemetry_metrics_prometheus_core Goto Github PK
View Code? Open in Web Editor NEWCore Prometheus Telemetry.Metrics Reporter package for telemetry_metrics_prometheus
License: Apache License 2.0
Core Prometheus Telemetry.Metrics Reporter package for telemetry_metrics_prometheus
License: Apache License 2.0
Hello,
we are switching to this library for metrics exposure with an application that used https://hex.pm/packages/prometheus beforehand.
And by doing so we run into usage-scenarios that we can't handle at the moment (correct me please if i am wrong).
The biggest problem is when a sanity-check triggers which leads to sum/counter metrics getting re-initialized with absolute values.
For sum-metrics we found an (ugly) workaround: we pull the metric value from the ets-table and then calculate the delta to our desired absolute value and emit an event that will have the desired effect.
In case of counter metrics this approach is not possible (but its also fine to use sum with prometheus_type: counter
in this case).
So we would really like to able to simply set an absolute value for sum and counter metrics.
Thanks for your attention to this topic,
Ole
While building I get a few warning for deprecated use of Logger.warn:
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
lib/core/aggregator.ex:58: TelemetryMetricsPrometheus.Core.Aggregator.filter_and_drop_time_series_with_bad_tag_values/2
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
lib/core/registry.ex:196: TelemetryMetricsPrometheus.Core.Registry.register_metrics/2
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
lib/core/registry.ex:203: TelemetryMetricsPrometheus.Core.Registry.register_metrics/2
Can I make a PR?
Today this library uses :ets, would it be possible to adopt a way to persist value of metrics when supervisor restarts? Usually when deploying with kubernetes, pods restarts and consequently applications and resets metrics in :ets. I think we can use :ets and save metrics in DETS too, and read metrics from DETS when supervisor start.
Hi! I'm trying to collect a histogram of the size in bytes of a piece of data in addition to some other timing metrics. When I do this I get warnings about both inconsistent units and not using seconds. This seems incorrect to me as I'm collecting something that's not time-based. I'm relatively new to Prometheus but AFAICT it's not unexpected to use the histogram metric type to collect non-timing data.
As an example:
def metrics do
[
distribution("my_app.response_sent.body_size",
buckets: [0.1, 1.0, 10.0],
unit: {:byte, :kilobyte}
),
distribution("my_app.response_sent.duration",
buckets: [0.01, 0.1, 1.0, 10.0],
unit: :second
)
]
end
This results in the following warning logs:
15:30:10.285 [warn] Multiple time units found in your Telemetry.Metrics definitions.
Prometheus recommends using consistent time units to make view creation simpler.
You can disable this validation check by adding `consistent_units: false` in the validations options on reporter init.
15:30:10.285 [warn] Prometheus requires that time units MUST only be offered in seconds according to their guidelines, though this is not always practical.
https://prometheus.io/docs/instrumenting/writing_clientlibs/#histogram.
You can disable this validation check by adding `require_seconds: false` in the validations options on reporter init.
On the whole, I appreciate the warnings, so it would be nice to be able to maybe disable them at an individual metric level?
If doing
def child_spec(_) do
TelemetryMetricsPrometheus.Core.child_spec(metrics: metrics())
end
Dialyzer complains about no local return, as the specs for Core.child_spec
doesn't mention any metrics:
telemetry_metrics_prometheus_core/lib/core.ex
Line 141 in c3cc031
The point is on the type prometheus_option
, which is way more restricted than what for example start_link
will admit.
There is no option, to delete an existing entry in ETS table. For example, if I have a sum
metric with some tags
, there is no option to remove value related to a specific set of tags. Because of that, size of reports generated during scrapes can only grow, and there is no possibility to remove values, that are no longer needed from these reports.
It is possible for labels to have values which do not have the String.Chars
protocol implemented, such as Ecto structs. If one of these values is encountered, the entire scrape is unable to be processed.
We should add exception handling around this operation, logging a warning that it occured, and continuing on with the export.
I have a sum over the amount of request duration from Plug.Telemetry
more or less defined like this:
sum(
"http.request.duration.seconds.total",
event_name: "phoenix.endpoint.stop",
measurement: :duration,
unit: {:native, :second}
)
Telemetry.Metrics
converts the native time integer to a float.
I think the behaviour makes sense.
The alternative of just using the output from System.convert_time_unit/3
would be that most of my measurements would be 0s because they are sub-second.
However, this fails because the Sum
module's handle_event/4
is using ets:update_counter/4
which only works with integers.
See line 61 of from:
telemetry_metrics_prometheus_core/lib/core/sum.ex
Lines 53 to 67 in cd8bea3
I'm not quite sure how to attack this problem.
If the conversion happened when rendering instead of when storing this wouldn't be an issue, but that only solves it for :native
vs :second
problem.
in general I think it should be possible to sum any numeric. Not just integers.
Instead of using custom metric name and overriding event_name
and measurement
.. what about accepting name
in reporter_options
?
From:
Metrics.distribution("prometheus_metrics.scrape.duration.milliseconds",
event_name: [:prometheus_metrics, :plug, :stop],
measurement: :duration,
unit: {:native, :second},
reporter_options: [
buckets: [0.05, 0.1, 0.2, 0.5, 1],
]
)
To:
Metrics.distribution("prometheus_metrics.plug.stop",
unit: {:native, :second},
reporter_options: [
name: "prometheus_metrics.scrape.duration.milliseconds"
buckets: [0.05, 0.1, 0.2, 0.5, 1],
]
)
I'd like to add support for exemplars as in the OpenMetrics Specification.
Hi @bryannaegele! The lib looks great, the only feedback I have is that the dependency on telemetry_poller is probably not necessary. The recent versions of telemetry start its own process to perform measurements and as long as the user depends on it and defines the proper metrics, the metrics will reach out to this lib.
I have issues compiling this part:
telemetry_metrics_prometheus_core/lib/core/exporter.ex
Lines 108 to 121 in a0e82ac
Using git bisect I narrowed down the issue to this commit: elixir-lang/elixir@51d23cb
It looks like now ~S
does not work as it did before and some reworks are needed to avoid compile-time errors:
== Compilation error in file lib/core/exporter.ex ==
** (MismatchedDelimiterError) mismatched delimiter found on lib/core/exporter.ex:116:3:
error: unexpected reserved word: end
│
114 │ |> String.replace(~S(\\), ~S(\\\\))
│ └ unclosed delimiter
115 │ |> String.replace(~S(\n), ~S(\\n))
116 │ end
│ └ mismatched closing delimiter (expected ")")
Please forgive me if this is a dumb question - I'm still learning my way around metrics in general and Prometheus in particular.
Would it make sense to have, at least as an option, the ability to initialise counter metrics to 0 when they're first set up? As it stands (if my reading of things is correct), they'll return a null value until the first time they're incremented. That leads to a not-entirely-intuitive result when later you're using the increase
function in PromQL and the very first event doesn't count because there's no "increase" between null and 1.
Thanks for making this library. The abstractions seem really great. We are looking at using this library in our enterprise systems and I have a few design questions.
Have you considered using a dynamic supervisor and spawning a child for every metric that is subscribed to? I'm thinking along the lines of multiple multiple Registry children instead of just one.
My reasoning being that every call to telemetry.execute
is synchronous and has to wait for the registry to processes the message. A high volume of events being emitted could cause a call in Plug to block slowing down web request response times.
Maybe this is a non-issue because event handling is so efficient that it never gets bottlenecked in practice. I'm just curious.
When an event is handled it looks like it is writing to an :ets
table that has {:write_concurrency, true}
, so it should be extremely fast. The only vectors that I can see that could potentially cause an issue is the :keep
and :tag_values
options:
Telemetry.Metrics.counter("http.request.count",
keep: fn _metadata ->
# something with high latency, resource consumption, or error prone
true
end,
tag_values: fn metadata ->
# something with high latency, resource consumption, or error prone
metadata
end
)
I'm not sure of the likelihood of this happening, but Murphy's Law. Given that a developer could cause self harm here if they are not aware of the complete impact of those functions, have you considered reporting metrics in a best-effort type of fashion?
My thought would be something along the lines of spawning a task using Task.Supervisor.async_nolink/2
immediately when an event is received. My reasoning is that I would prefer my application to continue to respond to web requests quickly regardless of my metrics being reported.
If a developer did do something resource intensive in those functions I'm thinking it would eventually manifest itself in an OOM or some other issue, but response latency would not be affected until a complete crash. The host application would be completely protected from a failed task because of a runtime exception or a timeout.
Sorry for the long post and thank you if you made it this far reading. I'm curious what your thoughts are on these questions -- even if for my own learning.
Thanks,
Spencer
Hello,
Is there any plan to add support for sending metrics to a Pushgateway? I have a few use cases for this including:
application_started
counter metric (it is hard to do this right now because if my app crashes on startup before the first scrape the metric is never consumed by Prometheus).I am not 100% sure where this functionality belongs, I originally started in the prom_ex
library and made my way down to here.
I can implement this if someone can point me in the right direction. I believe we just need to get an extra config value into the Counter/Sum/LastValue/Distribution
modules' handle_event
functions, which we would indicate if the metric should be written to ets
like is it now, or sent to the pushgateway. I just don't know where that extra config value should be declared.
I noticed that summary is not yet supported.
Just to make sure if this a possible DoS if your prometheus not scraping a node and you have a distribution metrics on board? Seems so, am i wrong?
TODO:
I have scoured this repo, and it is not used.
Hi. We are using your library in our project, and consider it helpful. Thank you.
However, we found a memory leak. If nobody ever asks for metrics through API, ETS dist table is never clean up.
We've got 1Gb of data and 3M of records in dist table before noticing it.
Hey all,
I would like to suggest that we use a default value for our buckets that other libraries use.
DEFAULT_BUCKETS = (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, INF)
DefBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
private double[] buckets = new double[] { .005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10 };
Value | Duration | Python | Golang | Java |
---|---|---|---|---|
0.005 | 5ms | Yes | Yes | Yes |
0.01 | 10ms | Yes | Yes | Yes |
0.025 | 25ms | Yes | Yes | Yes |
0.05 | 50ms | Yes | Yes | Yes |
0.075 | 75ms | Yes | No | Yes |
0.1 | 100ms | Yes | Yes | Yes |
0.25 | 250ms | Yes | Yes | Yes |
0.5 | 500ms | Yes | Yes | Yes |
0.75 | 750ms | Yes | No | Yes |
1.0 | 1s | Yes | Yes | Yes |
2.5 | 2.5s | Yes | Yes | Yes |
5.0 | 5s | Yes | Yes | Yes |
7.5 | 7.5s | Yes | No | Yes |
10.0 | 10s | Yes | Yes | Yes |
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.