Background and motivation As a follow-up of this <a href="https://

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I second <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

[API Proposal]: Introduce logging sampling about extensions HOT 13 OPEN

evgenyfedorov2 commented on August 16, 2024

[API Proposal]: Introduce logging sampling

from extensions.

Comments (13)

evgenyfedorov2 commented on August 16, 2024 1

Yes, we will be moving forward. I've been busy with other things recently, and I still am, which is why there's been a delay.

from extensions.

geeknoid commented on August 16, 2024

@noahfalk This is as we discussed a little while ago.

from extensions.

noahfalk commented on August 16, 2024

Cool! I've got this flagged to come and take a closer look but it might be a little bit.

from extensions.

noahfalk commented on August 16, 2024

Thanks for putting this together!

A couple thoughts here:

For the timestamping issue, this sounds like something we'd want to firm up prior to doing major work. I think my main concern is that if we expect the commonly used ILoggerProviders (say ConsoleProvider and OTel's provider at least) to test the TState for some pre-existing timestamp information, that is going to impose a performance overhead on every log message being processed, regardless whether the application is using the features proposed here. If we are asking all .NET logging users to take a perf regression so that a small fraction of those users can use new features we should do due diligence to ensure that perf overhead is very small. If the mechanism to recognize log messages that are pre-timestamped requires any new shared APIs in M.E.L.A then we should also ensure we are going to be able to ship that in .NET 9.

Specific things to firm up:
1. How do we expect ILoggerProviders to recognize and treat buffered data?
2. What is the performance overhead for checking a TState to determine it isn't buffered?
3. What APIs (if any) need to be added to M.E.L.A to make this work?
How do we expect the sampling API interacts with the filtering API?
Is the idea that filters happen first, then sampling happens on what remains?
If I want some data for some new category to be buffered, do I have to both add a filter rule to include it and a buffer rule to get it buffered, or is it sufficient to just add the buffering rule?
From an API design perspective, filtering and sampling seem related so could we make their usage appear more unified? For example if I wanted to include only Microsoft.Extensions.Hosting informational and all errors I could write this:

builder // ILoggingBuilder
        .AddFilter("Microsoft.Extensions.Hosting", LogLevel.Informational)
        .AddFilter(null, LogLevel.Error);

If I wanted to select that same set of logs to be buffered and sampled could I do something like this?

builder // ILoggingBuilder
        .AddBuffer("MyBuffer", durationSecs:10, size: 1_000_000)
        .AddBufferedFilter("MyBuffer", "Microsoft.Extensions.Hosting", LogLevel.Informational,
             samplingRate: 0.01)
        .AddBufferedFilter("MyBuffer", null, LogLevel.Error);

I'm not tied to that approach at all - its just an off-the-cuff example of what I am guessing would be possible. My broader hopes from the API design perspective:

Can we make the buffering and sampling appear more unified with the filtering that is already there? Or if you think making them appear unified will create confusion lets discuss why.
Can we make scenarios we expect to be common more compact? Reducing the number of lines of code, types, and method calls users need to interact with should make it easier to understand and author. This shouldn't preclude having verbose APIs available to handle more complex cases.

There is probably more fine-grained API feedback but I think we'd want to look at the high level stuff before the smaller stuff.

@tarekgh @samsp-msft - you guys also probably want to look at this.

from extensions.

tarekgh commented on August 16, 2024

I second @noahfalk for all points he raised. One more thing, why we are combining the filtering and buffering together. Buffering maybe used in more scenarios than just buffering. And should we follow the pattern we used in redaction? I means something like:

    builder.EnableRedaction(...)
               .EnableFilttering(....) // EnableSampler?
               .EnableBuffering(....)

And create a global filter class which can easily registered to support global filtering. For request filtering, can have local filter and can be a setting to enable in aspnet so anyone handle http requests? this can simplify the APIs I guess and remove ControlAction. just thoughts!

from extensions.

samsp-msft commented on August 16, 2024

I have a number of thoughts, but I think this also needs to be fit into the larger picture with OpenTelemetry.

Filtering based on EventID

I have heard the need from customers that the granularity of the logging levels is not sufficient, and that they would like to be able to collect specific log messages from a lower log level (debug ==1, Critical ==5), or filter out higher level messages as they occur too frequently.

This should happen regardless of whether the log message is associated with a trace, and the frequency of its usage. Ideally is handled through configuration along with the rest of the log levels, so it provides an override of the more granular defaults.

Head-based sampling of per-request telemetry.

If your service is handing a large number of requests, collecting tracing and log messages from all of them use up a lot of data. Rather than sampling each source of data indepedently, you want to sample them using the same algorithm based on the traceID. Each source can be given a percentage for what to keep. If the same hash algorithm is used to decide what to keep, then requests that fall under the lowest common denominator will have all their records, others will be filtered out as applicable.

Ideally the same processing can happen to log messages with a traceid as the distributed tracing so both the trace and log messages will be emitted together. Because this is a head-based sampling algorithm, it doesn't depend on buffering and waiting for completion to be useful.

Rate limiting

The classic Application Insights SDK has a sampling algorithm that supports rate based limiting - so you can cap the tracing at 5 requests/sec (as an example). I believe its a semi-adaptive algorithm so its not just the first 5 requests that will be emitted, but it uses an average over a timespan to determine what to retain.

I think we need the same algorithm to be implemented as a processor for OpenTelemetry so its not dependent on being used with a specific exporter. When Aspire Apps are deployed to ACA, this mechanism could be used to ensure that the dashboard is not overloaded with results.
The great thing about this kind of algorithm is that its unlikely to filter out records when doing local debugging (unless load testing) so you are not wondering where your telemetry went.

Hard limits
The adaptive algorithm can be augmented with additional safeguards to ensure that you don't get too much data or too little data. For example, you could have overrides of a min sampling percentage, so you get 0.1% of requests regardless of the rate limit.

It could also have max limits to reduce the chance of coding/configuration errors causing high bills. For example have a max-log-messages-per-traceid, max-spans-per-traceid and max-events-per-span which would be calculated and limit the data per request, per process. That should be doable on a per-process basis without needing to do tail sampling as long as there is an event for when the request is completed, and therefore the count stats can be removed.

In the case that the limits are reached, ideally an additional log message is emitted that details that messages have been suppressed, and what their counts were. This can then be used to understand the patterns that have caused the suppression to occur and the extent of the problem.

Log messages not associated with a trace

Not all log messages are going to be associated with a request, and therefore have the context of a traceid. These should have the same concepts of fixed rate and time-based sampling applied to them. The values for those rates should be independent from those for trace-based sampling.

Processing order

I suspect the EventID based filtering should be applied first, and be performed regardless of the other options. If a message is elevated, or suppressed, it should then go through the rest of the algorithm to determine what happens.
Ideally the decision for whether a trace is sampled is made at the time the traceid is retrieved from the request headers, and the Recorded bit is set. There should be an option as to whether to:
- always honor the Recorded bit from the incoming request
- ignore the incoming Recorded bit and use the algorithm to set the value
- only use the algorithm if the incoming request doesn't have the recorded bit
This should then set the Recorded bit on the Activity so that code can use that bit to determine whether to write telemetry or not.
- Ideally the IsEnabled() method for ILogger should use the Activity Recorded bit to determine if the message should be written. If no activity is present, it short-circuits that check
- In the output processing for OpenTelemetry an additional filter is applied - that will remove Traces and Logs that are trace specific, so that the above is an optimization, but bad logging code may have wasted CPU, but will not affect the output.

Tail-based sampling

All of the above is applied on a per-process basis and is decided up front without knowlege of the final success/failure of the request. In most high performance systems, this is the most cost effective way to make decisions.
An alternative would be to use an out of process, tail based sampler typically implemented in an agent such as OTel Collector or Strato. As the implementation of that is external to the process, it's not specific to .NET and so not part of this feature set.

from extensions.

tarekgh commented on August 16, 2024

I'm just checking in to see if we're moving forward with this. The reason I ask is because we have the tracking issue dotnet/runtime#82465, and we need to determine whether we should close it if we're going ahead with the proposal mentioned here. Naturally, we can collaborate to decide what needs to be incorporated into the runtime, if anything.

from extensions.

geeknoid commented on August 16, 2024

I think LogSamplngOptions.Matchers should be an IList instead of a List.

from extensions.

[API Proposal]: Introduce logging sampling about extensions HOT 13 OPEN

Comments (13)

Filtering based on EventID

Head-based sampling of per-request telemetry.

Rate limiting

Log messages not associated with a trace

Processing order

Tail-based sampling

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent