Giter Site home page Giter Site logo

Comments (13)

evgenyfedorov2 avatar evgenyfedorov2 commented on August 16, 2024 1

Yes, we will be moving forward. I've been busy with other things recently, and I still am, which is why there's been a delay.

from extensions.

geeknoid avatar geeknoid commented on August 16, 2024

@noahfalk This is as we discussed a little while ago.

from extensions.

noahfalk avatar noahfalk commented on August 16, 2024

Cool! I've got this flagged to come and take a closer look but it might be a little bit.

from extensions.

noahfalk avatar noahfalk commented on August 16, 2024

Thanks for putting this together!

A couple thoughts here:

  • For the timestamping issue, this sounds like something we'd want to firm up prior to doing major work. I think my main concern is that if we expect the commonly used ILoggerProviders (say ConsoleProvider and OTel's provider at least) to test the TState for some pre-existing timestamp information, that is going to impose a performance overhead on every log message being processed, regardless whether the application is using the features proposed here. If we are asking all .NET logging users to take a perf regression so that a small fraction of those users can use new features we should do due diligence to ensure that perf overhead is very small. If the mechanism to recognize log messages that are pre-timestamped requires any new shared APIs in M.E.L.A then we should also ensure we are going to be able to ship that in .NET 9.

    Specific things to firm up:

    1. How do we expect ILoggerProviders to recognize and treat buffered data?
    2. What is the performance overhead for checking a TState to determine it isn't buffered?
    3. What APIs (if any) need to be added to M.E.L.A to make this work?
  • How do we expect the sampling API interacts with the filtering API?
    Is the idea that filters happen first, then sampling happens on what remains?
    If I want some data for some new category to be buffered, do I have to both add a filter rule to include it and a buffer rule to get it buffered, or is it sufficient to just add the buffering rule?

  • From an API design perspective, filtering and sampling seem related so could we make their usage appear more unified? For example if I wanted to include only Microsoft.Extensions.Hosting informational and all errors I could write this:

builder // ILoggingBuilder
        .AddFilter("Microsoft.Extensions.Hosting", LogLevel.Informational)
        .AddFilter(null, LogLevel.Error);

If I wanted to select that same set of logs to be buffered and sampled could I do something like this?

builder // ILoggingBuilder
        .AddBuffer("MyBuffer", durationSecs:10, size: 1_000_000)
        .AddBufferedFilter("MyBuffer", "Microsoft.Extensions.Hosting", LogLevel.Informational,
             samplingRate: 0.01)
        .AddBufferedFilter("MyBuffer", null, LogLevel.Error);               

I'm not tied to that approach at all - its just an off-the-cuff example of what I am guessing would be possible. My broader hopes from the API design perspective:

  1. Can we make the buffering and sampling appear more unified with the filtering that is already there? Or if you think making them appear unified will create confusion lets discuss why.
  2. Can we make scenarios we expect to be common more compact? Reducing the number of lines of code, types, and method calls users need to interact with should make it easier to understand and author. This shouldn't preclude having verbose APIs available to handle more complex cases.

There is probably more fine-grained API feedback but I think we'd want to look at the high level stuff before the smaller stuff.

@tarekgh @samsp-msft - you guys also probably want to look at this.

from extensions.

tarekgh avatar tarekgh commented on August 16, 2024

I second @noahfalk for all points he raised. One more thing, why we are combining the filtering and buffering together. Buffering maybe used in more scenarios than just buffering. And should we follow the pattern we used in redaction? I means something like:

    builder.EnableRedaction(...)
               .EnableFilttering(....) // EnableSampler?
               .EnableBuffering(....)

And create a global filter class which can easily registered to support global filtering. For request filtering, can have local filter and can be a setting to enable in aspnet so anyone handle http requests? this can simplify the APIs I guess and remove ControlAction. just thoughts!

from extensions.

samsp-msft avatar samsp-msft commented on August 16, 2024

I have a number of thoughts, but I think this also needs to be fit into the larger picture with OpenTelemetry.

Filtering based on EventID

I have heard the need from customers that the granularity of the logging levels is not sufficient, and that they would like to be able to collect specific log messages from a lower log level (debug ==1, Critical ==5), or filter out higher level messages as they occur too frequently.

This should happen regardless of whether the log message is associated with a trace, and the frequency of its usage. Ideally is handled through configuration along with the rest of the log levels, so it provides an override of the more granular defaults.

Head-based sampling of per-request telemetry.

If your service is handing a large number of requests, collecting tracing and log messages from all of them use up a lot of data. Rather than sampling each source of data indepedently, you want to sample them using the same algorithm based on the traceID. Each source can be given a percentage for what to keep. If the same hash algorithm is used to decide what to keep, then requests that fall under the lowest common denominator will have all their records, others will be filtered out as applicable.

Ideally the same processing can happen to log messages with a traceid as the distributed tracing so both the trace and log messages will be emitted together. Because this is a head-based sampling algorithm, it doesn't depend on buffering and waiting for completion to be useful.

Rate limiting

The classic Application Insights SDK has a sampling algorithm that supports rate based limiting - so you can cap the tracing at 5 requests/sec (as an example). I believe its a semi-adaptive algorithm so its not just the first 5 requests that will be emitted, but it uses an average over a timespan to determine what to retain.

I think we need the same algorithm to be implemented as a processor for OpenTelemetry so its not dependent on being used with a specific exporter. When Aspire Apps are deployed to ACA, this mechanism could be used to ensure that the dashboard is not overloaded with results.
The great thing about this kind of algorithm is that its unlikely to filter out records when doing local debugging (unless load testing) so you are not wondering where your telemetry went.

  1. Hard limits
    The adaptive algorithm can be augmented with additional safeguards to ensure that you don't get too much data or too little data. For example, you could have overrides of a min sampling percentage, so you get 0.1% of requests regardless of the rate limit.

It could also have max limits to reduce the chance of coding/configuration errors causing high bills. For example have a max-log-messages-per-traceid, max-spans-per-traceid and max-events-per-span which would be calculated and limit the data per request, per process. That should be doable on a per-process basis without needing to do tail sampling as long as there is an event for when the request is completed, and therefore the count stats can be removed.

In the case that the limits are reached, ideally an additional log message is emitted that details that messages have been suppressed, and what their counts were. This can then be used to understand the patterns that have caused the suppression to occur and the extent of the problem.

Log messages not associated with a trace

Not all log messages are going to be associated with a request, and therefore have the context of a traceid. These should have the same concepts of fixed rate and time-based sampling applied to them. The values for those rates should be independent from those for trace-based sampling.

Processing order

  • I suspect the EventID based filtering should be applied first, and be performed regardless of the other options. If a message is elevated, or suppressed, it should then go through the rest of the algorithm to determine what happens.

  • Ideally the decision for whether a trace is sampled is made at the time the traceid is retrieved from the request headers, and the Recorded bit is set. There should be an option as to whether to:

    • always honor the Recorded bit from the incoming request
    • ignore the incoming Recorded bit and use the algorithm to set the value
    • only use the algorithm if the incoming request doesn't have the recorded bit

    This should then set the Recorded bit on the Activity so that code can use that bit to determine whether to write telemetry or not.

    • Ideally the IsEnabled() method for ILogger should use the Activity Recorded bit to determine if the message should be written. If no activity is present, it short-circuits that check

    • In the output processing for OpenTelemetry an additional filter is applied - that will remove Traces and Logs that are trace specific, so that the above is an optimization, but bad logging code may have wasted CPU, but will not affect the output.

Tail-based sampling

All of the above is applied on a per-process basis and is decided up front without knowlege of the final success/failure of the request. In most high performance systems, this is the most cost effective way to make decisions.
An alternative would be to use an out of process, tail based sampler typically implemented in an agent such as OTel Collector or Strato. As the implementation of that is external to the process, it's not specific to .NET and so not part of this feature set.

from extensions.

tarekgh avatar tarekgh commented on August 16, 2024

I'm just checking in to see if we're moving forward with this. The reason I ask is because we have the tracking issue dotnet/runtime#82465, and we need to determine whether we should close it if we're going ahead with the proposal mentioned here. Naturally, we can collaborate to decide what needs to be incorporated into the runtime, if anything.

from extensions.

geeknoid avatar geeknoid commented on August 16, 2024

I think LogSamplngOptions.Matchers should be an IList instead of a List.

from extensions.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.