We've received reports the client can block when both of the following conditions are

For those watching this issue, <a href="https://www.elastic.co/guide/en/apm/agent/node

I've been looking at this. Some summary notes so far: <p dir=

Nope, <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-

Comments (11)

trentm commented on September 22, 2024 1

For those watching this issue, v3.14.0 was released yesterday that includes this fix.

from apm-nodejs-http-client.

astorm commented on September 22, 2024

Capturing some context from a conversation elsewhere.

[Our current] assumption/educated-guess here is that it is: (a) buffers up all traces to send, (b) the client.write*() methods end up eventually being blocking when those buffers fill and (c) possibly (though we don't have evidence yet) there are bugs that result in map leaks if/when the APM server jam can be cleared.

from apm-nodejs-http-client.

trentm commented on September 22, 2024

The issue did not appear (for me) when the demo app.js was handling only one request at a time (autocannon -p 1). That could mean it is still just a load issue, or it could mean there is a concurrency handling bug in the apm-server client.

from apm-nodejs-http-client.

astorm commented on September 22, 2024

I spent this morning capturing some heap snapshots during these periods of an unresponsive APM server that leads to Node blocking. Nothing conclusive to draw these these yet, but I did notice that over time the number of "objects allocated but not released" appears to be growing in an unbounded way

My intuition/working-theory is that something about the client's state during these period leads to a growing number of objects that can't be released immediately and this eventually leads to a situation where the garbage collector starts trashing and blocking. (if that sounds a little hand wavy, it is)

Heap Snapshots are here if anyone wants to take a look: https://drive.google.com/file/d/1nYKgEboIKj9tY8FD1mycPxrqoNYxuPiZ/view?usp=sharing -- these should be in a format that Chrome's V8 debugging tools can read.

from apm-nodejs-http-client.

astorm commented on September 22, 2024

More data -- TL;DR -- it looks like garbage collection might be contributing here but isn't the sole reason for those long 5 - 10 second periods of Node blocking.

I used gc-profiler to log collection events across a 5 minute run. The gc-profiler package allows us to observe when V8 performs a garbage collection operations and will tell us how long (in millisecond) that garbage collection operation took. This is important because node will block while garbage collecting.

Pulling the logs into elasticsearch via filebeat we came up with this chart.

The chart above is dominated by increasingly long MarkSweepCompact GC events that reach as high as 800 milliseconds and occur routinely within 10 - 15 seconds of each other. If we remove these events from the chart

We can see more reasonable (if not great) times for the Scavange and IncrementalMarking events.

So these GC pauses aren't the sole reason for these long periods of blocking, but I'm guessing they don't help things -- long GC pause, then unpause, slew of a pending network requests come in, create a bunch more objects, even longer GC pause, etc.

More info on Node.js garbage collection Understanding Garbage Collection and Hunting Memory Leaks in Node.js and (more in depth) here

from apm-nodejs-http-client.

astorm commented on September 22, 2024

One more datapoint:

Using the blocked-at package mentioned in this repo we've been able to see messages like this

    Blocked for 3514.5704949951173ms, operation started here: [
      '    at Gzip.Zlib (zlib.js:656:18)',
      '    at new Gzip (zlib.js:724:8)',
      '    at Object.value [as createGzip] (zlib.js:849:14)',
      '    at StreamChopper.transform [as _transform] (/path/to/elastic-experiments/node_modules/elastic-apm-http-client/index.js:119:19)',
      '    at StreamChopper._startStream (/path/to/elastic-experiments/node_modules/stream-chopper/index.js:87:25)',
      '    at StreamChopper._write (/path/to/elastic-experiments/node_modules/stream-chopper/index.js:192:10)',
      '    at /path/to/elastic-experiments/node_modules/stream-chopper/index.js:193:12',
      '    at /path/to/elastic-experiments/node_modules/stream-chopper/index.js:116:7',
      '    at /path/to/elastic-experiments/node_modules/elastic-apm-http-client/index.js:523:7',
      '    at /path/to/elastic-experiments/node_modules/pump/index.js:75:7',
      '    at f (/path/to/elastic-experiments/node_modules/once/once.js:25:25)',
      '    at ClientRequest.<anonymous> (/path/to/elastic-experiments/node_modules/pump/index.js:31:21)',
      '    at ClientRequest.f (/path/to/elastic-experiments/node_modules/once/once.js:25:25)',
      '    at ClientRequest.onerror (/path/to/elastic-experiments/node_modules/end-of-stream/index.js:44:12)',
      '    at ClientRequest.emit (events.js:327:22)',
      '    at Socket.socketErrorListener (_http_client.js:469:9)',
      '    at Socket.emit (events.js:315:20)',
      '    at emitErrorNT (internal/streams/destroy.js:106:8)',
      '    at emitErrorCloseNT (internal/streams/destroy.js:74:3)',
      '    at processTicksAndRejections (internal/process/task_queues.js:80:21)'
    ]

under heavy load. There's other situations where things block, but this is the long one.

This indicates the longest blocking periods are due to the creation of the gzip stream for reasons that aren't immediately apparent.

from apm-nodejs-http-client.

trentm commented on September 22, 2024

I've been looking at this. Some summary notes so far:

The current onStream handling doesn't wait and check the apm server response before calling next() to move on to the next chopped stream. This means multiple "chopped stream" requests going on at the same time, if the apm-server is slow or non-responsive.
If the apm-server endpoint accepts the connection but doesn't read its input stream, then (IIUC) the gzip stream's buffer wedges with the result that you get a "finish" event from the zlib.createGzip() stream, but you do not get an "end" event... at least not until we manually destroy the stream in the 30s serverTimeout. That is why, at least in my testing scenario, I was getting 30s+ (actually often >60s) delays before the next request would go through.
(Possibly only considered a nice-to-have for this ticket.) The current client does no backoff as specified at https://github.com/elastic/apm/blob/master/specs/agents/transport.md#transport-errors
And the main kicker is that there is no dropping of events (transactions, spans, et al) from the buffer/queue when they are coming in too fast. If apm-server is down, they build up forever on the client._writableState.$iForgetTheFieldName
Also (at least some of) the latency/event-loop bubbles are from Client.prototype._writev doing encoding of all events in one big chunk synchronously. Those need to be broken up to not hog the event loop.

Next week I'll get a starter PR with some fix ideas.

from apm-nodejs-http-client.

trentm commented on September 22, 2024

implementation plan

Here is a possible implementation plan for the issues described in this earlier comment.

Strongly consider getting #139 merged for this work so we are always running with a keep-alive agent and not having to test with and without scenarios.
Redo the onStream (the handler for the chopped-stream "stream" event) implementation to handle (a) waiting for a response from the apm-server (handles (1)) and (b) using a new serverResponseTimeout timer on the gzip stream "finish" event (handles (2)). See the "intake request notes" section below.
Update the http-client sendSpan, sendTransaction, sendError, and sendMetricSet methods to drop an event if the buffer is over limit (handles (4)).
Re-measure and add logging to see if we still get the too-large .encode()s (5). I think we still will. _writevCleaned is already async, so it should be trivial to have an async.forEach (in series, paralle, or in batches) for this.
That leaves backoff (3). Could do this with a feature on StreamChopper to setStartStreamDelay(delay) that would be called before calling next(). StreamChopper's _startStream would wait that delay before the next "stream" event. The http-client's onStream handler would set that stream delay appropriately based on APM server responses.
Update 20210310: Might be able to handle this backoff delay by not calling next() -- the callback to StreamChopper's "stream" event -- until after the backoff delay. That'd avoid adding features to StreamChopper.

intake request notes

Each "intake request" (the name I'm using for each chopped-stream "stream"
event handler call), can conclude (I'm intentionally avoiding the overloaded
terms "end", "finish") for any of the following reasons:

the 10s apiRequestTime - handled by stream-chopper's _timer calling _endStream
the 768kB apiRequestSize - handled by stream-chopper counting bytes in the "data" event of the transform stream (a zlib.createGzip()) and calling _endStream if the bytes written plus bytes buffered exceeds
the 30s serverTimeout - a socket.setTimeout set by the http-client in onStream; this measures 30s of socket inactivity and, IME, can take 60s or more (perhaps because of delayed flushing in the Gzip stream resetting that idle time)
an early response from apm-server - handled by the "response" event handler in the http-client onStream
an "error" event on the HTTP request or the gzip stream - Handled by "error" event handlers in the http-client onStream. Typically these error events would be ECONNRESET or ECONNREFUSED due to connection troubles at any point in the request. Q: What are other error causes?
a new proposed ~5s (?) serverResponseTimeout - Handled by the http-client in onStream. This is a timer started after we are "done" sending data to the intake API, as measured by the "finish" event on the gzip stream. We cannot use the more typical "end" event because of issue (2) from above.
Q: How is an intake request terminated for http-client shutdown?

A typical "successful" intake request should happen like this:

the StreamChopper starts a new stream:
- creates a Gzip transform stream
- emits a "stream" event with that gzip stream and a next callback
- starts writing encoded event data to it
onStream (will probably rename this) in the http-client handles the "stream" event
- creates a HTTP request to the APM server
- pipes the gzip stream into the HTTP request stream
- listens for appropriate events (gzip "finish", gzip and req "end", gzip and req "error", req "response")
the StreamChopper ends the stream (_endStream) for either of apiRequestTime or apiRequestSize
- calls gzipStream.end() which begins draining and flushing the gzip stream
- the gzip stream emits "finish" when it has completed processing its input (though not necessarily completed flushing all data downstream)
the http-client handles the gzip "finish" event:
- starts a serverResponseTimeout timer
apm-server responds and we get a "response" event on the request:
- the statusCode is 200
- read the apm-server response body, which involves waiting for the "end"
  event on the res
"end" event on the res (the apm-server response)
- clearTimeouts
- we should have seen "end" events on gzipStream and req
- update the "backoff state" for the next intake request
- call the StreamChopper next callback to allow the next intake request

from apm-nodejs-http-client.

trentm commented on September 22, 2024

tl;dr: Circumstantial evidence showing the "Blocked for" occurrences are due to the encoding in _writevCleaned.

Here, I'll annotate the blocked-at stack that Alan posted above to show how it is caused (I believe) by a slow Client.prototype._writevCleaned.

First, background on how blocked-at works: It adds an async hook that:

captures a stack when an async op is created (init),
notes the time when the async op starts running (before),
calculates the run time when the async op stops running (after).
If the runtime is than threshold (default 20ms), then it calls its callback, which is configured here to print out these "Blocked for" messages.

The important point for pinpointing the slow code here is that the captured stack only identifies the call stack when the async op is created. The slow code can be anything in that function or any synchronous calls it makes. So let's work backwards on this call stack to see what the synchronous call stack after Gzip.Zlib might be:

Blocked for 3514.5704949951173ms, operation started here: [
  '    at Gzip.Zlib (zlib.js:656:18)',

From the line numbers I infer this is roughly node v14.15.4. This line is where new binding.Zlib(mode) calls the zlib C++ binding ZlibStream ctor. This subclasses CompressionStream which creates the "ZLIB" async op.

  '    at new Gzip (zlib.js:724:8)',
  '    at Object.value [as createGzip] (zlib.js:849:14)',
  '    at StreamChopper.transform [as _transform] (/path/to/elastic-experiments/node_modules/elastic-apm-http-client/index.js:119:19)',

Here we are creating the gzip stream ...

  '    at StreamChopper._startStream (/path/to/elastic-experiments/node_modules/stream-chopper/index.js:87:25)',

... as part of StreamChopper starting a new stream

  '    at StreamChopper._write (/path/to/elastic-experiments/node_modules/stream-chopper/index.js:192:10)',

... which is called as a result of _write having been called with data (from the http Client, whenever it is uncorked and has data)

  '    at /path/to/elastic-experiments/node_modules/stream-chopper/index.js:193:12',

... which is a callback used to re-call _write(chunk, ...) until after the stream is started

  '    at /path/to/elastic-experiments/node_modules/stream-chopper/index.js:116:7',

... this shows the this._onunlock mechanics for that callback.

  '    at /path/to/elastic-experiments/node_modules/elastic-apm-http-client/index.js:523:7',

... this was all after an "error" event handled that ended the pump() in the Client's onStream handler.

  '    at /path/to/elastic-experiments/node_modules/pump/index.js:75:7',
  '    at f (/path/to/elastic-experiments/node_modules/once/once.js:25:25)',
  '    at ClientRequest.<anonymous> (/path/to/elastic-experiments/node_modules/pump/index.js:31:21)',
  '    at ClientRequest.f (/path/to/elastic-experiments/node_modules/once/once.js:25:25)',
  '    at ClientRequest.onerror (/path/to/elastic-experiments/node_modules/end-of-stream/index.js:44:12)',
  '    at ClientRequest.emit (events.js:327:22)',
  '    at Socket.socketErrorListener (_http_client.js:469:9)',
  '    at Socket.emit (events.js:315:20)',
  '    at emitErrorNT (internal/streams/destroy.js:106:8)',
  '    at emitErrorCloseNT (internal/streams/destroy.js:74:3)',
  '    at processTicksAndRejections (internal/process/task_queues.js:80:21)'
]

To be honest, I don't show here that this leads to the _writev call on the Client stream which leads to the long _writevCleaned. Instead I offer this circumstantial evidence from a load testing run with both blocked-at and my added logging:

[2021-03-08T22:44:54.844Z]  WARN (elastic-apm-node on pink.local): _writev 2313 objs
    process: {"pid": 29719}
    module: "elastic-apm-http-client"
    writableState: (143) {
        "needDrain": true,
        "length": 2313,
        "writing": true,
        "corked": 0,
        "bufferProcessing": true,
        "writelen": 2313,
        "pendingcb": 2366,
        "bufferedRequestCount": 2313
    }
    error: (1361) {
        "type": "Error",
        "message": "hi",
        "stack_trace":
            Error: hi
                at Client._writev (/Users/trentm/tm/apm-nodejs-http-client/index.js:317:60)
                at doWrite (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:409:97)
                at clearBuffer (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:505:5)
                at onwrite (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:455:7)
                at WritableState.onwrite (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:160:5)
                at afterWrite (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:469:3)
                at onwrite (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:461:7)
                at WritableState.onwrite (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:160:5)
                at Gzip.ondrain (/Users/trentm/tm/stream-chopper/index.js:59:5)
                at Gzip.emit (events.js:315:20)
                at afterWrite (internal/streams/writable.js:461:12)
                at onwrite (internal/streams/writable.js:446:7)
                at Gzip.afterTransform (internal/streams/transform.js:103:3)
                at Zlib.processCallback (zlib.js:586:8)
                at Zlib.callbackTrampoline (internal/async_hooks.js:131:14)
    }
[2021-03-08T22:44:54.921Z]  WARN (elastic-apm-node on pink.local): _writevCleaned: encoded objects
    process: {"pid": 29719}
    module: "elastic-apm-http-client"
    elapsedMs: 76.09864
    numObjs: 2313
    chunkLen: 2074515
Blocked for 79.46283703613281ms, operation started here: [
  '    at Gzip.Zlib (zlib.js:655:18)',
  '    at new Gzip (zlib.js:723:8)',
  '    at Object.value [as createGzip] (zlib.js:848:14)',
  '    at StreamChopper.transform [as _transform] (/Users/trentm/tm/apm-nodejs-http-client/index.js:130:31)',
  '    at StreamChopper._startStream (/Users/trentm/tm/stream-chopper/index.js:91:25)',
  '    at StreamChopper._write (/Users/trentm/tm/stream-chopper/index.js:202:10)',
  '    at doWrite (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:409:139)',
  '    at writeOrBuffer (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:398:5)',
  '    at StreamChopper.Writable.write (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:307:11)',
  '    at Client._writevCleaned (/Users/trentm/tm/apm-nodejs-http-client/index.js:368:17)',
  '    at processBatch (/Users/trentm/tm/apm-nodejs-http-client/index.js:332:12)',
  '    at Client._writev (/Users/trentm/tm/apm-nodejs-http-client/index.js:350:3)',
  '    at doWrite (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:409:97)',
  '    at clearBuffer (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:505:5)',
  '    at Client.Writable.uncork (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:321:94)',
  '    at /Users/trentm/tm/apm-nodejs-http-client/index.js:431:14',
  '    at processTicksAndRejections (internal/process/task_queues.js:75:11)'
] ZLIB undefined

Notes how the time in _writevCleaned doing event encoding (elapsedMs: 76.09864) is the lion's share of the time for this blocked op (Blocked for 79.46283703613281ms). This repeats for all the Blocked for outputs I had for a few test runs.

from apm-nodejs-http-client.

trentm commented on September 22, 2024

Nope, #143 wasn't a fix yet for this issue.

from apm-nodejs-http-client.

trentm commented on September 22, 2024

Grr. GH auto-linking and then auto-closing for only related tickets is biting me.

from apm-nodejs-http-client.

Blocking Behavior under Benchmarking Load about apm-nodejs-http-client HOT 11 CLOSED

Comments (11)

implementation plan

intake request notes

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent