Comments (11)
For those watching this issue, v3.14.0 was released yesterday that includes this fix.
from apm-nodejs-http-client.
Capturing some context from a conversation elsewhere.
[Our current] assumption/educated-guess here is that it is: (a) buffers up all traces to send, (b) the client.write*() methods end up eventually being blocking when those buffers fill and (c) possibly (though we don't have evidence yet) there are bugs that result in map leaks if/when the APM server jam can be cleared.
from apm-nodejs-http-client.
The issue did not appear (for me) when the demo app.js was handling only one request at a time (autocannon -p 1
). That could mean it is still just a load issue, or it could mean there is a concurrency handling bug in the apm-server client.
from apm-nodejs-http-client.
I spent this morning capturing some heap snapshots during these periods of an unresponsive APM server that leads to Node blocking. Nothing conclusive to draw these these yet, but I did notice that over time the number of "objects allocated but not released" appears to be growing in an unbounded way
My intuition/working-theory is that something about the client's state during these period leads to a growing number of objects that can't be released immediately and this eventually leads to a situation where the garbage collector starts trashing and blocking. (if that sounds a little hand wavy, it is)
Heap Snapshots are here if anyone wants to take a look: https://drive.google.com/file/d/1nYKgEboIKj9tY8FD1mycPxrqoNYxuPiZ/view?usp=sharing -- these should be in a format that Chrome's V8 debugging tools can read.
from apm-nodejs-http-client.
More data -- TL;DR -- it looks like garbage collection might be contributing here but isn't the sole reason for those long 5 - 10 second periods of Node blocking.
I used gc-profiler to log collection events across a 5 minute run. The gc-profiler
package allows us to observe when V8 performs a garbage collection operations and will tell us how long (in millisecond) that garbage collection operation took. This is important because node will block while garbage collecting.
Pulling the logs into elasticsearch via filebeat we came up with this chart.
The chart above is dominated by increasingly long MarkSweepCompact
GC events that reach as high as 800 milliseconds and occur routinely within 10 - 15 seconds of each other. If we remove these events from the chart
We can see more reasonable (if not great) times for the Scavange
and IncrementalMarking
events.
So these GC pauses aren't the sole reason for these long periods of blocking, but I'm guessing they don't help things -- long GC pause, then unpause, slew of a pending network requests come in, create a bunch more objects, even longer GC pause, etc.
More info on Node.js garbage collection Understanding Garbage Collection and Hunting Memory Leaks in Node.js and (more in depth) here
from apm-nodejs-http-client.
One more datapoint:
Using the blocked-at
package mentioned in this repo we've been able to see messages like this
Blocked for 3514.5704949951173ms, operation started here: [
' at Gzip.Zlib (zlib.js:656:18)',
' at new Gzip (zlib.js:724:8)',
' at Object.value [as createGzip] (zlib.js:849:14)',
' at StreamChopper.transform [as _transform] (/path/to/elastic-experiments/node_modules/elastic-apm-http-client/index.js:119:19)',
' at StreamChopper._startStream (/path/to/elastic-experiments/node_modules/stream-chopper/index.js:87:25)',
' at StreamChopper._write (/path/to/elastic-experiments/node_modules/stream-chopper/index.js:192:10)',
' at /path/to/elastic-experiments/node_modules/stream-chopper/index.js:193:12',
' at /path/to/elastic-experiments/node_modules/stream-chopper/index.js:116:7',
' at /path/to/elastic-experiments/node_modules/elastic-apm-http-client/index.js:523:7',
' at /path/to/elastic-experiments/node_modules/pump/index.js:75:7',
' at f (/path/to/elastic-experiments/node_modules/once/once.js:25:25)',
' at ClientRequest.<anonymous> (/path/to/elastic-experiments/node_modules/pump/index.js:31:21)',
' at ClientRequest.f (/path/to/elastic-experiments/node_modules/once/once.js:25:25)',
' at ClientRequest.onerror (/path/to/elastic-experiments/node_modules/end-of-stream/index.js:44:12)',
' at ClientRequest.emit (events.js:327:22)',
' at Socket.socketErrorListener (_http_client.js:469:9)',
' at Socket.emit (events.js:315:20)',
' at emitErrorNT (internal/streams/destroy.js:106:8)',
' at emitErrorCloseNT (internal/streams/destroy.js:74:3)',
' at processTicksAndRejections (internal/process/task_queues.js:80:21)'
]
under heavy load. There's other situations where things block, but this is the long one.
This indicates the longest blocking periods are due to the creation of the gzip stream for reasons that aren't immediately apparent.
from apm-nodejs-http-client.
I've been looking at this. Some summary notes so far:
-
The current onStream handling doesn't wait and check the apm server response before calling
next()
to move on to the next chopped stream. This means multiple "chopped stream" requests going on at the same time, if the apm-server is slow or non-responsive. -
If the apm-server endpoint accepts the connection but doesn't read its input stream, then (IIUC) the gzip stream's buffer wedges with the result that you get a "finish" event from the
zlib.createGzip()
stream, but you do not get an "end" event... at least not until we manually destroy the stream in the 30sserverTimeout
. That is why, at least in my testing scenario, I was getting 30s+ (actually often >60s) delays before the next request would go through. -
(Possibly only considered a nice-to-have for this ticket.) The current client does no backoff as specified at https://github.com/elastic/apm/blob/master/specs/agents/transport.md#transport-errors
-
And the main kicker is that there is no dropping of events (transactions, spans, et al) from the buffer/queue when they are coming in too fast. If apm-server is down, they build up forever on the
client._writableState.$iForgetTheFieldName
-
Also (at least some of) the latency/event-loop bubbles are from
Client.prototype._writev
doing encoding of all events in one big chunk synchronously. Those need to be broken up to not hog the event loop.
Next week I'll get a starter PR with some fix ideas.
from apm-nodejs-http-client.
implementation plan
Here is a possible implementation plan for the issues described in this earlier comment.
- Strongly consider getting #139 merged for this work so we are always running with a keep-alive agent and not having to test with and without scenarios.
- Redo the
onStream
(the handler for the chopped-stream "stream" event) implementation to handle (a) waiting for a response from the apm-server (handles (1)) and (b) using a newserverResponseTimeout
timer on the gzip stream "finish" event (handles (2)). See the "intake request notes" section below. - Update the http-client
sendSpan
,sendTransaction
,sendError
, andsendMetricSet
methods to drop an event if the buffer is over limit (handles (4)). - Re-measure and add logging to see if we still get the too-large
.encode()
s (5). I think we still will._writevCleaned
is already async, so it should be trivial to have anasync.forEach
(in series, paralle, or in batches) for this. - That leaves backoff (3). Could do this with a feature on StreamChopper to
setStartStreamDelay(delay)
that would be called before callingnext()
. StreamChopper's_startStream
would wait that delay before the next "stream" event. The http-client'sonStream
handler would set that stream delay appropriately based on APM server responses.
Update 20210310: Might be able to handle this backoff delay by not callingnext()
-- the callback to StreamChopper's "stream" event -- until after the backoff delay. That'd avoid adding features to StreamChopper.
intake request notes
Each "intake request" (the name I'm using for each chopped-stream "stream"
event handler call), can conclude (I'm intentionally avoiding the overloaded
terms "end", "finish") for any of the following reasons:
- the 10s
apiRequestTime
- handled by stream-chopper's_timer
calling_endStream
- the 768kB
apiRequestSize
- handled by stream-chopper counting bytes in the "data" event of the transform stream (azlib.createGzip()
) and calling_endStream
if the bytes written plus bytes buffered exceeds - the 30s
serverTimeout
- asocket.setTimeout
set by the http-client inonStream
; this measures 30s of socket inactivity and, IME, can take 60s or more (perhaps because of delayed flushing in the Gzip stream resetting that idle time) - an early response from apm-server - handled by the "response" event handler in the http-client
onStream
- an "error" event on the HTTP request or the gzip stream - Handled by "error" event handlers in the http-client
onStream
. Typically these error events would be ECONNRESET or ECONNREFUSED due to connection troubles at any point in the request. Q: What are other error causes? - a new proposed ~5s (?)
serverResponseTimeout
- Handled by the http-client inonStream
. This is a timer started after we are "done" sending data to the intake API, as measured by the "finish" event on the gzip stream. We cannot use the more typical "end" event because of issue (2) from above. - Q: How is an intake request terminated for http-client shutdown?
A typical "successful" intake request should happen like this:
- the StreamChopper starts a new stream:
- creates a Gzip transform stream
- emits a "stream" event with that gzip stream and a
next
callback - starts writing encoded event data to it
onStream
(will probably rename this) in the http-client handles the "stream" event- creates a HTTP request to the APM server
- pipes the gzip stream into the HTTP request stream
- listens for appropriate events (gzip "finish", gzip and req "end", gzip and req "error", req "response")
- the StreamChopper ends the stream (
_endStream
) for either ofapiRequestTime
orapiRequestSize
- calls
gzipStream.end()
which begins draining and flushing the gzip stream - the gzip stream emits "finish" when it has completed processing its input (though not necessarily completed flushing all data downstream)
- calls
- the http-client handles the gzip "finish" event:
- starts a
serverResponseTimeout
timer
- starts a
- apm-server responds and we get a "response" event on the request:
- the statusCode is 200
- read the apm-server response body, which involves waiting for the "end"
event on the res
- "end" event on the
res
(the apm-server response)- clearTimeouts
- we should have seen "end" events on gzipStream and req
- update the "backoff state" for the next intake request
- call the StreamChopper
next
callback to allow the next intake request
from apm-nodejs-http-client.
tl;dr: Circumstantial evidence showing the "Blocked for" occurrences are due to the encoding in _writevCleaned
.
Here, I'll annotate the blocked-at stack that Alan posted above to show how it is caused (I believe) by a slow Client.prototype._writevCleaned
.
First, background on how blocked-at works: It adds an async hook that:
- captures a stack when an async op is created (init),
- notes the time when the async op starts running (before),
- calculates the run time when the async op stops running (after).
If the runtime is thanthreshold
(default 20ms), then it calls its callback, which is configured here to print out these "Blocked for" messages.
The important point for pinpointing the slow code here is that the captured stack only identifies the call stack when the async op is created. The slow code can be anything in that function or any synchronous calls it makes. So let's work backwards on this call stack to see what the synchronous call stack after Gzip.Zlib
might be:
Blocked for 3514.5704949951173ms, operation started here: [
' at Gzip.Zlib (zlib.js:656:18)',
From the line numbers I infer this is roughly node v14.15.4. This line is where new binding.Zlib(mode)
calls the zlib C++ binding ZlibStream ctor. This subclasses CompressionStream which creates the "ZLIB" async op.
' at new Gzip (zlib.js:724:8)',
' at Object.value [as createGzip] (zlib.js:849:14)',
' at StreamChopper.transform [as _transform] (/path/to/elastic-experiments/node_modules/elastic-apm-http-client/index.js:119:19)',
Here we are creating the gzip stream ...
' at StreamChopper._startStream (/path/to/elastic-experiments/node_modules/stream-chopper/index.js:87:25)',
... as part of StreamChopper starting a new stream
' at StreamChopper._write (/path/to/elastic-experiments/node_modules/stream-chopper/index.js:192:10)',
... which is called as a result of _write
having been called with data (from the http Client, whenever it is uncorked and has data)
' at /path/to/elastic-experiments/node_modules/stream-chopper/index.js:193:12',
... which is a callback used to re-call _write(chunk, ...)
until after the stream is started
' at /path/to/elastic-experiments/node_modules/stream-chopper/index.js:116:7',
... this shows the this._onunlock
mechanics for that callback.
' at /path/to/elastic-experiments/node_modules/elastic-apm-http-client/index.js:523:7',
... this was all after an "error" event handled that ended the pump()
in the Client's onStream
handler.
' at /path/to/elastic-experiments/node_modules/pump/index.js:75:7',
' at f (/path/to/elastic-experiments/node_modules/once/once.js:25:25)',
' at ClientRequest.<anonymous> (/path/to/elastic-experiments/node_modules/pump/index.js:31:21)',
' at ClientRequest.f (/path/to/elastic-experiments/node_modules/once/once.js:25:25)',
' at ClientRequest.onerror (/path/to/elastic-experiments/node_modules/end-of-stream/index.js:44:12)',
' at ClientRequest.emit (events.js:327:22)',
' at Socket.socketErrorListener (_http_client.js:469:9)',
' at Socket.emit (events.js:315:20)',
' at emitErrorNT (internal/streams/destroy.js:106:8)',
' at emitErrorCloseNT (internal/streams/destroy.js:74:3)',
' at processTicksAndRejections (internal/process/task_queues.js:80:21)'
]
To be honest, I don't show here that this leads to the _writev
call on the Client stream which leads to the long _writevCleaned
. Instead I offer this circumstantial evidence from a load testing run with both blocked-at
and my added logging:
[2021-03-08T22:44:54.844Z] WARN (elastic-apm-node on pink.local): _writev 2313 objs
process: {"pid": 29719}
module: "elastic-apm-http-client"
writableState: (143) {
"needDrain": true,
"length": 2313,
"writing": true,
"corked": 0,
"bufferProcessing": true,
"writelen": 2313,
"pendingcb": 2366,
"bufferedRequestCount": 2313
}
error: (1361) {
"type": "Error",
"message": "hi",
"stack_trace":
Error: hi
at Client._writev (/Users/trentm/tm/apm-nodejs-http-client/index.js:317:60)
at doWrite (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:409:97)
at clearBuffer (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:505:5)
at onwrite (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:455:7)
at WritableState.onwrite (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:160:5)
at afterWrite (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:469:3)
at onwrite (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:461:7)
at WritableState.onwrite (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:160:5)
at Gzip.ondrain (/Users/trentm/tm/stream-chopper/index.js:59:5)
at Gzip.emit (events.js:315:20)
at afterWrite (internal/streams/writable.js:461:12)
at onwrite (internal/streams/writable.js:446:7)
at Gzip.afterTransform (internal/streams/transform.js:103:3)
at Zlib.processCallback (zlib.js:586:8)
at Zlib.callbackTrampoline (internal/async_hooks.js:131:14)
}
[2021-03-08T22:44:54.921Z] WARN (elastic-apm-node on pink.local): _writevCleaned: encoded objects
process: {"pid": 29719}
module: "elastic-apm-http-client"
elapsedMs: 76.09864
numObjs: 2313
chunkLen: 2074515
Blocked for 79.46283703613281ms, operation started here: [
' at Gzip.Zlib (zlib.js:655:18)',
' at new Gzip (zlib.js:723:8)',
' at Object.value [as createGzip] (zlib.js:848:14)',
' at StreamChopper.transform [as _transform] (/Users/trentm/tm/apm-nodejs-http-client/index.js:130:31)',
' at StreamChopper._startStream (/Users/trentm/tm/stream-chopper/index.js:91:25)',
' at StreamChopper._write (/Users/trentm/tm/stream-chopper/index.js:202:10)',
' at doWrite (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:409:139)',
' at writeOrBuffer (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:398:5)',
' at StreamChopper.Writable.write (/Users/trentm/tm/stream-chopper/node_modules/readable-stream/lib/_stream_writable.js:307:11)',
' at Client._writevCleaned (/Users/trentm/tm/apm-nodejs-http-client/index.js:368:17)',
' at processBatch (/Users/trentm/tm/apm-nodejs-http-client/index.js:332:12)',
' at Client._writev (/Users/trentm/tm/apm-nodejs-http-client/index.js:350:3)',
' at doWrite (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:409:97)',
' at clearBuffer (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:505:5)',
' at Client.Writable.uncork (/Users/trentm/tm/apm-nodejs-http-client/node_modules/readable-stream/lib/_stream_writable.js:321:94)',
' at /Users/trentm/tm/apm-nodejs-http-client/index.js:431:14',
' at processTicksAndRejections (internal/process/task_queues.js:75:11)'
] ZLIB undefined
Notes how the time in _writevCleaned doing event encoding (elapsedMs: 76.09864
) is the lion's share of the time for this blocked op (Blocked for 79.46283703613281ms
). This repeats for all the Blocked for
outputs I had for a few test runs.
from apm-nodejs-http-client.
Nope, #143 wasn't a fix yet for this issue.
from apm-nodejs-http-client.
Grr. GH auto-linking and then auto-closing for only related tickets is biting me.
from apm-nodejs-http-client.
Related Issues (20)
- Node 11 support HOT 2
- TypeError: Cannot convert undefined or null to object HOT 2
- Cannot get this to send out spans HOT 2
- Non-standard indentation and undefined catch scope HOT 6
- Indentation with EsLint and Prettier HOT 2
- Need to increment git ignore configuration HOT 5
- test failure with node v15.5.0 HOT 1
- Kibana Instrumentation and `APM Server transport error (ECONNRESET): socket hang up` Log Messages HOT 14
- perf: Alternate Options to Initial Stream Corking
- Feature Request: Logging in the Client HOT 1
- Flush seems to complete before request is sent HOT 5
- consider changing payloadLogFile handling to *not* unzip the payload data HOT 1
- 'npm run coverage' errors; breaks checks for node v15 HOT 1
- `client.flush(cb)` callback may not be called in v9.7.0 HOT 1
- Client Behavior in AWS Lambda Enviornment
- CI implementations (in GitHub actions and Jenkinis) duplicate the number of builds per PR HOT 2
- Audit failure due to a vulnerability in semver HOT 1
- Vulnerability in semver HOT 1
- Throws error `Cannot read property 'length' of undefined` on StreamChopper.Writable.write() HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from apm-nodejs-http-client.