Comments (8)
Fleet cloud environments have the following ELB configuration:
- "Connection idle timeout": 305 seconds
- "HTTP client keepalive duration": 3600 seconds (default value). (see HTTP client keepalive duration)
I was able to confirm that Fleet's dogfood environment closes an open connection after 1 hour (no matter if the connection has been active).
I used the following dummy changes:
https://github.com/fleetdm/fleet/compare/18783-test-changes
and ran fleetd with Fleet Desktop and scripts disabled (to have only one active connection to dogfood which calls GetConfig every 30 seconds).
(After running the test I realized I could have used https://pkg.go.dev/net/http/httptrace#ClientTrace :)
And got the following result:
connect: 2024-05-07 21:45:26.876302 -0300 -03 m=+2.021171248: tcp: <redacted dogfood URL>:443
connect: 2024-05-07 22:45:57.763169 -0300 -03 m=+3632.988133763: tcp: <redacted dogfood URL>:443
(The second connect happened after one hour.)
The above plus the fact that orbit/config
is a POST
(which cannot be retried) explains the sporadic "connection reset by peer" errors seen in fleetd logs (it's very sporadic because to reproduce the connection must be terminated by ELB when fleetd is attempting to make a request).
--
Solutions:
- Avoid printing the "connection reset by peer" errors in fleetd.
- Change fleetd to force reconnects after ~30m or so of a connection being open even if the connection is active. I haven't found a good/clean way to do this in Go (I can see some unattended proposals like golang/go#54429 and golang/go#43905).
- Decreasing fleetd's TCP keep-alive probing from 30s to 15s to reduce the likelihood of trying to use a closed connection (at the cost of duplicating probe traffic? given that fleetd performs requests every 30s or less depending if it runs scripts, has Fleet Desktop, etc.). AFAICS the current value of 30s of KeepAlive is pointless because fleetd perform requests to Fleet every 30s or less...
- Increase "HTTP client keepalive duration" to 1 day or the max allowed value: 604800 (7 days). This would reduce the likelihood of the "connection reset by peer" happening.
TODO: What's the drawback of keeping a connection up to 7 days open:- resource-wise: maybe more resource utilization on both server and client side? Am not so sure as fleetd will re-open a connection right away? (Maybe good if there are memory leaks in the connection?).
- security-wise: I don't think there's an issue, using TLS1.2 means there's session resumption already and TLS1.3 improves security of re-used sessions)
- Turning
POST orbit/config to
GET`: You still get the "connection reset by peer" error, but fleetd retries automatically (so no error logs). It would only work with newer Fleet server versions. I don't think it would be a big improvement as fleetd already retries the GetClient request after a few seconds.
To discuss:
A. We could start with (4) - not saying increasing to 7 right away, but we can start with increasing it to 1 day and see if the connection errors decrease.
B. We could do a combination of (3) and (4).
--
from fleet.
Thanks @lukeheath.
We will look at this asap.
from fleet.
@sharon-fdm I am prioritizing this as a P2 because the customer began seeing it on April 30, so I want to determine if a code change could have caused it.
This could also be an infrastructure or Fleet configuration issue. This customer has recently been onboarding a large number of new hosts. In that case, please provide support to @rfairburn in investigating further.
from fleet.
Timebox to today 3 pts
from fleet.
In one of the proposals mentioned in (2) there's a "workaround" of calling Transport.CloseIdleConnections
on a for loop (every few minutes). I will give that a go (would also help reduce these connection reset by peer errors).
from fleet.
@xpkoala Added QA notes.
from fleet.
@rfairburn The fleetd change will be released next week.
As discussed, if we want to start reducing these errors we can increase Fleet clouds ELB's client_keep_alive.seconds
from the default of 1h to 1 day.
from fleet.
Connection reset cleared,
Fleet's orbit/config tuned,
Silent, like dew drops.
from fleet.
Related Issues (20)
- `Added to Fleet` set to `Never` when enrolling manually on MDM after installing fleetd first HOT 26
- Security group bug in fleet terraform module
- Request: Add additional context to /register page
- Improve fleetctl apply validation and error handling HOT 3
- No waiting spinner for Current versions table on Controls page HOT 2
- View all hosts button not right aligned on Software page HOT 2
- Windows lock script doesn't work for users who are Azure AD joined HOT 17
- Broken arrow HOT 2
- Website: device management land page update HOT 3
- WEBSITE: Homepage and /endpoint ops update (2024/06/29) HOT 3
- Check production dependencies of fleetdm.com HOT 1
- Offer 0-day support for new Apple operating systems HOT 5
- Return the `id` in the response of POST /api/v1/fleet/hosts/:id/software/install/:software_title_id HOT 1
- Apple Business Manager connection is broken/uneditable after ABM user is deleted HOT 6
- Host details: Show indication of minimum version HOT 3
- Website: Follow up changes from code review meeting HOT 2
- TODO: When 🌵 contact schedules a demo, include a bit more about what happened in Salesforce lead under "description" so we don't think the contact form message got lost and go on a wild goose chase HOT 2
- Fleet is not ingesting the Homebrew application itself HOT 3
- Release QA: 4.53.1 HOT 3
- Generate latest schema HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fleet.