Giter Site home page Giter Site logo

Comments (8)

lucasmrod avatar lucasmrod commented on September 22, 2024 2

@edwardsb @rfairburn

Fleet cloud environments have the following ELB configuration:

I was able to confirm that Fleet's dogfood environment closes an open connection after 1 hour (no matter if the connection has been active).

I used the following dummy changes:
https://github.com/fleetdm/fleet/compare/18783-test-changes
and ran fleetd with Fleet Desktop and scripts disabled (to have only one active connection to dogfood which calls GetConfig every 30 seconds).
(After running the test I realized I could have used https://pkg.go.dev/net/http/httptrace#ClientTrace :)

And got the following result:

connect: 2024-05-07 21:45:26.876302 -0300 -03 m=+2.021171248: tcp: <redacted dogfood URL>:443
connect: 2024-05-07 22:45:57.763169 -0300 -03 m=+3632.988133763: tcp: <redacted dogfood URL>:443

(The second connect happened after one hour.)

The above plus the fact that orbit/config is a POST (which cannot be retried) explains the sporadic "connection reset by peer" errors seen in fleetd logs (it's very sporadic because to reproduce the connection must be terminated by ELB when fleetd is attempting to make a request).

--

Solutions:

  1. Avoid printing the "connection reset by peer" errors in fleetd.
  2. Change fleetd to force reconnects after ~30m or so of a connection being open even if the connection is active. I haven't found a good/clean way to do this in Go (I can see some unattended proposals like golang/go#54429 and golang/go#43905).
  3. Decreasing fleetd's TCP keep-alive probing from 30s to 15s to reduce the likelihood of trying to use a closed connection (at the cost of duplicating probe traffic? given that fleetd performs requests every 30s or less depending if it runs scripts, has Fleet Desktop, etc.). AFAICS the current value of 30s of KeepAlive is pointless because fleetd perform requests to Fleet every 30s or less...
  4. Increase "HTTP client keepalive duration" to 1 day or the max allowed value: 604800 (7 days). This would reduce the likelihood of the "connection reset by peer" happening.
    TODO: What's the drawback of keeping a connection up to 7 days open:
  5. Turning POST orbit/config to GET`: You still get the "connection reset by peer" error, but fleetd retries automatically (so no error logs). It would only work with newer Fleet server versions. I don't think it would be a big improvement as fleetd already retries the GetClient request after a few seconds.

To discuss:

A. We could start with (4) - not saying increasing to 7 right away, but we can start with increasing it to 1 day and see if the connection errors decrease.
B. We could do a combination of (3) and (4).

--

from fleet.

sharon-fdm avatar sharon-fdm commented on September 22, 2024 1

Thanks @lukeheath.
We will look at this asap.

from fleet.

lukeheath avatar lukeheath commented on September 22, 2024

@sharon-fdm I am prioritizing this as a P2 because the customer began seeing it on April 30, so I want to determine if a code change could have caused it.

This could also be an infrastructure or Fleet configuration issue. This customer has recently been onboarding a large number of new hosts. In that case, please provide support to @rfairburn in investigating further.

from fleet.

sharon-fdm avatar sharon-fdm commented on September 22, 2024

Timebox to today 3 pts

from fleet.

lucasmrod avatar lucasmrod commented on September 22, 2024

In one of the proposals mentioned in (2) there's a "workaround" of calling Transport.CloseIdleConnections on a for loop (every few minutes). I will give that a go (would also help reduce these connection reset by peer errors).

from fleet.

lucasmrod avatar lucasmrod commented on September 22, 2024

@xpkoala Added QA notes.

from fleet.

lucasmrod avatar lucasmrod commented on September 22, 2024

@rfairburn The fleetd change will be released next week.
As discussed, if we want to start reducing these errors we can increase Fleet clouds ELB's client_keep_alive.seconds from the default of 1h to 1 day.

from fleet.

fleet-release avatar fleet-release commented on September 22, 2024

Connection reset cleared,
Fleet's orbit/config tuned,
Silent, like dew drops.

from fleet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.