Giter Site home page Giter Site logo

Comments (15)

bcwaldon avatar bcwaldon commented on July 19, 2024

@bashcoder I suspect the client is hiding some errors from you. When you reduce your cluster size to 2, your etcd cluster can no longer achieve a quorum, making writes impossible. Can you look at the logs of the remaining fleet services for anything suspicious?

from fleet.

bcwaldon avatar bcwaldon commented on July 19, 2024

To clarify - I suspect your autoscaling group killed your etcd cluster leader, causing a leader election among two nodes. I do not think this is resolvable and the cluster will lock up. Waiting to hear back from @philips

from fleet.

bashcoder avatar bashcoder commented on July 19, 2024

OK found the etcd log and didn't find any error messages in it that seemed to be helpful. But I started another cluster from scratch using an auto-scaling group of 4 servers. I started the three example units again, and they all started on the same host.

I then terminated the instance that was running the units. A new host was added in its place, so the output of fleetctl list-machines once again showed 4 server entries:

$ fleetctl list-machines
MACHINE     IP      METADATA
cfe623e3... 10.225.27.238   -
cfafe2c4... 10.181.228.98   -
041ad036... 10.167.4.22 -
5de3f7e5... 10.178.6.185    -

But the output of fleetctl list-units showed that the units were still running on the terminated host:

$ fleetctl list-units
UNIT        LOAD    ACTIVE  SUB DESC    MACHINE
hello.service   loaded  active  running -   60e7a353.../10.157.68.172
ping.service    loaded  active  running -   60e7a353.../10.157.68.172
pong.service    loaded  active  running -   60e7a353.../10.157.68.172

At this point, the fleetctl journal hello.service and fleetctl status hello.service commands hung indefinitely without timing out.

However, after running fleetctl stop hello.service and fleetctl start hello.service, that unit did start up again on a valid host:

$ fleetctl list-units
UNIT        LOAD    ACTIVE  SUB DESC    MACHINE
hello.service   loaded  active  running -   5de3f7e5.../10.178.6.185
ping.service    loaded  active  running -   60e7a353.../10.157.68.172
pong.service    loaded  active  running -   60e7a353.../10.157.68.172

So, even though I avoided the problem with the quorum/leader election with only two servers, the units did not migrate properly when their server was terminated. It's possible that the terminated server was also the leader, but I don't know how to determine that. Hope this helps.

from fleet.

philips avatar philips commented on July 19, 2024

One hunch would be the clocks aren't in sync between the machines causing
TTL lag.
On Feb 20, 2014 5:06 PM, "bashcoder" [email protected] wrote:

OK found the etcd log and didn't find any error messages in it that seemed
to be helpful. But I started another cluster from scratch using an
auto-scaling group of 4 servers. I started the three example units again,
and they all started on the same host.

I then terminated the instance that was running the units. A new host was
added in its place, so the output of fleetctl list-machines once again
showed 4 server entries:

$ fleetctl list-machines
MACHINE IP METADATA
cfe623e3... 10.225.27.238 -
cfafe2c4... 10.181.228.98 -
041ad036... 10.167.4.22 -
5de3f7e5... 10.178.6.185 -

But the output of fleetctl list-units showed that the units were still
running on the terminated host:

$ fleetctl list-units
UNIT LOAD ACTIVE SUB DESC MACHINE
hello.service loaded active running - 60e7a353.../10.157.68.172
ping.service loaded active running - 60e7a353.../10.157.68.172
pong.service loaded active running - 60e7a353.../10.157.68.172

At this point, the fleetctl journal hello.service and fleetctl status
hello.service commands hung indefinitely without timing out.

However, after running fleetctl stop hello.service and fleetctl start
hello.service, that unit did start up again on a valid host:

$ fleetctl list-units
UNIT LOAD ACTIVE SUB DESC MACHINE
hello.service loaded active running - 5de3f7e5.../10.178.6.185
ping.service loaded active running - 60e7a353.../10.157.68.172
pong.service loaded active running - 60e7a353.../10.157.68.172

So, even though I avoided the problem with the quorum/leader election with
only two servers, the units did not migrate properly when their server was
terminated. It's possible that the terminated server was also the leader,
but I don't know how to determine that. Hope this helps.

Reply to this email directly or view it on GitHubhttps://github.com//issues/149#issuecomment-35688879
.

from fleet.

bcwaldon avatar bcwaldon commented on July 19, 2024

@bashcoder I'll start digging into this today.

Btw, the behavior you saw with journal and ssh hanging indefinitely should be fixed in master. Now they'll time out after 10s if they can't dial the IP reported in the JobState.

from fleet.

philips avatar philips commented on July 19, 2024

@bcwaldon When should we cut a new fleetctl binary release with these UX fixes?

from fleet.

bcwaldon avatar bcwaldon commented on July 19, 2024

@philips Once these are done: https://github.com/coreos/fleet/issues?milestone=3&state=open

from fleet.

bcwaldon avatar bcwaldon commented on July 19, 2024

Ok - one lead here. If the etcd leader in the cluster used by fleet restarts, the fleet cluster becomes unusable. I simulated your deployment by manually deploying 4 vms, then I restarted the one that was the etcd leader. The cluster members were no longer receiving any events from etcd, and therefore could not actually do anything. I'll keep digging.

from fleet.

bashcoder avatar bashcoder commented on July 19, 2024

Yes, I think you're on to something there, @bcwaldon. I'll run a few more scenarios on my end too to see if I can confirm.

from fleet.

bcwaldon avatar bcwaldon commented on July 19, 2024

I'm thinking https://github.com/coreos/go-etcd/issues/106 is the root of the issue here.

/cc @bashcoder @philips

from fleet.

bcwaldon avatar bcwaldon commented on July 19, 2024

@bashcoder Good news, everyone! #167

from fleet.

bashcoder avatar bashcoder commented on July 19, 2024

Awesome! Great job on the patches and the logging updates @bcwaldon!

from fleet.

bcwaldon avatar bcwaldon commented on July 19, 2024

Closing as I believe this is fixed in master. Reopen if not

from fleet.

bashcoder avatar bashcoder commented on July 19, 2024

Great - any idea when this stuff will make its way into a coreos release?

from fleet.

bcwaldon avatar bcwaldon commented on July 19, 2024

Hopefully within the next couple of weeks. We're working through a large change in CoreOS that we're blocking the next release on.

from fleet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.