I had a 3 machine CoreOS cluster running the 3 example fleet example services. After r

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yes, I think you're on to something there, <a class="user-mention notranslate" data-ho

Running units not migrating when their machine shuts down about fleet HOT 15 CLOSED

coreos commented on July 19, 2024

Running units not migrating when their machine shuts down

from fleet.

Comments (15)

bcwaldon commented on July 19, 2024

@bashcoder I suspect the client is hiding some errors from you. When you reduce your cluster size to 2, your etcd cluster can no longer achieve a quorum, making writes impossible. Can you look at the logs of the remaining fleet services for anything suspicious?

from fleet.

bcwaldon commented on July 19, 2024

To clarify - I suspect your autoscaling group killed your etcd cluster leader, causing a leader election among two nodes. I do not think this is resolvable and the cluster will lock up. Waiting to hear back from @philips

from fleet.

bashcoder commented on July 19, 2024

OK found the etcd log and didn't find any error messages in it that seemed to be helpful. But I started another cluster from scratch using an auto-scaling group of 4 servers. I started the three example units again, and they all started on the same host.

I then terminated the instance that was running the units. A new host was added in its place, so the output of fleetctl list-machines once again showed 4 server entries:

$ fleetctl list-machines
MACHINE     IP      METADATA
cfe623e3... 10.225.27.238   -
cfafe2c4... 10.181.228.98   -
041ad036... 10.167.4.22 -
5de3f7e5... 10.178.6.185    -

But the output of fleetctl list-units showed that the units were still running on the terminated host:

$ fleetctl list-units
UNIT        LOAD    ACTIVE  SUB DESC    MACHINE
hello.service   loaded  active  running -   60e7a353.../10.157.68.172
ping.service    loaded  active  running -   60e7a353.../10.157.68.172
pong.service    loaded  active  running -   60e7a353.../10.157.68.172

At this point, the fleetctl journal hello.service and fleetctl status hello.service commands hung indefinitely without timing out.

However, after running fleetctl stop hello.service and fleetctl start hello.service, that unit did start up again on a valid host:

$ fleetctl list-units
UNIT        LOAD    ACTIVE  SUB DESC    MACHINE
hello.service   loaded  active  running -   5de3f7e5.../10.178.6.185
ping.service    loaded  active  running -   60e7a353.../10.157.68.172
pong.service    loaded  active  running -   60e7a353.../10.157.68.172

So, even though I avoided the problem with the quorum/leader election with only two servers, the units did not migrate properly when their server was terminated. It's possible that the terminated server was also the leader, but I don't know how to determine that. Hope this helps.

from fleet.

philips commented on July 19, 2024

One hunch would be the clocks aren't in sync between the machines causing
TTL lag.
On Feb 20, 2014 5:06 PM, "bashcoder" [email protected] wrote:

OK found the etcd log and didn't find any error messages in it that seemed
to be helpful. But I started another cluster from scratch using an
auto-scaling group of 4 servers. I started the three example units again,
and they all started on the same host.

I then terminated the instance that was running the units. A new host was
added in its place, so the output of fleetctl list-machines once again
showed 4 server entries:

$ fleetctl list-machines
MACHINE IP METADATA
cfe623e3... 10.225.27.238 -
cfafe2c4... 10.181.228.98 -
041ad036... 10.167.4.22 -
5de3f7e5... 10.178.6.185 -

But the output of fleetctl list-units showed that the units were still
running on the terminated host:

$ fleetctl list-units
UNIT LOAD ACTIVE SUB DESC MACHINE
hello.service loaded active running - 60e7a353.../10.157.68.172
ping.service loaded active running - 60e7a353.../10.157.68.172
pong.service loaded active running - 60e7a353.../10.157.68.172

At this point, the fleetctl journal hello.service and fleetctl status
hello.service commands hung indefinitely without timing out.

However, after running fleetctl stop hello.service and fleetctl start
hello.service, that unit did start up again on a valid host:

$ fleetctl list-units
UNIT LOAD ACTIVE SUB DESC MACHINE
hello.service loaded active running - 5de3f7e5.../10.178.6.185
ping.service loaded active running - 60e7a353.../10.157.68.172
pong.service loaded active running - 60e7a353.../10.157.68.172

So, even though I avoided the problem with the quorum/leader election with
only two servers, the units did not migrate properly when their server was
terminated. It's possible that the terminated server was also the leader,
but I don't know how to determine that. Hope this helps.

Reply to this email directly or view it on GitHubhttps://github.com//issues/149#issuecomment-35688879
.

from fleet.

bcwaldon commented on July 19, 2024

@bashcoder I'll start digging into this today.

Btw, the behavior you saw with journal and ssh hanging indefinitely should be fixed in master. Now they'll time out after 10s if they can't dial the IP reported in the JobState.

from fleet.

philips commented on July 19, 2024

@bcwaldon When should we cut a new fleetctl binary release with these UX fixes?

from fleet.

bcwaldon commented on July 19, 2024

@philips Once these are done: https://github.com/coreos/fleet/issues?milestone=3&state=open

from fleet.

bcwaldon commented on July 19, 2024

Ok - one lead here. If the etcd leader in the cluster used by fleet restarts, the fleet cluster becomes unusable. I simulated your deployment by manually deploying 4 vms, then I restarted the one that was the etcd leader. The cluster members were no longer receiving any events from etcd, and therefore could not actually do anything. I'll keep digging.

from fleet.

bashcoder commented on July 19, 2024

Yes, I think you're on to something there, @bcwaldon. I'll run a few more scenarios on my end too to see if I can confirm.

from fleet.

bcwaldon commented on July 19, 2024

I'm thinking https://github.com/coreos/go-etcd/issues/106 is the root of the issue here.

/cc @bashcoder @philips

from fleet.

bcwaldon commented on July 19, 2024

@bashcoder Good news, everyone! #167

from fleet.

bashcoder commented on July 19, 2024

Awesome! Great job on the patches and the logging updates @bcwaldon!

from fleet.

bcwaldon commented on July 19, 2024

Closing as I believe this is fixed in master. Reopen if not

from fleet.

bashcoder commented on July 19, 2024

Great - any idea when this stuff will make its way into a coreos release?

from fleet.

bcwaldon commented on July 19, 2024

Hopefully within the next couple of weeks. We're working through a large change in CoreOS that we're blocking the next release on.

from fleet.

Running units not migrating when their machine shuts down about fleet HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent