Comments (15)
@bashcoder I suspect the client is hiding some errors from you. When you reduce your cluster size to 2, your etcd cluster can no longer achieve a quorum, making writes impossible. Can you look at the logs of the remaining fleet services for anything suspicious?
from fleet.
To clarify - I suspect your autoscaling group killed your etcd cluster leader, causing a leader election among two nodes. I do not think this is resolvable and the cluster will lock up. Waiting to hear back from @philips
from fleet.
OK found the etcd log and didn't find any error messages in it that seemed to be helpful. But I started another cluster from scratch using an auto-scaling group of 4 servers. I started the three example units again, and they all started on the same host.
I then terminated the instance that was running the units. A new host was added in its place, so the output of fleetctl list-machines
once again showed 4 server entries:
$ fleetctl list-machines
MACHINE IP METADATA
cfe623e3... 10.225.27.238 -
cfafe2c4... 10.181.228.98 -
041ad036... 10.167.4.22 -
5de3f7e5... 10.178.6.185 -
But the output of fleetctl list-units
showed that the units were still running on the terminated host:
$ fleetctl list-units
UNIT LOAD ACTIVE SUB DESC MACHINE
hello.service loaded active running - 60e7a353.../10.157.68.172
ping.service loaded active running - 60e7a353.../10.157.68.172
pong.service loaded active running - 60e7a353.../10.157.68.172
At this point, the fleetctl journal hello.service
and fleetctl status hello.service
commands hung indefinitely without timing out.
However, after running fleetctl stop hello.service
and fleetctl start hello.service
, that unit did start up again on a valid host:
$ fleetctl list-units
UNIT LOAD ACTIVE SUB DESC MACHINE
hello.service loaded active running - 5de3f7e5.../10.178.6.185
ping.service loaded active running - 60e7a353.../10.157.68.172
pong.service loaded active running - 60e7a353.../10.157.68.172
So, even though I avoided the problem with the quorum/leader election with only two servers, the units did not migrate properly when their server was terminated. It's possible that the terminated server was also the leader, but I don't know how to determine that. Hope this helps.
from fleet.
One hunch would be the clocks aren't in sync between the machines causing
TTL lag.
On Feb 20, 2014 5:06 PM, "bashcoder" [email protected] wrote:
OK found the etcd log and didn't find any error messages in it that seemed
to be helpful. But I started another cluster from scratch using an
auto-scaling group of 4 servers. I started the three example units again,
and they all started on the same host.I then terminated the instance that was running the units. A new host was
added in its place, so the output of fleetctl list-machines once again
showed 4 server entries:$ fleetctl list-machines
MACHINE IP METADATA
cfe623e3... 10.225.27.238 -
cfafe2c4... 10.181.228.98 -
041ad036... 10.167.4.22 -
5de3f7e5... 10.178.6.185 -But the output of fleetctl list-units showed that the units were still
running on the terminated host:$ fleetctl list-units
UNIT LOAD ACTIVE SUB DESC MACHINE
hello.service loaded active running - 60e7a353.../10.157.68.172
ping.service loaded active running - 60e7a353.../10.157.68.172
pong.service loaded active running - 60e7a353.../10.157.68.172At this point, the fleetctl journal hello.service and fleetctl status
hello.service commands hung indefinitely without timing out.However, after running fleetctl stop hello.service and fleetctl start
hello.service, that unit did start up again on a valid host:$ fleetctl list-units
UNIT LOAD ACTIVE SUB DESC MACHINE
hello.service loaded active running - 5de3f7e5.../10.178.6.185
ping.service loaded active running - 60e7a353.../10.157.68.172
pong.service loaded active running - 60e7a353.../10.157.68.172So, even though I avoided the problem with the quorum/leader election with
only two servers, the units did not migrate properly when their server was
terminated. It's possible that the terminated server was also the leader,
but I don't know how to determine that. Hope this helps.Reply to this email directly or view it on GitHubhttps://github.com//issues/149#issuecomment-35688879
.
from fleet.
@bashcoder I'll start digging into this today.
Btw, the behavior you saw with journal
and ssh
hanging indefinitely should be fixed in master. Now they'll time out after 10s if they can't dial the IP reported in the JobState.
from fleet.
@bcwaldon When should we cut a new fleetctl binary release with these UX fixes?
from fleet.
@philips Once these are done: https://github.com/coreos/fleet/issues?milestone=3&state=open
from fleet.
Ok - one lead here. If the etcd leader in the cluster used by fleet restarts, the fleet cluster becomes unusable. I simulated your deployment by manually deploying 4 vms, then I restarted the one that was the etcd leader. The cluster members were no longer receiving any events from etcd, and therefore could not actually do anything. I'll keep digging.
from fleet.
Yes, I think you're on to something there, @bcwaldon. I'll run a few more scenarios on my end too to see if I can confirm.
from fleet.
I'm thinking https://github.com/coreos/go-etcd/issues/106 is the root of the issue here.
/cc @bashcoder @philips
from fleet.
@bashcoder Good news, everyone! #167
from fleet.
Awesome! Great job on the patches and the logging updates @bcwaldon!
from fleet.
Closing as I believe this is fixed in master. Reopen if not
from fleet.
Great - any idea when this stuff will make its way into a coreos release?
from fleet.
Hopefully within the next couple of weeks. We're working through a large change in CoreOS that we're blocking the next release on.
from fleet.
Related Issues (20)
- Random failure to submit units when etcd endpoint not specified HOT 1
- Official rkt images HOT 4
- Change logic of grpc client without calling ClientConn.State()
- Introduce cAPI.UnitState() for a single unit
- fleetctl status needs SSH_AUTH_SOCK HOT 2
- Warning if (duplicate) unit is already in systemd HOT 2
- Error creating units: error retrieving Unit([email protected]) from Registry: Get http://domain-sock/fleet/v1/units/appname-16%401.service?alt=json: EOF HOT 1
- Running units on part of the cluster stopped and started after master disconnect HOT 4
- Issue with --replace HOT 1
- systemd: turn off force flag when calling LinkUnitFiles()
- After reboots, timers sometimes broken due to missing service files HOT 1
- functional: TestNodeShutdown fails with CoreOS 1185
- container: create a fleet docker container HOT 3
- engine: TestScheduleMachineOf fails with gRPC turned on
- fleet: server monitor fails to shutdown process HOT 1
- Want to get resource usage of each running unit in the coreos, use the golang code , how to obtain, what API ~ HOT 1
- Do Fleet component have debug tool?I am Worried about this?Hope to get any answer。 HOT 1
- CoreOs cluster restarted all containers due to fleet or etcd errors HOT 1
- Handle hashed host names in known_hosts
- Here $SSHTIMEOUT not display proper value HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fleet.