coreos / fleet Goto Github PK

fleet ties together systemd and etcd into a distributed init system

License: Apache License 2.0

Go 96.21% Shell 3.00% Ruby 0.79%

fleet's Introduction

Deprecation warning

fleet is no longer developed or maintained by CoreOS. After February 1, 2018, a fleet container image will continue to be available from the CoreOS Quay registry, but will not be shipped as part of Container Linux. CoreOS instead recommends Kubernetes for all clustering needs.

The project exists here for historical reference. If you are interested in the future of the project and taking over stewardship, please contact [email protected]

fleet - a distributed init system

fleet ties together systemd and etcd into a simple distributed init system. Think of it as an extension of systemd that operates at the cluster level instead of the machine level.

This project is quite low-level, and is designed as a foundation for higher order orchestration. fleet is a cluster-wide elaboration on systemd units, and is not a container manager or orchestration system. fleet supports basic scheduling of systemd units across nodes in a cluster. Those looking for more complex scheduling requirements or a first-class container orchestration system should check out Kubernetes. The fleet and kubernetes comparison table has more information about the two systems.

Current status

The fleet project is no longer maintained.

As of v1.0.0, fleet has seen production use for some time and is largely considered stable. However, there are various known and unresolved issues, including scalability limitations with its architecture. As such, it is not recommended to run fleet clusters larger than 100 nodes or with more than 1000 services.

Using fleet

Launching a unit with fleet is as simple as running fleetctl start:

$ fleetctl start examples/hello.service
Unit hello.service launched on 113f16a7.../172.17.8.103

The fleetctl start command waits for the unit to get scheduled and actually start somewhere in the cluster. fleetctl list-unit-files tells you the desired state of your units and where they are currently scheduled:

$ fleetctl list-unit-files
UNIT            HASH     DSTATE    STATE     TMACHINE
hello.service   e55c0ae  launched  launched  113f16a7.../172.17.8.103

fleetctl list-units exposes the systemd state for each unit in your fleet cluster:

$ fleetctl list-units
UNIT            MACHINE                    ACTIVE   SUB
hello.service   113f16a7.../172.17.8.103   active   running

Supported Deployment Patterns

fleet is not intended to be an all-purpose orchestration system, and as such supports only a few simple deployment patterns:

Deploy a single unit anywhere on the cluster
Deploy a unit globally everywhere in the cluster
Automatic rescheduling of units on machine failure
Ensure that units are deployed together on the same machine
Forbid specific units from colocation on the same machine (anti-affinity)
Deploy units to machines only with specific metadata

These patterns are all defined using custom systemd unit options.

Getting Started

Before you can deploy units, fleet must be deployed and configured on each host in your cluster. (If you are running CoreOS, fleet is already installed.)

After you have machines configured (check fleetctl list-machines), get to work with the client.

Building

fleet must be built with Go 1.5+ on a Linux machine. Simply run ./build and then copy the binaries out of bin/ directory onto each of your machines. The tests can similarly be run by simply invoking ./test.

If you're on a machine without Go 1.5+ but you have Docker installed, run ./build-docker to compile the binaries instead.

Project Details

API

The fleet API uses JSON over HTTP to manage units in a fleet cluster. See the API documentation for more information.

Release Notes

See the releases tab for more information on each release.

License

fleet is released under the Apache 2.0 license. See the LICENSE file for details.

Specific components of fleet use code derivative from software distributed under other licenses; in those cases the appropriate licenses are stipulated alongside the code.

fleet's People

Contributors

Stargazers

Watchers

Forkers

polvi robryk bcwaldon philips jamessharp robszumski lucciano maxgfaraday rtvt123 natacado ngbinh calavera cchongxd xuanhan863 topiaruss yichengq andradeandrey amuniz dvanduzer randomstuffs22 proppy djekels-zz jsdir cakkineni nullstyle jonboulle bmizerany fizx brianredbeard rswart edpaget sirupsen dpb587 burke paulczar dudymas jpg0 rayleyva apatil colegleason ivanmarcin johnmontero thinkbox sukrit007 jitendrakry martindale ddrboxman tclavier zhgwenming deliverous alunduil juancarlosm yekeqiang fanyeren rubandeventhiran wheelcomplex benburkert dcurran90 billthebest danielolson5 ngpestelos appiah marineam zachlatta bx5974 eyakubovich dpetzel igm seacoastboy fpcmotif qlw hugoduncan metral th3architect jcderr misteritguru sarda-nikhil ryandub chenyf mwhooker eungjun-yi cheribral epipho godeep sebgregoire raymondeng huskeder atbrydeud tleyden bbrock25 genba lsnyder songguang-2010 rainsome-org1 saintaxl sigterm-no xuzhaokui sigma cap10morgan alex-docker

fleet's Issues

Getting status of stopped unit causes panic

core@ip-172-31-20-135 ~ $ fleetctl stop test2.service 
core@ip-172-31-20-135 ~ $ fleetctl status test*
test.service - My Service
   Loaded: loaded (/run/systemd/system/test.service; enabled-runtime)
   Active: active (running) since Wed 2014-02-12 22:24:35 UTC; 1h 33min ago
 Main PID: 569 (docker)
   CGroup: /system.slice/test.service
           └─569 /usr/bin/docker run busybox /bin/sh -c while true; do echo Hello World; sleep 1; done

Feb 12 23:58:12 ip-172-31-20-135 docker[569]: Hello World
Feb 12 23:58:13 ip-172-31-20-135 docker[569]: Hello World
Feb 12 23:58:14 ip-172-31-20-135 docker[569]: Hello World
Feb 12 23:58:15 ip-172-31-20-135 docker[569]: Hello World
Feb 12 23:58:16 ip-172-31-20-135 docker[569]: Hello World
Feb 12 23:58:17 ip-172-31-20-135 docker[569]: Hello World
Feb 12 23:58:18 ip-172-31-20-135 docker[569]: Hello World
Feb 12 23:58:19 ip-172-31-20-135 docker[569]: Hello World
Feb 12 23:58:20 ip-172-31-20-135 docker[569]: Hello World
Feb 12 23:58:21 ip-172-31-20-135 docker[569]: Hello World

%s does not appear to be running test2.service
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x48 pc=0x405563]

goroutine 1 [running]:
runtime.panic(0x6cf220, 0xa9a188)
    /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
main.printUnitStatus(0xc210000290, 0x7fffb8e7e6ca, 0xd)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.0-r1/work/fleet-0.1.0/third_party/src/github.com/coreos/fleet/cmd/status.go:44 +0x1d3
main.statusUnitsAction(0xc21004fec0)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.0-r1/work/fleet-0.1.0/third_party/src/github.com/coreos/fleet/cmd/status.go:32 +0xef
github.com/codegangsta/cli.Command.Run(0x737b40, 0x6, 0x0, 0x0, 0x79c270, ...)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.0-r1/work/fleet-0.1.0/third_party/src/github.com/codegangsta/cli/command.go:73 +0x994
github.com/codegangsta/cli.(*App).Run(0xc21007a000, 0xc21000a000, 0x4, 0x4, 0x7, ...)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.0-r1/work/fleet-0.1.0/third_party/src/github.com/codegangsta/cli/app.go:111 +0x855
main.main()
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.0-r1/work/fleet-0.1.0/third_party/src/github.com/coreos/fleet/cmd/cmd.go:90 +0xd0d
...
...

fleetctl memory address errors with status, ssh, and journal commands

When testing the three fleet example services using the latest AWS images, many of the fleetctl commands work just fine: submit, start, stop, destroy, list-machines, list-units, and cat.

But when running fleetctl status hello.service, fleetctl journal or any form of the fleetctl ssh command, I get a runtime error for dereferencing a nil pointer.

Versions:

CoreOS-231.0.0 (ami-c1f3f4a8)
fleet version 0.1.2+git

Example error:

$ fleetctl status hello.service

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x4eb5fe]

goroutine 1 [running]:
runtime.panic(0x6cf360, 0xae0188)
    /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/coreos/fleet/third_party/code.google.com/p/go.crypto/ssh.clientWithAddress(0x7f9a0fdb28b0, 0xc210000598, 0xc21008b1c0, 0x10, 0x0, ...)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/third_party/code.google.com/p/go.crypto/ssh/client.go:43 +0x2e
github.com/coreos/fleet/third_party/code.google.com/p/go.crypto/ssh.Dial(0x7380e0, 0x3, 0xc21008b1c0, 0x10, 0x0, ...)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/third_party/code.google.com/p/go.crypto/ssh/client.go:433 +0xab
github.com/coreos/fleet/ssh.NewSSHClient(0x732be0, 0x4, 0xc21008b1c0, 0x10, 0x1, ...)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/ssh/ssh.go:83 +0x6f
main.printUnitStatus(0xc210051ec0, 0xc210000290, 0x7fff2b4cd6dc, 0xd)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/fleetctl/status.go:62 +0x637
main.statusUnitsAction(0xc210051ec0)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/fleetctl/status.go:43 +0xf9
github.com/coreos/fleet/third_party/github.com/codegangsta/cli.Command.Run(0x737d00, 0x6, 0x0, 0x0, 0x79c4f0, ...)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/third_party/github.com/codegangsta/cli/command.go:73 +0x994
github.com/coreos/fleet/third_party/github.com/codegangsta/cli.(*App).Run(0xc210083000, 0xc21000a000, 0x3, 0x3, 0x7, ...)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/third_party/github.com/codegangsta/cli/app.go:111 +0x855
main.main()
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/fleetctl/cmd.go:99 +0x9c4

goroutine 3 [chan receive]:
github.com/coreos/fleet/third_party/github.com/golang/glog.(*loggingT).flushDaemon(0xae56e0)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/third_party/github.com/golang/glog/glog.go:839 +0x50
created by github.com/coreos/fleet/third_party/github.com/golang/glog.init·1
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/third_party/github.com/golang/glog/glog.go:406 +0x276

goroutine 4 [syscall]:
runtime.goexit()
    /usr/lib/go/src/pkg/runtime/proc.c:1394

goroutine 11 [select]:
net/http.(*persistConn).writeLoop(0xc210069800)
    /usr/lib/go/src/pkg/net/http/transport.go:791 +0x271
created by net/http.(*Transport).dialConn
    /usr/lib/go/src/pkg/net/http/transport.go:529 +0x61e

goroutine 10 [IO wait]:
net.runtime_pollWait(0x7f9a0fdb3878, 0x72, 0x0)
    /usr/lib/go/src/pkg/runtime/netpoll.goc:116 +0x6a
net.(*pollDesc).Wait(0xc210057bc0, 0x72, 0x7f9a0fdb1f88, 0xb)
    /usr/lib/go/src/pkg/net/fd_poll_runtime.go:81 +0x34
net.(*pollDesc).WaitRead(0xc210057bc0, 0xb, 0x7f9a0fdb1f88)
    /usr/lib/go/src/pkg/net/fd_poll_runtime.go:86 +0x30
net.(*netFD).Read(0xc210057b60, 0xc210089000, 0x1000, 0x1000, 0x0, ...)
    /usr/lib/go/src/pkg/net/fd_unix.go:204 +0x2a0
net.(*conn).Read(0xc210000358, 0xc210089000, 0x1000, 0x1000, 0x30, ...)
    /usr/lib/go/src/pkg/net/net.go:122 +0xc5
bufio.(*Reader).fill(0xc2100377e0)
    /usr/lib/go/src/pkg/bufio/bufio.go:91 +0x110
bufio.(*Reader).Peek(0xc2100377e0, 0x1, 0x0, 0x0, 0x0, ...)
    /usr/lib/go/src/pkg/bufio/bufio.go:119 +0xcb
net/http.(*persistConn).readLoop(0xc210069800)
    /usr/lib/go/src/pkg/net/http/transport.go:687 +0xb7
created by net/http.(*Transport).dialConn
    /usr/lib/go/src/pkg/net/http/transport.go:528 +0x607

fleetctl destroy doesn't do glob matching

core@ip-10-157-42-56 ~ $ fleetctl destroy apache*
core@ip-10-157-42-56 ~ $ fleetctl list-units     
UNIT            LOAD    ACTIVE  SUB DESC    MACHINE
apache.1.service    loaded  failed  failed  -   6794c9ec.../10.156.144.241
apache.2.service    loaded  failed  failed  -   0019f2d8.../10.182.139.116
myapp.service       loaded  active  running -   0019f2d8.../10.182.139.116

It works if you name the file directly:

core@ip-10-157-42-56 ~ $ fleetctl destroy apache.1.service
core@ip-10-157-42-56 ~ $ fleetctl list-units
UNIT            LOAD    ACTIVE  SUB DESC    MACHINE
apache.2.service    loaded  failed  failed  -   0019f2d8.../10.182.139.116
myapp.service       loaded  active  running -   0019f2d8.../10.182.139.116

fleet panics when default authorized_keys file is not found

Starting a fleet daemon on coreos as root with verify_units=true in the config file yields this in the log:

Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: panic: open /root/.ssh/authorized_keys: no such file or directory
Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: goroutine 1 [running]:
Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: runtime.panic(0x6d3f80, 0xc2100990f0)
Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: /usr/local/go/src/pkg/runtime/panic.c:266 +0xb6
Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: github.com/coreos/fleet/server.New(0x745dc0, 0x0, 0xb196b8, 0x0, 0x0, ...)
Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: /home/core/fleet/src/github.com/coreos/fleet/server/server.go:45 +0x2db
Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: main.main()
Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: /home/core/fleet/src/github.com/coreos/fleet/fleet.go:63 +0x9ee
Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: goroutine 3 [chan receive]:
Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: github.com/coreos/fleet/third_party/github.com/golang/glog.(*loggingT).flushDaemon(0xb10960)
Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: /home/core/fleet/src/github.com/coreos/fleet/third_party/github.com/golang/glog/glog.go:839 +0x50
Mar 04 15:39:46 ip-10-244-143-28 fleet-local[568]: created by github.com/coreos/fleet/third_party/github.com/golang/glog.init<C2><B7>1

I would expect an error to get logged that the authorized_keys file could not be found, and the server should not halt.

docs: document scheduling based on metadata

Document using fleetctl start --require region=us-east foo.service

X-ConditionMachineBootID should accept short-form of ID

A user wishing to use X-ConditionMachineBootID is going to sun a list-machines command to get the ID:

$ fleetctl list-machines
MACHINE     IP      METADATA
d66f8269... 192.168.101.10  -

The user will then take that shortened ID and pass it in as the value of X-ConditionMachineBootId in their unit file. This unit will be unschedulable, as the value will not actually be a valid boot ID.

The user should be able to provide a shortened machine ID in X-ConditionMachineBootId and have it schedule out to the necessary machine.

fleetctl: block on start to share result of scheduling

Idea: Have fleetctl start block until the unit is scheduled so we can show you something like:

$ fleetctl start myunit.service
Started on 9ef9b9ea.../10.178.32.22

Based on my usage, it seems justified to wait a few milliseconds since you end up running list-units right afterwards to check that everything went well. Would need a --no-block option as well.

Replace configuration with globalconf

https://github.com/rakyll/globalconf/ looks like a much better solution to allow CLI flags, config files and env vars to play together nicely

fleet panics when attempting to verify signature of unsigned unit

I deployed fleet with this config file:

verbosity=2
metadata="region=us-east-1"
verify_units=true
authorized_key_file="/home/core/.ssh/authorized_keys"

Then I attempted to start a unit with fleetctl, without enabling the client-side signing. I found this stacktrace on the agent my unit was scheduled to:

Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: I0304 15:44:26.167889 00592 agent.go:248] Fetching Job(hello.service) from Registry
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: panic: runtime error: invalid memory address or nil pointer dereference
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: [signal 0xb code=0x1 addr=0x0 pc=0x4658c4]
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: goroutine 215 [running]:
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: runtime.panic(0x6eee80, 0xb0b228)
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: /usr/local/go/src/pkg/runtime/panic.c:266 +0xb6
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: github.com/coreos/fleet/sign.(*SignatureVerifier).VerifyPayload(0xc21008a480, 0xc210283f40, 0x0, 0x0,
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: /home/core/fleet/src/github.com/coreos/fleet/sign/job.go:29 +0xd4
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: github.com/coreos/fleet/agent.(*Agent).FetchJob(0xc21008f5f0, 0xc2100f2857, 0xd, 0x7f2320cf2bc8)
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: /home/core/fleet/src/github.com/coreos/fleet/agent/agent.go:257 +0x1fe
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: github.com/coreos/fleet/agent.(*EventHandler).HandleEventJobScheduled(0xc2100ac200, 0x773ab0, 0x11, 0
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: /home/core/fleet/src/github.com/coreos/fleet/agent/event.go:49 +0x33a
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: reflect.Value.call(0x71d300, 0xc2100ac200, 0x338, 0x747a00, 0x4, ...)
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: /usr/local/go/src/pkg/reflect/value.go:474 +0xe0b
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: reflect.Value.Call(0x71d300, 0xc2100ac200, 0x338, 0xc210283700, 0x1, ...)
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: /usr/local/go/src/pkg/reflect/value.go:345 +0x9d
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: created by github.com/coreos/fleet/event.(*EventBus).dispatch
Mar 04 15:44:26 ip-10-242-210-235 fleet-local[592]: /home/core/fleet/src/github.com/coreos/fleet/event/bus.go:63 +0x6d8

fleet has no IP information when automatically started on EC2

I booted a five-node CoreOS cluster on EC2 and brought fleet up automatically. Looking at the machine list, I see no IP information:

% fleetctl list-machines
MACHINE     IP      METADATA
18805c5d... -       -
28d9993e... -       -
3a74f97d... -       -
758fe95c... -       -
ade94405... -       -

Restarting fleet on each of the instances does correct the problem:

% fleetctl list-machines
MACHINE     IP      METADATA
18805c5d... 10.80.83.51 -
28d9993e... 10.114.167.102  -
3a74f97d... 10.77.63.132    -
758fe95c... 10.194.243.63   -
ade94405... 10.203.65.129   -

It seems like we're starting fleet before the instances actually have IPs bound to their interfaces, and fleet won't refresh it's machine state once it starts. The Machine object will simply get the IP address at creation time: https://github.com/coreos/fleet/blob/master/machine/machine.go#L24

Agents bid for jobs they cannot run

There are still two places that an Agent is bidding for jobs it cannot verify:

https://github.com/coreos/fleet/blob/master/agent/event.go#L49
https://github.com/coreos/fleet/blob/master/agent/event.go#L88

Allow Job requirements to be passed through unit file

The --require flag is not the best way to provide machine-filtering of a given job. The original idea was that deployment-specific parameters should not be in the X-Fleet section of the unit files. We've realized that if the unit file generation is going to be scripted, it doesn't really help at all. Additionally, we've already let deployment-specific parameters sneak in to the unit files in the form of X-ConditionMachineBootId.

Add an X-ConditionMachineMetadata option to unit files and deprecate the --require flag.

No job offers were sent due to rolling update

After a rolling update, job offers were not sent out when one of the machines went down. list-units still showed an old machine running the units. The etcd cluster was not affected by this update other than a leader election.

Starting and stopping the units fixed the problem.

Allow user to set FLEETCTL_TUNNEL env var

It's kind of annoying to have to provide a --tunnel flag with every command. Yes, I could just alias the thing, but it's much more common to allow users to set environment variables for this kind of stuff. Let's look for the FLEETCTL_TUNNEL env var if --tunnel is not provided.

fleetctl: ssh detects unit or machine automatically

It would be great if you ran fleetctl ssh it would detect whether you used a unit name or a machine id and connect accordingly. This would phase out use of -u to specify a unit.

Better error message for fleetctl start failures

I started two units that are already running:

$ fleetctl start apache.*
Creation of Job apache.1.service failed
Creation of Job apache.2.service failed

As a user, it's not clear why this failed.

fleetctl: fetching the journal of a failed unit causes panic

Fetching the journal or status of a failed unit causes panic:

core@ip-10-178-32-22 ~ $ fleetctl list-units
UNIT                LOAD    ACTIVE  SUB DESC    MACHINE
robszumski.1.service        loaded  active  running -   9ef9b9ea.../10.178.32.22
robszumski.2.service        loaded  failed  failed  -   ff82bd16.../10.180.219.216
core@ip-10-178-32-22 ~ $ fleetctl journal robszumski.2.service
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x4eb5fe]

goroutine 1 [running]:
runtime.panic(0x6cf360, 0xae0188)
    /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/coreos/fleet/third_party/code.google.com/p/go.crypto/ssh.clientWithAddress(0x7f2e15920958, 0xc2100005b8, 0xc210085540, 0x11, 0x0, ...)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/third_party/code.google.com/p/go.crypto/ssh/client.go:43 +0x2e
github.com/coreos/fleet/third_party/code.google.com/p/go.crypto/ssh.Dial(0x7380e0, 0x3, 0xc210085540, 0x11, 0x0, ...)
    /build/amd64-generic/tmp/portage/app-admin/fleet-0.1.2/work/fleet-0.1.2/src/github.com/coreos/fleet/third_party/code.google.com/p/go.crypto/ssh/client.go:433 +0xab
...
core@ip-10-178-32-22 ~ $

better error messages

This one should say "already started" or something

alexs-air-3:fleet-redis-demo polvi$ fleetctl-imc start redis-dyn-amb.service
Creation of job redis-dyn-amb.service failed: 105: Key already exists (/_coreos.com/fleet/job/redis-dyn-amb.service/object) [2088]

Submitted units vanish after Vagrant halt/up

Script started on Wed 19 Feb 2014 09:20:11 AM CST
nolan-laptop% vagrant ssh
Last login: Wed Feb 19 14:54:54 UTC 2014 from 10.0.2.2 on pts/0

/ / ________ / __ / **/
/ / / __ / / _ / / / /**
/ /**/ // / / / __/ // // /
_/**// /**//**/
core@localhost ~ $ fleetctl list-units
UNIT LOAD ACTIVE SUB DESC MACHINE
core@localhost ~/coreos $ fleetctl submit skydns.service
core@localhost ~/coreos $ fleetctl list-units
UNIT LOAD ACTIVE SUB DESC MACHINE
skydns.service - - - - -
core@localhost ~/coreos $ exit
logout
Connection to 127.0.0.1 closed.
nolan-laptop% vagrant halt
[default] Attempting graceful shutdown of VM...
nolan-laptop% vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
[default] Clearing any previously set forwarded ports...
[default] Clearing any previously set network interfaces...
[default] Preparing network interfaces based on configuration...
[default] Forwarding ports...
[default] -- 22 => 2222 (adapter 1)
[default] Running 'pre-boot' VM customizations...
[default] Booting VM...
[default] Waiting for machine to boot. This may take a few minutes...
[default] Machine booted and ready!
[default] No guest additions were detected on the base box for this VM! Guest
additions are required for forwarded ports, shared folders, host only
networking, and more. If SSH fails on this machine, please install
the guest additions and repackage the box to continue.

This is not an error message; everything may continue to work properly,
in which case you may ignore this message.
[default] VM already provisioned. Run vagrant provision or use --provision to force it
nolan-laptop% vagrant ssh
Last login: Wed Feb 19 15:22:05 UTC 2014 from 10.0.2.2 on ssh

/ / ________ / __ / **/
/ / / __ / / _ / / / /**
/ /**/ // / / / __/ // // /
_/**// /**//**/
core@localhost ~ $ fleetctl list-units
UNIT LOAD ACTIVE SUB DESC MACHINE
core@localhost ~ $

ec2 reboot caused weird fleet status, and we can't launch new services now

This issue has been moved over from etcd-io/etcd#615

We rebooted a server (manually) and all of a sudden a whole lot of the elasticsearch services lost their state. They were all listed as running before the reboot. This is what they look like now:

UNIT                                    LOAD    ACTIVE  SUB     DESC                            MACHINE
elasticsearch.1.service                 loaded  active  running elasticsearch                   534ead73.../10.28.7.6
elasticsearch.2.service                 -       -       -       elasticsearch                   -
elasticsearch.3.service                 -       -       -       elasticsearch                   -
elasticsearch.4.service                 -       -       -       elasticsearch                   -
elasticsearch.5.service                 -       -       -       elasticsearch                   -
elasticsearch.6.service                 loaded  active  running elasticsearch                   008faf7b.../10.28.70.224
elasticsearch.7.service                 -       -       -       elasticsearch                   -
github-authentication-proxy.1.service   loaded  active  running github-authentication-proxy     6ec5b6a7.../10.28.69.136
github-authentication-proxy.2.service   loaded  active  running github-authentication-proxy     008faf7b.../10.28.70.224
github-authentication-proxy.3.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.4.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.5.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.6.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.7.service   -       -       -       github-authentication-proxy     -
prodotti.1.service                      -       -       -       prodotti                        -
prodotti.2.service                      loaded  active  running prodotti                        534ead73.../10.28.7.6
prodotti.3.service                      -       -       -       prodotti                        -
prodotti.4.service                      loaded  active  running prodotti                        6ec5b6a7.../10.28.69.136
prodotti.5.service                      -       -       -       prodotti                        -
prodotti.6.service                      -       -       -       prodotti                        -
prodotti.7.service                      loaded  active  running prodotti                        008faf7b.../10.28.70.224

Also, we now can't seem to schedule any prodotti services. We have left the elasticsearch services running for a few days.

@bcwaldon mentioned I should include the X-Fleet stuff:

[X-Fleet]
X-Conflicts=prodotti.*.service

(Previously we had X-Conflicts=prodotti* but today switched over to the above, thinking that maybe the path globbing was an issue. Turns out this doesn't work either).

Any other information I can get? The servers are still in this state and will stay in this state until a reboot.

(BTW if anyone wants to look at this issue I can let you look at the server state with tmate.io or something?)

v0.1.2

Failed machine heartbeats are not reattempted

A machine is configured to heartbeat its state to etcd every 15s with a TTL of 30s. If a single heartbeat fails, it will cause the machine's key to expire ~15s later. This expiration causes unnecessary churn in cluster membership. The agent should reattempt failed heartbeats.

Cannot see any error message when starting the verify-fail unit again

Cannot see any error message when starting the unit again.
I should fix it so user could know what happens.

core@localhost ~/fleet $ ~/coredev/start_service.sh
core@localhost ~/fleet $ I0304 11:08:40.915487 00474 fleet.go:114] Continuing without config file
I0304 11:08:40.920157 00474 manager.go:195] Writing systemd unit file fleet-d638e6dc-c656-4dac-a2a1-fe3233d581b4.target

core@localhost ~/fleet $ ./bin/fleetctl list-units
UNIT    LOAD    ACTIVE  SUB DESC    MACHINE
core@localhost ~/fleet $ ./bin/fleetctl submit examples/hello.service
core@localhost ~/fleet $ ./bin/fleetctl list-units -verify
Check of payload hello.service failed: signature to verify is nil
core@localhost ~/fleet $ ./bin/fleetctl start hello.service
I0304 11:09:13.524296 00474 engine.go:78] Published JobOffer(hello.service)
I0304 11:09:13.525041 00474 event.go:28] EventJobOffered(hello.service): passed all criteria, submitting JobBid
core@localhost ~/fleet $ I0304 11:09:13.527050 00474 agent.go:209] Submitting JobBid for Job(hello.service)
I0304 11:09:13.532611 00474 engine.go:108] Scheduled Job(hello.service) to Machine(d638e6dc-c656-4dac-a2a1-fe3233d581b4)
E0304 11:09:13.533308 00474 agent.go:259] Check of payload hello.service failed when fetch: signature to verify is nil
E0304 11:09:13.533407 00474 event.go:51] EventJobScheduled(hello.service): Failed to fetch Job

core@localhost ~/fleet $ ./bin/fleetctl list-units
UNIT        LOAD    ACTIVE  SUB DESC        MACHINE
hello.service   -   -   -   Hello World -
core@localhost ~/fleet $ ./bin/fleetctl start hello.service
I0304 11:10:37.380773 00474 agent.go:142] Starting Job(hello.service)
I0304 11:10:37.381356 00474 manager.go:195] Writing systemd unit file hello.service
I0304 11:10:37.381783 00474 manager.go:189] Instructing systemd to reload units
I0304 11:10:37.385772 00474 manager.go:154] Started systemd unit hello.service
core@localhost ~/fleet $

Running units not migrating when their machine shuts down

I had a 3 machine CoreOS cluster running the 3 example fleet example services. After running several tests, the 3 units all ended up running on one machine (the distribution of units isn't the issue I'm posting about - normally they are spread out evenly). I reduced the size of the cluster down to 2 machines in my AWS auto scaling group. This caused the server that was running the example units to shut down.

On shutdown, the running units did not migrate to the other hosts. In addition, it appears to have left the cluster in a state where it can not run any units. When I ran fleetctl list-units, all of the services were still registered, but none were running:

UNIT        LOAD    ACTIVE  SUB DESC    MACHINE
hello.service   -   -   -   -   -
ping.service    -   -   -   -   -
pong.service    -   -   -   -   -

When I tried starting the fleetctl start hello.service, I got this error:

Creation of job hello.service failed: 105: Key already exists (/_coreos.com/fleet/job/hello.service/object) [14518]

So I ran fleetctl stop hello.service, then fleetctl start hello.service and received no errors. But the output of fleetctl list-units was still the same, and fleetctl status hello.service reported:

hello.service does not appear to be running

I've tried destroying and re-submitting the services, but still can not successfully start any units.

Versions:

CoreOS-231.0.0 (ami-c1f3f4a8)
fleet version 0.1.2+git

Static bootid & publicip defined in config not taking effect

I'm running fleet with this config:

verbosity=2
public_ip="54.81.21.194"
metadata="region=us-east-1"

Yet the list-machines command tells me this:

% fleetctl list-machines -l
MACHINE                 IP      METADATA
c31e44e1-f858-436e-933e-59c642517860    10.114.233.176  region=us-east-1

need to add usage to journal

Need more than just [command options] [arguments...] to describe how to use the journal. Check other commands too and make sure they are properly documented.

$ fleetctl journal [email protected]
2014/02/12 01:20:56 Unable to run command over SSH: dial unix: missing address
core@ip-10-151-62-78 ~ $ fleetctl journal -u [email protected]
Incorrect Usage.

NAME:
   journal - Print the journal of a unit in the cluster to stdout

USAGE:
   command journal [command options] [arguments...]

DESCRIPTION:


OPTIONS:
   --lines, -n '10' Number of log lines to return.

Unclear where to get value for X-ConditionMachineBootId

After reading over the section on X-ConditionMachineBootId in https://github.com/coreos/fleet/blob/master/Documentation/unit-files.md, it is unclear where to actually get the boot id of a given machine.

As the developer, I know this value is just the ID of the machine exposed in fleetctl list-machines -l, but that is not obvious to users that are not me.

Can't restart services, key already exists

When I can't start a service after stopping it. This happend after starting a service with fleet on a single Vagrant box. To me it looks like the etcd key-value information is out-of-sync with the machine state.

core@localhost ~ $ fleetctl start nginx.1.service 
core@localhost ~ $ fleetctl list-units
UNIT        LOAD    ACTIVE  SUB DESC    MACHINE
nginx.1.service loaded  active  running -   86badc2e...
core@localhost ~ $ journalctl -f -u nginx.1.service 
-- Logs begin at Wed 2014-03-12 11:05:54 UTC. --
Mar 12 11:08:41 localhost systemd[1]: [/run/systemd/system/nginx.1.service:8] Unknown section 'X-Fleet'. Ignoring.
Mar 12 11:08:41 localhost systemd[1]: [/run/systemd/system/nginx.1.service:9] Assignment outside of section. Ignoring.
Mar 12 11:08:41 localhost systemd[1]: Starting nginx...
Mar 12 11:08:41 localhost systemd[1]: Started nginx.
^C
core@localhost ~ $ curl localhost
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> ...
core@localhost ~ $ fleetctl stop nginx.1.service 
core@localhost ~ $ curl localhost
curl: (7) Failed connect to localhost:80; Connection refused
core@localhost ~ $ fleetctl status nginx.1.service 
nginx.1.service does not appear to be running
core@localhost ~ $ fleetctl start nginx.1.service
Creation of job nginx.1.service failed: 105: Key already exists (/_coreos.com/fleet/job/nginx.1.service/object) [12]

Some context information:

core@localhost ~ $ curl 'http://127.0.0.1:4001/v2/keys/_coreos.com/fleet/?recursive=true'
{"action":"get","node":{"key":"/_coreos.com/fleet","dir":true,"nodes":[{"key":"/_coreos.com/fleet/payload","dir":true,"nodes":[{"key":"/_coreos.com/fleet/payload/nginx.1.service","value":"{\"Name\":\"nginx.1.service\",\"Unit\":{\"Contents\":{\"Service\":{\"ExecStart\":\"/usr/bin/docker run -rm -name nginx-1 -p 80:80 nginx:base\",\"ExecStop\":\"/usr/bin/docker kill nginx-1\"},\"Unit\":{\"Description\":\"nginx\"},\"X-Fleet\":{\"X-Conflicts\":\"nginx.*.service\"}}}}","modifiedIndex":2,"createdIndex":2}],"modifiedIndex":2,"createdIndex":2},{"key":"/_coreos.com/fleet/job","dir":true,"nodes":[{"key":"/_coreos.com/fleet/job/nginx.1.service","dir":true,"nodes":[{"key":"/_coreos.com/fleet/job/nginx.1.service/object","value":"{\"Name\":\"nginx.1.service\",\"JobRequirements\":{},\"Payload\":{\"Name\":\"nginx.1.service\",\"Unit\":{\"Contents\":{\"Service\":{\"ExecStart\":\"/usr/bin/docker run -rm -name nginx-1 -p 80:80 nginx:base\",\"ExecStop\":\"/usr/bin/docker kill nginx-1\"},\"Unit\":{\"Description\":\"nginx\"},\"X-Fleet\":{\"X-Conflicts\":\"nginx.*.service\"}}}},\"State\":null}","modifiedIndex":3,"createdIndex":3}],"modifiedIndex":3,"createdIndex":3}],"modifiedIndex":3,"createdIndex":3},{"key":"/_coreos.com/fleet/machines","dir":true,"nodes":[{"key":"/_coreos.com/fleet/machines/fba93380-6e5c-43fb-a620-829e166f0865","dir":true,"nodes":[{"key":"/_coreos.com/fleet/machines/fba93380-6e5c-43fb-a620-829e166f0865/object","value":"{\"BootId\":\"fba93380-6e5c-43fb-a620-829e166f0865\",\"PublicIP\":\"10.0.2.15\",\"Metadata\":{}}","expiration":"2014-03-12T11:17:27.15988847Z","ttl":18,"modifiedIndex":8,"createdIndex":4}],"modifiedIndex":4,"createdIndex":4}],"modifiedIndex":4,"createdIndex":4}],"modifiedIndex":2,"createdIndex":2}}core@localhost ~

My nginx.1.service file:

[Unit]
Description=nginx

[Service]
ExecStartPre=/usr/bin/docker build -t nginx:base github.com/tscheepers/docker-centos
ExecStart=/usr/bin/docker run -rm -name nginx-1 -p 80:80 nginx:base
ExecStop=/usr/bin/docker kill nginx-1

[X-Fleet]
X-Conflicts=nginx.*.service

fleetctl: timeout ssh connections

fleetctl --tunnel 1.2.3.4 will hang forever; set a timeout of some sort.

README: Add a debugging section

I think the most useful thing is to recommend people dump the keystate. We should build something that does curl 'http://127.0.0.1:4001/v2/keys/_coreos.com/fleet/?recursive=true' into fleetctl. Thoughts?

UX is poor when ssh-agent misconfigured

fleetctl is not very helpful when your ssh-agent is misconfigured:

$ fleetctl status apache.1.service
2014/02/12 19:07:48 Unable to execute command over SSH: dial unix: missing address

$ fleetctl-imc list-machines
panic: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

goroutine 1 [running]:
runtime.panic(0x2b1920, 0x2108979e0)
    /usr/local/go/src/pkg/runtime/panic.c:266 +0xb6
main.getRegistry(0x21086eee0, 0x2108710d0)
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/fleetctl/cmd.go:51 +0x1ab
main.listMachinesAction(0x21086eee0)
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/fleetctl/list_machines.go:31 +0x35
github.com/coreos/fleet/third_party/github.com/codegangsta/cli.Command.Run(0x347eb0, 0xd, 0x0, 0x0, 0x38fe10, ...)
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/third_party/github.com/codegangsta/cli/command.go:73 +0x994
github.com/coreos/fleet/third_party/github.com/codegangsta/cli.(*App).Run(0x21088e000, 0x21082d000, 0x4, 0x4, 0x7, ...)
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/third_party/github.com/codegangsta/cli/app.go:111 +0x855
main.main()
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/fleetctl/cmd.go:99 +0x9c4

goroutine 3 [chan receive]:
github.com/coreos/fleet/third_party/github.com/golang/glog.(*loggingT).flushDaemon(0x6dad80)
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/third_party/github.com/golang/glog/glog.go:839 +0x50
created by github.com/coreos/fleet/third_party/github.com/golang/glog.init·1
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/third_party/github.com/golang/glog/glog.go:406 +0x276
Alexs-MacBook-Air-3:Downloads polvi$ ssh-add ~/.ssh/polvi-air2.pem 
Identity added: /Users/polvi/.ssh/polvi-air2.pem (/Users/polvi/.ssh/polvi-air2.pem)
Alexs-MacBook-Air-3:Downloads polvi$ fleetctl-imc list-machines
MACHINE     IP      METADATA
...

Here are some methods of being more helpful:

$ fleetctl status apache.1.service
Unable to make an SSH connection because there is no ssh-agent running. Ensure than an agent
is running on this machine or forward your ssh agent from your local machine before
connecting to the remote machine. Example: "ssh -A [email protected]"

$ fleetctl list-machines
ssh: unable to authenticate via ssh agent. 
Maybe ssh-add ~/.ssh/my-key.pem and try again?

X-Conflicts not working

Thought it was related to #96, but the machine that gets it scheduled is does not have an IP... so... not sure. I did do a fleet start docker* followed by a fleet destroy docker\@*, then a start again to get this behavior.

core@ip-10-151-62-78 ~ $ fleetctl list-units
UNIT            LOAD    ACTIVE  SUB DESC    MACHINE
[email protected]    -   -   -   -   -
[email protected]    loaded  failed  failed  -   7855caaf...
[email protected]    -   -   -   -   -
core@ip-10-151-62-78 ~ $ fleetctl destroy docker*
core@ip-10-151-62-78 ~ $ fleetctl list-units
UNIT    LOAD    ACTIVE  SUB DESC    MACHINE
core@ip-10-151-62-78 ~ $ fleetctl start docker\@*
core@ip-10-151-62-78 ~ $ fleetctl list-units
UNIT            LOAD    ACTIVE  SUB DESC    MACHINE
[email protected]    loaded  active  running -   c8823909...
[email protected]    -   -   -   -   -
[email protected]    -   -   -   -   -
core@ip-10-151-62-78 ~ $ fleetctl list-machines
MACHINE     IP      METADATA
c8823909... -       -
7855caaf... -       -
4073a287... 10.151.62.78    -
core@ip-10-151-62-78 ~ $ cat docker\@test2.service 
[Service]
ExecStart=/usr/bin/docker run -rm -name %i busybox sleep 30
ExecStop=/usr/bin/docker stop %i 

[X-Fleet]
X-Conflicts=docker*
core@ip-10-151-62-78 ~ $ cat docker\@test3.service 
[Service]
ExecStart=/usr/bin/docker run -rm -name %i busybox sleep 30
ExecStop=/usr/bin/docker stop %i 

[X-Fleet]
X-Conflicts=docker*

Another example (all from scratch):

core@ip-10-151-62-78 ~ $ tee docker-test@{test1,test2,test3}.service < docker.service 
[Service]
ExecStart=/usr/bin/docker run -rm -name %i busybox sleep 30
ExecStop=/usr/bin/docker stop %i 

[X-Fleet]
X-Conflicts=docker-test*
core@ip-10-151-62-78 ~ $ fleetctl start docker-test*
core@ip-10-151-62-78 ~ $ fleetctl list-units
UNIT                LOAD    ACTIVE  SUB DESC    MACHINE
[email protected]   loaded  active  running -   4073a287.../10.151.62.78
[email protected]   loaded  active  running -   c8823909...
[email protected]   -   -   -   -   -
core@ip-10-151-62-78 ~ $ fleetctl list-machines
MACHINE     IP      METADATA
c8823909... -       -
7855caaf... -       -
4073a287... 10.151.62.78    -

test3 is not scheduled.

One service is conflicting, but not running

I have three services, portiere.{1,2,3}.service, and three ec2 instances running. Each service X-Conflicts with portiere*.

I have two of these running at the moment, but the third never finds a place to run because for some reason fleet thinks that portiere is being run in all three places.

This conflicts with what list-units thinks:

UNIT                    LOAD    ACTIVE  SUB     DESC            MACHINE
portiere.1.service      loaded  active  running portiere        de2eacc3.../10.185.208.27
portiere.2.service      -       -       -       portiere        -
portiere.3.service      loaded  active  running portiere        b6bfeed6.../10.165.32.45

This is all I get in systemctl status fleet:
Mar 06 03:00:05 ip-10-185-208-27 fleet[469]: I0306 03:00:05.759903 00469 engine.go:78] Published JobOffer(portiere.3.service)

Any idea how I can debug this?

fleetctl: forward agent through fleetctl ssh

Title says it all :(

Replace `fleetctl --verify` feature with `fleetctl verify` command

The --verify feature does not actually allow us to enforce anything. We should remove it until we have a user-facing API that can actually enforce it for us.

In the meantime, we should provide a fleetctl verify command that will validate the signatures of a payload in the system.

unit descriptions are always exposed as a '-' in list-units output

My unit has a valid Description option in the [Unit] section, yet it's exposed as a hyphen in the output of fleetctl list-units:

$ fleetctl list-units
UNIT                    LOAD    ACTIVE  SUB     DESC    MACHINE
app-http.1.service      loaded  active  running -       ac79b6a5.../54.80.152.184

I expect the description to be exposed just like systemctl list-units.

Panic when running fleetctl status or journal on vagrant coreos

I was following the instructions in Run a Container in the Cluster in https://coreos.com/docs/launching-containers/launching/launching-containers-fleet/ after installing and running things on the vagrant coreos image.

Once I found out from the IRC channel that I needed to do sudo systemctl start fleet I was able to do the steps in Run a Container in the Cluster and get the process running as seen in list_units:

UNIT        LOAD    ACTIVE  SUB DESC    MACHINE
myapp.service   loaded  active  running -   64d11319...

But when I tried fleetctl status myapp.service or fleetctl journal myapp.service I got panics. You can see one at https://gist.github.com/rberger/9108181

fleetctl ssh fails with Vagrant

using fleetctl ssh with a Vagrant setup fails...

$ echo $FLEETCTL_TUNNEL                                    
127.0.0.1:2222
$ fleetctl ssh -u hello.service
2014/02/20 20:48:27 Unable to establish SSH connection: dial tcp :22: connection refused

Jobs are not stopped on machine when fleet is stopped

if you SIGTERM fleet, it will attempt to "gracefully" shut down. It proactively removes the state of the jobs it is currently running and the state of itself from the Registry. It even re-offers its own jobs.

It does not, however, stop the jobs it just re-offered. This should happen.

Unit on dead machine listed as active

I launched 3 machines in an autoscaling group, then ran 4 conflicting units on the cluster. As expected, 1 unit did not start because it didn't have a qualified machine.

I told the autoscale group to have a minimum of 4 servers, and as expected the 4th unit was scheduled successfully.

Afterwards, I removed a random server from the autoscale group, but I still see 4 units running even thought fleet has detected that the machine has disappeared. apache.1.service isn't running anymore because 172.31.29.243 has died.

core@ip-172-31-32-178 ~ $ fleetctl list-machines
MACHINE     IP      METADATA
b21e97e5... 172.31.29.243   -
0b739cea... 172.31.0.153    -
fd8aab4a... 172.31.32.178   -
b97044dc... 172.31.20.142   -

--MACHINE KILLED--

core@ip-172-31-32-178 ~ $ fleetctl list-machines
MACHINE     IP      METADATA
0b739cea... 172.31.0.153    -
fd8aab4a... 172.31.32.178   -
b97044dc... 172.31.20.142   -
core@ip-172-31-32-178 ~ $ fleetctl list-units
UNIT            LOAD    ACTIVE  SUB DESC    MACHINE
apache.1.service    loaded  active  running -   b21e97e5.../172.31.29.243
apache.2.service    loaded  active  running -   b97044dc.../172.31.20.142
apache.3.service    loaded  active  running -   0b739cea.../172.31.0.153
apache.4.service    loaded  active  running -   fd8aab4a.../172.31.32.178

Vagrant cluster demo does not work with fleet.

https://coreos.com/docs/running-coreos/platforms/vagrant/

I'm guessing because these systems are not properly setup with SSH.

Getting issue detailed here: #143

fleetctl journal should have --follow

Tailing the output of services is very useful.

nicer error message when ssh-add needs to be ran

Right now we panic, maybe we could just throw a nice error?

Existing behavior:

$ fleetctl-imc list-machines
panic: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

goroutine 1 [running]:
runtime.panic(0x2b1920, 0x2108979e0)
    /usr/local/go/src/pkg/runtime/panic.c:266 +0xb6
main.getRegistry(0x21086eee0, 0x2108710d0)
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/fleetctl/cmd.go:51 +0x1ab
main.listMachinesAction(0x21086eee0)
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/fleetctl/list_machines.go:31 +0x35
github.com/coreos/fleet/third_party/github.com/codegangsta/cli.Command.Run(0x347eb0, 0xd, 0x0, 0x0, 0x38fe10, ...)
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/third_party/github.com/codegangsta/cli/command.go:73 +0x994
github.com/coreos/fleet/third_party/github.com/codegangsta/cli.(*App).Run(0x21088e000, 0x21082d000, 0x4, 0x4, 0x7, ...)
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/third_party/github.com/codegangsta/cli/app.go:111 +0x855
main.main()
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/fleetctl/cmd.go:99 +0x9c4

goroutine 3 [chan receive]:
github.com/coreos/fleet/third_party/github.com/golang/glog.(*loggingT).flushDaemon(0x6dad80)
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/third_party/github.com/golang/glog/glog.go:839 +0x50
created by github.com/coreos/fleet/third_party/github.com/golang/glog.init·1
    /home/philips/coreos/fleet/fleet/src/github.com/coreos/fleet/third_party/github.com/golang/glog/glog.go:406 +0x276
Alexs-MacBook-Air-3:Downloads polvi$ ssh-add ~/.ssh/polvi-air2.pem 
Identity added: /Users/polvi/.ssh/polvi-air2.pem (/Users/polvi/.ssh/polvi-air2.pem)
Alexs-MacBook-Air-3:Downloads polvi$ fleetctl-imc list-machines
MACHINE     IP      METADATA
...

How about something like this?

$ fleetctl list-machines
ssh: unable to authenticate via ssh agent. 
Maybe ssh-add ~/.ssh/my-key.pem and try again?

fleet agents submitting bids for signed jobs they cannot verify

I deployed three fleet machines, each with my normal SSH key. I then created a new SSH key, authorized it on one of the machines, and replaced my old SSH key in my local agent with it. I then called fleetctl start --sign hello2.service.

Here's what ended up in etcd:

$ curl localhost:4001/v2/keys/_coreos.com/fleet/signing/payload
{"action":"get","node":{"key":"/_coreos.com/fleet/signing/payload","dir":true,"nodes":[{"key":"/_coreos.com/fleet/signing/payload/hello2.service","value":"{\"Tag\":\"/payload/hello2.service\",\"Signs\":[\"ipiWtx3jfm8psL33IE9gKyqr96xf2gg5PywIo7Nf0UkGyBKMBPnpxzl23Vq3XoLhK2o5tbyoEsVD6FJUOGUhYKCacjAb5ONuBfGOs1T7+iZbsDhhc+LoOwf7/A8axSpZ2V+T9vIPtJ3HrITYUlj2kdner3P2xF4WCysxufTDDVRGvFD2YcCBZ04re/SIdsdTiBkGqvB0pk9W6mrsp5BrhRFoQQMrcBTqyOlHSTn1fARrJJY1hBGlTtHXeQTZzwc/SGkfhgQkUmRwImuL+zqGvnYzQB7jLlj4RXOtxuOt77IEcDc2gTVhqah95hiD8+bHvNdnj2GbPtq7L2L1h6IfbA==\",\"p7fZyP9Nm7XL6TKQCuHTqgTu3qq4tPcrA0S+FG/IlOtVOC5VRVoQdi79hsi/I6V0lBRB1I7n589o/Kt/TdxmSSpEnGx1MY2vkaSdJnyQ9KSbvTb5Cifl/AhchcPZFMXrfjIrCQjOMO0n/4h6O7gJ3Czt89396u1yLRqmq6okTOvKv2jhlFKGfsZmqD1zxAZguIQeb04+Si5rdoHGwcZYqL+KhMDl7QyaEeSO4BYp+9QTcVn6OFCy4/UGlFcT+yPciQ+9ViMNejAlgQDe6jJeWL4rH4AYC2bbl2rlh2zHJ94rlIec25oU23n6iyHIBYalYQBF7U9T1wzYnTg2g3x4fg==\"]}","modifiedIndex":196,"createdIndex":196}],"modifiedIndex":113,"createdIndex":113}}

Each of the agents ended up bidding for my job, and the agent to which the job was scheduled logged this error:

I0304 15:55:05.544323 00635 agent.go:259] Check of payload hello2.service failed: <nil>
E0304 15:55:05.544361 00635 event.go:51] EventJobScheduled(%!s(MISSING)): Failed to fetch Job

I expected the one machine that had my new key to bid, and the other two to ignore my job.

I also tested disabling the two agents without my new key, and the remaining agent was able to verify and run my job.

Support for [email protected] files.

It'd be nice to be able to do:

fleetctl submit [email protected]
fleetctl start [email protected]
fleetctl start [email protected]
fleetctl start [email protected]

odd stuff in the log

What is I0212 ?

Feb 12 01:31:15 ip-10-151-62-78 fleet[1628]: I0212 01:31:15.934924 01628 agent.go:142] Stopping Job([email protected])
Feb 12 01:31:15 ip-10-151-62-78 fleet[1628]: I0212 01:31:15.936894 01628 manager.go:159] Stopped systemd unit [email protected]
Feb 12 01:31:15 ip-10-151-62-78 fleet[1628]: I0212 01:31:15.936917 01628 manager.go:167] Unlinking systemd unit [email protected] from target fleet-4073a287-16bd-4431-87e7-e3f309ccd967.target
Feb 12 01:31:15 ip-10-151-62-78 fleet[1628]: I0212 01:31:15.936968 01628 manager.go:172] Removing systemd unit file /run/systemd/system/[email protected]
Feb 12 01:31:16 ip-10-151-62-78 fleet[1628]: I0212 01:31:16.329188 01628 event.go:28] EventJobOffered([email protected]): passed all criteria, submitting JobBid

Expose binary versions in fleetctl --version and fleet --version

Both as a deployer and a client of fleet, I need to know what versions I'm using. Add a --version flag to each binary that prints this information to stdout.

Attempting to start unit in verified cluster crashes Agents

I set verify_units=true in my cluster, then attempted to start units without providing --sign. Every Agent crashed with this traceback:

I0305 05:32:19.405602 00741 agent.go:248] Fetching Job(hello.service) from Registry
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: panic: runtime error: invalid memory address or nil pointer dereference
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: [signal 0xb code=0x1 addr=0x0 pc=0x4658c4]
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: goroutine 91 [running]:
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: runtime.panic(0x6eee80, 0xb0b228)
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: /usr/local/go/src/pkg/runtime/panic.c:266 +0xb6
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: github.com/coreos/fleet/sign.(*SignatureVerifier).VerifyPayload(0xc210082500, 0xc2107e89a0, 0x0, 0x0,
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: /home/core/fleet/src/github.com/coreos/fleet/sign/job.go:29 +0xd4
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: github.com/coreos/fleet/agent.(*Agent).FetchJob(0xc2100875f0, 0xc2106d2e87, 0xd, 0x7f6ce8091bc8)
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: /home/core/fleet/src/github.com/coreos/fleet/agent/agent.go:257 +0x1fe
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: github.com/coreos/fleet/agent.(*EventHandler).HandleEventJobScheduled(0xc2100a7208, 0x773ab0, 0x11, 0
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: /home/core/fleet/src/github.com/coreos/fleet/agent/event.go:49 +0x33a
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: reflect.Value.call(0x71d300, 0xc2100a7208, 0x338, 0x747a00, 0x4, ...)
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: /usr/local/go/src/pkg/reflect/value.go:474 +0xe0b
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: reflect.Value.Call(0x71d300, 0xc2100a7208, 0x338, 0xc2107e86c0, 0x1, ...)
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: /usr/local/go/src/pkg/reflect/value.go:345 +0x9d
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: created by github.com/coreos/fleet/event.(*EventBus).dispatch
Mar 05 05:32:19 ip-10-164-111-220 fleet-local[741]: /home/core/fleet/src/github.com/coreos/fleet/event/bus.go:63 +0x6d8

fleetctl --tunnel with invalid IP times out and panics

$ fleetctl --tunnel 10.10.10.10 list-units

panic: dial tcp 10.10.10.10:22: operation timed out

goroutine 1 [running]:
runtime.panic(0x2eab60, 0xc21000a780)
    /usr/local/go/src/pkg/runtime/panic.c:266 +0xb6
main.getRegistry(0xc210060ee0, 0xc210037820)
    /Users/robszumski/Documents/fleet/src/github.com/coreos/fleet/fleetctl/cmd.go:50 +0x1ab
main.listUnitsAction(0xc210060ee0)
    /Users/robszumski/Documents/fleet/src/github.com/coreos/fleet/fleetctl/list_units.go:32 +0x35
github.com/coreos/fleet/third_party/github.com/codegangsta/cli.Command.Run(0x34d7b0, 0xa, 0x0, 0x0, 0x386e70, ...)
    /Users/robszumski/Documents/fleet/src/github.com/coreos/fleet/third_party/github.com/codegangsta/cli/command.go:73 +0x994
github.com/coreos/fleet/third_party/github.com/codegangsta/cli.(*App).Run(0xc21007b000, 0xc21000a000, 0x4, 0x4, 0x7, ...)
    /Users/robszumski/Documents/fleet/src/github.com/coreos/fleet/third_party/github.com/codegangsta/cli/app.go:111 +0x855
main.main()
    /Users/robszumski/Documents/fleet/src/github.com/coreos/fleet/fleetctl/cmd.go:97 +0x9ae

goroutine 3 [chan receive]:
github.com/coreos/fleet/third_party/github.com/golang/glog.(*loggingT).flushDaemon(0x6e7000)
    /Users/robszumski/Documents/fleet/src/github.com/coreos/fleet/third_party/github.com/golang/glog/glog.go:839 +0x50
created by github.com/coreos/fleet/third_party/github.com/golang/glog.init·1
    /Users/robszumski/Documents/fleet/src/github.com/coreos/fleet/third_party/github.com/golang/glog/glog.go:406 +0x276

goroutine 4 [syscall]:
runtime.goexit()
    /usr/local/go/src/pkg/runtime/proc.c:1394

Unable to `fleetctl start` units in failed state

I attempted to start a bunch of units, but two of them failed. I then attempted to call fleetctl start on them, thinking this would just attempt to restart the units. The fleetctl command failed:

% fleetctl start units/subgun-{presence,http}.2.service
Creation of job subgun-presence.2.service failed: 105: Key already exists (/_coreos.com/fleet/job/subgun-presence.2.service/object) [292]
Creation of job subgun-http.2.service failed: 105: Key already exists (/_coreos.com/fleet/job/subgun-http.2.service/object) [292]

Thinking that maybe I could get around this by not giving the local path to the unit file, but just providing the name did not help:

% fleetctl start subgun-presence.2.service subgun-http.2.service
Creation of job subgun-presence.2.service failed: 105: Key already exists (/_coreos.com/fleet/job/subgun-presence.2.service/object) [312]
Creation of job subgun-http.2.service failed: 105: Key already exists (/_coreos.com/fleet/job/subgun-http.2.service/object) [312]

I would expect in this case that the unit simply be started, as the payload is already in the system.