Comments (11)
@angelcar That's great - thank you.
Please let me know if you need any further information.
FYI @stuart-warren
from amazon-ecs-init.
Hi,
I don't seem to be able to reproduce this.
Here's what I'm doing:
- I start an instance with
dockerd
running and ecs agent stopped - I start the ECS agent using
sudo systemctl start ecs
- With ECS agent up and running, I stop docker by executing:
sudo systemctl stop docker
- I don't see timeouts in the logs (see below)
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal sudo[32064]: ec2-user : TTY=pts/2 ; PWD=/home/ec2-user ; USER=root ; COMMAND=/bin/systemctl stop docker
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal sudo[32064]: pam_unix(sudo:session): session opened for user root by ec2-user(uid=0)
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopping Amazon Elastic Container Service - container agent...
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]: 2021-11-16T00:01:58Z [INFO] stop
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]: 2021-11-16T00:01:58Z [INFO] Stopping Amazon Elastic Container Service Agent
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]: 2021-11-16T00:01:59Z [INFO] Container name: /ecs-agent
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.266804960Z" level=info msg="ignoring event" container=d3861bf0983c424fdc81041a4f8c5f4db725b8fbc67c072c29fbe223f5869ca2 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal containerd[23767]: time="2021-11-16T00:01:59.266900919Z" level=info msg="shim disconnected" id=d3861bf0983c424fdc81041a4f8c5f4db725b8fbc67c072c29fbe223f5869ca2
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal containerd[23767]: time="2021-11-16T00:01:59.266959772Z" level=error msg="copy shim log" error="read /proc/self/fd/17: file already closed"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[31821]: 2021-11-16T00:01:59Z [INFO] Agent exited with code 0
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]:
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[31821]:
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32127]: 2021-11-16T00:01:59Z [INFO] post-stop
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32127]: 2021-11-16T00:01:59Z [INFO] Cleaning up the credentials endpoint setup for Amazon Elastic Container Service Agent
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32127]:
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopped Amazon Elastic Container Service - container agent.
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopping Docker Application Container Engine...
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.372893427Z" level=info msg="Processing signal 'terminated'"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.374549610Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.374839252Z" level=info msg="Daemon shutdown complete"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopped Docker Application Container Engine.
Could you share the exact repro steps please?
from amazon-ecs-init.
Hi.
Thanks for taking a look at this.
The reproduce it, you will need to replicate the service configuration on this system, because as I mentioned above, you cannot specify dependencies between arbitrary systemd service units. I also just found another piece of this puzzle, it's that the service responsible for docker on this system also removes the socket file after stopping docker.
Here are the reproduction steps:
# Update service config to remove inter service dependency
cp /lib/systemd/system/ecs.service /etc/systemd/system/
sed -i '/docker.service$/d' /etc/systemd/system/ecs.service
systemctl daemon-reload
# Stop docker
systemctl stop docker.socket
systemctl stop docker.service
# Remove socket
rm /var/run/docker.sock
Then you will see messages in the ecs service journal such as:
Nov 16 14:50:54 ip-10-0-1-119 systemd[1]: ecs.service: Control process exited, code=exited, status=255/EXCEPTION
Nov 16 14:50:54 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:50:54Z [INFO] Error connecting to docker, backing off for 1.14777941s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:50:55 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:50:55Z [INFO] Error connecting to docker, backing off for 2.282153551s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:50:57 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:50:57Z [INFO] Error connecting to docker, backing off for 4.466145821s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:51:02 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:51:02Z [INFO] Error connecting to docker, backing off for 5.235010051s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:51:07 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:51:07Z [INFO] Error connecting to docker, backing off for 5.287113937s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Those are the same backoff messages I get from the ecs service when trying to run an ecs-init post-stop
on system shutdown if docker has already stopped running.
from amazon-ecs-init.
I was able to repro the issue, but I'm afraid the errors are not coming from post-stop
. Here is what happens:
- ECS Agent runs inside a container
- When docker service & socket are stopped, ECS Agent gets the termination signal and proceeds to exit with code zero, as shown by ecs-init logs here.
StartSupervised
function then proceeds to end, returning a nil error here- In turn, ecs-init main function ends (i.e. ecs-init process exits) as you can see here
- Since ecs-init exits,
systemd
restarts it automatically - Finally, the errors you're seeing actually come from ecs-init's main function starting over, when it tries to create the docker client, which it will need to start ecs agent:
engine.New()
->docker.NewClient()
->newDockerClient
->"Error connecting to docker...
Therefore, systemd is just trying to restart ecs-init, which in turn is trying to restart ecs agent. stop, and post-stop are never actually executed with the repro steps you provided.
Thoughts?
from amazon-ecs-init.
I think our situation is as a result of what you may currently consider a corner case. I think in our case docker and ecs are being stopped at the same time, but with the above dependency and socket caveats, which seems to result the unit running ecs-init stop
before the runing amazon-init process exists with an error.
We are running on a not officially supported OS, Ubuntu Core 20 [ based on 20.04 components ], so I don't want to waist your time trying to reproduce something that may be tricky.
The question remains though, should amazon-ecs-init post-stop
require the docker socket to exist ? Does the current code require it ?
from amazon-ecs-init.
The question remains though, should amazon-ecs-init post-stop require the docker socket to exist ? Does the current code require it ?
post-stop
does not require docker, however stop
does, since this step attempts to cleanly stop the ecs agent via docker container stop
from amazon-ecs-init.
Hmmm, ok, that's interesting. I will try and show why it appears like that from the situation I have, perhaps that will help shed some light on it.
As I mentioned above, when we do a system shutdown, the ecs-init service gets stopped when the docker service isn't running and the docker socket is removed. This causes significant delays on shutdown while there are backoff retries from ecs-init while the service unit runs the ecs-init stop
and post-stop
To prevent the delays, we added the following workarounds:
stop
was replaced with this:
#!/bin/bash
[ -S /var/run/docker.sock ] || echo "Not running ecs-init stop, as docker is not available"
[ -S /var/run/docker.sock ] && ${SNAP}/bin/amazon-ecs-init stop
post-stop
was replaced with this:
#!/bin/bash
[ -S /var/run/docker.sock ] || echo "Not running ecs-init post-stop, as docker is not available"
[ -S /var/run/docker.sock ] && ${SNAP}/bin/amazon-ecs-init post-stop
The above approach means that if docker is running then it works normally [ e.g. any normal service restart or stop ], but in system shutdown it doesn't timeout if the docker socket is not available.
It works as expected, and produces the following logs:
2021-11-18T08:36:40Z systemd[1]: Stopping Service for snap application ecs-anywhere.ecs-init...
2021-11-18T08:36:41Z ecs-anywhere.ecs-init[11803]: Not running ecs-init stop, as docker is not available
2021-11-18T08:36:41Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-18T08:36:41Z ecs-anywhere.ecs-init[11856]: Not running ecs-init post-stop, as docker is not available
2021-11-18T08:36:41Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-18T08:36:41Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Failed with result 'exit-code'.
2021-11-18T08:36:41Z systemd[1]: Stopped Service for snap application ecs-anywhere.ecs-init.
If I remove the socket check from the post-stop
, like this:
#!/bin/bash
[ -S /var/run/docker.sock ] || echo "Will run ecs-init post-stop, even though docker socket is not available"
${SNAP}/bin/amazon-ecs-init post-stop
I hangs on shutdown, which you can see in the following logs:
2021-11-18T08:51:04Z systemd[1]: Stopping Service for snap application ecs-anywhere.ecs-init...
2021-11-18T08:51:05Z ecs-anywhere.ecs-init[19141]: Not running ecs-init stop, as docker is not available
2021-11-18T08:51:05Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-18T08:51:05Z ecs-anywhere.ecs-init[9973]:
2021-11-18T08:51:05Z ecs-anywhere.ecs-init[19187]: Will run ecs-init post-stop, even though docker socket is not available
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:11Z [WARN] Could not retrieve the region from EC2 Instance Metadata. Error: %sEC2MetadataRequestError: failed to get EC2 instance identity document
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: caused by: RequestError: send request failed
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: caused by: Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:11Z [INFO] Error connecting to docker, backing off for 1.14777941s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:12Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:12Z [INFO] Error connecting to docker, backing off for 2.282153551s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:14Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:14Z [INFO] Error connecting to docker, backing off for 4.466145821s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:19Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:19Z [INFO] Error connecting to docker, backing off for 5.235010051s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:24Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:24Z [INFO] Error connecting to docker, backing off for 5.287113937s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:29Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:29Z [ERROR] Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:29Z ecs-anywhere.ecs-init[19222]:
2021-11-18T08:51:29Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=255/EXCEPTION
2021-11-18T08:51:29Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Failed with result 'exit-code'.
2021-11-18T08:51:29Z systemd[1]: Stopped Service for snap application ecs-anywhere.ecs-init.
Some of the delay appears to be trying to reach the ec2 metadata service, but the bulk of the time seems to be trying to connect to docker.
Am I simply misinterpreting this ?
from amazon-ecs-init.
Hmm, that's very interesting. I still really don't see how post-stop
requires docker, but admittedly I can't explain the behavior observed in the logs above when post-stop
is executed.
To further investigate, could you repeat the scenario where you don't execute stop
; you DO execute post-stop
, but this time also add an echo
to both, pre-start
, and start
? this is to see if any of the latter is executed after stop
.
from amazon-ecs-init.
I added that debugging. We have both pre-start
and start
in the same wrapper, which now looks like this:
#!/bin/bash
set -a
source ${SNAP_DATA}/etc/ecs/ecs.config
source ${SNAP_DATA}/var/lib/ecs/ecs.config
set +a
echo "Will run ecs-init pre-start"
${SNAP}/bin/amazon-ecs-init pre-start
echo "Finished ecs-init pre-start"
echo "Will run ecs-init start"
${SNAP}/bin/amazon-ecs-init start
echo "Finished ecs-init start"
But, it doesn't seem to be executed during the system shutdown. Here are the logs:
2021-11-19T08:52:44Z systemd[1]: Stopping Service for snap application ecs-anywhere.ecs-init...
2021-11-19T08:52:44Z ecs-anywhere.ecs-init[16760]: Not running ecs-init stop, as docker is not available
2021-11-19T08:52:44Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-19T08:52:44Z ecs-anywhere.ecs-init[14952]:
2021-11-19T08:52:44Z ecs-anywhere.ecs-init[16829]: Will run ecs-init post-stop, even though docker socket is not available
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:50Z [WARN] Could not retrieve the region from EC2 Instance Metadata. Error: %sEC2MetadataRequestError: failed to get EC2 instance identity document
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: caused by: RequestError: send request failed
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: caused by: Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:50Z [INFO] Error connecting to docker, backing off for 1.14777941s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:52:52Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:52Z [INFO] Error connecting to docker, backing off for 2.282153551s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:52:54Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:54Z [INFO] Error connecting to docker, backing off for 4.466145821s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:52:58Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:58Z [INFO] Error connecting to docker, backing off for 5.235010051s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:53:04Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:53:04Z [INFO] Error connecting to docker, backing off for 5.287113937s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:53:09Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:53:09Z [ERROR] Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:53:09Z ecs-anywhere.ecs-init[16893]:
2021-11-19T08:53:09Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=255/EXCEPTION
2021-11-19T08:53:09Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Failed with result 'exit-code'.
2021-11-19T08:53:09Z systemd[1]: Stopped Service for snap application ecs-anywhere.ecs-init.
Seems to be a bit of a mystery!
from amazon-ecs-init.
Thanks for providing the additional info. I think I know what's happening. The problem is that the docker client is created on ecs-init's main function for all the actions, even if it's not going to be used (like in post-stop
).
I'll try and get a PR to address this and post an update here when the change is released.
from amazon-ecs-init.
Hi,
The change has been released as part of ECS agent 1.57.1.
I will close this issue. Kindly reopen if you face further problems.
from amazon-ecs-init.
Related Issues (20)
- Add staticcheck tool to ecs-init (same as agent) HOT 1
- Containers with multiple listening ports in 'host' networking unable to send traffic to other local ports HOT 2
- Installation of ecs volume plugin on RHEL HOT 3
- ECS Agent update issue HOT 1
- iptables rule isn't removed on service restart after changing ECS_ALLOW_OFFHOST_INTROSPECTION_ACCESS from false to true HOT 1
- Setting ECS_ALLOW_OFFHOST_INTROSPECTION_ACCESS=true doesn't appear to work on Amazon Linux 1 HOT 1
- ECS cluster registers new task every time While uploading large file in aws ecs through loadbalancer with 503 gateway error HOT 7
- ECS service doesn't always run at boot time due to systemd breaking service ordering cycles HOT 4
- Migrate from dep to go modules for dependency management HOT 1
- Configurable ecs-agent image version HOT 8
- Support overriding ecs agent certificate with host certificate for generic rpm package
- 1.53.0 not availalbe as rpm HOT 1
- Running "make deb" or "make generic-rpm" does not package amazon-ecs-volume-plugin HOT 2
- Offhost ECS Agent Container Introspection is permanently enabled when primary NIC is not called eth0 HOT 4
- ecs anywhere install script: unnecessary requirement on dirmngr HOT 1
- Cannot build RPM package on CentOS 7.9.2009 HOT 2
- amazon-ecs-volume-plugin.service is missing in SLES 15 SP2 amazon-ecs-init 1.53 version HOT 2
- [Proposal] Integrate ECS Init into ECS Agent & deprecate standalone ECS Init repository HOT 3
- VolumeDriver.Create: encountered error(s) in creating the volume HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amazon-ecs-init.