Giter Site home page Giter Site logo

Comments (11)

jocado avatar jocado commented on June 3, 2024 1

@angelcar That's great - thank you.

Please let me know if you need any further information.

FYI @stuart-warren

from amazon-ecs-init.

angelcar avatar angelcar commented on June 3, 2024

Hi,

I don't seem to be able to reproduce this.

Here's what I'm doing:

  • I start an instance with dockerd running and ecs agent stopped
  • I start the ECS agent using sudo systemctl start ecs
  • With ECS agent up and running, I stop docker by executing: sudo systemctl stop docker
  • I don't see timeouts in the logs (see below)
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal sudo[32064]: ec2-user : TTY=pts/2 ; PWD=/home/ec2-user ; USER=root ; COMMAND=/bin/systemctl stop docker
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal sudo[32064]: pam_unix(sudo:session): session opened for user root by ec2-user(uid=0)
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopping Amazon Elastic Container Service - container agent...
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]: 2021-11-16T00:01:58Z [INFO] stop
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]: 2021-11-16T00:01:58Z [INFO] Stopping Amazon Elastic Container Service Agent
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]: 2021-11-16T00:01:59Z [INFO] Container name: /ecs-agent
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.266804960Z" level=info msg="ignoring event" container=d3861bf0983c424fdc81041a4f8c5f4db725b8fbc67c072c29fbe223f5869ca2 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal containerd[23767]: time="2021-11-16T00:01:59.266900919Z" level=info msg="shim disconnected" id=d3861bf0983c424fdc81041a4f8c5f4db725b8fbc67c072c29fbe223f5869ca2
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal containerd[23767]: time="2021-11-16T00:01:59.266959772Z" level=error msg="copy shim log" error="read /proc/self/fd/17: file already closed"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[31821]: 2021-11-16T00:01:59Z [INFO] Agent exited with code 0
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]:
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[31821]:
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32127]: 2021-11-16T00:01:59Z [INFO] post-stop
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32127]: 2021-11-16T00:01:59Z [INFO] Cleaning up the credentials endpoint setup for Amazon Elastic Container Service Agent
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32127]:
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopped Amazon Elastic Container Service - container agent.
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopping Docker Application Container Engine...
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.372893427Z" level=info msg="Processing signal 'terminated'"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.374549610Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.374839252Z" level=info msg="Daemon shutdown complete"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopped Docker Application Container Engine.

Could you share the exact repro steps please?

from amazon-ecs-init.

jocado avatar jocado commented on June 3, 2024

Hi.

Thanks for taking a look at this.

The reproduce it, you will need to replicate the service configuration on this system, because as I mentioned above, you cannot specify dependencies between arbitrary systemd service units. I also just found another piece of this puzzle, it's that the service responsible for docker on this system also removes the socket file after stopping docker.

Here are the reproduction steps:

# Update service config to remove inter service dependency 
cp /lib/systemd/system/ecs.service /etc/systemd/system/
sed -i '/docker.service$/d' /etc/systemd/system/ecs.service
systemctl daemon-reload

# Stop docker
systemctl stop docker.socket
systemctl stop docker.service

# Remove socket
rm /var/run/docker.sock

Then you will see messages in the ecs service journal such as:

Nov 16 14:50:54 ip-10-0-1-119 systemd[1]: ecs.service: Control process exited, code=exited, status=255/EXCEPTION
Nov 16 14:50:54 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:50:54Z [INFO] Error connecting to docker, backing off for 1.14777941s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:50:55 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:50:55Z [INFO] Error connecting to docker, backing off for 2.282153551s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:50:57 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:50:57Z [INFO] Error connecting to docker, backing off for 4.466145821s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:51:02 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:51:02Z [INFO] Error connecting to docker, backing off for 5.235010051s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:51:07 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:51:07Z [INFO] Error connecting to docker, backing off for 5.287113937s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory

Those are the same backoff messages I get from the ecs service when trying to run an ecs-init post-stop on system shutdown if docker has already stopped running.

from amazon-ecs-init.

angelcar avatar angelcar commented on June 3, 2024

I was able to repro the issue, but I'm afraid the errors are not coming from post-stop. Here is what happens:

  • ECS Agent runs inside a container
  • When docker service & socket are stopped, ECS Agent gets the termination signal and proceeds to exit with code zero, as shown by ecs-init logs here.
  • StartSupervised function then proceeds to end, returning a nil error here
  • In turn, ecs-init main function ends (i.e. ecs-init process exits) as you can see here
  • Since ecs-init exits, systemd restarts it automatically
  • Finally, the errors you're seeing actually come from ecs-init's main function starting over, when it tries to create the docker client, which it will need to start ecs agent: engine.New() -> docker.NewClient() -> newDockerClient -> "Error connecting to docker...

Therefore, systemd is just trying to restart ecs-init, which in turn is trying to restart ecs agent. stop, and post-stop are never actually executed with the repro steps you provided.

Thoughts?

from amazon-ecs-init.

jocado avatar jocado commented on June 3, 2024

I think our situation is as a result of what you may currently consider a corner case. I think in our case docker and ecs are being stopped at the same time, but with the above dependency and socket caveats, which seems to result the unit running ecs-init stop before the runing amazon-init process exists with an error.

We are running on a not officially supported OS, Ubuntu Core 20 [ based on 20.04 components ], so I don't want to waist your time trying to reproduce something that may be tricky.

The question remains though, should amazon-ecs-init post-stop require the docker socket to exist ? Does the current code require it ?

from amazon-ecs-init.

angelcar avatar angelcar commented on June 3, 2024

The question remains though, should amazon-ecs-init post-stop require the docker socket to exist ? Does the current code require it ?

post-stop does not require docker, however stop does, since this step attempts to cleanly stop the ecs agent via docker container stop

from amazon-ecs-init.

jocado avatar jocado commented on June 3, 2024

Hmmm, ok, that's interesting. I will try and show why it appears like that from the situation I have, perhaps that will help shed some light on it.

As I mentioned above, when we do a system shutdown, the ecs-init service gets stopped when the docker service isn't running and the docker socket is removed. This causes significant delays on shutdown while there are backoff retries from ecs-init while the service unit runs the ecs-init stop and post-stop

To prevent the delays, we added the following workarounds:

stop was replaced with this:

#!/bin/bash

[ -S /var/run/docker.sock ] || echo "Not running ecs-init stop, as docker is not available"
[ -S /var/run/docker.sock ] && ${SNAP}/bin/amazon-ecs-init stop

post-stop was replaced with this:

#!/bin/bash

[ -S /var/run/docker.sock ] || echo "Not running ecs-init post-stop, as docker is not available"
[ -S /var/run/docker.sock ] && ${SNAP}/bin/amazon-ecs-init post-stop

The above approach means that if docker is running then it works normally [ e.g. any normal service restart or stop ], but in system shutdown it doesn't timeout if the docker socket is not available.

It works as expected, and produces the following logs:

2021-11-18T08:36:40Z systemd[1]: Stopping Service for snap application ecs-anywhere.ecs-init...
2021-11-18T08:36:41Z ecs-anywhere.ecs-init[11803]: Not running ecs-init stop, as docker is not available
2021-11-18T08:36:41Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-18T08:36:41Z ecs-anywhere.ecs-init[11856]: Not running ecs-init post-stop, as docker is not available
2021-11-18T08:36:41Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-18T08:36:41Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Failed with result 'exit-code'.
2021-11-18T08:36:41Z systemd[1]: Stopped Service for snap application ecs-anywhere.ecs-init.

If I remove the socket check from the post-stop, like this:

#!/bin/bash

[ -S /var/run/docker.sock ] || echo "Will run ecs-init post-stop, even though docker socket is not available"
${SNAP}/bin/amazon-ecs-init post-stop

I hangs on shutdown, which you can see in the following logs:

2021-11-18T08:51:04Z systemd[1]: Stopping Service for snap application ecs-anywhere.ecs-init...
2021-11-18T08:51:05Z ecs-anywhere.ecs-init[19141]: Not running ecs-init stop, as docker is not available
2021-11-18T08:51:05Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-18T08:51:05Z ecs-anywhere.ecs-init[9973]: 
2021-11-18T08:51:05Z ecs-anywhere.ecs-init[19187]: Will run ecs-init post-stop, even though docker socket is not available
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:11Z [WARN] Could not retrieve the region from EC2 Instance Metadata. Error: %sEC2MetadataRequestError: failed to get EC2 instance identity document
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: caused by: RequestError: send request failed
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: caused by: Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:11Z [INFO] Error connecting to docker, backing off for 1.14777941s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:12Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:12Z [INFO] Error connecting to docker, backing off for 2.282153551s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:14Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:14Z [INFO] Error connecting to docker, backing off for 4.466145821s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:19Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:19Z [INFO] Error connecting to docker, backing off for 5.235010051s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:24Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:24Z [INFO] Error connecting to docker, backing off for 5.287113937s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:29Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:29Z [ERROR] Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:29Z ecs-anywhere.ecs-init[19222]: 
2021-11-18T08:51:29Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=255/EXCEPTION
2021-11-18T08:51:29Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Failed with result 'exit-code'.
2021-11-18T08:51:29Z systemd[1]: Stopped Service for snap application ecs-anywhere.ecs-init.

Some of the delay appears to be trying to reach the ec2 metadata service, but the bulk of the time seems to be trying to connect to docker.

Am I simply misinterpreting this ?

from amazon-ecs-init.

angelcar avatar angelcar commented on June 3, 2024

Hmm, that's very interesting. I still really don't see how post-stop requires docker, but admittedly I can't explain the behavior observed in the logs above when post-stop is executed.

To further investigate, could you repeat the scenario where you don't execute stop; you DO execute post-stop, but this time also add an echo to both, pre-start, and start? this is to see if any of the latter is executed after stop.

from amazon-ecs-init.

jocado avatar jocado commented on June 3, 2024

I added that debugging. We have both pre-start and start in the same wrapper, which now looks like this:

#!/bin/bash

set -a
source ${SNAP_DATA}/etc/ecs/ecs.config
source ${SNAP_DATA}/var/lib/ecs/ecs.config
set +a

echo "Will run ecs-init pre-start"
${SNAP}/bin/amazon-ecs-init pre-start
echo "Finished ecs-init pre-start"

echo "Will run ecs-init start"
${SNAP}/bin/amazon-ecs-init start
echo "Finished ecs-init start"

But, it doesn't seem to be executed during the system shutdown. Here are the logs:

2021-11-19T08:52:44Z systemd[1]: Stopping Service for snap application ecs-anywhere.ecs-init...
2021-11-19T08:52:44Z ecs-anywhere.ecs-init[16760]: Not running ecs-init stop, as docker is not available
2021-11-19T08:52:44Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-19T08:52:44Z ecs-anywhere.ecs-init[14952]: 
2021-11-19T08:52:44Z ecs-anywhere.ecs-init[16829]: Will run ecs-init post-stop, even though docker socket is not available
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:50Z [WARN] Could not retrieve the region from EC2 Instance Metadata. Error: %sEC2MetadataRequestError: failed to get EC2 instance identity document
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: caused by: RequestError: send request failed
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: caused by: Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:50Z [INFO] Error connecting to docker, backing off for 1.14777941s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:52:52Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:52Z [INFO] Error connecting to docker, backing off for 2.282153551s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:52:54Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:54Z [INFO] Error connecting to docker, backing off for 4.466145821s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:52:58Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:58Z [INFO] Error connecting to docker, backing off for 5.235010051s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:53:04Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:53:04Z [INFO] Error connecting to docker, backing off for 5.287113937s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:53:09Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:53:09Z [ERROR] Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:53:09Z ecs-anywhere.ecs-init[16893]: 
2021-11-19T08:53:09Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=255/EXCEPTION
2021-11-19T08:53:09Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Failed with result 'exit-code'.
2021-11-19T08:53:09Z systemd[1]: Stopped Service for snap application ecs-anywhere.ecs-init.

Seems to be a bit of a mystery!

from amazon-ecs-init.

angelcar avatar angelcar commented on June 3, 2024

Thanks for providing the additional info. I think I know what's happening. The problem is that the docker client is created on ecs-init's main function for all the actions, even if it's not going to be used (like in post-stop).

I'll try and get a PR to address this and post an update here when the change is released.

from amazon-ecs-init.

ubhattacharjya avatar ubhattacharjya commented on June 3, 2024

Hi,

The change has been released as part of ECS agent 1.57.1.

I will close this issue. Kindly reopen if you face further problems.

from amazon-ecs-init.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.