Giter Site home page Giter Site logo

Comments (24)

ederst avatar ederst commented on August 28, 2024

Yeah that's a bummer, just found that out. I think there is no easy solution to that.

The executor processes when executing Mesos slave the normal way (without docker container) are switched to the init process (pid 1). To my understanding this is not be possible when a containerized process dies, because it would violate the whole container idea.

However, it may be possible to create a Mesos executor image (basically the Mesos slave image but without the mesos-slave ENTRYPOINT) and switch the mesos-executor binary/script of the Mesos slave image with a script that executes the Mesos executor image.

Since Mesos slave is already spawning Docker containers on the host system (-v /var/run/docker.sock:/var/run/docker.sock) the Mesos executor container probably won't die when the slave process gets killed.

At least the Task containers launched by the Mesos slave container don't die - I've tested this. They just get killed when the Slave container is restarted, because the it cannot find a running executor.

The Executor containers probably need to be started with --privileged and --pid=host.

Sounds hacky, but it maybe it will work. I'll try that.

from docker-containers.

ederst avatar ederst commented on August 28, 2024

@bobrik I did it. Basically, with the method I mentioned in my last post.

Roughly explained:
I replaced /usr/libexec/mesos/mesos-executor in the mesos_slave container with a Python script which uses docker-py to start a Docker container on the host machine. This script just passes on the "--override..." to a container with entrypoint mesos-executor AND changes the forked.pid to the PID value of the started mesos-executor Docker container.

Now it is possible to restart the mesos-slave Service without losing/killing the executor (it's a Docker container on the host machine!), and on restart mesos-slave finds the executor and reconnects it. Yay!

I'll post the code here ASAP.

from docker-containers.

ederst avatar ederst commented on August 28, 2024

The solution I came up with: https://github.com/ederst/docker-containers/tree/master/mesos/dockerfile-templates

from docker-containers.

tobilg avatar tobilg commented on August 28, 2024

Is this still an issue with 0.25.0? As I understand, this was fixed in 0.23.0?

from docker-containers.

ederst avatar ederst commented on August 28, 2024

Good question. AFAIK we never implemented my somewhat hacky solution.
Currently we use 0.24.1, so at least I could test it for that version.

from docker-containers.

tobilg avatar tobilg commented on August 28, 2024

Well, I was asking because #20 was closed, which references this issue.

I'm not really sure if using the /usr/libexec/mesos/mesos-executor replacement is still necessary. I'm running Mesos 0.25.0 Masters/Slaves in CoreOS as Docker images with no problems so far.

from docker-containers.

gregory90 avatar gregory90 commented on August 28, 2024

@tobilg I did mention this issue in #20. Unfortunately it still isn't working for me even with 0.25.0. Did you manage to run mesos-slave with docker containers recovery(e.g. after slave restart/failure)? If yes, I'd really appreciate if you share how you start mesos on CoreOS.
I've also raised this issue on mesos mailing list: https://www.mail-archive.com/[email protected]/msg04975.html . I'd be very greatful if you could pass me some directions on how to use this feature properly:)

from docker-containers.

tnachen avatar tnachen commented on August 28, 2024

@gregory90 the way to run the slave so you can recover correctly is that you need to both 1) Set the right flags when you run the mesos slave docker container 2) Set the right flags when you start the slave.

  1. The mesos slave docker container should at least run with --net=host, --pid=host -v /var/run/docker.sock:/var/run/docker.sock -v /sys/fs/cgroup:/sys/fs/cgroup -v /tmp/mesos:/tmp/mesos

  2. You need to also specify the --docker_mesos_image flag that points to the image you used to launch the slave

And basically the flag allows Mesos slave to launch all executors in also docker containers, so when the mesos-slave restarts again it will be able to reattach and find the running executors in the containers.

from docker-containers.

gregory90 avatar gregory90 commented on August 28, 2024

@tnachen what I'm trying to say is I'm still having problems with containers recovery on mesos with those settings. I'm still not sure if it's misconfiguration on my side. I've provided all the information I could think of in https://www.mail-archive.com/[email protected]/msg04983.html . Is there something suspicious?

from docker-containers.

tnachen avatar tnachen commented on August 28, 2024

@gregory90 so looks like the container was sigtermed? what shows up on the host when you do docker ps -a, and also docker inspect on the finished docker container?

from docker-containers.

gregory90 avatar gregory90 commented on August 28, 2024

@tnachen docker ps -a: https://gist.github.com/gregory90/3aa141ed6acb56f02ee1
docker inspect on finished container: https://gist.github.com/gregory90/5f9b3c59d357943aa8a1

from docker-containers.

tobilg avatar tobilg commented on August 28, 2024

I just tested it with the official mesosphere/mesos-slave:0.25.0-0.2.70.ubuntu1404 image, and passing --docker_mesos_image=mesosphere/mesos-slave:0.25.0-0.2.70.ubuntu1404 leads to Marathon not being able to deploy any tasks on the slaves.

Removing this option enables the task deployment again.

Current config:

[Unit]
Description=MesosSlave
After=docker.service
Requires=docker.service

[Service]
Restart=on-failure
RestartSec=20
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill mesos_slave
ExecStartPre=-/usr/bin/docker rm mesos_slave
ExecStartPre=/usr/bin/docker pull mesosphere/mesos-slave:0.25.0-0.2.70.ubuntu1404
ExecStart=/usr/bin/sh -c "/usr/bin/docker run \
    --name=mesos_slave \
    --net=host \
    --privileged \
    --pid=host \
    -v /sys/fs/cgroup:/sys/fs/cgroup \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v /usr/bin/docker:/usr/bin/docker \
    -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
    -v /tmp/mesos:/tmp/mesos \
    -p 5051:5051 \
    -e MESOS_IP=192.168.200.167 \
    -e MESOS_HOSTNAME=192.168.200.167 \
    -e MESOS_CONTAINERIZERS=docker,mesos \
    -e MESOS_MASTER=zk://192.168.200.169:2181,192.168.200.168:2181,192.168.200.167:2181/mesos \
    -e MESOS_LOG_DIR=/var/log/mesos/slave \
    -e MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins \
    -e MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD=90secs \
    -e MESOS_DOCKER_STOP_TIMEOUT=60secs \
    -e MESOS_CGROUPS_ENABLE_CFS=true \
    mesosphere/mesos-slave:0.25.0-0.2.70.ubuntu1404"
ExecStop=/usr/bin/docker stop mesos_slave

[Install]
WantedBy=multi-user.target

from docker-containers.

gregory90 avatar gregory90 commented on August 28, 2024

@tobilg That's exactly my issue. Without that flag recovering containers won't work though. Hope @tnachen can help us troubleshoot this problem a little more.

from docker-containers.

gregory90 avatar gregory90 commented on August 28, 2024

@tnachen I just tried it on boot2docker on OSX (thinking the problem was CoreOS specific) and can't make docker recovery to work.
Steps to reproduce using docker-compose:

  1. Run this docker-compose.yml file: https://gist.github.com/gregory90/421ccf59294184694410
  2. Try to run any docker container through marathon - I tried this: https://gist.github.com/gregory90/83a57756e2f3c5e461a4

from docker-containers.

dhorbach avatar dhorbach commented on August 28, 2024

The problem with docker executor container is that it doesn't have appropriate mounts for docker files like /var/run/docker.sock:/var/run/docker.sock and /usr/bin/docker:/usr/bin/docker and thus not able to launch docker tasks.

  1. Mesos-slave container - mounts specified by -v options. It spawns executor containers
  2. Mesos-slave executor container- mounts absent - not able to launch docker tasks with Marathon. Error is "docker" command not found.
  3. Task container - not started

I couldn't find the way to specify appropriate mounts for executor containers so far.

from docker-containers.

gregory90 avatar gregory90 commented on August 28, 2024

Related: https://issues.apache.org/jira/browse/MESOS-2115

from docker-containers.

bobrik avatar bobrik commented on August 28, 2024

@dhorbach https://github.com/bobrik/mesos-compose/blob/master/docker-compose.yml#L22-L37

from docker-containers.

tobilg avatar tobilg commented on August 28, 2024

@bobrik I don't think that this solves the problem. If I understand correctly, Mesos doesn't provide a way yet to even specifiy additional options together with the --docker_mesos_image argument.

from docker-containers.

bobrik avatar bobrik commented on August 28, 2024

Ah, right, you try to make executor in the separate container work. I'd prefer having MESOS-3573 resolved.

from docker-containers.

ederst avatar ederst commented on August 28, 2024

@tnachen Great to see a flag like --docker_mesos_image implemented, however i ran into an issue using it (using Docker Image mesosphere/mesos-slave:0.27.2-2.0.15.ubuntu1404):

I0401 11:02:55.378304     7 exec.cpp:134] Version: 0.27.2
I0401 11:02:55.383819    12 exec.cpp:208] Executor registered on slave e0f2a7a3-878b-4cd5-b9ac-d4393223a00a-S0
ABORT: (/tmp/mesos-build/mesos-repo/3rdparty/libprocess/src/subprocess.cpp:322): Failed to os::execvpe on path '/usr/bin/docker': No such file or directory

According to the logs it seems that the mesos-executor-container cannot find the Docker binary.

This is because the Mesos slave Docker image does not include a Docker binary (my original slave container mounts that binary to /usr/bin/docker with the -v option like here) and the executor container which is started by the slave container does not mount this binary when I understand the code of docker.cpp correctly.

Without the --docker_mesos_image option everything works fine, maybe I'm missing something else when trying to use this option.

Edit: aesthetic changes...

Edit2: Adding something like this to the code of docker.cpp (where the slave starts the executor in the container) would probably help:

Volume* dockerBinVolume = newContainerInfo.add_volumes();
dockerBinVolume->set_host_path(flags.docker);
dockerBinVolume->set_container_path(flags.docker);
dockerBinVolume->set_mode(Volume::RO);

(flags.docker == --docker=<path to docker bin> == docker -v <ptdb_host>:<ptdb_container> ... mesos_slave ... i guess?)

Edit3: It probably would work with the Images of mesoscloud/mesos-slave since a docker binary is included in those. However there are no images for Mesos >=0.25* and having a different version of a Docker binary accessing a registry on the host via a mounted socket does not always work, as far as I remember (some versions of Docker requiring a certain version of the registry, etc.).

from docker-containers.

ederst avatar ederst commented on August 28, 2024

I have implemented my change (adding the binary volume mount to the executor), and compiled it to test it. With this change, the executor container is able to find the docker binary (duh), but another problem arises: it seems that the fetching does is not triggered anymore

What this means is, we are using Mesos in combination with the Jenkins Mesos Plugin, and the fetcher should download the slave.jar (and also some other stuff defined in "Additional URIs"), but this does not happen.

When looking at the stdout/stderr outputs in the sandbox, it seems that the fetching stuff is never called:

I0405 17:27:09.908493     7 exec.cpp:134] Version: 0.27.2
I0405 17:27:09.931807    13 exec.cpp:208] Executor registered on slave 7c1a98fb-1292-417a-b44e-fcd2935d8a53-S2
Error: Unable to access jarfile /mnt/mesos/sandbox/slave.jar

Without the --docker_mesos_image option it looks like this (I have shortened the output, so it is not complete, if you are wondering):

I0405 17:31:11.068203 24698 fetcher.cpp:424] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/7c1a98fb-1292-417a-b44e-fcd2935d8a53-S2","items":[{"action":"BYPASS_CACHE","uri":{"executable":false,"extract":false,"value":"http:\/\/10.1.80.85:8080\/mesos\/createSlave\/mesos-6c51c183-5fae-4f11-8444-4a8f364c6fc3"}},{"action":"BYPASS_CACHE","uri":{"executable":false,"extract":false,"value":"http:\/\/10.1.80.85:8080\/jnlpJars\/slave.jar"}},{"action":"BYPASS_CACHE","uri":{"executable":false,"extract":false,"value":"\/home\/jenkins\/.dockercfg"}}],"sandbox_directory":"\/var\/lib\/mesos\/slave\/slaves\<longPathAhead>"}
I0405 17:31:11.080420 24698 fetcher.cpp:250] Fetching directly into the sandbox directory
I0405 17:31:11.080476 24698 fetcher.cpp:187] Fetching URI 
I0405 17:31:11.411428 24698 fetcher.cpp:456] Fetched 
I0405 17:31:11.411519 24698 fetcher.cpp:379] Fetching URI 'http://10.1.80.85:8080/jnlpJars/slave.jar'
I0405 17:31:11.411540 24698 fetcher.cpp:250] Fetching directly into the sandbox directory
I0405 17:31:11.411578 24698 fetcher.cpp:187] Fetching URI 'http://10.1.80.85:8080/jnlpJars/slave.jar'
I0405 17:31:11.455317 24698 fetcher.cpp:379] Fetching URI '/home/jenkins/.dockercfg'
I0405 17:31:11.455390 24698 fetcher.cpp:250] Fetching directly into the sandbox directory
I0405 17:31:11.455472 24698 fetcher.cpp:187] Fetching URI '/home/jenkins/.dockercfg'
I0405 17:31:11.455543 24698 fetcher.cpp:167] Copying resource with command:cp 
I0405 17:31:11.717672 24716 exec.cpp:134] Version: 0.27.2
I0405 17:31:11.766477 24723 exec.cpp:208] Executor registered on slave 7c1a98fb-1292-417a-b44e-fcd2935d8a53-S2
Picked up _JAVA_OPTIONS: -Djava.net.preferIPv4Stack=true
Apr 05, 2016 7:31:16 PM hudson.remoting.jnlp.Main createEngine

It looks like the executor containerization does not work as expected.

from docker-containers.

asridharan avatar asridharan commented on August 28, 2024

From the comments seems like the --docker_mesos_image does not solve this problem. At the very least we need to document this behavior/limitation. Adding the documentation label to start with.

from docker-containers.

ederst avatar ederst commented on August 28, 2024

@asridharan Seems to me that the skipped fetcher step is resolved by issue https://issues.apache.org/jira/browse/MESOS-4249 (version 0.28.0), have not tested it but it looks good

Still, the only issue which seems to prevail is that the Docker cotainer running a Mesos agent would need a Docker executable either installed in the corresponding Docker image (which could lead to incompatibilities between the mounted in Docker socket) or a mounted in Docker executable.

The latter would require a code change in docker.cpp (somewhere between those lines), looking something like this:

Volume* dockerBinVolume = newContainerInfo.add_volumes();
dockerBinVolume->set_host_path(flags.docker);
dockerBinVolume->set_container_path(flags.docker);
dockerBinVolume->set_mode(Volume::RO);

Which would mount the Docker executable from the host to the Mesos agent container by setting the path to the Docker container with --docker=....

Maybe I should open this issue at https://issues.apache.org since this seems to be the wrong place for this.

from docker-containers.

h0tbird avatar h0tbird commented on August 28, 2024

The docker container running the mesos executor not only needs the docker binary but also other host libraries might be needed too such as:

  --volume /usr/bin/docker:/usr/bin/docker:ro
  --volume /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro
  --volume /lib64/libsystemd.so.0:/lib/libsystemd.so.0:ro
  --volume /lib64/libgcrypt.so.20:/lib/libgcrypt.so.20:ro

from docker-containers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.