Comments (21)
FYI, I also tried updating deimos to 0.4.0, which was available by doing `pip install deimos --upgrade). Nothing has changed with the above issue.
I might have allowed Ubuntu to update the kernel through a standard upgrade. Would I need to reinstall anything to do with Mesos after that?
from deimos.
Try with --default_container_image=docker:///libmesos/ubuntu:13.10
.
from deimos.
I tried that and it still is either ignoring that option or passing in ubuntu:14.04 directly:
from deimos.
@rcsenkbeil I believe though I may be wrong, the deimos config file default image takes precedence over the one you configure on the mesos slave. Are you using the example config? https://github.com/mesosphere/deimos/blob/master/example.cfg#L24.
from deimos.
Yeah, I also added the example config. Since there wasn't too much documentation on each individual option, I just guessed on what to do.
The config located at /home/mesos/.deimos:
The results of running deimos config
:
from deimos.
Any chance you can dump your deimos logs here? I notice you have console: DEBUG
enabled in your configuration.
from deimos.
Another thing I can think of is that mesos master/slave/marathon are all running as root since I installed them all through deb packages rather than building from source. I guess I'd need to move the deimos config to the root level for it to apply to the running instances?
But, if that's the case, there shouldn't be any config affecting mesos-master and mesos-slave, right?
The results of running deimos config
as root:
from deimos.
@tarnfeld Sorry, I literally installed and got everything running yesterday. Where would the log you want be found?
from deimos.
Looks like the console logging is written to the stderr file in the sandbox, can you paste the full output of that? You can get to that through the Mesos Web UI.
from deimos.
Ah, okay. Also, it's showing up in var/log/syslog:
In terms of the stderr file, this is what I see:
Unable to find image 'libmesos/ubuntu:14.04' locally
Pulling repository libmesos/ubuntu
2014/07/10 11:21:57 Tag 14.04 not found in repository libmesos/ubuntu
from deimos.
A new update. I moved the .deimos file to /root since mesos was running under root. I deployed a new spark job and this is what I see:
Unable to find image 'ubuntu:13.10' locally
Pulling repository ubuntu
nix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/1d9f565c-43a9-4467-a105-2be15524a086/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/1d9f565c-43a9-4467-a105-2be15524a086/fs:/tmp/mesos-sandbox -c 1024 -m 2048m -e SPARK_USER=mesos -e SPARK_EXECUTOR_MEMORY=2048m -e MESOS_SLAVE_ID=20140710-110412-16842879-5050-1441-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-110412-16842879-5050-1441-0004 -e MESOS_EXECUTOR_ID=20140710-110412-16842879-5050-1441-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:31:43.415 deimos.state.await_cid() Awaiting CID file: /tmp/deimos/mesos/1d9f565c-43a9-4467-a105-2be15524a086/cid
11:32:56.925 deimos.cli() deimos.state.CIDTimeout: No CID file after 60s
I'm going to update the VM with more resources to see if it is just really sluggish.
from deimos.
And, now I can launch Spark, but all of the tasks become lost. From one of the tasks, I see the following in stderr:
WARNING: Your kernel does not support swap limit capabilities. Limitation discarded.
JAVA_HOME is not set
00
11:37:15.791 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/d93c6db7-42fa-4083-9c7d-d9e4cb10c4af/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/d93c6db7-42fa-4083-9c7d-d9e4cb10c4af/fs:/tmp/mesos-sandbox -c 1024 -m 2048m -e SPARK_USER=ibm -e SPARK_EXECUTOR_MEMORY=2048m -e MESOS_SLAVE_ID=20140710-113632-16842879-5050-1568-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-113632-16842879-5050-1568-0000 -e MESOS_EXECUTOR_ID=20140710-113632-16842879-5050-1568-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:37:15.799 deimos.state.await_cid() Awaiting CID file: /tmp/deimos/mesos/d93c6db7-42fa-4083-9c7d-d9e4cb10c4af/cid
11:37:16.804 deimos.state.lock() request // wait EX (60s)
11:37:16.807 deimos.state.lock() success // wait EX (60s)
11:37:16.810 deimos.containerizer.docker.launch() eid = 20140710-113632-16842879-5050-1568-0
11:37:16.811 deimos.containerizer.docker.launch() mesos = d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
11:37:16.810 deimos.containerizer.docker.launch() eid = 20140710-113632-16842879-5050-1568-0
11:37:16.811 deimos.containerizer.docker.launch() docker = a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.811 deimos.containerizer.docker.launch() mesos = d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
11:37:16.812 deimos.containerizer.docker.launch() docker = a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.812 deimos.containerizer.docker.launch() Forking watcher into child...
11:37:16.812 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock wait a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.813 deimos.usage.self() rss = 16.047M user = 0.158 sys = 0.074
11:37:16.813 deimos.usage.children() rss = 10.777M user = 1.991 sys = 2.107
11:37:16.858 deimos.containerizer.docker.launch() exit 1 // docker --host unix:///var/run/docker.sock wait a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.859 deimos.containerizer.docker.launch() STDERR // Error: wait: no such container: a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
2014/07/10 11:37:16 Error: failed to wait one or more containers
11:37:16.860 deimos.cli() Command '['docker', '--host', u'unix:///var/run/docker.sock', 'wait', 'a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515']' returned non-zero exit status 1
11:37:17.585 deimos.config.load_configuration() Loaded configuration from /root/.deimos
11:37:17.586 deimos.config.load_configuration() Found: deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None)
11:37:17.586 deimos.config.load_configuration() Found: deimos.config.State(root='/tmp/deimos')
11:37:17.586 deimos.config.load_configuration() Found: deimos.config.Docker(host=[u'unix:///var/run/docker.sock'])
11:37:17.587 deimos.config.load_configuration() Found: deimos.config.Log(syslog=20, console=10)
11:37:17.587 deimos.config.load_configuration() Found: deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[]))
11:37:17.588 deimos.usage.self() rss = 16.043M user = 0.121 sys = 0.045
11:37:17.588 deimos.usage.children() rss = 0.000M user = 0.000 sys = 0.000
11:37:17.589 deimos.containerizer.stdio() (<bound method Docker.wait of deimos.containerizer.docker.Docker(index_settings=deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None), workdir='/tmp/mesos-sandbox', state_root='/tmp/deimos', runner=None, optimistic_unpack=True, state=None, shared_dir='fs', container_settings=deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[])))>, <class 'containerizer_pb2.Wait'>)
11:37:17.589 deimos.proto.deserialize() container_id {
11:37:17.590 deimos.proto.deserialize() value: "d93c6db7-42fa-4083-9c7d-d9e4cb10c4af"
11:37:17.590 deimos.proto.deserialize() }
11:37:17.591 deimos.containerizer.docker.wait()
11:37:17.592 deimos.state.lock() request // launch SH (60s)
11:37:17.593 deimos.state.lock() success // launch SH (60s)
11:37:17.594 deimos.containerizer.docker.wait() eid = 20140710-113632-16842879-5050-1568-0
11:37:17.594 deimos.containerizer.docker.wait() mesos = d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
11:37:17.595 deimos.containerizer.docker.wait() docker = a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:17.595 deimos.state.lock() request // observe SH (indefinite)
11:37:17.596 deimos.state.lock() success // observe SH (indefinite)
11:37:17.597 deimos.state.lock() request // wait SH (indefinite)
11:37:17.598 deimos.state.lock() success // wait SH (indefinite)
11:37:17.598 deimos.proto.serialize() Termination.status = 16384
11:37:17.599 deimos.proto.serialize() Termination.message = ''
11:37:17.599 deimos.proto.serialize() Termination.killed = False
11:37:17.600 deimos.cli() deimos.err.Err: Wait lock is not held nor is exit file present
And the following in stdout:
Running spark-executor with framework dir = /tmp/mesos-sandbox/spark-1.0.0-bin-hadoop2
[EDIT]
Below is the output from running spark-submit:
ibm@ibm-VirtualBox:/opt/spark-1.0.0-bin-hadoop2$ spark-submit --class org.apache.spark.examples.SparkPi --master mesos://127.0.1.1:5050 --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 lib/spark-examples*.jar 10
14/07/10 11:41:52 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/07/10 11:41:52 WARN Utils: Your hostname, ibm-VirtualBox resolves to a loopback address: 127.0.1.1; using 172.17.42.1 instead (on interface docker0)
14/07/10 11:41:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
14/07/10 11:41:52 INFO SecurityManager: Changing view acls to: ibm
14/07/10 11:41:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ibm)
14/07/10 11:41:53 INFO Slf4jLogger: Slf4jLogger started
14/07/10 11:41:53 INFO Remoting: Starting remoting
14/07/10 11:41:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:58372]
14/07/10 11:41:53 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:58372]
14/07/10 11:41:54 INFO SparkEnv: Registering MapOutputTracker
14/07/10 11:41:54 INFO SparkEnv: Registering BlockManagerMaster
14/07/10 11:41:54 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140710114154-7601
14/07/10 11:41:54 INFO MemoryStore: MemoryStore started with capacity 589.2 MB.
14/07/10 11:41:54 INFO ConnectionManager: Bound socket to port 54642 with id = ConnectionManagerId(172.17.42.1,54642)
14/07/10 11:41:54 INFO BlockManagerMaster: Trying to register BlockManager
14/07/10 11:41:54 INFO BlockManagerInfo: Registering block manager 172.17.42.1:54642 with 589.2 MB RAM
14/07/10 11:41:54 INFO BlockManagerMaster: Registered BlockManager
14/07/10 11:41:54 INFO HttpServer: Starting HTTP Server
14/07/10 11:41:54 INFO HttpBroadcast: Broadcast server started at http://172.17.42.1:40132
14/07/10 11:41:54 INFO HttpFileServer: HTTP File server directory is /tmp/spark-2b706d82-7321-47a9-a6ed-f884b5d4cfa0
14/07/10 11:41:54 INFO HttpServer: Starting HTTP Server
14/07/10 11:41:54 INFO SparkUI: Started SparkUI at http://172.17.42.1:4040
14/07/10 11:41:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/07/10 11:41:55 INFO EventLoggingListener: Logging events to /tmp/sparklogs/spark-pi-1405010515376
14/07/10 11:41:55 INFO SparkContext: Added JAR file:/opt/spark-1.0.0-bin-hadoop2/lib/spark-examples-1.0.0-hadoop2.2.0.jar at http://172.17.42.1:58841/jars/spark-examples-1.0.0-hadoop2.2.0.jar with timestamp 1405010515911
I0710 11:41:56.042367 3435 sched.cpp:126] Version: 0.19.0
I0710 11:41:56.044584 3439 sched.cpp:222] New master detected at [email protected]:5050
I0710 11:41:56.044920 3439 sched.cpp:230] No credentials provided. Attempting to register without authentication
I0710 11:41:56.047376 3439 sched.cpp:397] Framework registered with 20140710-113632-16842879-5050-1568-0002
14/07/10 11:41:56 INFO MesosSchedulerBackend: Registered as framework ID 20140710-113632-16842879-5050-1568-0002
14/07/10 11:41:56 INFO SparkContext: Starting job: reduce at SparkPi.scala:35
14/07/10 11:41:56 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35) with 10 output partitions (allowLocal=false)
14/07/10 11:41:56 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkPi.scala:35)
14/07/10 11:41:56 INFO DAGScheduler: Parents of final stage: List()
14/07/10 11:41:56 INFO DAGScheduler: Missing parents: List()
14/07/10 11:41:56 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkPi.scala:31), which has no missing parents
14/07/10 11:41:56 INFO DAGScheduler: Submitting 10 missing tasks from Stage 0 (MappedRDD[1] at map at SparkPi.scala:31)
14/07/10 11:41:56 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 5 ms
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:1 as 1411 bytes in 0 ms
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 0 ms
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:3 as TID 3 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:3 as 1411 bytes in 0 ms
W0710 11:41:56.416663 3438 sched.cpp:901] Attempting to launch task 1 with an unknown offer 20140710-113632-16842879-5050-1568-66
W0710 11:41:56.416805 3438 sched.cpp:901] Attempting to launch task 2 with an unknown offer 20140710-113632-16842879-5050-1568-66
W0710 11:41:56.416862 3438 sched.cpp:901] Attempting to launch task 3 with an unknown offer 20140710-113632-16842879-5050-1568-66
14/07/10 11:42:01 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 2 (task 0.0:2)
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 3 (task 0.0:3)
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/07/10 11:42:01 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 0)
14/07/10 11:42:01 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:01 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:0 as TID 4 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 0 ms
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:3 as TID 5 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:3 as 1411 bytes in 0 ms
14/07/10 11:42:01 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:1 as TID 6 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:1 as 1411 bytes in 1 ms
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:2 as TID 7 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 1 ms
W0710 11:42:01.881386 3440 sched.cpp:901] Attempting to launch task 5 with an unknown offer 20140710-113632-16842879-5050-1568-67
W0710 11:42:01.881475 3440 sched.cpp:901] Attempting to launch task 6 with an unknown offer 20140710-113632-16842879-5050-1568-67
W0710 11:42:01.881731 3440 sched.cpp:901] Attempting to launch task 7 with an unknown offer 20140710-113632-16842879-5050-1568-67
14/07/10 11:42:07 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 5 (task 0.0:3)
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 7 (task 0.0:2)
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 6 (task 0.0:1)
14/07/10 11:42:07 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 1)
14/07/10 11:42:07 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:07 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:07 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:1 as TID 8 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:1 as 1411 bytes in 1 ms
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:2 as TID 9 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 0 ms
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:0 as TID 10 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 0 ms
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:3 as TID 11 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:3 as 1411 bytes in 0 ms
W0710 11:42:07.820994 3437 sched.cpp:901] Attempting to launch task 9 with an unknown offer 20140710-113632-16842879-5050-1568-70
W0710 11:42:07.821722 3437 sched.cpp:901] Attempting to launch task 10 with an unknown offer 20140710-113632-16842879-5050-1568-70
W0710 11:42:07.821792 3437 sched.cpp:901] Attempting to launch task 11 with an unknown offer 20140710-113632-16842879-5050-1568-70
14/07/10 11:42:13 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 8 (task 0.0:1)
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 11 (task 0.0:3)
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 10 (task 0.0:0)
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 9 (task 0.0:2)
14/07/10 11:42:13 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 2)
14/07/10 11:42:13 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:13 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:13 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox
14/07/10 11:42:13 INFO TaskSetManager: Starting task 0.0:2 as TID 12 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:13 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 1 ms
14/07/10 11:42:13 INFO TaskSetManager: Starting task 0.0:0 as TID 13 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:13 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 0 ms
14/07/10 11:42:13 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 13 (task 0.0:0)
14/07/10 11:42:13 ERROR TaskSetManager: Task 0.0:0 failed 4 times; aborting job
14/07/10 11:42:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
W0710 11:42:13.820405 3438 sched.cpp:901] Attempting to launch task 13 with an unknown offer 20140710-113632-16842879-5050-1568-72
14/07/10 11:42:13 INFO DAGScheduler: Failed to run reduce at SparkPi.scala:35
Exception in thread "main" 14/07/10 11:42:13 INFO TaskSchedulerImpl: Cancelling stage 0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: TID 13 on host ibm-VirtualBox failed for unknown reason
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/07/10 11:42:13 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 3)
14/07/10 11:42:13 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:13 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:13 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox
from deimos.
Ok, progress :-) So it looks like the docker container is failing to launch (by that I mean the executor is not launching successfully inside it) – You can see the container ID as d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
. Try running docker logs d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
and see what comes up, you might see some spark errors that help you debug.
from deimos.
Weird, it claims the container doesn't exist. There isn't any sort of cleanup of containers, is there?
ibm@ibm-VirtualBox:/opt/spark-1.0.0-bin-hadoop2$ sudo docker logs d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
2014/07/10 11:48:41 Error: No such container: d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
I tried it with a short segment sudo docker logs d93c6
with the same results.
[EDIT]
In fact, there doesn't seem to be any previously-run containers:
from deimos.
So, I tried updating the deimos config that is being used and replacing docker:///ubuntu:13.10
with docker:///libmesos/ubuntu:13.10
to see what would happen. It looks like the mesos library couldn't be found? I thought libmesos/ubuntu was built to contain that library?
This is the new error log:
WARNING: Your kernel does not support swap limit capabilities. Limitation discarded.
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/07/10 16:55:11 INFO SparkHadoopUtil: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Failed to load native Mesos library from /usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
Exception in thread "main" java.lang.UnsatisfiedLinkError: no mesos in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at java.lang.System.loadLibrary(System.java:1088)
at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:52)
at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:64)
at org.apache.spark.executor.MesosExecutorBackend$$anonfun$main$1.apply$mcV$sp(MesosExecutorBackend.scala:99)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at org.apache.spark.executor.MesosExecutorBackend$.main(MesosExecutorBackend.scala:98)
at org.apache.spark.executor.MesosExecutorBackend.main(MesosExecutorBackend.scala)
7 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock wait 71de134814fe64d5094321ce54fca132d1fe14e9cd7163ed5461ea6708c9a3cd
11:55:10.148 deimos.usage.self() rss = 16.047M user = 0.157 sys = 0.072
11:55:10.148 deimos.usage.children() rss = 10.777M user = 1.982 sys = 1.986
11:55:11.090 deimos.config.load_configuration() Loaded configuration from /root/.deimos
11:55:11.091 deimos.config.load_configuration() Found: deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None)
11:55:11.091 deimos.config.load_configuration() Found: deimos.config.State(root='/tmp/deimos')
11:55:11.092 deimos.config.load_configuration() Found: deimos.config.Docker(host=[u'unix:///var/run/docker.sock'])
11:55:11.092 deimos.config.load_configuration() Found: deimos.config.Log(syslog=20, console=10)
11:55:11.092 deimos.config.load_configuration() Found: deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///libmesos/ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[]))
11:55:11.093 deimos.usage.self() rss = 16.047M user = 0.115 sys = 0.045
11:55:11.094 deimos.usage.children() rss = 0.000M user = 0.000 sys = 0.000
11:55:11.094 deimos.containerizer.stdio() (<bound method Docker.wait of deimos.containerizer.docker.Docker(index_settings=deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None), workdir='/tmp/mesos-sandbox', state_root='/tmp/deimos', runner=None, optimistic_unpack=True, state=None, shared_dir='fs', container_settings=deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///libmesos/ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[])))>, <class 'containerizer_pb2.Wait'>)
11:55:11.095 deimos.proto.deserialize() container_id {
11:55:11.095 deimos.proto.deserialize() value: "8c2312fb-3fec-4ac0-941a-db0f587e01d8"
11:55:11.096 deimos.proto.deserialize() }
11:55:11.096 deimos.containerizer.docker.wait()
11:55:11.098 deimos.state.lock() request // launch SH (60s)
11:55:11.098 deimos.state.lock() success // launch SH (60s)
11:55:11.099 deimos.containerizer.docker.wait() eid = 20140710-113632-16842879-5050-1568-0
11:55:11.099 deimos.containerizer.docker.wait() mesos = 8c2312fb-3fec-4ac0-941a-db0f587e01d8
11:55:11.100 deimos.containerizer.docker.wait() docker = 71de134814fe64d5094321ce54fca132d1fe14e9cd7163ed5461ea6708c9a3cd
11:55:11.100 deimos.state.lock() request // observe SH (indefinite)
11:55:11.101 deimos.state.lock() success // observe SH (indefinite)
11:55:11.102 deimos.state.lock() request // wait SH (indefinite)
11:55:11.375 deimos.containerizer.docker.launch() exit 0 // docker --host unix:///var/run/docker.sock wait 71de134814fe64d5094321ce54fca132d1fe14e9cd7163ed5461ea6708c9a3cd
11:55:11.376 deimos.containerizer.docker.launch() STDOUT // 1
11:55:11.377 deimos.state.lock() success // wait SH (indefinite)
11:55:11.378 deimos.proto.serialize() Termination.status = 256
11:55:11.378 deimos.proto.serialize() Termination.message = ''
11:55:11.379 deimos.proto.serialize() Termination.killed = False
11:55:11.380 deimos.usage.self() rss = 16.047M user = 0.124 sys = 0.049
11:55:11.380 deimos.usage.children() rss = 0.000M user = 0.000 sys = 0.000
11:55:11.383 deimos.containerizer.docker.launch() exit 0 // docker --host unix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/8c2312fb-3fec-4ac0-941a-db0f587e01d8/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/8c2312fb-3fec-4ac0-941a-db0f587e01d8/fs:/tmp/mesos-sandbox -c 1024 -m 512m -e SPARK_USER=ibm -e SPARK_EXECUTOR_MEMORY=512m -e MESOS_SLAVE_ID=20140710-113632-16842879-5050-1568-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-113632-16842879-5050-1568-0005 -e MESOS_EXECUTOR_ID=20140710-113632-16842879-5050-1568-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox libmesos/ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:55:11.385 deimos.usage.self() rss = 11.770M user = 0.007 sys = 0.006
11:55:11.385 deimos.usage.children() rss = 10.777M user = 0.018 sys = 0.039
To reiterate, the previous stderr from using docker:///ubuntu:13.10
was the following:
WARNING: Your kernel does not support swap limit capabilities. Limitation discarded.
JAVA_HOME is not set
06
11:57:33.447 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/4ff94f11-2671-48e2-bfbf-6b077eb60bf8/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/4ff94f11-2671-48e2-bfbf-6b077eb60bf8/fs:/tmp/mesos-sandbox -c 1024 -m 512m -e SPARK_USER=ibm -e SPARK_EXECUTOR_MEMORY=512m -e MESOS_SLAVE_ID=20140710-113632-16842879-5050-1568-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-113632-16842879-5050-1568-0006 -e MESOS_EXECUTOR_ID=20140710-113632-16842879-5050-1568-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:57:33.457 deimos.state.await_cid() Awaiting CID file: /tmp/deimos/mesos/4ff94f11-2671-48e2-bfbf-6b077eb60bf8/cid
11:57:34.461 deimos.state.lock() request // wait EX (60s)
11:57:34.462 deimos.state.lock() success // wait EX (60s)
11:57:34.467 deimos.containerizer.docker.launch() eid = 20140710-113632-16842879-5050-1568-0
11:57:34.468 deimos.containerizer.docker.launch() mesos = 4ff94f11-2671-48e2-bfbf-6b077eb60bf8
11:57:34.467 deimos.containerizer.docker.launch() eid = 20140710-113632-16842879-5050-1568-0
11:57:34.469 deimos.containerizer.docker.launch() docker = d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.470 deimos.containerizer.docker.launch() mesos = 4ff94f11-2671-48e2-bfbf-6b077eb60bf8
11:57:34.470 deimos.containerizer.docker.launch() docker = d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.470 deimos.containerizer.docker.launch() Forking watcher into child...
11:57:34.472 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock wait d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.473 deimos.usage.self() rss = 16.047M user = 0.145 sys = 0.064
11:57:34.473 deimos.usage.children() rss = 10.777M user = 1.968 sys = 1.972
11:57:34.540 deimos.containerizer.docker.launch() exit 1 // docker --host unix:///var/run/docker.sock wait d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.541 deimos.containerizer.docker.launch() STDERR // Error: wait: no such container: d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
2014/07/10 11:57:34 Error: failed to wait one or more containers
11:57:34.542 deimos.cli() Command '['docker', '--host', u'unix:///var/run/docker.sock', 'wait', 'd7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471']' returned non-zero exit status 1
11:57:35.265 deimos.config.load_configuration() Loaded configuration from /root/.deimos
11:57:35.265 deimos.config.load_configuration() Found: deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None)
11:57:35.266 deimos.config.load_configuration() Found: deimos.config.State(root='/tmp/deimos')
11:57:35.266 deimos.config.load_configuration() Found: deimos.config.Docker(host=[u'unix:///var/run/docker.sock'])
11:57:35.266 deimos.config.load_configuration() Found: deimos.config.Log(syslog=20, console=10)
11:57:35.267 deimos.config.load_configuration() Found: deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[]))
11:57:35.267 deimos.usage.self() rss = 16.047M user = 0.104 sys = 0.039
11:57:35.268 deimos.usage.children() rss = 0.000M user = 0.000 sys = 0.000
11:57:35.268 deimos.containerizer.stdio() (<bound method Docker.wait of deimos.containerizer.docker.Docker(index_settings=deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None), workdir='/tmp/mesos-sandbox', state_root='/tmp/deimos', runner=None, optimistic_unpack=True, state=None, shared_dir='fs', container_settings=deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[])))>, <class 'containerizer_pb2.Wait'>)
11:57:35.269 deimos.proto.deserialize() container_id {
11:57:35.269 deimos.proto.deserialize() value: "4ff94f11-2671-48e2-bfbf-6b077eb60bf8"
11:57:35.270 deimos.proto.deserialize() }
11:57:35.270 deimos.containerizer.docker.wait()
11:57:35.271 deimos.state.lock() request // launch SH (60s)
11:57:35.272 deimos.state.lock() success // launch SH (60s)
11:57:35.272 deimos.containerizer.docker.wait() eid = 20140710-113632-16842879-5050-1568-0
11:57:35.273 deimos.containerizer.docker.wait() mesos = 4ff94f11-2671-48e2-bfbf-6b077eb60bf8
11:57:35.273 deimos.containerizer.docker.wait() docker = d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:35.274 deimos.state.lock() request // observe SH (indefinite)
11:57:35.274 deimos.state.lock() success // observe SH (indefinite)
11:57:35.275 deimos.state.lock() request // wait SH (indefinite)
11:57:35.276 deimos.state.lock() success // wait SH (indefinite)
11:57:35.276 deimos.proto.serialize() Termination.status = 16384
11:57:35.277 deimos.proto.serialize() Termination.message = ''
11:57:35.277 deimos.proto.serialize() Termination.killed = False
11:57:35.278 deimos.cli() deimos.err.Err: Wait lock is not held nor is exit file present
from deimos.
Another interesting thing to note is that the containers only show up when listing all processes (including non-running ones) for a split second. So, I'm guessing they are being removed rather than being allowed to persist?
watch -i 0.1 sudo docker ps -a
[EDIT]
Also, I want to confirm that deploying Docker containers through Marathon works fine. Maybe Spark itself cannot run in Docker with the way that Deimos handles it?
[EDIT2]
Finally, going back to not using deimos (and in turn not using Docker as the containerization), I can run the Spark framework without issues and I can deploy tasks through Marathon. This includes deploying a command that runs a Docker container. I don't know how well Docker plays inside Mesos' default container system, but I can run Docker through the normal setup, at least.
For me, I'd love to have the ability to let Marathon use deimos and the Spark framework use the default containerization, which I see is being worked on. It would allow me to deploy standard Docker containers through a nice RESTful API.
from deimos.
I believe that libmesos
gets placed in /usr/local/lib
by default; or at least, that used to be the case. (The libmesos/ubuntu
container was built against an RC for Mesos 0.19.0.) Maybe java.library.path
does not include /usr/local
when Spark is run?
You could change account_libmesos
to point to a different Docker account and create a new container in that account, tagged as ubuntu:14.04
. Then you'd have control over the Java version as well as the other supporting libraries.
from deimos.
Thanks for post. I went through the same process. Creating a new Docker container with libmesos in /usr/lib solved the UnsatisfiedLinkError.
from deimos.
Yes. I also managed to work spark with deimos, the only thing I couldn't do is, didn't know how to pass parameters when submitting a job (different from deimos.cfg). Eg: which image should be used. (I only ran with the same image)
from deimos.
@tdna That's currently not supported by spark. It's something @tnachen or I will be adding over the coming few days. Feel free to contribute, too!
from deimos.
Thats cool, thanks!
from deimos.
Related Issues (20)
- deimos not picking up additional options for the run command from config HOT 1
- TypeError exception during launch HOT 3
- How to map service port to the random port? HOT 6
- Possible to run any docker container? HOT 3
- Status code not being returned by conteinerizer
- Not sure how to login to private registry HOT 15
- Support Images with Entrypoint defined HOT 17
- Error when trying to start mesos-slave with bad path for containerizer_path HOT 1
- Use Deimos for only Marathon HOT 2
- Container info in /cgroups on CentOS 6.5 HOT 4
- -p option doesn't work as documented HOT 1
- Mesos environment variables not set when task_info specified
- More than just docker? HOT 3
- Hooks to run tasks after container launch / shutdown HOT 5
- Container flaps between 'Staging' and 'Running' => can't download a package HOT 4
- Question: How does port pairing work? HOT 1
- Deimos should not override workdir HOT 1
- Customize docker run arguments HOT 2
- Deimos insists on ate least one pull per image
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deimos.