Giter Site home page Giter Site logo

Comments (21)

chipsenkbeil avatar chipsenkbeil commented on July 21, 2024

FYI, I also tried updating deimos to 0.4.0, which was available by doing `pip install deimos --upgrade). Nothing has changed with the above issue.

I might have allowed Ubuntu to update the kernel through a standard upgrade. Would I need to reinstall anything to do with Mesos after that?

from deimos.

tarnfeld avatar tarnfeld commented on July 21, 2024

Try with --default_container_image=docker:///libmesos/ubuntu:13.10.

from deimos.

chipsenkbeil avatar chipsenkbeil commented on July 21, 2024

I tried that and it still is either ignoring that option or passing in ubuntu:14.04 directly:

mesos2

from deimos.

tarnfeld avatar tarnfeld commented on July 21, 2024

@rcsenkbeil I believe though I may be wrong, the deimos config file default image takes precedence over the one you configure on the mesos slave. Are you using the example config? https://github.com/mesosphere/deimos/blob/master/example.cfg#L24.

from deimos.

chipsenkbeil avatar chipsenkbeil commented on July 21, 2024

Yeah, I also added the example config. Since there wasn't too much documentation on each individual option, I just guessed on what to do.

The config located at /home/mesos/.deimos:

dot_deimos

The results of running deimos config:

deimos_config

from deimos.

tarnfeld avatar tarnfeld commented on July 21, 2024

Any chance you can dump your deimos logs here? I notice you have console: DEBUG enabled in your configuration.

from deimos.

chipsenkbeil avatar chipsenkbeil commented on July 21, 2024

Another thing I can think of is that mesos master/slave/marathon are all running as root since I installed them all through deb packages rather than building from source. I guess I'd need to move the deimos config to the root level for it to apply to the running instances?

But, if that's the case, there shouldn't be any config affecting mesos-master and mesos-slave, right?

The results of running deimos config as root:

deimos_config_root

from deimos.

chipsenkbeil avatar chipsenkbeil commented on July 21, 2024

@tarnfeld Sorry, I literally installed and got everything running yesterday. Where would the log you want be found?

from deimos.

tarnfeld avatar tarnfeld commented on July 21, 2024

Looks like the console logging is written to the stderr file in the sandbox, can you paste the full output of that? You can get to that through the Mesos Web UI.

from deimos.

chipsenkbeil avatar chipsenkbeil commented on July 21, 2024

Ah, okay. Also, it's showing up in var/log/syslog:

syslog_snapshot

In terms of the stderr file, this is what I see:

framework_for_sparkpi
stderr

Unable to find image 'libmesos/ubuntu:14.04' locally
Pulling repository libmesos/ubuntu
2014/07/10 11:21:57 Tag 14.04 not found in repository libmesos/ubuntu

from deimos.

chipsenkbeil avatar chipsenkbeil commented on July 21, 2024

A new update. I moved the .deimos file to /root since mesos was running under root. I deployed a new spark job and this is what I see:

Unable to find image 'ubuntu:13.10' locally
Pulling repository ubuntu



nix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/1d9f565c-43a9-4467-a105-2be15524a086/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/1d9f565c-43a9-4467-a105-2be15524a086/fs:/tmp/mesos-sandbox -c 1024 -m 2048m -e SPARK_USER=mesos -e SPARK_EXECUTOR_MEMORY=2048m -e MESOS_SLAVE_ID=20140710-110412-16842879-5050-1441-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-110412-16842879-5050-1441-0004 -e MESOS_EXECUTOR_ID=20140710-110412-16842879-5050-1441-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:31:43.415 deimos.state.await_cid() Awaiting CID file: /tmp/deimos/mesos/1d9f565c-43a9-4467-a105-2be15524a086/cid
11:32:56.925 deimos.cli() deimos.state.CIDTimeout: No CID file after 60s

I'm going to update the VM with more resources to see if it is just really sluggish.

from deimos.

chipsenkbeil avatar chipsenkbeil commented on July 21, 2024

And, now I can launch Spark, but all of the tasks become lost. From one of the tasks, I see the following in stderr:



WARNING: Your kernel does not support swap limit capabilities. Limitation discarded.
JAVA_HOME is not set
00
11:37:15.791 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/d93c6db7-42fa-4083-9c7d-d9e4cb10c4af/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/d93c6db7-42fa-4083-9c7d-d9e4cb10c4af/fs:/tmp/mesos-sandbox -c 1024 -m 2048m -e SPARK_USER=ibm -e SPARK_EXECUTOR_MEMORY=2048m -e MESOS_SLAVE_ID=20140710-113632-16842879-5050-1568-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-113632-16842879-5050-1568-0000 -e MESOS_EXECUTOR_ID=20140710-113632-16842879-5050-1568-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:37:15.799 deimos.state.await_cid() Awaiting CID file: /tmp/deimos/mesos/d93c6db7-42fa-4083-9c7d-d9e4cb10c4af/cid
11:37:16.804 deimos.state.lock() request // wait EX (60s)
11:37:16.807 deimos.state.lock() success // wait EX (60s)
11:37:16.810 deimos.containerizer.docker.launch() eid    = 20140710-113632-16842879-5050-1568-0
11:37:16.811 deimos.containerizer.docker.launch() mesos  = d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
11:37:16.810 deimos.containerizer.docker.launch() eid    = 20140710-113632-16842879-5050-1568-0
11:37:16.811 deimos.containerizer.docker.launch() docker = a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.811 deimos.containerizer.docker.launch() mesos  = d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
11:37:16.812 deimos.containerizer.docker.launch() docker = a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.812 deimos.containerizer.docker.launch() Forking watcher into child...
11:37:16.812 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock wait a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.813 deimos.usage.self() rss = 16.047M  user = 0.158  sys = 0.074
11:37:16.813 deimos.usage.children() rss = 10.777M  user = 1.991  sys = 2.107
11:37:16.858 deimos.containerizer.docker.launch() exit 1 // docker --host unix:///var/run/docker.sock wait a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.859 deimos.containerizer.docker.launch() STDERR // Error: wait: no such container: a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
2014/07/10 11:37:16 Error: failed to wait one or more containers

11:37:16.860 deimos.cli() Command '['docker', '--host', u'unix:///var/run/docker.sock', 'wait', 'a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515']' returned non-zero exit status 1
11:37:17.585 deimos.config.load_configuration() Loaded configuration from /root/.deimos
11:37:17.586 deimos.config.load_configuration() Found: deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None)
11:37:17.586 deimos.config.load_configuration() Found: deimos.config.State(root='/tmp/deimos')
11:37:17.586 deimos.config.load_configuration() Found: deimos.config.Docker(host=[u'unix:///var/run/docker.sock'])
11:37:17.587 deimos.config.load_configuration() Found: deimos.config.Log(syslog=20, console=10)
11:37:17.587 deimos.config.load_configuration() Found: deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[]))
11:37:17.588 deimos.usage.self() rss = 16.043M  user = 0.121  sys = 0.045
11:37:17.588 deimos.usage.children() rss = 0.000M  user = 0.000  sys = 0.000
11:37:17.589 deimos.containerizer.stdio() (<bound method Docker.wait of deimos.containerizer.docker.Docker(index_settings=deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None), workdir='/tmp/mesos-sandbox', state_root='/tmp/deimos', runner=None, optimistic_unpack=True, state=None, shared_dir='fs', container_settings=deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[])))>, <class 'containerizer_pb2.Wait'>)
11:37:17.589 deimos.proto.deserialize() container_id {
11:37:17.590 deimos.proto.deserialize()   value: "d93c6db7-42fa-4083-9c7d-d9e4cb10c4af"
11:37:17.590 deimos.proto.deserialize() }
11:37:17.591 deimos.containerizer.docker.wait() 
11:37:17.592 deimos.state.lock() request // launch SH (60s)
11:37:17.593 deimos.state.lock() success // launch SH (60s)
11:37:17.594 deimos.containerizer.docker.wait() eid    = 20140710-113632-16842879-5050-1568-0
11:37:17.594 deimos.containerizer.docker.wait() mesos  = d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
11:37:17.595 deimos.containerizer.docker.wait() docker = a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:17.595 deimos.state.lock() request // observe SH (indefinite)
11:37:17.596 deimos.state.lock() success // observe SH (indefinite)
11:37:17.597 deimos.state.lock() request // wait SH (indefinite)
11:37:17.598 deimos.state.lock() success // wait SH (indefinite)
11:37:17.598 deimos.proto.serialize() Termination.status = 16384
11:37:17.599 deimos.proto.serialize() Termination.message = ''
11:37:17.599 deimos.proto.serialize() Termination.killed = False
11:37:17.600 deimos.cli() deimos.err.Err: Wait lock is not held nor is exit file present

And the following in stdout:



Running spark-executor with framework dir = /tmp/mesos-sandbox/spark-1.0.0-bin-hadoop2

[EDIT]

Below is the output from running spark-submit:

ibm@ibm-VirtualBox:/opt/spark-1.0.0-bin-hadoop2$ spark-submit     --class org.apache.spark.examples.SparkPi     --master mesos://127.0.1.1:5050     --num-executors 1     --driver-memory 1g     --executor-memory 1g     --executor-cores 1     lib/spark-examples*.jar 10
14/07/10 11:41:52 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/07/10 11:41:52 WARN Utils: Your hostname, ibm-VirtualBox resolves to a loopback address: 127.0.1.1; using 172.17.42.1 instead (on interface docker0)
14/07/10 11:41:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
14/07/10 11:41:52 INFO SecurityManager: Changing view acls to: ibm
14/07/10 11:41:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ibm)
14/07/10 11:41:53 INFO Slf4jLogger: Slf4jLogger started
14/07/10 11:41:53 INFO Remoting: Starting remoting
14/07/10 11:41:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:58372]
14/07/10 11:41:53 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:58372]
14/07/10 11:41:54 INFO SparkEnv: Registering MapOutputTracker
14/07/10 11:41:54 INFO SparkEnv: Registering BlockManagerMaster
14/07/10 11:41:54 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140710114154-7601
14/07/10 11:41:54 INFO MemoryStore: MemoryStore started with capacity 589.2 MB.
14/07/10 11:41:54 INFO ConnectionManager: Bound socket to port 54642 with id = ConnectionManagerId(172.17.42.1,54642)
14/07/10 11:41:54 INFO BlockManagerMaster: Trying to register BlockManager
14/07/10 11:41:54 INFO BlockManagerInfo: Registering block manager 172.17.42.1:54642 with 589.2 MB RAM
14/07/10 11:41:54 INFO BlockManagerMaster: Registered BlockManager
14/07/10 11:41:54 INFO HttpServer: Starting HTTP Server
14/07/10 11:41:54 INFO HttpBroadcast: Broadcast server started at http://172.17.42.1:40132
14/07/10 11:41:54 INFO HttpFileServer: HTTP File server directory is /tmp/spark-2b706d82-7321-47a9-a6ed-f884b5d4cfa0
14/07/10 11:41:54 INFO HttpServer: Starting HTTP Server
14/07/10 11:41:54 INFO SparkUI: Started SparkUI at http://172.17.42.1:4040
14/07/10 11:41:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/07/10 11:41:55 INFO EventLoggingListener: Logging events to /tmp/sparklogs/spark-pi-1405010515376
14/07/10 11:41:55 INFO SparkContext: Added JAR file:/opt/spark-1.0.0-bin-hadoop2/lib/spark-examples-1.0.0-hadoop2.2.0.jar at http://172.17.42.1:58841/jars/spark-examples-1.0.0-hadoop2.2.0.jar with timestamp 1405010515911
I0710 11:41:56.042367  3435 sched.cpp:126] Version: 0.19.0
I0710 11:41:56.044584  3439 sched.cpp:222] New master detected at [email protected]:5050
I0710 11:41:56.044920  3439 sched.cpp:230] No credentials provided. Attempting to register without authentication
I0710 11:41:56.047376  3439 sched.cpp:397] Framework registered with 20140710-113632-16842879-5050-1568-0002
14/07/10 11:41:56 INFO MesosSchedulerBackend: Registered as framework ID 20140710-113632-16842879-5050-1568-0002
14/07/10 11:41:56 INFO SparkContext: Starting job: reduce at SparkPi.scala:35
14/07/10 11:41:56 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35) with 10 output partitions (allowLocal=false)
14/07/10 11:41:56 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkPi.scala:35)
14/07/10 11:41:56 INFO DAGScheduler: Parents of final stage: List()
14/07/10 11:41:56 INFO DAGScheduler: Missing parents: List()
14/07/10 11:41:56 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkPi.scala:31), which has no missing parents
14/07/10 11:41:56 INFO DAGScheduler: Submitting 10 missing tasks from Stage 0 (MappedRDD[1] at map at SparkPi.scala:31)
14/07/10 11:41:56 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 5 ms
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:1 as 1411 bytes in 0 ms
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 0 ms
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:3 as TID 3 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:3 as 1411 bytes in 0 ms
W0710 11:41:56.416663  3438 sched.cpp:901] Attempting to launch task 1 with an unknown offer 20140710-113632-16842879-5050-1568-66
W0710 11:41:56.416805  3438 sched.cpp:901] Attempting to launch task 2 with an unknown offer 20140710-113632-16842879-5050-1568-66
W0710 11:41:56.416862  3438 sched.cpp:901] Attempting to launch task 3 with an unknown offer 20140710-113632-16842879-5050-1568-66
14/07/10 11:42:01 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 2 (task 0.0:2)
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 3 (task 0.0:3)
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/07/10 11:42:01 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 0)
14/07/10 11:42:01 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:01 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:0 as TID 4 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 0 ms
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:3 as TID 5 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:3 as 1411 bytes in 0 ms
14/07/10 11:42:01 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:1 as TID 6 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:1 as 1411 bytes in 1 ms
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:2 as TID 7 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 1 ms
W0710 11:42:01.881386  3440 sched.cpp:901] Attempting to launch task 5 with an unknown offer 20140710-113632-16842879-5050-1568-67
W0710 11:42:01.881475  3440 sched.cpp:901] Attempting to launch task 6 with an unknown offer 20140710-113632-16842879-5050-1568-67
W0710 11:42:01.881731  3440 sched.cpp:901] Attempting to launch task 7 with an unknown offer 20140710-113632-16842879-5050-1568-67
14/07/10 11:42:07 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 5 (task 0.0:3)
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 7 (task 0.0:2)
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 6 (task 0.0:1)
14/07/10 11:42:07 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 1)
14/07/10 11:42:07 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:07 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:07 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:1 as TID 8 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:1 as 1411 bytes in 1 ms
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:2 as TID 9 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 0 ms
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:0 as TID 10 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 0 ms
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:3 as TID 11 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:3 as 1411 bytes in 0 ms
W0710 11:42:07.820994  3437 sched.cpp:901] Attempting to launch task 9 with an unknown offer 20140710-113632-16842879-5050-1568-70
W0710 11:42:07.821722  3437 sched.cpp:901] Attempting to launch task 10 with an unknown offer 20140710-113632-16842879-5050-1568-70
W0710 11:42:07.821792  3437 sched.cpp:901] Attempting to launch task 11 with an unknown offer 20140710-113632-16842879-5050-1568-70
14/07/10 11:42:13 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 8 (task 0.0:1)
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 11 (task 0.0:3)
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 10 (task 0.0:0)
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 9 (task 0.0:2)
14/07/10 11:42:13 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 2)
14/07/10 11:42:13 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:13 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:13 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox
14/07/10 11:42:13 INFO TaskSetManager: Starting task 0.0:2 as TID 12 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:13 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 1 ms
14/07/10 11:42:13 INFO TaskSetManager: Starting task 0.0:0 as TID 13 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:13 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 0 ms
14/07/10 11:42:13 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 13 (task 0.0:0)
14/07/10 11:42:13 ERROR TaskSetManager: Task 0.0:0 failed 4 times; aborting job
14/07/10 11:42:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
W0710 11:42:13.820405  3438 sched.cpp:901] Attempting to launch task 13 with an unknown offer 20140710-113632-16842879-5050-1568-72
14/07/10 11:42:13 INFO DAGScheduler: Failed to run reduce at SparkPi.scala:35
Exception in thread "main" 14/07/10 11:42:13 INFO TaskSchedulerImpl: Cancelling stage 0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: TID 13 on host ibm-VirtualBox failed for unknown reason
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
    at akka.actor.ActorCell.invoke(ActorCell.scala:456)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
    at akka.dispatch.Mailbox.run(Mailbox.scala:219)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/07/10 11:42:13 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 3)
14/07/10 11:42:13 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:13 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:13 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox

from deimos.

tarnfeld avatar tarnfeld commented on July 21, 2024

Ok, progress :-) So it looks like the docker container is failing to launch (by that I mean the executor is not launching successfully inside it) – You can see the container ID as d93c6db7-42fa-4083-9c7d-d9e4cb10c4af. Try running docker logs d93c6db7-42fa-4083-9c7d-d9e4cb10c4af and see what comes up, you might see some spark errors that help you debug.

from deimos.

chipsenkbeil avatar chipsenkbeil commented on July 21, 2024

Weird, it claims the container doesn't exist. There isn't any sort of cleanup of containers, is there?

ibm@ibm-VirtualBox:/opt/spark-1.0.0-bin-hadoop2$ sudo docker logs d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
2014/07/10 11:48:41 Error: No such container: d93c6db7-42fa-4083-9c7d-d9e4cb10c4af

I tried it with a short segment sudo docker logs d93c6 with the same results.

[EDIT]

In fact, there doesn't seem to be any previously-run containers:

docker_info

from deimos.

chipsenkbeil avatar chipsenkbeil commented on July 21, 2024

So, I tried updating the deimos config that is being used and replacing docker:///ubuntu:13.10 with docker:///libmesos/ubuntu:13.10 to see what would happen. It looks like the mesos library couldn't be found? I thought libmesos/ubuntu was built to contain that library?

This is the new error log:



WARNING: Your kernel does not support swap limit capabilities. Limitation discarded.
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/07/10 16:55:11 INFO SparkHadoopUtil: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Failed to load native Mesos library from /usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
Exception in thread "main" java.lang.UnsatisfiedLinkError: no mesos in java.library.path
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
    at java.lang.Runtime.loadLibrary0(Runtime.java:849)
    at java.lang.System.loadLibrary(System.java:1088)
    at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:52)
    at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:64)
    at org.apache.spark.executor.MesosExecutorBackend$$anonfun$main$1.apply$mcV$sp(MesosExecutorBackend.scala:99)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
    at org.apache.spark.executor.MesosExecutorBackend$.main(MesosExecutorBackend.scala:98)
    at org.apache.spark.executor.MesosExecutorBackend.main(MesosExecutorBackend.scala)
7 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock wait 71de134814fe64d5094321ce54fca132d1fe14e9cd7163ed5461ea6708c9a3cd
11:55:10.148 deimos.usage.self() rss = 16.047M  user = 0.157  sys = 0.072
11:55:10.148 deimos.usage.children() rss = 10.777M  user = 1.982  sys = 1.986
11:55:11.090 deimos.config.load_configuration() Loaded configuration from /root/.deimos
11:55:11.091 deimos.config.load_configuration() Found: deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None)
11:55:11.091 deimos.config.load_configuration() Found: deimos.config.State(root='/tmp/deimos')
11:55:11.092 deimos.config.load_configuration() Found: deimos.config.Docker(host=[u'unix:///var/run/docker.sock'])
11:55:11.092 deimos.config.load_configuration() Found: deimos.config.Log(syslog=20, console=10)
11:55:11.092 deimos.config.load_configuration() Found: deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///libmesos/ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[]))
11:55:11.093 deimos.usage.self() rss = 16.047M  user = 0.115  sys = 0.045
11:55:11.094 deimos.usage.children() rss = 0.000M  user = 0.000  sys = 0.000
11:55:11.094 deimos.containerizer.stdio() (<bound method Docker.wait of deimos.containerizer.docker.Docker(index_settings=deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None), workdir='/tmp/mesos-sandbox', state_root='/tmp/deimos', runner=None, optimistic_unpack=True, state=None, shared_dir='fs', container_settings=deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///libmesos/ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[])))>, <class 'containerizer_pb2.Wait'>)
11:55:11.095 deimos.proto.deserialize() container_id {
11:55:11.095 deimos.proto.deserialize()   value: "8c2312fb-3fec-4ac0-941a-db0f587e01d8"
11:55:11.096 deimos.proto.deserialize() }
11:55:11.096 deimos.containerizer.docker.wait() 
11:55:11.098 deimos.state.lock() request // launch SH (60s)
11:55:11.098 deimos.state.lock() success // launch SH (60s)
11:55:11.099 deimos.containerizer.docker.wait() eid    = 20140710-113632-16842879-5050-1568-0
11:55:11.099 deimos.containerizer.docker.wait() mesos  = 8c2312fb-3fec-4ac0-941a-db0f587e01d8
11:55:11.100 deimos.containerizer.docker.wait() docker = 71de134814fe64d5094321ce54fca132d1fe14e9cd7163ed5461ea6708c9a3cd
11:55:11.100 deimos.state.lock() request // observe SH (indefinite)
11:55:11.101 deimos.state.lock() success // observe SH (indefinite)
11:55:11.102 deimos.state.lock() request // wait SH (indefinite)
11:55:11.375 deimos.containerizer.docker.launch() exit 0 // docker --host unix:///var/run/docker.sock wait 71de134814fe64d5094321ce54fca132d1fe14e9cd7163ed5461ea6708c9a3cd
11:55:11.376 deimos.containerizer.docker.launch() STDOUT // 1

11:55:11.377 deimos.state.lock() success // wait SH (indefinite)
11:55:11.378 deimos.proto.serialize() Termination.status = 256
11:55:11.378 deimos.proto.serialize() Termination.message = ''
11:55:11.379 deimos.proto.serialize() Termination.killed = False
11:55:11.380 deimos.usage.self() rss = 16.047M  user = 0.124  sys = 0.049
11:55:11.380 deimos.usage.children() rss = 0.000M  user = 0.000  sys = 0.000
11:55:11.383 deimos.containerizer.docker.launch() exit 0 // docker --host unix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/8c2312fb-3fec-4ac0-941a-db0f587e01d8/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/8c2312fb-3fec-4ac0-941a-db0f587e01d8/fs:/tmp/mesos-sandbox -c 1024 -m 512m -e SPARK_USER=ibm -e SPARK_EXECUTOR_MEMORY=512m -e MESOS_SLAVE_ID=20140710-113632-16842879-5050-1568-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-113632-16842879-5050-1568-0005 -e MESOS_EXECUTOR_ID=20140710-113632-16842879-5050-1568-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox libmesos/ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:55:11.385 deimos.usage.self() rss = 11.770M  user = 0.007  sys = 0.006
11:55:11.385 deimos.usage.children() rss = 10.777M  user = 0.018  sys = 0.039

To reiterate, the previous stderr from using docker:///ubuntu:13.10 was the following:



WARNING: Your kernel does not support swap limit capabilities. Limitation discarded.
JAVA_HOME is not set
06
11:57:33.447 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/4ff94f11-2671-48e2-bfbf-6b077eb60bf8/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/4ff94f11-2671-48e2-bfbf-6b077eb60bf8/fs:/tmp/mesos-sandbox -c 1024 -m 512m -e SPARK_USER=ibm -e SPARK_EXECUTOR_MEMORY=512m -e MESOS_SLAVE_ID=20140710-113632-16842879-5050-1568-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-113632-16842879-5050-1568-0006 -e MESOS_EXECUTOR_ID=20140710-113632-16842879-5050-1568-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:57:33.457 deimos.state.await_cid() Awaiting CID file: /tmp/deimos/mesos/4ff94f11-2671-48e2-bfbf-6b077eb60bf8/cid
11:57:34.461 deimos.state.lock() request // wait EX (60s)
11:57:34.462 deimos.state.lock() success // wait EX (60s)
11:57:34.467 deimos.containerizer.docker.launch() eid    = 20140710-113632-16842879-5050-1568-0
11:57:34.468 deimos.containerizer.docker.launch() mesos  = 4ff94f11-2671-48e2-bfbf-6b077eb60bf8
11:57:34.467 deimos.containerizer.docker.launch() eid    = 20140710-113632-16842879-5050-1568-0
11:57:34.469 deimos.containerizer.docker.launch() docker = d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.470 deimos.containerizer.docker.launch() mesos  = 4ff94f11-2671-48e2-bfbf-6b077eb60bf8
11:57:34.470 deimos.containerizer.docker.launch() docker = d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.470 deimos.containerizer.docker.launch() Forking watcher into child...
11:57:34.472 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock wait d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.473 deimos.usage.self() rss = 16.047M  user = 0.145  sys = 0.064
11:57:34.473 deimos.usage.children() rss = 10.777M  user = 1.968  sys = 1.972
11:57:34.540 deimos.containerizer.docker.launch() exit 1 // docker --host unix:///var/run/docker.sock wait d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.541 deimos.containerizer.docker.launch() STDERR // Error: wait: no such container: d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
2014/07/10 11:57:34 Error: failed to wait one or more containers

11:57:34.542 deimos.cli() Command '['docker', '--host', u'unix:///var/run/docker.sock', 'wait', 'd7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471']' returned non-zero exit status 1
11:57:35.265 deimos.config.load_configuration() Loaded configuration from /root/.deimos
11:57:35.265 deimos.config.load_configuration() Found: deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None)
11:57:35.266 deimos.config.load_configuration() Found: deimos.config.State(root='/tmp/deimos')
11:57:35.266 deimos.config.load_configuration() Found: deimos.config.Docker(host=[u'unix:///var/run/docker.sock'])
11:57:35.266 deimos.config.load_configuration() Found: deimos.config.Log(syslog=20, console=10)
11:57:35.267 deimos.config.load_configuration() Found: deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[]))
11:57:35.267 deimos.usage.self() rss = 16.047M  user = 0.104  sys = 0.039
11:57:35.268 deimos.usage.children() rss = 0.000M  user = 0.000  sys = 0.000
11:57:35.268 deimos.containerizer.stdio() (<bound method Docker.wait of deimos.containerizer.docker.Docker(index_settings=deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None), workdir='/tmp/mesos-sandbox', state_root='/tmp/deimos', runner=None, optimistic_unpack=True, state=None, shared_dir='fs', container_settings=deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[])))>, <class 'containerizer_pb2.Wait'>)
11:57:35.269 deimos.proto.deserialize() container_id {
11:57:35.269 deimos.proto.deserialize()   value: "4ff94f11-2671-48e2-bfbf-6b077eb60bf8"
11:57:35.270 deimos.proto.deserialize() }
11:57:35.270 deimos.containerizer.docker.wait() 
11:57:35.271 deimos.state.lock() request // launch SH (60s)
11:57:35.272 deimos.state.lock() success // launch SH (60s)
11:57:35.272 deimos.containerizer.docker.wait() eid    = 20140710-113632-16842879-5050-1568-0
11:57:35.273 deimos.containerizer.docker.wait() mesos  = 4ff94f11-2671-48e2-bfbf-6b077eb60bf8
11:57:35.273 deimos.containerizer.docker.wait() docker = d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:35.274 deimos.state.lock() request // observe SH (indefinite)
11:57:35.274 deimos.state.lock() success // observe SH (indefinite)
11:57:35.275 deimos.state.lock() request // wait SH (indefinite)
11:57:35.276 deimos.state.lock() success // wait SH (indefinite)
11:57:35.276 deimos.proto.serialize() Termination.status = 16384
11:57:35.277 deimos.proto.serialize() Termination.message = ''
11:57:35.277 deimos.proto.serialize() Termination.killed = False
11:57:35.278 deimos.cli() deimos.err.Err: Wait lock is not held nor is exit file present

from deimos.

chipsenkbeil avatar chipsenkbeil commented on July 21, 2024

Another interesting thing to note is that the containers only show up when listing all processes (including non-running ones) for a split second. So, I'm guessing they are being removed rather than being allowed to persist?

watch -i 0.1 sudo docker ps -a

[EDIT]
Also, I want to confirm that deploying Docker containers through Marathon works fine. Maybe Spark itself cannot run in Docker with the way that Deimos handles it?

[EDIT2]
Finally, going back to not using deimos (and in turn not using Docker as the containerization), I can run the Spark framework without issues and I can deploy tasks through Marathon. This includes deploying a command that runs a Docker container. I don't know how well Docker plays inside Mesos' default container system, but I can run Docker through the normal setup, at least.

For me, I'd love to have the ability to let Marathon use deimos and the Spark framework use the default containerization, which I see is being worked on. It would allow me to deploy standard Docker containers through a nice RESTful API.

from deimos.

solidsnack avatar solidsnack commented on July 21, 2024

I believe that libmesos gets placed in /usr/local/lib by default; or at least, that used to be the case. (The libmesos/ubuntu container was built against an RC for Mesos 0.19.0.) Maybe java.library.path does not include /usr/local when Spark is run?

You could change account_libmesos to point to a different Docker account and create a new container in that account, tagged as ubuntu:14.04. Then you'd have control over the Java version as well as the other supporting libraries.

from deimos.

efeller avatar efeller commented on July 21, 2024

Thanks for post. I went through the same process. Creating a new Docker container with libmesos in /usr/lib solved the UnsatisfiedLinkError.

from deimos.

tdna avatar tdna commented on July 21, 2024

Yes. I also managed to work spark with deimos, the only thing I couldn't do is, didn't know how to pass parameters when submitting a job (different from deimos.cfg). Eg: which image should be used. (I only ran with the same image)

from deimos.

tarnfeld avatar tarnfeld commented on July 21, 2024

@tdna That's currently not supported by spark. It's something @tnachen or I will be adding over the coming few days. Feel free to contribute, too!

from deimos.

tdna avatar tdna commented on July 21, 2024

Thats cool, thanks!

from deimos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.