Hello, thanks for your works. I try to execute "Run distributed MNIS

Per your logs, I see <div class="snippet-clipboard-content notranslate position-re

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hang tensorflowonspark with example in standalonecluster mode about tensorflowonspark HOT 5 CLOSED

yahoo commented on May 12, 2024

Hang tensorflowonspark with example in standalonecluster mode

from tensorflowonspark.

Comments (5)

leewyang commented on May 12, 2024

I don't see anything obviously wrong, but there might be more information in the executor logs. Can you see an errors in the executor logs?

Also, since you're using Tensorflow-GPU 0.12, you may want to check out the "r0.12_examples" branch, since we've updated the "master" branch to TF 1.0 API recently. (Or you can update your env to TF 1.0 instead).

from tensorflowonspark.

brickbit commented on May 12, 2024

Hi, now TF its in 1.0, but not work :-(

I attach output. GPU out of memory.
output.txt

It seems that, the execution hang because my GPU has 2GB of RAM. Spark works with block of 511MB, 3 block of 511MB are reserved ok, but other block of 511 CUDA MEMORY ERROR, but why if MNIST CSV has 126M, in TFoS has more than 1.5G. I supposse that is a Spark, but I do not understand. I try also with --executor-memory 500M and not works.

Thanks for your support!

17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 158.49.189.177:42554 (size: 6.5 KB, free: 511.1 MB)
17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 158.49.189.177:42554 (size: 19.3 KB, free: 511.1 MB)
17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 158.49.189.177:42554 (size: 19.3 KB, free: 511.1 MB)
HANG IN THIS POINT

org.apache.spark.deploy.worker.Worker running as process 14529. Stop it first.
roberto@ladrillo:/software/TensorFlowOnSpark/spark-1.6.0-bin-hadoop2.6$ cd ..
roberto@ladrillo:/software/TensorFlowOnSpark$ ${SPARK_HOME}/bin/spark-submit --master ${MASTER} --py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py --conf spark.cores.max=${TOTAL_CORES} --conf spark.task.cpus=${CORES_PER_WORKER} --conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py --cluster_size ${SPARK_WORKER_INSTANCES} --images examples/mnist/csv/train/images --labels examples/mnist/csv/train/labels --format csv --mode train --model mnist_model
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
17/05/09 18:03:10 INFO SparkContext: Running Spark version 1.6.0
17/05/09 18:03:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/09 18:03:10 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '2').
This is deprecated in Spark 1.0+.

Please instead use:

./spark-submit with --num-executors to specify the number of executors
Or set SPARK_EXECUTOR_INSTANCES
spark.executor.instances to configure the number of instances in the spark config.

17/05/09 18:03:10 WARN Utils: Your hostname, ladrillo resolves to a loopback address: 127.0.1.1; using 158.49.189.177 instead (on interface wlp2s0)
17/05/09 18:03:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/05/09 18:03:10 INFO SecurityManager: Changing view acls to: roberto
17/05/09 18:03:10 INFO SecurityManager: Changing modify acls to: roberto
17/05/09 18:03:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(roberto); users with modify permissions: Set(roberto)
17/05/09 18:03:11 INFO Utils: Successfully started service 'sparkDriver' on port 35313.
17/05/09 18:03:11 INFO Slf4jLogger: Slf4jLogger started
17/05/09 18:03:11 INFO Remoting: Starting remoting

[...]
2017-05-09 18:03:15,941 INFO (MainThread-18547) {'addr': ('ladrillo', 40163), 'task_index': 0, 'port': 45598, 'authkey': 'O\xd8\xfc\xed1dJX\x80s\xceS\xc7\xac\xdaD', 'worker_num': 0, 'host': 'ladrillo', 'ppid': 18863, 'job_name': 'ps', 'tb_pid': 0, 'tb_port': 0}
2017-05-09 18:03:15,941 INFO (MainThread-18547) Feeding training data
17/05/09 18:03:15 INFO SparkContext: Starting job: collect at PythonRDD.scala:405
17/05/09 18:03:15 INFO DAGScheduler: Got job 1 (collect at PythonRDD.scala:405) with 10 output partitions
17/05/09 18:03:15 INFO DAGScheduler: Final stage: ResultStage 1 (collect at PythonRDD.scala:405)
17/05/09 18:03:15 INFO DAGScheduler: Parents of final stage: List()
17/05/09 18:03:15 INFO DAGScheduler: Missing parents: List()
17/05/09 18:03:15 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[10] at RDD at PythonRDD.scala:43), which has no missing parents
17/05/09 18:03:15 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 12.4 KB, free 486.1 KB)
17/05/09 18:03:15 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 6.5 KB, free 492.6 KB)
17/05/09 18:03:15 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 158.49.189.177:42896 (size: 6.5 KB, free: 511.1 MB)
17/05/09 18:03:15 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
17/05/09 18:03:15 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 1 (PythonRDD[10] at RDD at PythonRDD.scala:43)
17/05/09 18:03:15 INFO TaskSchedulerImpl: Adding task set 1.0 with 10 tasks
17/05/09 18:03:17 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 158.49.189.177, partition 0,PROCESS_LOCAL, 2838 bytes)
17/05/09 18:03:17 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3045 ms on 158.49.189.177 (1/2)
17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 158.49.189.177:42554 (size: 6.5 KB, free: 511.1 MB)
17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 158.49.189.177:42554 (size: 19.3 KB, free: 511.1 MB)
17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 158.49.189.177:42554 (size: 19.3 KB, free: 511.1 MB)
HANG IN THIS POINT

from tensorflowonspark.

leewyang commented on May 12, 2024

Per your logs, I see

Total memory: 1.96GiB
Free memory: 173.69MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 173.69M (182124544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

Please note that your GPU memory is not managed by Spark at all, so it's unrelated to the --executor-memory setting. That said, it looks like your GPU is already using 1.7GB of memory, so it's unable to start the TensorFlow device on it. I'd recommend trying the CPU version of TensorFlow for now (unless you can stop/remove whatever is tying up the rest of your GPU resources).

from tensorflowonspark.

mhaut commented on May 12, 2024

Hi @leewyang, thanks. I have same problem

I think that the problem is because in tutorial say "export SPARK_WORKER_INSTANCES=2". First worker reserve all GPU (1.73G), second worker not GPU.

The question is. How can we solve this and work with multiples workers? (same or different virtual machines)

from tensorflowonspark.

leewyang commented on May 12, 2024

@mhaut GPU resources are managed by TensorFlow, so it's mostly beyond our control. You can either use more GPUs, or try this, but note that we currently assume one GPU per TensorFlow process.

from tensorflowonspark.

Hang tensorflowonspark with example in standalonecluster mode about tensorflowonspark HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent