Giter Site home page Giter Site logo

Comments (5)

leewyang avatar leewyang commented on May 12, 2024

I don't see anything obviously wrong, but there might be more information in the executor logs. Can you see an errors in the executor logs?

Also, since you're using Tensorflow-GPU 0.12, you may want to check out the "r0.12_examples" branch, since we've updated the "master" branch to TF 1.0 API recently. (Or you can update your env to TF 1.0 instead).

from tensorflowonspark.

brickbit avatar brickbit commented on May 12, 2024

Hi, now TF its in 1.0, but not work :-(

I attach output. GPU out of memory.
output.txt

It seems that, the execution hang because my GPU has 2GB of RAM. Spark works with block of 511MB, 3 block of 511MB are reserved ok, but other block of 511 CUDA MEMORY ERROR, but why if MNIST CSV has 126M, in TFoS has more than 1.5G. I supposse that is a Spark, but I do not understand. I try also with --executor-memory 500M and not works.

Thanks for your support!


17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 158.49.189.177:42554 (size: 6.5 KB, free: 511.1 MB)
17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 158.49.189.177:42554 (size: 19.3 KB, free: 511.1 MB)
17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 158.49.189.177:42554 (size: 19.3 KB, free: 511.1 MB)
HANG IN THIS POINT

org.apache.spark.deploy.worker.Worker running as process 14529. Stop it first.
roberto@ladrillo:/software/TensorFlowOnSpark/spark-1.6.0-bin-hadoop2.6$ cd ..
roberto@ladrillo:
/software/TensorFlowOnSpark$ ${SPARK_HOME}/bin/spark-submit --master ${MASTER} --py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py --conf spark.cores.max=${TOTAL_CORES} --conf spark.task.cpus=${CORES_PER_WORKER} --conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py --cluster_size ${SPARK_WORKER_INSTANCES} --images examples/mnist/csv/train/images --labels examples/mnist/csv/train/labels --format csv --mode train --model mnist_model
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
17/05/09 18:03:10 INFO SparkContext: Running Spark version 1.6.0
17/05/09 18:03:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/09 18:03:10 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '2').
This is deprecated in Spark 1.0+.

Please instead use:

  • ./spark-submit with --num-executors to specify the number of executors
  • Or set SPARK_EXECUTOR_INSTANCES
  • spark.executor.instances to configure the number of instances in the spark config.

17/05/09 18:03:10 WARN Utils: Your hostname, ladrillo resolves to a loopback address: 127.0.1.1; using 158.49.189.177 instead (on interface wlp2s0)
17/05/09 18:03:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/05/09 18:03:10 INFO SecurityManager: Changing view acls to: roberto
17/05/09 18:03:10 INFO SecurityManager: Changing modify acls to: roberto
17/05/09 18:03:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(roberto); users with modify permissions: Set(roberto)
17/05/09 18:03:11 INFO Utils: Successfully started service 'sparkDriver' on port 35313.
17/05/09 18:03:11 INFO Slf4jLogger: Slf4jLogger started
17/05/09 18:03:11 INFO Remoting: Starting remoting

[...]
2017-05-09 18:03:15,941 INFO (MainThread-18547) {'addr': ('ladrillo', 40163), 'task_index': 0, 'port': 45598, 'authkey': 'O\xd8\xfc\xed1dJX\x80s\xceS\xc7\xac\xdaD', 'worker_num': 0, 'host': 'ladrillo', 'ppid': 18863, 'job_name': 'ps', 'tb_pid': 0, 'tb_port': 0}
2017-05-09 18:03:15,941 INFO (MainThread-18547) Feeding training data
17/05/09 18:03:15 INFO SparkContext: Starting job: collect at PythonRDD.scala:405
17/05/09 18:03:15 INFO DAGScheduler: Got job 1 (collect at PythonRDD.scala:405) with 10 output partitions
17/05/09 18:03:15 INFO DAGScheduler: Final stage: ResultStage 1 (collect at PythonRDD.scala:405)
17/05/09 18:03:15 INFO DAGScheduler: Parents of final stage: List()
17/05/09 18:03:15 INFO DAGScheduler: Missing parents: List()
17/05/09 18:03:15 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[10] at RDD at PythonRDD.scala:43), which has no missing parents
17/05/09 18:03:15 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 12.4 KB, free 486.1 KB)
17/05/09 18:03:15 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 6.5 KB, free 492.6 KB)
17/05/09 18:03:15 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 158.49.189.177:42896 (size: 6.5 KB, free: 511.1 MB)
17/05/09 18:03:15 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
17/05/09 18:03:15 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 1 (PythonRDD[10] at RDD at PythonRDD.scala:43)
17/05/09 18:03:15 INFO TaskSchedulerImpl: Adding task set 1.0 with 10 tasks
17/05/09 18:03:17 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 158.49.189.177, partition 0,PROCESS_LOCAL, 2838 bytes)
17/05/09 18:03:17 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3045 ms on 158.49.189.177 (1/2)
17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 158.49.189.177:42554 (size: 6.5 KB, free: 511.1 MB)
17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 158.49.189.177:42554 (size: 19.3 KB, free: 511.1 MB)
17/05/09 18:03:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 158.49.189.177:42554 (size: 19.3 KB, free: 511.1 MB)
HANG IN THIS POINT

from tensorflowonspark.

leewyang avatar leewyang commented on May 12, 2024

Per your logs, I see

Total memory: 1.96GiB
Free memory: 173.69MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 173.69M (182124544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

Please note that your GPU memory is not managed by Spark at all, so it's unrelated to the --executor-memory setting. That said, it looks like your GPU is already using 1.7GB of memory, so it's unable to start the TensorFlow device on it. I'd recommend trying the CPU version of TensorFlow for now (unless you can stop/remove whatever is tying up the rest of your GPU resources).

from tensorflowonspark.

mhaut avatar mhaut commented on May 12, 2024

Hi @leewyang, thanks. I have same problem

I think that the problem is because in tutorial say "export SPARK_WORKER_INSTANCES=2". First worker reserve all GPU (1.73G), second worker not GPU.

The question is. How can we solve this and work with multiples workers? (same or different virtual machines)

from tensorflowonspark.

leewyang avatar leewyang commented on May 12, 2024

@mhaut GPU resources are managed by TensorFlow, so it's mostly beyond our control. You can either use more GPUs, or try this, but note that we currently assume one GPU per TensorFlow process.

from tensorflowonspark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.