I am successfully able to run CIFAR test with and without RDMA on a sinlge Node containig 2 GPUs to 2 Nodes containing a total of 4 GPUs. However when I scale it out even further, I get an unknown error pertaining to HDFS. Kindly check out the log below and see if it makes sense:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/filecache/12/__spark_libs__6493683702299857357.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/javed.19/git-pull/finished/HiBench/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/05/18 03:13:51 INFO util.SignalUtils: Registered signal handler for TERM
17/05/18 03:13:51 INFO util.SignalUtils: Registered signal handler for HUP
17/05/18 03:13:51 INFO util.SignalUtils: Registered signal handler for INT
17/05/18 03:13:52 INFO yarn.ApplicationMaster: Preparing Local resources
17/05/18 03:13:52 INFO yarn.ApplicationMaster: Prepared Local resources Map(pyspark.zip -> resource { scheme: "hdfs" host: "gpu09" port: 9000 file: "/user/javed.19/.sparkStaging/application_1495091612438_0001/pyspark.zip" } size: 440385 timestamp: 1495091625707 type: FILE visibility: PRIVATE, __spark_libs__ -> resource { scheme: "hdfs" host: "gpu09" port: 9000 file: "/user/javed.19/.sparkStaging/application_1495091612438_0001/__spark_libs__6493683702299857357.zip" } size: 192670368 timestamp: 1495091625513 type: ARCHIVE visibility: PRIVATE, __spark_conf__ -> resource { scheme: "hdfs" host: "gpu09" port: 9000 file: "/user/javed.19/.sparkStaging/application_1495091612438_0001/__spark_conf__.zip" } size: 85884 timestamp: 1495091626034 type: ARCHIVE visibility: PRIVATE, py4j-0.10.3-src.zip -> resource { scheme: "hdfs" host: "gpu09" port: 9000 file: "/user/javed.19/.sparkStaging/application_1495091612438_0001/py4j-0.10.3-src.zip" } size: 91275 timestamp: 1495091625776 type: FILE visibility: PRIVATE, tfspark.zip -> resource { scheme: "hdfs" host: "gpu09" port: 9000 file: "/user/javed.19/.sparkStaging/application_1495091612438_0001/tfspark.zip" } size: 19385 timestamp: 1495091625849 type: FILE visibility: PRIVATE, cifar10.zip -> resource { scheme: "hdfs" host: "gpu09" port: 9000 file: "/user/javed.19/.sparkStaging/application_1495091612438_0001/cifar10.zip" } size: 21343 timestamp: 1495091625938 type: FILE visibility: PRIVATE, Python -> resource { scheme: "hdfs" host: "gpu09" port: 9000 file: "/Python.zip" } size: 148009343 timestamp: 1495091620084 type: ARCHIVE visibility: PUBLIC)
17/05/18 03:13:52 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1495091612438_0001_000001
17/05/18 03:13:52 INFO spark.SecurityManager: Changing view acls to: javed.19
17/05/18 03:13:52 INFO spark.SecurityManager: Changing modify acls to: javed.19
17/05/18 03:13:52 INFO spark.SecurityManager: Changing view acls groups to:
17/05/18 03:13:52 INFO spark.SecurityManager: Changing modify acls groups to:
17/05/18 03:13:52 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(javed.19); groups with view permissions: Set(); users with modify permissions: Set(javed.19); groups with modify permissions: Set()
17/05/18 03:13:52 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
17/05/18 03:13:52 INFO yarn.ApplicationMaster: Waiting for spark context initialization
17/05/18 03:13:52 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
17/05/18 03:13:53 INFO spark.SparkContext: Running Spark version 2.0.2
17/05/18 03:13:53 INFO spark.SecurityManager: Changing view acls to: javed.19
17/05/18 03:13:53 INFO spark.SecurityManager: Changing modify acls to: javed.19
17/05/18 03:13:53 INFO spark.SecurityManager: Changing view acls groups to:
17/05/18 03:13:53 INFO spark.SecurityManager: Changing modify acls groups to:
17/05/18 03:13:53 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(javed.19); groups with view permissions: Set(); users with modify permissions: Set(javed.19); groups with modify permissions: Set()
17/05/18 03:13:53 INFO util.Utils: Successfully started service 'sparkDriver' on port 36823.
17/05/18 03:13:53 INFO spark.SparkEnv: Registering MapOutputTracker
17/05/18 03:13:53 INFO spark.SparkEnv: Registering BlockManagerMaster
17/05/18 03:13:53 INFO storage.DiskBlockManager: Created local directory at /tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/blockmgr-d5c580a9-2b04-4b3c-b4df-4a06e7372741
17/05/18 03:13:53 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
17/05/18 03:13:53 INFO spark.SparkEnv: Registering OutputCommitCoordinator
17/05/18 03:13:53 INFO util.log: Logging initialized @2194ms
17/05/18 03:13:53 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
17/05/18 03:13:53 INFO server.Server: jetty-9.2.z-SNAPSHOT
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6321bc55{/jobs,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3d283b81{/jobs/json,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@122dc555{/jobs/job,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@701070ff{/jobs/job/json,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1e3db9bc{/stages,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@62357dc9{/stages/json,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1ece0bb7{/stages/stage,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6d2a1719{/stages/stage/json,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1e29b359{/stages/pool,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@57bb4e60{/stages/pool/json,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@39467493{/storage,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4e92e7d{/storage/json,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5789f6c2{/storage/rdd,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5a4c7a1d{/storage/rdd/json,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@50247f2b{/environment,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1d733994{/environment/json,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@554e31e{/executors,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3cf1ed3b{/executors/json,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4fdf10a9{/executors/threadDump,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4be42f5f{/executors/threadDump/json,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@690a792e{/static,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5a21c901{/,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6d4cc0b4{/api,null,AVAILABLE}
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7b88495{/stages/stage/kill,null,AVAILABLE}
17/05/18 03:13:53 INFO server.ServerConnector: Started ServerConnector@7995f9ac{HTTP/1.1}{0.0.0.0:43181}
17/05/18 03:13:53 INFO server.Server: Started @2290ms
17/05/18 03:13:53 INFO util.Utils: Successfully started service 'SparkUI' on port 43181.
17/05/18 03:13:53 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.3.1.12:43181
17/05/18 03:13:53 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler
17/05/18 03:13:53 INFO cluster.SchedulerExtensionServices: Starting Yarn extension services with app application_1495091612438_0001 and attemptId Some(appattempt_1495091612438_0001_000001)
17/05/18 03:13:53 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49987.
17/05/18 03:13:53 INFO netty.NettyBlockTransferService: Server created on 10.3.1.12:49987
17/05/18 03:13:53 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.3.1.12, 49987)
17/05/18 03:13:53 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.3.1.12:49987 with 366.3 MB RAM, BlockManagerId(driver, 10.3.1.12, 49987)
17/05/18 03:13:53 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.3.1.12, 49987)
17/05/18 03:13:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@43822b7b{/metrics/json,null,AVAILABLE}
17/05/18 03:13:53 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://[email protected]:36823)
17/05/18 03:13:53 INFO client.RMProxy: Connecting to ResourceManager at gpu09/10.3.1.9:8030
17/05/18 03:13:53 INFO yarn.YarnRMClient: Registering the ApplicationMaster
17/05/18 03:13:53 INFO yarn.YarnAllocator: Will request 4 executor containers, each with 1 cores and 12390 MB memory including 1126 MB overhead
17/05/18 03:13:53 INFO yarn.YarnAllocator: Canceled 0 container requests (locality no longer needed)
17/05/18 03:13:53 INFO yarn.YarnAllocator: Submitted container request (host: Any, capability: <memory:12390, vCores:1>)
17/05/18 03:13:53 INFO yarn.YarnAllocator: Submitted container request (host: Any, capability: <memory:12390, vCores:1>)
17/05/18 03:13:53 INFO yarn.YarnAllocator: Submitted container request (host: Any, capability: <memory:12390, vCores:1>)
17/05/18 03:13:53 INFO yarn.YarnAllocator: Submitted container request (host: Any, capability: <memory:12390, vCores:1>)
17/05/18 03:13:54 INFO yarn.ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
17/05/18 03:13:55 INFO impl.AMRMClientImpl: Received new token for : gpu10-ib.cluster:54754
17/05/18 03:13:55 INFO impl.AMRMClientImpl: Received new token for : gpu11-ib.cluster:38611
17/05/18 03:13:55 INFO impl.AMRMClientImpl: Received new token for : gpu13-ib.cluster:44034
17/05/18 03:13:55 INFO impl.AMRMClientImpl: Received new token for : gpu12-ib.cluster:40470
17/05/18 03:13:55 INFO yarn.YarnAllocator: Launching container container_1495091612438_0001_01_000002 for on host gpu10-ib.cluster
17/05/18 03:13:55 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: spark://[email protected]:36823, executorHostname: gpu10-ib.cluster
17/05/18 03:13:55 INFO yarn.YarnAllocator: Launching container container_1495091612438_0001_01_000003 for on host gpu11-ib.cluster
17/05/18 03:13:55 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: spark://[email protected]:36823, executorHostname: gpu11-ib.cluster
17/05/18 03:13:55 INFO yarn.YarnAllocator: Launching container container_1495091612438_0001_01_000004 for on host gpu13-ib.cluster
17/05/18 03:13:55 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: spark://[email protected]:36823, executorHostname: gpu13-ib.cluster
17/05/18 03:13:55 INFO yarn.YarnAllocator: Launching container container_1495091612438_0001_01_000005 for on host gpu12-ib.cluster
17/05/18 03:13:55 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: spark://[email protected]:36823, executorHostname: gpu12-ib.cluster
17/05/18 03:13:55 INFO yarn.YarnAllocator: Received 4 containers from YARN, launching executors on 4 of them.
17/05/18 03:13:55 INFO yarn.ExecutorRunnable: Starting Executor Container
17/05/18 03:13:55 INFO yarn.ExecutorRunnable: Starting Executor Container
17/05/18 03:13:55 INFO yarn.ExecutorRunnable: Starting Executor Container
17/05/18 03:13:55 INFO yarn.ExecutorRunnable: Starting Executor Container
17/05/18 03:13:55 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/05/18 03:13:55 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/05/18 03:13:55 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/05/18 03:13:55 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/05/18 03:13:55 INFO yarn.ExecutorRunnable: Setting up ContainerLaunchContext
17/05/18 03:13:55 INFO yarn.ExecutorRunnable: Setting up ContainerLaunchContext
17/05/18 03:13:55 INFO yarn.ExecutorRunnable: Setting up ContainerLaunchContext
17/05/18 03:13:55 INFO yarn.ExecutorRunnable: Setting up ContainerLaunchContext
17/05/18 03:13:55 INFO yarn.ExecutorRunnable:
===============================================================================
YARN executor launch context:
env:
SPARK_YARN_USER_ENV -> PYSPARK_PYTHON=Python/bin/python
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
SPARK_LOG_URL_STDERR -> http://gpu11-ib.cluster:8042/node/containerlogs/container_1495091612438_0001_01_000003/javed.19/stderr?start=-4096
SPARK_YARN_STAGING_DIR -> hdfs://gpu09:9000/user/javed.19/.sparkStaging/application_1495091612438_0001
SPARK_USER -> javed.19
SPARK_YARN_MODE -> true
PYTHONPATH -> /home/javed.19/bin/pydoop-install/lib/python<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.3-src.zip<CPS>{{PWD}}/tfspark.zip<CPS>{{PWD}}/cifar10.zip
LD_LIBRARY_PATH -> /usr/local/cuda/lib64:/usr/lib/jvm/java-1.8.0/jre/lib/amd64/server
SPARK_LOG_URL_STDOUT -> http://gpu11-ib.cluster:8042/node/containerlogs/container_1495091612438_0001_01_000003/javed.19/stdout?start=-4096
PYSPARK_PYTHON -> Python/bin/python
command:
{{JAVA_HOME}}/bin/java -server -Xmx11264m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' '-Dspark.driver.port=36823' -Dspark.yarn.app.container.log.dir=<LOG_DIR> -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:36823 --executor-id 2 --hostname gpu11-ib.cluster --cores 1 --app-id application_1495091612438_0001 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
===============================================================================
17/05/18 03:13:55 INFO yarn.ExecutorRunnable:
===============================================================================
YARN executor launch context:
env:
SPARK_YARN_USER_ENV -> PYSPARK_PYTHON=Python/bin/python
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
SPARK_LOG_URL_STDERR -> http://gpu10-ib.cluster:8042/node/containerlogs/container_1495091612438_0001_01_000002/javed.19/stderr?start=-4096
SPARK_YARN_STAGING_DIR -> hdfs://gpu09:9000/user/javed.19/.sparkStaging/application_1495091612438_0001
SPARK_USER -> javed.19
SPARK_YARN_MODE -> true
PYTHONPATH -> /home/javed.19/bin/pydoop-install/lib/python<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.3-src.zip<CPS>{{PWD}}/tfspark.zip<CPS>{{PWD}}/cifar10.zip
LD_LIBRARY_PATH -> /usr/local/cuda/lib64:/usr/lib/jvm/java-1.8.0/jre/lib/amd64/server
SPARK_LOG_URL_STDOUT -> http://gpu10-ib.cluster:8042/node/containerlogs/container_1495091612438_0001_01_000002/javed.19/stdout?start=-4096
PYSPARK_PYTHON -> Python/bin/python
command:
{{JAVA_HOME}}/bin/java -server -Xmx11264m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' '-Dspark.driver.port=36823' -Dspark.yarn.app.container.log.dir=<LOG_DIR> -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:36823 --executor-id 1 --hostname gpu10-ib.cluster --cores 1 --app-id application_1495091612438_0001 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
===============================================================================
17/05/18 03:13:55 INFO yarn.ExecutorRunnable:
===============================================================================
YARN executor launch context:
env:
SPARK_YARN_USER_ENV -> PYSPARK_PYTHON=Python/bin/python
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
SPARK_LOG_URL_STDERR -> http://gpu12-ib.cluster:8042/node/containerlogs/container_1495091612438_0001_01_000005/javed.19/stderr?start=-4096
SPARK_YARN_STAGING_DIR -> hdfs://gpu09:9000/user/javed.19/.sparkStaging/application_1495091612438_0001
SPARK_USER -> javed.19
SPARK_YARN_MODE -> true
PYTHONPATH -> /home/javed.19/bin/pydoop-install/lib/python<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.3-src.zip<CPS>{{PWD}}/tfspark.zip<CPS>{{PWD}}/cifar10.zip
LD_LIBRARY_PATH -> /usr/local/cuda/lib64:/usr/lib/jvm/java-1.8.0/jre/lib/amd64/server
SPARK_LOG_URL_STDOUT -> http://gpu12-ib.cluster:8042/node/containerlogs/container_1495091612438_0001_01_000005/javed.19/stdout?start=-4096
PYSPARK_PYTHON -> Python/bin/python
command:
{{JAVA_HOME}}/bin/java -server -Xmx11264m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' '-Dspark.driver.port=36823' -Dspark.yarn.app.container.log.dir=<LOG_DIR> -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:36823 --executor-id 4 --hostname gpu12-ib.cluster --cores 1 --app-id application_1495091612438_0001 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
===============================================================================
17/05/18 03:13:55 INFO yarn.ExecutorRunnable:
===============================================================================
YARN executor launch context:
env:
SPARK_YARN_USER_ENV -> PYSPARK_PYTHON=Python/bin/python
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
SPARK_LOG_URL_STDERR -> http://gpu13-ib.cluster:8042/node/containerlogs/container_1495091612438_0001_01_000004/javed.19/stderr?start=-4096
SPARK_YARN_STAGING_DIR -> hdfs://gpu09:9000/user/javed.19/.sparkStaging/application_1495091612438_0001
SPARK_USER -> javed.19
SPARK_YARN_MODE -> true
PYTHONPATH -> /home/javed.19/bin/pydoop-install/lib/python<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.3-src.zip<CPS>{{PWD}}/tfspark.zip<CPS>{{PWD}}/cifar10.zip
LD_LIBRARY_PATH -> /usr/local/cuda/lib64:/usr/lib/jvm/java-1.8.0/jre/lib/amd64/server
SPARK_LOG_URL_STDOUT -> http://gpu13-ib.cluster:8042/node/containerlogs/container_1495091612438_0001_01_000004/javed.19/stdout?start=-4096
PYSPARK_PYTHON -> Python/bin/python
command:
{{JAVA_HOME}}/bin/java -server -Xmx11264m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' '-Dspark.driver.port=36823' -Dspark.yarn.app.container.log.dir=<LOG_DIR> -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:36823 --executor-id 3 --hostname gpu13-ib.cluster --cores 1 --app-id application_1495091612438_0001 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
===============================================================================
17/05/18 03:13:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : gpu12-ib.cluster:40470
17/05/18 03:13:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : gpu11-ib.cluster:38611
17/05/18 03:13:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : gpu10-ib.cluster:54754
17/05/18 03:13:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : gpu13-ib.cluster:44034
17/05/18 03:13:57 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.3.1.12:44066) with ID 4
17/05/18 03:13:57 INFO storage.BlockManagerMasterEndpoint: Registering block manager gpu12-ib.cluster:52469 with 5.7 GB RAM, BlockManagerId(4, gpu12-ib.cluster, 52469)
17/05/18 03:14:01 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.3.1.11:42183) with ID 2
17/05/18 03:14:01 INFO storage.BlockManagerMasterEndpoint: Registering block manager gpu11-ib.cluster:34736 with 5.7 GB RAM, BlockManagerId(2, gpu11-ib.cluster, 34736)
17/05/18 03:14:01 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.3.1.13:39548) with ID 3
17/05/18 03:14:01 INFO storage.BlockManagerMasterEndpoint: Registering block manager gpu13-ib.cluster:39205 with 5.7 GB RAM, BlockManagerId(3, gpu13-ib.cluster, 39205)
17/05/18 03:14:01 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.3.1.10:34494) with ID 1
17/05/18 03:14:01 INFO storage.BlockManagerMasterEndpoint: Registering block manager gpu10-ib.cluster:43050 with 5.7 GB RAM, BlockManagerId(1, gpu10-ib.cluster, 43050)
17/05/18 03:14:01 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
17/05/18 03:14:01 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
17/05/18 03:14:01 INFO spark.SparkContext: Starting job: foreachPartition at /tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/tfspark.zip/com/yahoo/ml/tf/TFCluster.py:279
17/05/18 03:14:01 INFO scheduler.DAGScheduler: Got job 0 (foreachPartition at /tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/tfspark.zip/com/yahoo/ml/tf/TFCluster.py:279) with 4 output partitions
17/05/18 03:14:01 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (foreachPartition at /tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/tfspark.zip/com/yahoo/ml/tf/TFCluster.py:279)
17/05/18 03:14:01 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/05/18 03:14:01 INFO scheduler.DAGScheduler: Missing parents: List()
17/05/18 03:14:01 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at foreachPartition at /tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/tfspark.zip/com/yahoo/ml/tf/TFCluster.py:279), which has no missing parents
17/05/18 03:14:01 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 18.5 KB, free 366.3 MB)
17/05/18 03:14:01 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.5 KB, free 366.3 MB)
17/05/18 03:14:01 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.3.1.12:49987 (size: 12.5 KB, free: 366.3 MB)
17/05/18 03:14:01 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1012
17/05/18 03:14:01 INFO scheduler.DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (PythonRDD[1] at foreachPartition at /tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/tfspark.zip/com/yahoo/ml/tf/TFCluster.py:279)
17/05/18 03:14:01 INFO cluster.YarnClusterScheduler: Adding task set 0.0 with 4 tasks
17/05/18 03:14:02 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, gpu11-ib.cluster, partition 0, PROCESS_LOCAL, 5594 bytes)
17/05/18 03:14:02 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, gpu12-ib.cluster, partition 1, PROCESS_LOCAL, 5594 bytes)
17/05/18 03:14:02 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, gpu13-ib.cluster, partition 2, PROCESS_LOCAL, 5594 bytes)
17/05/18 03:14:02 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, gpu10-ib.cluster, partition 3, PROCESS_LOCAL, 5594 bytes)
17/05/18 03:14:02 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 0 on executor id: 2 hostname: gpu11-ib.cluster.
17/05/18 03:14:02 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 1 on executor id: 4 hostname: gpu12-ib.cluster.
17/05/18 03:14:02 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 2 on executor id: 3 hostname: gpu13-ib.cluster.
17/05/18 03:14:02 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 3 on executor id: 1 hostname: gpu10-ib.cluster.
17/05/18 03:14:02 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on gpu10-ib.cluster:43050 (size: 12.5 KB, free: 5.7 GB)
17/05/18 03:14:02 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on gpu11-ib.cluster:34736 (size: 12.5 KB, free: 5.7 GB)
17/05/18 03:14:02 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on gpu13-ib.cluster:39205 (size: 12.5 KB, free: 5.7 GB)
17/05/18 03:14:02 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on gpu12-ib.cluster:52469 (size: 12.5 KB, free: 5.7 GB)
17/05/18 03:14:07 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, gpu10-ib.cluster): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000002/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000002/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 317, in func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 762, in func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 421, in _mapfn
File "cifar10_multi_gpu_train.py", line 271, in main_fun
File "/home/javed.19/Python/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively
pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status)
File "/home/javed.19/Python/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/home/javed.19/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
UnknownError: hdfs://default/user/javed.19/cifar10_train/events.out.tfevents.1495091646.gpu13.cluster; Input/output error
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/18 03:14:07 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, gpu12-ib.cluster): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000005/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000005/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 317, in func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 762, in func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 421, in _mapfn
File "cifar10_multi_gpu_train.py", line 271, in main_fun
File "/home/javed.19/Python/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively
pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status)
File "/home/javed.19/Python/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/home/javed.19/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
UnknownError: hdfs://default/user/javed.19/cifar10_train/events.out.tfevents.1495091646.gpu13.cluster; Input/output error
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/18 03:14:07 INFO scheduler.TaskSetManager: Starting task 1.1 in stage 0.0 (TID 4, gpu12-ib.cluster, partition 1, PROCESS_LOCAL, 5594 bytes)
17/05/18 03:14:07 INFO scheduler.TaskSetManager: Starting task 3.1 in stage 0.0 (TID 5, gpu10-ib.cluster, partition 3, PROCESS_LOCAL, 5594 bytes)
17/05/18 03:14:07 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 4 on executor id: 4 hostname: gpu12-ib.cluster.
17/05/18 03:14:07 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 5 on executor id: 1 hostname: gpu10-ib.cluster.
17/05/18 03:14:09 WARN scheduler.TaskSetManager: Lost task 1.1 in stage 0.0 (TID 4, gpu12-ib.cluster): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000005/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000005/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 317, in func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 762, in func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 421, in _mapfn
File "cifar10_multi_gpu_train.py", line 271, in main_fun
File "/home/javed.19/Python/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively
pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status)
File "/home/javed.19/Python/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/home/javed.19/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
UnknownError: hdfs://default/user/javed.19/cifar10_train; Input/output error
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/18 03:14:09 INFO scheduler.TaskSetManager: Starting task 1.2 in stage 0.0 (TID 6, gpu12-ib.cluster, partition 1, PROCESS_LOCAL, 5594 bytes)
17/05/18 03:14:09 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 6 on executor id: 4 hostname: gpu12-ib.cluster.
17/05/18 03:14:09 WARN scheduler.TaskSetManager: Lost task 3.1 in stage 0.0 (TID 5, gpu10-ib.cluster): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000002/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000002/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 317, in func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/pyspark.zip/pyspark/rdd.py", line 762, in func
File "/tmp/hadoop-javed.19/nm-local-dir/usercache/javed.19/appcache/application_1495091612438_0001/container_1495091612438_0001_01_000001/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 421, in _mapfn
File "cifar10_multi_gpu_train.py", line 271, in main_fun
File "/home/javed.19/Python/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively
pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status)
File "/home/javed.19/Python/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/home/javed.19/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
UnknownError: hdfs://default/user/javed.19/cifar10_train/events.out.tfevents.1495091647.gpu11.cluster; Input/output error
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/18 03:14:09 INFO scheduler.TaskSetManager: Starting task 3.2 in stage 0.0 (TID 7, gpu10-ib.cluster, partition 3, PROCESS_LOCAL, 5594 bytes)
17/05/18 03:14:09 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 7 on executor id: 1 hostname: gpu10-ib.cluster.