sequenceiq / docker-spark Goto Github PK

License: Apache License 2.0

Shell 50.03% Dockerfile 49.97%

docker-spark's Introduction

Apache Spark on Docker

This repository contains a Docker file to build a Docker image with Apache Spark. This Docker image depends on our previous Hadoop Docker image, available at the SequenceIQ GitHub page. The base Hadoop Docker image is also available as an official Docker image.

##Pull the image from Docker Repository

docker pull sequenceiq/spark:1.6.0

Building the image

docker build --rm -t sequenceiq/spark:1.6.0 .

Running the image

if using boot2docker make sure your VM has more than 2GB memory
in your /etc/hosts file add $(boot2docker ip) as host 'sandbox' to make it easier to access your sandbox UI
open yarn UI ports when running container

docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:1.6.0 bash

docker run -d -h sandbox sequenceiq/spark:1.6.0

Versions

Hadoop 2.6.0 and Apache Spark v1.6.0 on Centos

Testing

There are two deploy modes that can be used to launch Spark applications on YARN.

YARN-client mode

In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

# run the spark shell
spark-shell \
--master yarn-client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1

# execute the the following command which should return 1000
scala> sc.parallelize(1 to 1000).count()

YARN-cluster mode

In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.

Estimating Pi (yarn-cluster mode):

# execute the the following command which should write the "Pi is roughly 3.1418" into the logs
# note you must specify --files argument in cluster mode to enable metrics
spark-submit \
--class org.apache.spark.examples.SparkPi \
--files $SPARK_HOME/conf/metrics.properties \
--master yarn-cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
$SPARK_HOME/lib/spark-examples-1.6.0-hadoop2.6.0.jar

Estimating Pi (yarn-client mode):

# execute the the following command which should print the "Pi is roughly 3.1418" to the screen
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
$SPARK_HOME/lib/spark-examples-1.6.0-hadoop2.6.0.jar

Submitting from the outside of the container

To use Spark from outside of the container it is necessary to set the YARN_CONF_DIR environment variable to directory with a configuration appropriate for the docker. The repository contains such configuration in the yarn-remote-client directory.

export YARN_CONF_DIR="`pwd`/yarn-remote-client"

Docker's HDFS can be accessed only by root. When submitting Spark applications from outside of the cluster, and from a user different than root, it is necessary to configure the HADOOP_USER_NAME variable so that root user is used.

export HADOOP_USER_NAME=root

docker-spark's People

Contributors

Stargazers

Watchers

Forkers

smungee c3h3 elviento suensummit hcchen khajavi mkscala fabiofumarola koukumar daishichao changwu-tw joshjdevl int32bit shanonvl lovelybigdata cwilkes nealmcb lirender roadahead josephwinston rafanami giserh flrngel namehta gschmutz vitan dischord01 linearregression calaba atlaspilotpuppy yvolver jinshen-cn yuecong chengdian-yang spullara pinguo-liguo perfmjs m00dy lukeforehand alanma neil-rubens yyljlyy baontq qdubious yerly teramonagi openedbox mavencode chenhuimin urlgrey vietanh85 elafo keiono handong890 tomekla rutulpatel piaolinzhi yobibytes badriub xwzpp yonglehou jackalchen kadirtaskiran mpeskin-r7 dhruvkumar pshemass ogarcialop skate056 zekizeki weidezhang h453693821 karthikv2k nsmith cjhcms lrusnac ijon57 snewhouse vigarbuaa duyet devsprint supermy jiekechoo agildata sixgodx wewela is00hcw kgribov zhaohc10 islammohamed tianshangjun audoe mindis kimptoc young8 rumpelt haiyang1987 puhrez yncxcw teknott pyxixi2012

docker-spark's Issues

when I build it,

liudeMacBook-Pro:~ liu$ docker build --rm -t sequenceiq/spark:1.6.0 .
unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /Users/liu/Dockerfile: no such file or directory

should i give the path?but i don not where is it?

How to run spark-perf in spark container

This spark container works well with yarn and spark installed. But we're wondering how to run spark-perf in the container.

Docker Compose Cluster Example

Publish to Docker official repositories

Consider to publish this Docker build project to Docker's official repository since some companies trust only to official Docker images.
This also means that we should publish all dependencies.

run on multiple machines?

How do I run this on multiple machines?

Can we get the latest version of Spark (1.2) on Ubuntu?

Thanks again for these images!

I'm astonished to learn that CentOS 6.5 (used in the latest Spark 1.2.0 tag at https://registry.hub.docker.com/u/sequenceiq/spark/tags/manage/) apparently remains in the dark ages of Python, since it has Python version 2.6, and it seems that upgrading to 2.7 breaks the yum installer. That means that the best way to run python (ipython notebook) only works with a version that is years out of date, and is painful to install:

Big Mac data: Pain in the culo setting up IPython Centos 6.5
http://bigmacdata.blogspot.com/2015/01/pain-in-culo-setting-up-ipython-centos.html

So I'd once again love to see an Ubuntu build of Spark 1.2.0. (like with sequenceiq/hadoop-docker#16).

Thanks!

Running example throws java exception.

Running the suggested example:
./bin/spark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 1

Throws the following java exception:

java.net.ConnectException: Call From sandbox/172.16.0.6 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

Running the subsequent command produces the following:

scala> sc.parallelize(1 to 1000).count()
<console>:11: error: not found: value sc
               sc.parallelize(1 to 1000).count()
               ^

Support Java 8

I was trying to run lambda expressions in a Java job and noticed the image is running Java 1.5. That's really old. Any reason it can't be updated to 1.8

ps - image works great other than that.

36bf8ea12ad2: Error downloading dependent layers

Hello, I ha ve tried to pull this docker image few times. However, I always get this error
36bf8ea12ad2: Error downloading dependent layers
Any idea why?

can not use port 50070 to see hdfs

i can use port 8088 to see spark dashborad ,but 50070 do not work

unable to submit multiple applications

Hi,

I'm able to run a single application fine. Though if I submit two different applications, only the first application is RUNNING. I'm using yarn-cluster, so not sure why two applications can't run.

Thanks

Cannot start spark

Hello,

I am able to successfully pull/run the container and I can confirm hadoop is indeed working. However, any attempt to start spark (spark-shell or other) results in a hang and I cannot find logs that have any information indicating what might be going wrong. Unsure if any of the below helps:

I have created a 4GB VM and am using it
My physical machine is Win7
I am using DockerToolbox-1.9.1b
/etc/hosts has 'sandbox' resolving to the IP of the 4 GB VM

In another experiment, I tried running start-all.sh and saw the following exception in the log:

Exception in thread "main" java.lang.ClassFormatError: org.apache.spark.launcher
.Main (unrecognized class file version)
at java.lang.VMClassLoader.defineClass(libgcj.so.10)
at java.lang.ClassLoader.defineClass(libgcj.so.10)
at java.security.SecureClassLoader.defineClass(libgcj.so.10)
at java.net.URLClassLoader.findClass(libgcj.so.10)
at java.lang.ClassLoader.loadClass(libgcj.so.10)
at java.lang.ClassLoader.loadClass(libgcj.so.10)
at gnu.java.lang.MainThread.run(libgcj.so.10)

This made me think it might be using the wrong JVM (perhaps above is GNU java 1.5?). But the hadoop process is indeed using Java 1.7.

Any help or advice on where to see what might be going wrong much appreciated!
Thanks.

yum install wget failing with non-zero code: 1

I tried doing a wget on my docker container from sequenceip and I keep getting the following error.

My command in the Dockerfile

RUN yum update && yum install wget -y

(note: also tried with just yum install without the update first)

Output:

The command '/bin/sh -c yum update && yum install wget -y' returned a non-zero code: 1

Full trace of related output:

Step 14/15 : RUN yum install wget -y
 ---> Running in bc7a6c2fae7a
Loaded plugins: fastestmirror, keys, protect-packages, protectbase
Determining fastest mirrors
 * base: mirror.lug.udel.edu
 * epel: mirror.us.leaseweb.net
 * extras: mirror.cogentco.com
 * updates: mirror.linux.duke.edu
0 packages excluded due to repository protections
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package wget.x86_64 0:1.12-10.el6 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================
 Package         Arch              Version                Repository       Size
================================================================================
Installing:
 wget            x86_64            1.12-10.el6            base            484 k

Transaction Summary
================================================================================
Install       1 Package(s)

Total download size: 484 k
Installed size: 1.8 M
Downloading Packages:
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing : wget-1.12-10.el6.x86_64                                      1/1

Rpmdb checksum is invalid: dCDPT(pkg checksums): wget.x86_64 0:1.12-10.el6 - u
 
The command '/bin/sh -c yum install wget -y' returned a non-zero code: 1

Get permission denied when launching examples

Get the following error when launching a clean docker container and trying any of the examples:

bash-4.1# spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 1 ./lib/spark-examples-1.1.0-hadoop2.4.0.jar
...
14/11/06 18:16:22 INFO yarn.Client: Got cluster metric info from ResourceManager, number of NodeManagers: 1
14/11/06 18:16:22 INFO yarn.Client: Max mem capabililty of a single resource in this cluster 8192
14/11/06 18:16:22 INFO yarn.Client: Preparing Local resources
Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=hdfs, access=WRITE, inode="/user":root:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)

sparkR command does not work

I downloaded docker-spark images via docker pull command and run it with

docker run -it sequenceiq/spark:1.4.0 bash

After that, I found "sparkR" was already in /usr/local/spark/bin.
I wanted to try it ant got the following result:

bash-4.1# sparkR
env: R: No such file or directory

It seems that we have to write "install R" like commands in docker file.
Do you have any plan to install R in this docker image in the near future?
(Is it possible for me to write that and make a PR to this rpeository?)

what is the password when login using the ssh

i running the docker, and expose the port 2122.when i connect to the docker using ssh,the command is this : ssh [email protected] -p 2122, it prompt me that :

$ ssh [email protected] -p 2122
The authenticity of host '[10.10.8.166]:2122 ([10.10.8.166]:2122)' can't be established.
RSA key fingerprint is SHA256:a8Zv0AK3cp/h4JPs/hKDzyhZ5OxA6PURFxoQW2nILjw.
Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added '[10.10.8.166]:2122' (RSA) to the list of known hosts.
[email protected]'s password:

what's the password of root i enter?

Tag 1.5.1 not found in repository docker.io/sequenceiq/spark

Hi Guys,

love that image. Why don't you update docker hub? Readme implies that there is recent version.

For the moment being, I cloned the repo and built it myself. But hub is more comfortable. :)

Thanks for the great image. :)

Spark-1.3.0 version not getting searched

When I do :

sudo docker pull sequenceiq/spark:1.3.0

I get :

FATA[0008] Tag 1.3.0 not found in repository sequenceiq/spark

when in install apache spark with docker

Repository sequenceiq/spark already being pulled by another client

Cannot run the spark shell

I couldn't get the prompt. Application report for application_1469690833305_0001 (state: ACCEPTED) message keeps continuing as follows:

$ docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:1.6.0 bash
...
bash-4.1# spark-shell \
> --master yarn-client \
> --driver-memory 1g \
> --executor-memory 1g \
> --executor-cores 1
...
16/07/28 03:29:23 INFO yarn.Client: Application report for application_1469690833305_0001 (state: ACCEPTED)
16/07/28 03:29:24 INFO yarn.Client: Application report for application_1469690833305_0001 (state: ACCEPTED)
16/07/28 03:29:25 INFO yarn.Client: Application report for application_1469690833305_0001 (state: ACCEPTED)
...

Container terminates after 1-2 minutes

Hi,

my container (*) always terminates after a couple of minutes:

Is this expected? Do you have further information how to debug it?

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e3e36cf709bf sequenceiq/spark:1.5.1 "/etc/bootstrap.sh" 2 minutes ago Exited (0) 5 seconds ago elated_jones

Many thanks for help,
Dennis

(*)
docker run -d -h sandbox sequenceiq/spark:1.5.1

Can't install packages

sudo docker run -i -t sequenceiq/spark:latest /bin/bash
uname -a :
Linux xxx 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

This implies the image is built from an Ububtu image. But when in the Dockerfile I use this:
RUN apt-get update
It says apt-get : command not found

Where am I going wrong ? I need to install a lot of packages using this. What's the workaround ?

Hadoop “Unable to load native-hadoop library for your platform” error

I have put this issue is from SO, but nobody answer it, so I hope i can get answer here, thanks!

I am using docker-spark 1.3.0. After starting spark-shell, it outputs:

15/05/21 04:28:22 DEBUG NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError:no hadoop in java.library.path
15/05/21 04:28:22 DEBUG NativeCodeLoader: java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

The environment variables of this spark container are:

bash-4.1# export
declare -x BOOTSTRAP="/etc/bootstrap.sh"
declare -x HADOOP_COMMON_HOME="/usr/local/hadoop"
declare -x HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
declare -x HADOOP_HDFS_HOME="/usr/local/hadoop"
declare -x HADOOP_MAPRED_HOME="/usr/local/hadoop"
declare -x HADOOP_PREFIX="/usr/local/hadoop"
declare -x HADOOP_YARN_HOME="/usr/local/hadoop"
declare -x HOME="/"
declare -x HOSTNAME="sandbox"
declare -x JAVA_HOME="/usr/java/default"
declare -x OLDPWD
declare -x PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/java/default/bin:/usr/local/spark/bin:/usr/local/hadoop/bin"
declare -x PWD="/"
declare -x SHLVL="3"
declare -x SPARK_HOME="/usr/local/spark"
declare -x SPARK_JAR="hdfs:///spark/spark-assembly-1.3.0-hadoop2.4.0.jar"
declare -x TERM="xterm"
declare -x YARN_CONF_DIR="/usr/local/hadoop/etc/hadoop"

After referring Hadoop “Unable to load native-hadoop library for your platform” error on CentOS, I have done the following:

(1) Check the hadoop library:

bash-4.1# file /usr/local/hadoop/lib/native/libhadoop.so.1.1.0
/usr/local/hadoop/lib/native/libhadoop.so.1.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped

Yes, it is 64-bit library.

(2) Try adding the HADOOP_OPTS environment variable:

export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native"

It doesn't work, and reports the same error.

(3) Try adding the HADOOP_OPTS and HADOOP_COMMON_LIB_NATIVE_DIR environment variable:

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

It still doesn't work, and reports the same error.

Could anyone give some clues about the issue?

What is yarn-remote-client used for? they are useless?

There are 2 files (core-site.xml and yarn-site.xml) added into spark home folder. Do we use them for which purpose? I don't see they are used?

Actually, HDFS and Yarn are configured by using core-site.xml and yarn-site.xml inside HADOOP_CONF_DIR. Right?
Thanks

stateful python streaming examples fail

Thanks so much for doing this, this docker has come in handy getting me going with spark.

I tried the streaming examples with

spark-submit /usr/local/spark/examples/src/main/python/streaming/network_wordcount.py 172.17.42.1 9999 2> out.txt

And it works great but when I run

spark-submit /usr/local/spark/examples/src/main/python/streaming/stateful_network_wordcount.py 172.17.42.1 9999 2> out.txt

I don't get any results and the errors I get are

15/03/16 20:09:23 ERROR scheduler.JobScheduler: Error running job streaming job 1426550835000 ms.0
py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: null
    at py4j.Protocol.getReturnValue(Protocol.java:417)
--
15/03/16 20:09:23 ERROR scheduler.JobScheduler: Error running job streaming job  1426550836000 ms.0
py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: null
    at py4j.Protocol.getReturnValue(Protocol.java:417)
--
15/03/16 20:09:23 ERROR scheduler.JobScheduler: Error running job streaming job 1426550837000 ms.0
py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: null
    at py4j.Protocol.getReturnValue(Protocol.java:417)
--
15/03/16 20:09:23 ERROR scheduler.JobScheduler: Error running job streaming job 1426550838000 ms.0
py4j.Py4JException: Error while obtaining a new communication channel
    at py4j.CallbackClient.getConnectionLock(CallbackClient.java:155)
--
15/03/16 20:09:23 ERROR scheduler.JobScheduler: Error running job streaming job 1426550839000 ms.0
py4j.Py4JException: Error while obtaining a new communication channel
    at py4j.CallbackClient.getConnectionLock(CallbackClient.java:155)
--
15/03/16 20:09:23 ERROR scheduler.JobScheduler: Error running job streaming job 1426550840000 ms.0
py4j.Py4JException: Error while obtaining a new communication channel
    at py4j.CallbackClient.getConnectionLock(CallbackClient.java:155)

Apologies if this is just a misunderstanding of how I need to run this in the context of docker or a simple flag I need to add to the spark-submit command. I'm new to spark (and docker) and I'm unclear about what's failing. The recoverable_network_wordcount.py failed as well.

Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/":heriipurnama:supergroup:drwxr-xr-x

Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/":heriipurnama:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:179)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6512)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6494)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6446)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4248)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4218)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4191)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2755)
at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2724)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866)
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:133)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:437)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
at mr.WordCount.main(WordCount.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/":heriipurnama:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:179)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6512)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6494)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6446)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4248)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4218)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4191)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy10.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2753)
... 22 more

Expose spark and hadoop binaries as global

Using KafkaUtils doesn't work

I'm getting this error:

>>> directKafkaStream = KafkaUtils.createDirectStream(ssc, ["help_center.activity.events"], {"metadata.broker.list": "kafka.service.consul:9092"})

________________________________________________________________________________________________

  Spark Streaming's Kafka libraries not found in class path. Try one of the following.

  1. Include the Kafka library and its dependencies with in the
     spark-submit command as

     $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka:1.5.1 ...

  2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
     Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-assembly, Version = 1.5.1.
     Then, include the jar in the spark-submit command as

     $ bin/spark-submit --jars <spark-streaming-kafka-assembly.jar> ...

________________________________________________________________________________________________


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/spark/python/pyspark/streaming/kafka.py", line 130, in createDirectStream
    raise e
py4j.protocol.Py4JJavaError: An error occurred while calling o21.loadClass.
: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConne

Would there be a downside to including the necessary libraries in the image?

Using pyspark in standalone scripts

How can I import pyspark to create a SparkContext in a standalone script?

Running

PYTHONPATH=/usr/local/spark/python python -c 'import pyspark'

fails:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/spark/python/pyspark/__init__.py", line 41, in <module>
    from pyspark.context import SparkContext
  File "/usr/local/spark/python/pyspark/context.py", line 31, in <module>
    from pyspark.java_gateway import launch_gateway
  File "/usr/local/spark/python/pyspark/java_gateway.py", line 31, in <module>
    from py4j.java_gateway import java_import, JavaGateway, GatewayClient
ImportError: No module named py4j.java_gateway

And indeed py4j seems to only exist as a zip file in $SPARK_HOME it is not "installed".

Spark_v1.3.0-ubuntu with scala_v2.11.4

I'm currently using the spark:1.3.0-ubuntu image built against scala_2.10.4
Can we please have the same image built against scala_2.11.4 as we are facing dependency failure issues.

Failure for TFSparkNode.mgr is NULL

Hello teams,

I run TFoS on Hadoop Cluster, everything goes well but, while training step, it was Hung and meet this Error as below, could you have a look? Thanks.

File "/root/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1506665300625_0002/container_1506665300625_0002_01_000008/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 66, in _get_manager
logging.info("Connected to TFSparkNode.mgr on {0}, ppid={1}, state={2}".format(host, ppid, str(TFSparkNode.mgr.get('state'))))
AttributeError: 'NoneType' object has no attribute 'get'

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more

================================================

[Feature] Cluster mode

How hard could it be to add support for running spark apps in dockerized yarn cluster?.. at least for dev.

i use it on aws,awslinux but it can't finish download

it stop at Downloading 281.2MB

Fail to load local file

I was tryging to read local file with sc.textFile("./README.md") in spark container. But if fails and throw exception to file hdfs://README.md which is really not what we want.

Licensing

It looks like many of your other open source codebases are Apache 2.0.

Upgrade to Spark 2.0

strange error

hi, i get the following error (it seems randomly, sometimes it works....). any idea why?

$ docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:2.1.0 bash
/
Starting sshd:                                             [  OK  ]
Starting namenodes on [sandbox]
sandbox: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-sandbox.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-sandbox.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-sandbox.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-sandbox.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-sandbox.out
chown: missing operand after `/usr/local/hadoop/logs'
Try `chown --help' for more information.
starting historyserver, logging to /usr/local/hadoop/logs/mapred--historyserver-sandbox.out
bash-4.1# 
bash-4.1# 
bash-4.1# spark-submit --class org.apache.spark.examples.SparkPi \
>              --master yarn \
>              --driver-memory 1g \
>              --executor-memory 1g \
>              --executor-cores 1 \
>              $SPARK_HOME/examples/jars/spark-examples*.jar
17/12/08 11:05:57 INFO spark.SparkContext: Running Spark version 2.1.0
17/12/08 11:05:57 WARN spark.SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
17/12/08 11:05:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/12/08 11:05:58 INFO spark.SecurityManager: Changing view acls to: root
17/12/08 11:05:58 INFO spark.SecurityManager: Changing modify acls to: root
17/12/08 11:05:58 INFO spark.SecurityManager: Changing view acls groups to: 
17/12/08 11:05:58 INFO spark.SecurityManager: Changing modify acls groups to: 
17/12/08 11:05:58 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
17/12/08 11:05:59 INFO util.Utils: Successfully started service 'sparkDriver' on port 32857.
17/12/08 11:05:59 INFO spark.SparkEnv: Registering MapOutputTracker
17/12/08 11:05:59 INFO spark.SparkEnv: Registering BlockManagerMaster
17/12/08 11:05:59 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/12/08 11:05:59 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/12/08 11:05:59 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-9ef99bf3-adf8-4299-b473-1f220542b8d2
17/12/08 11:05:59 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
17/12/08 11:05:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator
17/12/08 11:05:59 INFO util.log: Logging initialized @2813ms
17/12/08 11:05:59 INFO server.Server: jetty-9.2.z-SNAPSHOT
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@29968f06{/jobs,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b87e83e{/jobs/json,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@37a06d64{/jobs/job,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@56ddcc4{/jobs/job/json,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6fb8caa4{/stages,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4d000e49{/stages/json,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3eaa021d{/stages/stage,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@b70de0f{/stages/stage/json,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1f02b0a7{/stages/pool,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@699bb3d8{/stages/pool/json,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6d3c6012{/storage,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@16c775c5{/storage/json,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@104e432{/storage/rdd,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@68218f23{/storage/rdd/json,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@733c783d{/environment,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6fa27e6{/environment/json,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1151709e{/executors,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@79b89df3{/executors/json,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4789faf3{/executors/threadDump,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@33ba8c36{/executors/threadDump/json,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1c4b47c2{/static,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@12542011{/,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5105457d{/api,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@31153b19{/jobs/job/kill,null,AVAILABLE}
17/12/08 11:05:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@68daff7b{/stages/stage/kill,null,AVAILABLE}
17/12/08 11:05:59 INFO server.ServerConnector: Started ServerConnector@433c482{HTTP/1.1}{0.0.0.0:4040}
17/12/08 11:05:59 INFO server.Server: Started @3037ms
17/12/08 11:05:59 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
17/12/08 11:05:59 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.17.0.2:4040
17/12/08 11:05:59 INFO spark.SparkContext: Added JAR file:/usr/local/spark/examples/jars/spark-examples_2.11-2.1.0.jar at spark://172.17.0.2:32857/jars/spark-examples_2.11-2.1.0.jar with timestamp 1512749159790
17/12/08 11:06:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/12/08 11:06:01 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
17/12/08 11:06:01 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
17/12/08 11:06:01 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
17/12/08 11:06:01 INFO yarn.Client: Setting up container launch context for our AM
17/12/08 11:06:01 INFO yarn.Client: Setting up the launch environment for our AM container
17/12/08 11:06:01 INFO yarn.Client: Preparing resources for our AM container
17/12/08 11:06:02 WARN yarn.Client: Failed to cleanup staging dir hdfs://sandbox:9000/user/root/.sparkStaging/application_1512749152998_0001
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot delete /user/root/.sparkStaging/application_1512749152998_0001. Name node is in safe mode.
The reported blocks 238 has reached the threshold 0.9990 of total blocks 238. The number of live datanodes 1 has reached the minimum number 0. In safe mode extension. Safe mode will be turned off automatically in 9 seconds.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1327)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3674)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:946)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:611)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

	at org.apache.hadoop.ipc.Client.call(Client.java:1475)
	at org.apache.hadoop.ipc.Client.call(Client.java:1412)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy12.delete(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:540)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy13.delete(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:2044)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:707)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:703)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:714)
	at org.apache.spark.deploy.yarn.Client.cleanupStagingDir(Client.scala:194)
	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:180)
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
	at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
	at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/12/08 11:06:02 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /user/root/.sparkStaging/application_1512749152998_0001. Name node is in safe mode.
The reported blocks 238 has reached the threshold 0.9990 of total blocks 238. The number of live datanodes 1 has reached the minimum number 0. In safe mode extension. Safe mode will be turned off automatically in 9 seconds.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1327)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3855)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:977)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

	at org.apache.hadoop.ipc.Client.call(Client.java:1475)
	at org.apache.hadoop.ipc.Client.call(Client.java:1412)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy12.mkdirs(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:558)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy13.mkdirs(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3000)
	at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2970)
	at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1047)
	at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1043)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:1061)
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:1036)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:600)
	at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:432)
	at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:868)
	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:170)
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
	at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
	at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/12/08 11:06:02 INFO server.ServerConnector: Stopped ServerConnector@433c482{HTTP/1.1}{0.0.0.0:4040}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@68daff7b{/stages/stage/kill,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@31153b19{/jobs/job/kill,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5105457d{/api,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@12542011{/,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1c4b47c2{/static,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@33ba8c36{/executors/threadDump/json,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4789faf3{/executors/threadDump,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@79b89df3{/executors/json,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1151709e{/executors,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6fa27e6{/environment/json,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@733c783d{/environment,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@68218f23{/storage/rdd/json,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@104e432{/storage/rdd,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@16c775c5{/storage/json,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6d3c6012{/storage,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@699bb3d8{/stages/pool/json,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1f02b0a7{/stages/pool,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@b70de0f{/stages/stage/json,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3eaa021d{/stages/stage,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4d000e49{/stages/json,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6fb8caa4{/stages,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@56ddcc4{/jobs/job/json,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@37a06d64{/jobs/job,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5b87e83e{/jobs/json,null,UNAVAILABLE}
17/12/08 11:06:02 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@29968f06{/jobs,null,UNAVAILABLE}
17/12/08 11:06:02 INFO ui.SparkUI: Stopped Spark web UI at http://172.17.0.2:4040
17/12/08 11:06:02 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
17/12/08 11:06:02 INFO cluster.YarnClientSchedulerBackend: Stopped
17/12/08 11:06:02 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/12/08 11:06:02 INFO memory.MemoryStore: MemoryStore cleared
17/12/08 11:06:02 INFO storage.BlockManager: BlockManager stopped
17/12/08 11:06:02 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
17/12/08 11:06:02 WARN metrics.MetricsSystem: Stopping a MetricsSystem that is not running
17/12/08 11:06:02 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/12/08 11:06:02 INFO spark.SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /user/root/.sparkStaging/application_1512749152998_0001. Name node is in safe mode.
The reported blocks 238 has reached the threshold 0.9990 of total blocks 238. The number of live datanodes 1 has reached the minimum number 0. In safe mode extension. Safe mode will be turned off automatically in 9 seconds.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1327)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3855)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:977)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

	at org.apache.hadoop.ipc.Client.call(Client.java:1475)
	at org.apache.hadoop.ipc.Client.call(Client.java:1412)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy12.mkdirs(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:558)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy13.mkdirs(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3000)
	at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2970)
	at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1047)
	at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1043)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:1061)
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:1036)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:600)
	at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:432)
	at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:868)
	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:170)
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
	at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
	at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/12/08 11:06:02 INFO util.ShutdownHookManager: Shutdown hook called
17/12/08 11:06:02 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-cb8c0672-ec3a-482a-aad4-fa64cbee6b49

Installation Issues with spark-2.3.0-bin-hadoop2.7 on Windows 10

Hi guys,
Someone can help me? I't my first time to install and use Spark. Actually i'm installing spark-2.3.0-bin-hadoop2.7 in Windows 10. I encountered many issues but now when i write this command: spark-shell in the CMD i received the message like this bellow:

2018-03-12 03:28:17 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://DESKTOP-TI34DO6:4040
Spark context available as 'sc' (master = local[*], app id = local-1520796534341).
Spark session available as 'spark'.
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
// .__/_,// //_\ version 2.3.0
//

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

What can i do exactly? and what that mean? do I need to install another library? If yes which?
Thanks in advance for your help.

Add TAG 1.4.0

https://registry.hub.docker.com/u/sequenceiq/spark/tags/manage/

vagrant@ubuntu-14:~$ docker run -d -h sandbox sequenceiq/spark:1.4.0 -d
Unable to find image 'sequenceiq/spark:1.4.0' locally
Pulling repository sequenceiq/spark
Tag 1.4.0 not found in repository sequenceiq/spark

Can we get the latest version of Spark (1.2.1) on Ubuntu?

And maybe do it regularly (cf. #12 )
Thanks again for the great images!

web ui

Hi,

If I expose the port and also provide the mapping to 4040, I'm unable to view the web ui. The doc mentions that web UI is started automatically. Is this also the case with this particular docker-spark build ?

Thanks

Support for spark 2.3?

This is great can we have support for Support for spark 2.3?

Container hangs on Ubuntu 14.04 Docker version 1.11.1

rpc error: code = 2 desc = "oci runtime error: exec failed: exit status 1

Client:
Version: 1.11.1
API version: 1.23
Go version: go1.5.4
Git commit: 5604cbe
Built: Tue Apr 26 23:30:23 2016
OS/Arch: linux/amd64

Server:
Version: 1.11.1
API version: 1.23
Go version: go1.5.4
Git commit: 5604cbe
Built: Tue Apr 26 23:30:23 2016
OS/Arch: linux/amd64

Running test failed with Connection refused exception

I'm trying docker-spark on Ubuntu 14.04. When running the test
spark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 1

It throws the following exception:

java.net.ConnectException: Call From spark/172.17.0.5 to spark:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

Support spark 1.6.1 on ubuntu

Please support spark 1.6.1 on ubuntu, I have tried realize based on sequenceiq/hadoop-ubuntu:2.6.0 by myself, but always meet some problems.

Is there an ETA on a sequenceiq/spark:1.3.1 image?

Currently we have tests run in a docker container failing due to use of features in 1.3.1, which is the version live on EMR - it would be great to have the same version of spark in a docker image as we have in our live environment

Comes to halt after few hours of streams processing

Trying to run "sequenceiq/spark:1.2.1" on Amazon AWS mainly to processes a predictable stream of events. Works for few hours with constant throughput and constant CPU usage. After few hours the whole event processing comes to a near halt with CPU pegged in a span of few minutes. At this time the the disk reads also increase

one of the things we noticed is that the "Docker container memory usage keeps climbing" though the streams data is constant with RDD storage used is minimal/virtually constant

Used the following command to list the Docker memory

for line in sudo docker ps | awk '{print $1}' | grep -v CONTAINER; do sudo docker ps | grep $line | awk '{printf $NF" "}' && echo $(( cat /sys/fs/cgroup/memory/docker/$line*/memory.usage_in_bytes / 1024 / 1024 ))MB ; done

Not sure why the memory usage is constantly increasing for un-changing streams of data. Could it be that the default YARN/Spark temp data/logs configured and is written to default RAM disk on Amazon AWS. If so, is there a way to configure all of YARN/Spark data to be written to an externally mounted volume into Docker using "-v"