cluster-apps-on-docker / spark-standalone-cluster-on-docker Goto Github PK

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:

License: MIT License

Shell 9.44% Dockerfile 10.42% Jupyter Notebook 80.14%

spark docker python scala pyspark jupyter r sparkr

spark-standalone-cluster-on-docker's Introduction

Apache Spark Standalone Cluster on Docker

The project was featured on an article at MongoDB official tech blog! 😱

The project just got its own article at Towards Data Science Medium blog! ✨

Introduction

This project gives you an Apache Spark cluster in standalone mode with a JupyterLab interface built on top of Docker. Learn Apache Spark through its Scala, Python (PySpark) and R (SparkR) API by running the Jupyter notebooks with examples on how to read, process and write data.

TL;DR

curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml
docker-compose up

Quick Start
Tech Stack
Metrics
Contributing
Contributors
Support

Quick Start

Cluster overview

Application	URL	Description
JupyterLab	localhost:8888	Cluster interface with built-in Jupyter notebooks
Spark Driver	localhost:4040	Spark Driver web ui
Spark Master	localhost:8080	Spark Master node
Spark Worker I	localhost:8081	Spark Worker node with 1 core and 512m of memory (default)
Spark Worker II	localhost:8082	Spark Worker node with 1 core and 512m of memory (default)

Prerequisites

Install Docker and Docker Compose, check infra supported versions

Download from Docker Hub (easier)

Download the docker compose file;

curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml

Edit the docker compose file with your favorite tech stack version, check apps supported versions;
Start the cluster;

docker-compose up

Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
Stop the cluster by typing ctrl+c on the terminal;
Run step 3 to restart the cluster.

Build from your local machine

Note: Local build is currently only supported on Linux OS distributions.

Download the source code or clone the repository;
Move to the build directory;

cd build

Edit the build.yml file with your favorite tech stack version;
Match those version on the docker compose file;
Build up the images;

chmod +x build.sh ; ./build.sh

Start the cluster;

docker-compose up

Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
Stop the cluster by typing ctrl+c on the terminal;
Run step 6 to restart the cluster.

Tech Stack

Infra

Component	Version
Docker Engine	1.13.0+
Docker Compose	1.10.0+

Languages and Kernels

Spark	Hadoop	Scala	Scala Kernel	Python	Python Kernel	R	R Kernel
3.x	3.2	2.12.10	0.10.9	3.7.3	7.19.0	3.5.2	1.1.1
2.x	2.7	2.11.12	0.6.0	3.7.3	7.19.0	3.5.2	1.1.1

Apps

Component	Version	Docker Tag
Apache Spark	2.4.0 \| 2.4.4 \| 3.0.0	<spark-version>
JupyterLab	2.1.4 \| 3.0.0	<jupyterlab-version>-spark-<spark-version>

Metrics

Image	Size	Downloads
JupyterLab
Spark Master
Spark Worker

Contributing

We'd love some help. To contribute, please read this file.

Contributors

A list of amazing people that somehow contributed to the project can be found in this file. This project is maintained by:

André Perez - dekoperez - [email protected]

Support

Support us on GitHub by staring this project ⭐

Support us on Patreon. 💖

spark-standalone-cluster-on-docker's People

Contributors

Stargazers

Watchers

Forkers

jcarlosbatista cbcoutinho vicchugu milenkomarkovic beetelbrox arylwen jiechenghan slotbite mehai aminemaamouri sunwoo315 michelelaporta eumhwa rburactaon amhambra dominikgithub cumbof databill86 kunau badrishdavey lpalum hute37 outsidemybox jariwsz alexeyegorov rareal jtmancilla alhaol ferrerasrp peterprescott peparhugo ftylmz1 29cm-developers trinh-hoang-hiep aaiyo sergeevyi rafiftaris juanlp rossanomarcos toddg stanleycruvinel armandobs14 jhdavino kylepierce hagarciag andreibyf lopesdiego12 maligulzar durgam27 scottsunsh swiftsuretech jcolechanged justinepdevasia jonathanneo joelmacscosta artwr marcusrb mauriciofh hooliowobbits therealchainman irrationomical wiseosho nathantarbox matheuscolussi fernandocarballido tyleradavis mykrass kvdv fadinaadeolu isamaljawarneh bonglim jsmallwood harismichailidis dpinedaj zhouyouth raphaelmansuy rodra-go gyan42 felipeasilva1 justin-ashton pennli ahmedrachid mdembiczak markemberson robertopcarreiro vpankaj97 pbaiz datastudysquad biren162 choupijiang mushcatshiro edwinmesa pejathayas jfpie16 nenetto orcascope dumitruc jootiinha msellamitn marinert

spark-standalone-cluster-on-docker's Issues

[FEATURE] Docker-Compose script with resources

Add resource limits to the docker-compose file

Description

As I was trying out the 3.0.0 spark version and wanted to initialize workers with 2G or more, it failed every time. After looking up the docker-compose wiki, I've found that it is restricted by 2G as default and can be set in the docker-compose file. I would suggest to add these couple of lines for better understanding and setup of the local docker Spark cluster.

Here an example:

    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G

[FEATURE] Core Development for 1.2.x

Introduction

On version 1.1.0 we delivered full support for Scala language (Jupyter kernel, Jupyter notebook, etc.) and improved the documentation. We also improved and fixed some bugs on the PySpark Jupyter notebook. At last, we have reached 1900+ downloads on Docker Hub, ~45 downloads per day. 🚀

Feature

On version 1.2.x we plan a short incremental development cycle to deliver full support for R programming language.

Description

R programming language;
R Jupyter kernel;
Spark R API (SparkR);
SparkR Jupyter notebook.

Comments (optional)

Wanna join? Please get in touch and help us to deliver a great experience for the community to learn and practice parallel computing on distributed systems.

import wget fails in jupyter notebook (not installed)

Bug

Expected behaviour

Current behaviour

import wget fails with:

ModuleNotFoundError: No module named 'wget'

Steps to reproduce

Step 1

git clone (this repo)
docker-compose up

Step 2

open JupyterLab at localhost:8888

Step 3

Follow instructions:

In [1]:

from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("pyspark-notebook").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "512m").\
        getOrCreate()

Learn and practice Apache Spark using PySpark
In [2]:

import wget

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
wget.download(url)

import wget fails with

ModuleNotFoundError: No module named 'wget'

Possible solutions (optional)

add apt install pip3 to base Dockerfile.
I tried the above, but am getting errors pulling down the scala deb from https://www.lightbend.com/

Which brings me to another question...why are you using a bespoke scala image stuffed on some random server? Attempts to rebuild the docker/base/Dockerfile are failing b/c (I think) the scala deb is no longer there:

Processing triggers for libc-bin (2.28-10) ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   243    0   243    0     0    682      0 --:--:-- --:--:-- --:--:--   680

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Reading package lists...E: Sub-process Popen returned an error code (2)
E: Encountered a section with no Package: header
E: Problem with MergeList /scala.deb
E: The package lists or status file could not be parsed or opened.

The command '/bin/sh -c mkdir -p ${shared_workspace}/data &&     mkdir -p /usr/share/man/man1 &&     apt-get update -y &&     apt-get install -y curl python3 r-base &&     ln -s /usr/bin/python3 /usr/bin/python &&     curl https://downloads.lightbend.com/scala/${scala_version}/scala-${scala_version}.deb -k -o scala.deb &&     apt install -y ./scala.deb &&     rm -rf scala.deb /var/lib/apt/lists/*' returned a non-zero code: 100

Add some solutions, if any

Comments (optional)

Add some comments, if any

Checklist

Please provide the following:

Docker Engine version: *20.10.1"
Docker Compose version: 1.25.0

Client: Docker Engine - Community
Version: 20.10.1
API version: 1.41
Go version: go1.13.15
Git commit: 831ebea
Built: Tue Dec 15 04:34:58 2020
OS/Arch: linux/amd64
Context: default
Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.1
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: f001486
Built: Tue Dec 15 04:32:52 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0

docker-compose version 1.25.0, build unknown
docker-py version: 4.1.0
CPython version: 3.8.5
OpenSSL version: OpenSSL 1.1.1f 31 Mar 2020

Hive and Hue

Introduction

This is a great stack, thank you! I managed to combine it with 4-node hadoop as well. Any chance you could extend this to include the same by default, and add Hive and latest Hue?

Feature

Please fill the template below.

Description

Extend to include hadoop with hdfs proper, and add Hive and Hue

Comments (optional)

Happy to send you my docker-compose with hadoop stack included. I tried adding hive and hadoop myself could not get it working, perhaps you could give it a try?

[BUG] Python 3.9 in image "andreper/jupyterlab:3.0.0-spark-2.4.4"

Hey! Thank you very much for the project, it helps us a lot.

However, it looks like during the last update (about 15 hours ago) there was the problems in updating the images. Now in the image with spark 2.4.4 Python 3.9 is installed, instead of 3.7.
For this reason, errors occur when the container starts.

Is it possible to fix this problem and roll back the python version in the image back to 3.7?

Thanks in advance

unable to connect to kafka from jupyter lab container but able to connect from spark-master container shell

Introduction

Hi there, thanks for helping the project! We are doing our best to help the community to learn and practice
parallel computing in distributed environments through our projects. ✨

Bug

Hi so i built all the dockers from scratch with help of build.sh and littel modifying the jupyterlab docker (by removing that c.notebook="" line).

After upping the containers so i have changed the spark-defaults.conf.template to spark-defualts.conf in spark-master conf folder and added kafka dependency to the conf file

Expected behaviour

Actually we should be able to connect to kafka broker when running code from pyspark jupyterlab container

Current behaviour

when running pyspark code from jupyterlab
It is throwing this error

AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;

But when running the code to connect to a kafka broker from pyspark terminal from spark-master it doesnot

Checklist

Please provide the following:

Docker Engine version: Can be found using docker version, e.g.: 19.03.6
Docker Compose version: Can be found using docker-compose version, e.g.: 1.29.0

[BUG]

Introduction

The project is very helpful but I think the principal issue that I faced is not able to access to other container when putting their IP

Bug

Access to foreign containers issue

Expected behaviour

Can access to foreign containers such as hadoop, kafka or elastic search by their IP

Current behaviour

Time waiting until it shows a timeout error

Possible solutions (optional)

Maybe work on networking part of docker-container yml file

[FEATURE] Build images on Mac OSX (with solution)

Building on Mac OSX

Description

I have tried to build your images on Mac OSX which failed as awaited, BUT it is pretty easy to solve and I was able to build everything locally. 👍

The problem with the original shell script and Mac OSX is that Apple has BSD grep preinstalled in contrast to GNU grep. On Mac OSX one can simply install GNU grep with Homebrew:

brew install grep

Afterwards, one can set grep to the PATH variable, while this might change some system behaviour. I have changed locally all grep calls in the script with ggrep and it all worked. This might be possible to be interchangeable by setting the command as variable (e.g. in build.yml?). Then the build script should work on Linux and Mac OSX.

[FEATURE] Ability to split Spark Master and Workers Nodes and run on different cluster nodes

Ability to split Spark Master and Workers Nodes and run on different cluster nodes

Description

As this repository is a really nice start for creating a small Spark cluster on Docker, I found it amazing to be able to use this same set up as a start for distributing this on multiple nodes. I was trying to split up the docker compose into Spark Master + Node1 and Node2 as two Docker instances, but was not able to connect Node2 to Master. Do you think it is somehow interesting?

[BUG] Pulling bespoke scala deb from unvetted location

Bug

I mentioned this in 65, but downloading the scala .deb from a random location is really unusual and will prevent me from both using and suggesting others use this project.

Expected behaviour

I'd expect that scala be either pulled from a vetted location, pulled in as a vetted base image, or built from source using sbt.

Current behaviour

build/docker/base/Dockerfile:    curl https://downloads.lightbend.com/scala/${scala_version}/scala-${scala_version}.deb -k -o scala.deb && \

I do not know who or what lightbend.com is
I see no reason to trust an arbitrary binary being pulled from this location

Steps to reproduce

Step 1
Clone this repo
Step 2

cd ./build

Step 3

build.sh

Possible solutions (optional)

Use standard scala tooling to generate the scala binary

https://www.scala-sbt.org/sbt-native-packager/introduction.html

Comments (optional)

Add some comments, if any

Checklist

Please provide the following:

Docker Engine version: 20.10.1
Docker Compose version: 1.25.0

 $ git pretty | head
*   d908cb6 Merge pull request #72 from cluster-apps-on-docker/staging

[BUG] unable to create spark context from jupyterlab

Bug

Expected behaviour

Should be able to create spark context using code from the given examples

Current behaviour

The spark-master logs show the following error when trying to connect to the cluster from the notebook:

ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.

java.io.InvalidClassException: org.apache.spark.deploy.ApplicationDescription; local class incompatible: stream classdesc serialVersionUID = 6543101073799644159, local class serialVersionUID = 1574364215946805297

The root cause is the mismatch between the scala versions. Spark 3.0.0 is pre-built using Scala 2.12.10 while the containers are bundled with Scala 2.12.11. This is a known issue.

Reference:

https://stackoverflow.com/questions/45755881/unable-to-connect-to-remote-spark-master-ror-transportrequesthandler-error-wh
https://almond.sh/docs/usage-spark

ammonite-spark handles loading Spark in a clever way, and does not rely on a specific Spark distribution. Because of that, you can use it with any Spark 2.x version. The only limitation is that the Scala version of Spark and the running Almond kernel must match, so make sure your kernel uses the same Scala version as your Spark cluster. Spark 2.0.x - 2.3.x requires Scala 2.11. Spark 2.4.x supports both Scala 2.11 and 2.12.

Steps to reproduce

Use docker-compose up -d to bring up the cluster
Create a new Scala notebook and create Spark context using the following code:

import $ivy.`org.apache.spark::spark-sql:2.4.4`;

import org.apache.log4j.{Level, Logger};
Logger.getLogger("org").setLevel(Level.OFF);

import org.apache.spark.sql._

val spark = {
  NotebookSparkSession.builder()
    .master("spark://spark-master:7077")
    .config("spark.executor.instances", "4")
    .config("spark.executor.memory", "2g")
    .getOrCreate()
}

Steps to check Scala version

Jupyterlab Launcher shows the Scala version installed
Launch a terminal on spark-master and fire up the spark shell using the command: bin/spark-shell

From my spark-master node:

# bin/spark-shell
20/09/05 10:07:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://d8b69de92ccc:4040
Spark context available as 'sc' (master = local[*], app id = local-1599300457142).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
Type in expressions to have them evaluated.
Type :help for more information.

scala> ^C
# scala
Welcome to Scala 2.12.11 (OpenJDK 64-Bit Server VM, Java 1.8.0_265).
Type in expressions for evaluation. Or try :help.

scala> ^C
#

Possible solutions (optional)

Bundle the Spark 3.0.0 containers with Scala 2.12.10 and check for similar issues for other versions of Spark as well

Comments (optional)

I can help resolve this issue with a PR

Checklist

Please provide the following:

Docker Engine version: 19.03.8
Docker Compose version: 1.25.5

[BUG] Application Web UI unavailable

Application Web UI unavailable

Expected behaviour

Run docker-compose up, run PySpark and see all the details of the running application through the web UI.

Current behaviour

After running docker-compose up and starting the pyspark session I am able to run everything, but I am not able to view the application web UI to see the stats of the finished jobs, DAGs, etc.

Steps to reproduce

Dockerfile with Spark 2.4.4
Run docker-compose up
Use Pyspark notebook to start a new session
Go to the Web UI localhost:4040 (while localhost:8080 is up).

Possible solutions (optional)

I have added port 4040 also to the Spark Master in the Dockerfile, but this didn't change anything.

Checklist

Please provide the following:

Docker Engine version:

Client: Docker Engine - Community
 Cloud integration: 1.0.1
 Version:           19.03.13
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46d9d
 Built:             Wed Sep 16 16:58:31 2020
 OS/Arch:           darwin/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.13
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46d9d
  Built:            Wed Sep 16 17:07:04 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.7
  GitCommit:        8fba4e9a7d01810a393d5d25a3621dc101981175
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Docker Compose version:

docker-compose version 1.27.4, build 40524192
docker-py version: 4.3.1
CPython version: 3.7.7
OpenSSL version: OpenSSL 1.1.1g  21 Apr 2020

[BUG] root docker-compose.yaml needs to be updated

REPRO

cd build
./build.sh

spark-standalone-cluster-on-docker/build $ diff docker-compose.yml ../docker-compose.yml 
12c12
<     image: jupyterlab:3.0.0-spark-3.0.0
---
>     image: andreper/jupyterlab:3.0.0-spark-3.0.0
20c20
<     image: spark-master:3.0.0
---
>     image: andreper/spark-master:3.0.0
28c28
<     image: spark-worker:3.0.0
---
>     image: andreper/spark-worker:3.0.0
40c40
<     image: spark-worker:3.0.0
---
>     image: andreper/spark-worker:3.0.0

The root yaml needs to be updated as part of the build process?
mispelled maintainer in Dockerfiles

$ grep manteiner * -R
docker/base/Dockerfile:LABEL manteiner="Andre Perez <[email protected]>"
docker/spark-base/Dockerfile:LABEL manteiner="Andre Perez <[email protected]>"
docker/jupyterlab/Dockerfile:LABEL manteiner="Andre Perez <[email protected]>"
docker/spark-worker/Dockerfile:LABEL manteiner="Andre Perez <[email protected]>"
docker/spark-master/Dockerfile:LABEL manteiner="Andre Perez <[email protected]>"

Not sure why you overwriting maintainer lable in child Dockerfiles. IIRC they are inherited, and so only need defining in base image.

[FEATURE] Core Development for 1.3.x

Introduction

On version 1.2.x we delivered full support to R programming language (R Jupyter kernel with IRkernel and SparkR API + Jupyter notebook).

Feature

On version 1.3.x we will clean up the house. We will be looking at the CI/CD script which is too long and repetitive. We also will be rethinking the documentation and repo structure to better delivery the project, this includes a dedicated section of examples.

Description

Improve CI/CD script;
Refactor documentation;
Create examples folder.

Comments (optional)

Wanna join? Please get in touch and help us to deliver a great experience for the community to learn and practice parallel computing on distributed systems

[FEATURE] Core Development for 1.1.x

Introduction

Hi there, what a ride version 1.0.0 has been. On this version we setup the project basic infrastructure: Docker Hub build, manual build, support for Apache Spark 2.4.0 + 2.4.4 + 3.0.0, so on and so forth. Setting up Docker Hub CI using GitHub Actions was definitely a game changer because it released a lot of pressure from me. The results? We receive lots of feedback from all over the world and we have reached more than 450 downloads on Docker Hub in just 1 month alive. 🚀

Thanks a lot folks.

Feature

On version 1.1.x we plan to delivery Scala and R JupyterLab kernels to support their correspondent Apache Spark API. Currently only Python and PySpark are available, though Scala language is already present on Spark base images. Also, the contributing flow will require the creation of a issue (bug or feature) to better track the project process and ease the deployment documentation. Lastly some minor changes will be develop to improve the documentation.

Description

JupyterLab: Scala programming language and kernel;
JupyterLab: R programming language and kernel;
Spark: R programming language;
Repository: Issue workflow, change log file and general documentation improvements.

Comments (optional)

Wanna join? Please get in touch and help us to delivery a great experience for the community to learn and practice parallel computing on distributed systems.

[BUG] Download Spark URL

Hey! Thank you very much for the project, it helps us a lot.

The bug is that the URL for download spark has changed its structure to https://dlcdn.apache.org/spark/spark-${spark_version}/spark-${spark_version}-bin-hadoop${hadoop_version}.tgz it seems they have change the download location.

[BUG]

Hello! I started JupyterLab, but only PySpark API is available in it. Spark Scala API is broken. Tell me, please, how to run a notebook with Scala and Spark?

Example:

import org.apache.spark.sql._

cmd0.sc:1: object apache is not a member of package org
import org.apache.spark.sql._
^Compilation Failed
Compilation Failed

Client: Docker Engine - Community
Version: 24.0.2
API version: 1.43
Go version: go1.20.4
Git commit: cb74dfc
Built: Thu May 25 21:51:00 2023
OS/Arch: linux/amd64
Context: default

Server: Docker Engine - Community
Engine:
Version: 24.0.2
API version: 1.43 (minimum version 1.12)
Go version: go1.20.4
Git commit: 659604f
Built: Thu May 25 21:51:00 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.21
GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc:
Version: 1.1.7
GitCommit: v1.1.7-0-g860f061
docker-init:
Version: 0.19.0
GitCommit: de40ad0

Docker Compose version v2.17.3

[FEATURE] OpenJDK Docker Image is getting deprecated

Introduction

As stated here: https://hub.docker.com/_/openjdk
The OpenJDK Docker image is getting deprecated which mean it could open its way into unsecure application.

Feature

Update project version and remove deprecated ecosystem

Description

Is this possible to select the amazoncorretto docker image for example and also select a minified version like for example using alpine ? This could greatly reduce the Docker Images size and complexity.
Another features that I would love to see is updated versions of spark as well.

[BUG] Unable to start service with Docker Compose

Introduction

I am unable to start all the services on my Ubuntu Server 20.10. Cloned the current master and only executed the build.sh.

Bug

When I then try to start the services, the following error occures:

thomas@srv:/srv/spark/build$ sudo rm -rf  /var/lib/docker/volumes/hadoop-distributed-file-system/
thomas@srv/srv/spark/build$ docker-compose up
jupyterlab is up-to-date
Creating spark-master ... error

ERROR: for spark-master  Cannot create container for service spark-master: open /var/lib/docker/volumes/hadoop-distributed-file-system/_data: no such file or directory

ERROR: for spark-master  Cannot create container for service spark-master: open /var/lib/docker/volumes/hadoop-distributed-file-system/_data: no such file or directory
ERROR: Encountered errors while bringing up the project.

Checklist

Please provide the following:

Docker Engine version: 20.10.2
Docker Compose version: 1.25.0

[FEATURE] Core Development for 1.2.2

Introduction

Hi there, thanks for helping the project! We are doing our best to help the community to learn and practice
parallel computing in distributed environments through our projects. ✨

Feature

On version 1.2.2 we did a short development cycle to improve the ci/cd script by empowering it with Github actions.

Description

Improved CICD with Github actions;
Fixed some docs typo.

[FEATURE]

i recive an oom exception

jupyterlab | Setting default log level to "WARN".
jupyterlab | To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/15 09:28:30 WARN TaskMemoryManager: Failed to allocate a page (1048576 bytes), try again.
jupyterlab | 23/06/15 09:28:40 ERROR Utils: Uncaught exception in thread stdout writer for /usr/bin/python3
jupyterlab | java.lang.OutOfMemoryError: Java heap space
jupyterlab | Exception in thread "stdout writer for /usr/bin/python3" java.lang.OutOfMemoryError: Java heap space
jupyterlab | 23/06/15 09:28:53 ERROR Utils: Uncaught exception in thread stdout writer for /usr/bin/python3
jupyterlab | java.lang.OutOfMemoryError: GC overhead limit exceeded

Is it possible to adjust thememory setings of the worker ? and increase the ram

[BUG] Cannot submit tasks to master

Hi,
I met an issue. Anyone can help? Thanks in advance.

After deploy spark-standalone-cluster-on-docker (images: andreper/spark-master:3.0.0) on a server (192.XX.X.X), I try to test by another PC (192.XX.X.Y).
cmd steps:
$ spark-shell --master spark://192.XX.X.X:7077

val count = sc.parallelize(1 to 1000).filter { _ =>
  val x = math.random
  val y = math.random
  x*x + y*y < 1
}.count()

I got below error.(infinite loop messages)

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Build Env.

Images: andreper/spark-master:3.0.0
Docker Engine version: 20.10.14
Docker Compose version: Docker Compose version v2.2.3

[FEATURE] Suppress verbosity from build.sh

Feature

The shell script build.sh is veeeery verbose. We may make it less verbose adding some tags after commands and suppressing the logs.

Description

On section "Feature".

[FEATURE] Core Development for 1.2.1

Introduction

On version 1.2.x we delivered full support to R programming language (R Jupyter kernel with IRkernel and SparkR API + Jupyter notebook).

Feature

On version 1.2.1 is totally dedicated to the community contribution. For SparkR, we removed CRAN repo dependency and started to download the dependency directly from Spark repository. For Scala, a lot of improvements: we enhanced scala lang and kernel interoperability with Spark by dynamically setting their version according to Spark distro version. We also exposed port 4040 to allow access to Spark driver web ui and upgraded JupyterLab version to its latest stable version. Finally, we added a staging branch between develop and master branch to allow the cicd pipeline testing.

Almost forgot, we integrated the repo with Patreon, now the project can be sponsored 💖

Description

Added Patreon support link 💖
Added staging branch between develop and master to test CI/CD pipeline without pushing images to Docker Hub
Exposed Spark driver web ui 4040 port (#39);
Upgraded JupyterLab from v2.1.4 to v3.0.0.
Made SparkR available for all Spark versions;
Enhanced Spark compatibility with Scala kernel (#35).

Comments (optional)

We have reached 10k downloads. ✨

cluster-apps-on-docker / spark-standalone-cluster-on-docker Goto Github PK

spark-standalone-cluster-on-docker's Introduction

Apache Spark Standalone Cluster on Docker

Introduction

TL;DR

Contents

Quick Start

Cluster overview

Prerequisites

Download from Docker Hub (easier)

Build from your local machine

Tech Stack

Metrics

Contributing

Contributors

Support

spark-standalone-cluster-on-docker's People

Contributors

Stargazers

Watchers

Forkers

spark-standalone-cluster-on-docker's Issues

Add resource limits to the docker-compose file

Description

Introduction

Feature

Description

Comments (optional)

Bug

Expected behaviour

Current behaviour

Steps to reproduce

Possible solutions (optional)

Comments (optional)

Checklist

Introduction

Feature

Description

Comments (optional)

Introduction

Bug

Expected behaviour

Current behaviour

Checklist

Introduction

Bug

Expected behaviour

Current behaviour

Possible solutions (optional)

Building on Mac OSX

Description

Ability to split Spark Master and Workers Nodes and run on different cluster nodes

Description

Bug

Expected behaviour

Current behaviour

Steps to reproduce

Possible solutions (optional)

Comments (optional)

Checklist

Bug

Expected behaviour

Current behaviour

Steps to reproduce

Steps to check Scala version

Possible solutions (optional)

Comments (optional)

Checklist

Application Web UI unavailable

Expected behaviour

Current behaviour

Steps to reproduce

Possible solutions (optional)

Checklist

REPRO

Introduction

Feature

Description

Comments (optional)

Introduction