matthieuvion / spark-cluster Goto Github PK

How to : deploy a local Spark cluster (standalone) w/Docker (Linux)

Dockerized env : [JupyterLab server => Spark (master <-> 1 worker) ]

Deploying a local spark cluster (standalone) can be tricky
Most of online ressources focus on single driver installation w/ Spark in a custom env or using jupyter-docker-stacks
Here my notes to work with Spark locally, with a Jupyter Labs interface, with one Master and one Worker, using Docker Compose.
All the PySpark dependencies already configured in a container, access to your local files (in an existing directory)
You might also want to do it the easy way --not local though, using Databricks community (free)

1. Prerequisites

Install Docker Engine, either through Docker Desktop or directly Docker engine. Personally, using the latter.
Make sure Docker Compose is installed or install it.
Ressources :
Medium article install and basic use of Docker . Docker official ressources should be enough though.
Jupyter-Docker-Stacks. "A set of ready-to-run Docker images containing Jupyter applications".
The source article I (very slightly) adapted the docker-compose file from.
Install Docker engine (apt get), official ressource.
Install Docker compose (apt get), official ressource.

2. How to

After Docker Engine/compose installation, on linux, do not forget the post-installation steps

Git clone this repository or create a new one (name of your choice)

Open terminal, cd into custom directory, make sure docker-compose.yml file is present (copy it in if needed)

spark-cluster/docker-compose.yml

Lines 1 to 30 in 6e5cb82

    
           version: '3' 
        
           services: 
        
             spark: 
        
               image: bitnami/spark:3.3.1 
        
               environment: 
        
                 - SPARK_MODE=master 
        
               ports: 
        
                 - '8080:8080' 
        
                 - '7077:7077' 
        
               volumes: 
        
                 - $PWD:/home/jovyan/work 
        
             spark-worker: 
        
               image: bitnami/spark:3.3.1 
        
               environment: 
        
                 - SPARK_MODE=worker 
        
                 - SPARK_MASTER_URL=spark://spark:7077 
        
                 - SPARK_WORKER_MEMORY=4G 
        
                 - SPARK_EXECUTOR_MEMORY=4G 
        
                 - SPARK_WORKER_CORES=4 
        
               ports: 
        
                 - '8081:8081' 
        
               volumes: 
        
                 - $PWD:/home/jovyan/work 
        
             jupyter: 
        
               image: jupyter/pyspark-notebook:spark-3.3.1 
        
               ports: 
        
                 - '8888:8888' 
        
               volumes: 
        
                 - $PWD:/home/jovyan/work

Basically, the yml file tells Docker compose how to run the Spark Master, Worker, Jupyterlab. You will have access to your local disk/current working directory every time you run this command :

Run docker compose

cd my-directory
docker compose up
# or depending of your Docker Compose install:   
docker-compose up

Docker compose will automatically download the needed images (spark:3.3.1 for Master and Worker, pyspark-notebook for the JupyterLab interface) and run the whole thing. The next times, it will only run them.

3. Profit : JupyterLab interface, Spark cluster (standalone) mode

Access the different interfaces with :

Jupyter lab interface : http://localhost:8888
Spark Master : http://localhost:8080
Spark Worker : http://localhost:8081

You can use the demo notebook spark-cluster.ipynb for a ready-to-use PySpark notebook, or simply create a new one and run this as a SparkSession:

from pyspark.sql import SparkSession

# SparkSession
URL_SPARK = "spark://spark:7077"

spark = (
    SparkSession.builder
    .appName("spark-ml")
    .config("executor.memory", "4g")
    .master(URL_SPARK)
    .getOrCreate()
)

Bonus : Notebook, predict using spark.ml Pipeline()

If you use spark-cluster.ipynb, a demo example shows how to build a spark.ml predict Pipeline() with a random forest regressor, on a well known dataset.

matthieuvion / spark-cluster Goto Github PK

spark-cluster's Introduction

How to : deploy a local Spark cluster (standalone) w/Docker (Linux)

1. Prerequisites

2. How to

3. Profit : JupyterLab interface, Spark cluster (standalone) mode

Bonus : Notebook, predict using spark.ml Pipeline()

spark-cluster's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	version: '3'

	services:
	spark:
	image: bitnami/spark:3.3.1
	environment:
	- SPARK_MODE=master
	ports:
	- '8080:8080'
	- '7077:7077'
	volumes:
	- $PWD:/home/jovyan/work
	spark-worker:
	image: bitnami/spark:3.3.1
	environment:
	- SPARK_MODE=worker
	- SPARK_MASTER_URL=spark://spark:7077
	- SPARK_WORKER_MEMORY=4G
	- SPARK_EXECUTOR_MEMORY=4G
	- SPARK_WORKER_CORES=4
	ports:
	- '8081:8081'
	volumes:
	- $PWD:/home/jovyan/work
	jupyter:
	image: jupyter/pyspark-notebook:spark-3.3.1
	ports:
	- '8888:8888'
	volumes:
	- $PWD:/home/jovyan/work