Giter Site home page Giter Site logo

matthieuvion / spark-cluster Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 643 KB

Steps to deploy a local spark cluster w/ Docker. Bonus: a ready-to-use notebook for model prediction on Pyspark using spark.ml Pipeline() on a well known dataset

Jupyter Notebook 100.00%
docker-compose jupyter-docker-stacks pyspark-notebook randomforestregressor spark-cluster

spark-cluster's Introduction

How to : deploy a local Spark cluster (standalone) w/Docker (Linux)

License: MIT made-with-python

Dockerized env : [JupyterLab server => Spark (master <-> 1 worker) ]

Deploying a local spark cluster (standalone) can be tricky
Most of online ressources focus on single driver installation w/ Spark in a custom env or using jupyter-docker-stacks
Here my notes to work with Spark locally, with a Jupyter Labs interface, with one Master and one Worker, using Docker Compose.
All the PySpark dependencies already configured in a container, access to your local files (in an existing directory)
You might also want to do it the easy way --not local though, using Databricks community (free)

1. Prerequisites


  • Install Docker Engine, either through Docker Desktop or directly Docker engine. Personally, using the latter.
  • Make sure Docker Compose is installed or install it.
  • Ressources :
    Medium article install and basic use of Docker . Docker official ressources should be enough though.
    Jupyter-Docker-Stacks. "A set of ready-to-run Docker images containing Jupyter applications".
    The source article I (very slightly) adapted the docker-compose file from.
    Install Docker engine (apt get), official ressource.
    Install Docker compose (apt get), official ressource.

2. How to


After Docker Engine/compose installation, on linux, do not forget the post-installation steps

  1. Git clone this repository or create a new one (name of your choice)
  2. Open terminal, cd into custom directory, make sure docker-compose.yml file is present (copy it in if needed)
    version: '3'
    services:
    spark:
    image: bitnami/spark:3.3.1
    environment:
    - SPARK_MODE=master
    ports:
    - '8080:8080'
    - '7077:7077'
    volumes:
    - $PWD:/home/jovyan/work
    spark-worker:
    image: bitnami/spark:3.3.1
    environment:
    - SPARK_MODE=worker
    - SPARK_MASTER_URL=spark://spark:7077
    - SPARK_WORKER_MEMORY=4G
    - SPARK_EXECUTOR_MEMORY=4G
    - SPARK_WORKER_CORES=4
    ports:
    - '8081:8081'
    volumes:
    - $PWD:/home/jovyan/work
    jupyter:
    image: jupyter/pyspark-notebook:spark-3.3.1
    ports:
    - '8888:8888'
    volumes:
    - $PWD:/home/jovyan/work

Basically, the yml file tells Docker compose how to run the Spark Master, Worker, Jupyterlab. You will have access to your local disk/current working directory every time you run this command :

  1. Run docker compose
cd my-directory
docker compose up
# or depending of your Docker Compose install:   
docker-compose up

Docker compose will automatically download the needed images (spark:3.3.1 for Master and Worker, pyspark-notebook for the JupyterLab interface) and run the whole thing. The next times, it will only run them.

3. Profit : JupyterLab interface, Spark cluster (standalone) mode

Access the different interfaces with :

Jupyter lab interface : http://localhost:8888
Spark Master : http://localhost:8080
Spark Worker : http://localhost:8081

You can use the demo notebook spark-cluster.ipynb for a ready-to-use PySpark notebook, or simply create a new one and run this as a SparkSession:

from pyspark.sql import SparkSession

# SparkSession
URL_SPARK = "spark://spark:7077"

spark = (
    SparkSession.builder
    .appName("spark-ml")
    .config("executor.memory", "4g")
    .master(URL_SPARK)
    .getOrCreate()
)

Bonus : Notebook, predict using spark.ml Pipeline()


If you use spark-cluster.ipynb, a demo example shows how to build a spark.ml predict Pipeline() with a random forest regressor, on a well known dataset.

spark-cluster's People

Contributors

matthieuvion avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.