Welcome to μcluster or mucluster! A docker playground for experimenting with Big Data technologies like Spark, Hadoop, YARN, and more! This Docker-based setup allows you to quickly spin up an environment to test and play with various tools and frameworks without worrying about the complexities of installation and configuration.
- Prerequisites
- Getting Started
- Included Technologies
- Configuration
- Usage
- Additional Resources
- Disclaimer
- Docker installed on your system. You can download and install Docker from here.
- Have a decent amount of RAM, the docker container needs to be able to allocate at least 10GB (16GB should be barely enough) and disk space
To get started with this docker playground, follow these simple steps:
-
Clone this repository to your local environment:
git clone https://github.com/AlexStirban/mucluster.git
-
Navigate to the project directory:
cd mucluster
-
Copy the
.env.sample
file as.env
and replace the values ofNOTEBOOK_UID
andNOTEBOOK_GID
with the User ID and Group ID of your current user:cp .env.sample .env \ && sed -i -e "s|\[UID\]|$(id -u)|g" -e "s|\[GID\]|$(id -g)|g" .env
-
Build the Docker images:
docker-compose build
-
Run services in detached mode:
docker-compose up -d
[!NOTE] You can also have an interactive lift of the services by omitting the
-d
flag. -
Access the environment UIs using the URLs provided or submit jobs to your newly created cluster. Please see Usage for more information.
This playground deploys the following Big Data tools:
- Spark 3.5.1
- Apache Iceberg
- Hadoop 3.3.4
- HBase 1.2.6 (Standalone) for the YARN Timeline Service v.2. Please see Documentation
- Livy 0.8.0-rc2
- Hive 4.0.0
- PostgreSQL Database (used as the metastore)
- MinIO (used as the data lake)
- Jupyter Lab & Sparkmagic (used for interactive analytics)
- Hue (used as a centralized SQL assistant)
Note
Keep in mind that we won't be using the actual release binaries, but a fork I've made to update the dependencies and fix some missing JVM class imports for spark3. Please see: (LIVY-863) Missing JVM class imports for Spark3 and Livy Fork.
Most of the times everything should be working by now, however you might need to tweak how the services are being deployed. Below you can find the main configurations used by μcluster.
In the root directory you'll find a .env file that stores some basic settings shared between services:
Default Value | Description | |
---|---|---|
MINIO_PORT | 10000 | Port the MinIO Server will bind to |
MINIO_PORT | 10001 | Port the MinIO Web UI will bind to |
MINIO_ROOT_USER | root | Root user for MinIO |
MINIO_ROOT_PASSWORD | wfNFHGIvYf1RrJ1 | Root password for MinIO |
AWS_ACCESS_KEY_ID | y9kmpQA4RC2asIF4toWi | Initial access key created and propagated to the other services |
AWS_SECRET_ACCESS_KEY | M9piklUa69UTjMcnX4z W1hl7LJJHym7BLbm0GZ2C |
Initial access secret created and propagated to the other services |
METASTORE_USER | meta | User for the PostgreSQL metastore instance |
METASTORE_PASS | 7dDWK2Ft | Password for the PostgreSQL metastore instance |
NOTEBOOK_UID | [GID] | User ID to be used for Jupyter Lab when creating files and folder. Set it to be the same as your current user with id -u |
NOTEBOOK_GID | [UID] | Group ID to be used for Jupyter Lab when creating files and folder. Set it to be the same as your current user with id -g |
INCLUDE_CSV_SAMPLES | true | Creates a samples-csv-src bucket in MinIO with two sample datasets: iris.csv and wine.csv |
INCLUDE_NOTEBOOK_SAMPLES | true | Create a samples directory with two notebooks that show how to read/write files and a quick iceberg showcase |
Warning
If you plan on changing the ports or hostnames used by default in the .yml
you'll need to propagate some of them to the proper service configuration as well.
μcluster will deploy by default a master instance and two slaves (see: docker/compute.yml
):
- The master instance deploys the RM, HBase, Timeline Service, Spark History Server, Livy and Hive.
- The slaves will only ship with hadoop and spark to act as workers.
For any tweaks use the subfolders inside the cluster
directory:
📦cluster
┣ 📂hadoop
┃ ┣ 📜core-site.xml
┃ ┣ 📜hdfs-site.xml
┃ ┣ 📜mapred-site.xml
┃ ┗ 📜yarn-site.xml
┣ 📂hbase
┃ ┗ 📜hbase-site.xml
┣ 📂hive
┃ ┗ 📜hive-site.xml
┣ 📂livy
┃ ┣ 📜livy-env.sh
┃ ┗ 📜livy.conf
┣ 📂spark
┃ ┣ 📜hive-site.xml
┃ ┣ 📜log4j.properties
┃ ┗ 📜spark-defaults.conf
Check Additional Resources to get more information on each specific file.
Under the docker
folder you'll find the definitions of each service in compose files and grouped by their purpose:
📦docker
┣ 📜analytics.yml --> Hue & Jupyter Lab + Sparkmagic
┣ 📜compute.yml --> Hadoop, YARN, Hive...
┣ 📜networks.yml --> Network shared by the services
┗ 📜storage.yml --> MinIO & PostgreSQL
If you require including a new tool, feel free to add it to any of the above categories or create a new one (remember to use the same network and include it in compose.yml
)
Both Hive and Hue require an auxiliary storage to keep track of the schema changes, users logged, configurations...To manage that a PostgreSQL instance is deployed. Inside the metastore
folder you'll find:
📦metastore
┣ 📜data --> Mounted folder to store PostgreSQL data
┣ 📜hue.ini --> Configuration file for hue
┗ 📜init.sh --> Bootstrapping script for PostgreSQL
Check Addtional Resources to get more information on how Hue should configured.
We're using MinIO as the data lake for it's simplicity and compatibility with S3,
in the minio
folder you'll find:
📦minio
┣ 📜data --> Mounted folder to store MinIO data
┣ 📜samples --> Folder with the samples that will be uploaded to `samples-csv-src` bucket
┗ 📜init.sh --> Bootstrapping script for MinIO
As a "quick-n-dirty" analytics tool an instance of Jupyter Lab and sparkmagic is provided,
in the sparkmagic
folder you'll find:
📦sparkmagic
┣ 📂samples --> Sample included in the environment
┃ ┣ 📜sample.ipynb
┃ ┗ 📜sample_iceberg.ipynb
┣ 📜Dockerfile --> Image definition
┣ 📜config.json --> Configuration for sparkmagic (Livy server, timeouts, kernel...)
┣ 📜entrypoint.sh --> Bootstrapping script
┗ 📜requirements.txt --> Requirements file used to install additional packages
After starting the Docker containers as instructed in the "Getting Started" section, you can start experimenting with various Big Data technologies. Here are a few common tasks you might want to try:
- Launching a Spark job using the Spark shell or submitting a Spark application.
- Monitoring resource usage and job status via YARN ResourceManager UI.
- Create S3 resources (buckets, upload files...) using MinIO.
- Read/Transform/Save data using spark in Jupyter Lab through the Livy server.
- Query data using Hive through Hue or the beeline CLI.
Warning
We only have two slaves available (feel free to scale-up if you can), thus only one livy session can be started at the same time (one driver and one worker). Starting two sessions will keep the last one in queue until more resources are available.
By default the different UIs used for monitoring the jobs, sessions or accessing Jupyter Lab can be found as:
Port | Description | |
---|---|---|
Jupyer Lab | http://localhost:8888 | Jupyter Lab + Spark Magic |
Yarn UI | http://localhost:8088/ui2/ | YARN UI v2 |
MinIO Console | http://localhost:10001/ | MinIO Console |
Livy | http://localhost:8998/ | Livy UI & Server |
Spark UI | http://localhost:18080/ | Spark History Server |
Hue | http://localhost:9999/ | Hue SQL Assistance |
Hadoop UI | http://localhost:9870/ | Hadoop UI |
While this project serves as good starting point, it's not meant to explain how each service actually works, see the following resources for more info:
-
Apache Related Configuration:
- XML files: core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml
- YARN: Determining Memory Config
- Timeline Service v2: Overview & HBase Deployment
- Hive: Quickstart
- Livy: .conf template, Build Livy with Docker, (LIVY-863) Missing JVM class imports for Spark3
- Spark: Spark Configuration
-
Sparkmagic: Install and Settings
-
Hue: hue.ini
-
MinIO: MinIO CLI
Warning
You might be tempted to use the latest Hbase version, however the timelineservice v2 doesn't seem to work properly if we're not in version 1.2.6 and Java 1.8, upgrade at your own risk!
This project is nothing more than a learning exercise for me. I tried making it approachable for anyone else; nonetheless, you'll need to have some understanding of Spark, Apache tools, and Docker. None of this should be used as it is in production environments as I had no security concerns at the time.
In the best-case scenario, this will work for you out-of-the-box; if not, tinker and play around, don't get frustrated and enjoy!