Giter Site home page Giter Site logo

gyan42 / docker-big-data-playground Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 1.71 MB

Single Node Big Data Playground with Docker compose : Apache Spark, Apache Airflow, Apache Livy, Jupyter Notebook

License: Apache License 2.0

Python 39.39% Dockerfile 16.73% Shell 17.24% Jupyter Notebook 26.64%

docker-big-data-playground's Introduction

Big Data Playground with Spark Ecosystem

Note: This is shamelessly copied from here

This project contains the following containers:

Architecture components

Setup

Clone project

$ git clone https://github.com/gyan42/docker-big-data-playground

Build airflow Docker

Change to directory docker-big-data-playground/docker/docker-airflow and run

$ docker build --rm --force-rm -t docker-airflow-spark:1.10.7_3.1.2 .

Change to directory docker-big-data-playground/docker/bitnami-livy and run Note : Livy doesn;t support Spark 3+ out of the box. Refer here

```
docker build -t bitnami/spark-livy:3.1.2 .

docker run --network="bridge" -p 8998:8998 -it bitnami/spark-livy:3.1.2 bash
/opt/bitnami/livy/bin/livy-server

#mac
export HOST=`docker-machine ip`

# linux
export HOST=localhost

curl -X POST -d '{"kind": "pyspark"}' -H "Content-Type: application/json" ${HOST}:8998/sessions/

curl -X POST -d '{                                            
"kind": "pyspark",                 
"code": "for i in range(1,10): print(i)"
}' \
-H "Content-Type: application/json" \
${HOST}:8998/sessions/0/statements
```

Optionally, you can override the arguments in the build to choose specific Spark, Hadoop and Airflow versions. As an example, here is how to build an image containing Airflow version 1.10.14, Spark version 2.4.7 and Hadoop version 2.7.

$ docker build --rm --force-rm \
-t docker-airflow-spark:1.10.14_2.4.7 . \
--build-arg AIRFLOW_VERSION=1.10.14 \
--build-arg SPARK_VERSION=2.4.7 \
--build-arg HADOOP_VERSION=2.7

Spark and hadoop versions follow the versions as defined at Spark download page: https://spark.apache.org/downloads.html

Airflow versions can be found here: https://pypi.org/project/apache-airflow/#history

If you change the name or the tag of the docker image when building, remember to update the name/tag in docker-compose file.

Start containers

Navigate to airflow-spark/docker and:

$ docker-compose up

If you want to run in background:

$ docker-compose up -d

Note: when running the docker-compose for the first time, the images postgres:9.6, bitnami/spark:3.1.2 and jupyter/pyspark-notebook:spark-3.1.2 will be downloaded before the containers started.

Check if you can access

Airflow: http://localhost:8282

Spark Master: http://localhost:8181

PostgreSql - Database Test:

  • Server: localhost:5432
  • Database: test
  • User: test
  • Password: postgres

Postgres - Database airflow:

  • Server: localhost:5432
  • Database: airflow
  • User: airflow
  • Password: airflow

Jupyter Notebook: http://127.0.0.1:8888

  • For Jupyter notebook, you must copy the URL with the token generated when the container is started and paste in your browser. The URL with the token can be taken from container logs using:

    $ docker logs -f docker_jupyter-spark_1
    

How to run a DAG to test

  1. Configure spark connection acessing airflow web UI http://localhost:8282 and going to Connections

  2. Edit the spark_default connection inserting spark://spark in Host field and Port 7077

  3. Run the spark-test DAG

  4. Check the DAG log for the task spark_job. You will see the result printed in the log

  5. Check the spark application in the Spark Master web UI (http://localhost:8181)

How to run the Spark Apps via spark-submit

After started your docker containers, run the command below in your terminal:

$ docker exec -it docker_spark_1 spark-submit --master spark://spark:7077 <spark_app_path> [optional]<list_of_app_args>

Example running the hellop-world.py application:

$ docker exec -it docker_spark_1 spark-submit --master spark://spark:7077 /usr/local/spark/app/hello-world.py /usr/local/spark/resources/data/airflow.cfg

Increasing the number of Spark Workers

You can increase the number of Spark workers just adding new services based on bitnami/spark:3.1.2 image to the docker-compose.yml file like following:

spark-worker-n:
        image: bitnami/spark:3.1.2
        user: root
        networks:
            - default_net
        environment:
            - SPARK_MODE=worker
            - SPARK_MASTER_URL=spark://spark:7077
            - SPARK_WORKER_MEMORY=1G
            - SPARK_WORKER_CORES=1
            - SPARK_RPC_AUTHENTICATION_ENABLED=no
            - SPARK_RPC_ENCRYPTION_ENABLED=no
            - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
            - SPARK_SSL_ENABLED=no
        volumes:
            - ../spark/app:/usr/local/spark/app # Spark scripts folder (Must be the same path in airflow and Spark Cluster)
            - ../spark/resources/data:/usr/local/spark/resources/data #Data folder (Must be the same path in airflow and Spark Cluster)

Adding Airflow Extra packages

Rebuild Dockerfile (in this example, adding GCP extra):

$ docker build --rm --build-arg AIRFLOW_DEPS="gcp" -t docker-airflow-spark:1.10.7_3.1.2 .

After successfully built, run docker-compose to start container:

$ docker-compose up

More info at: https://github.com/puckel/docker-airflow#build

Useful docker commands

List Images:
$ docker images <repository_name>

List Containers:
$ docker container ls

Check container logs:
$ docker logs -f <container_name>

To build a Dockerfile after changing sth (run inside directoty containing Dockerfile):
$ docker build --rm -t <tag_name> .

Access container bash:
$ docker exec -i -t <container_name> /bin/bash

Useful docker-compose commands

Start Containers:
$ docker-compose -f <compose-file.yml> up -d

Stop Containers:
$ docker-compose -f <compose-file.yml> down --remove-orphans

Extras

Spark + Postgres sample

docker-big-data-playground's People

Contributors

mageswaran1989 avatar

Stargazers

Naveen avatar  avatar

Watchers

Kostas Georgiou avatar

Forkers

aspirina765

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.