Giter Site home page Giter Site logo

vonwey / docker-hadoop-workbench Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bambrow/docker-hadoop-workbench

0.0 0.0 0.0 527 KB

A Hadoop cluster based on Docker, including Hive and Spark.

License: Apache License 2.0

Shell 83.95% Dockerfile 9.98% HiveQL 6.07%

docker-hadoop-workbench's Introduction

Docker Hadoop Workbench

A Hadoop cluster based on Docker, including Hive and Spark.

Introduction

This repository uses Docker Compose to initialize a Hadoop cluster including the following:

  • Hadoop
  • Hive
  • Spark

Please note that this project is built on top of Big Data Europe works. Please check their Docker Hub for latest images.

This project is based on the following Docker versions:

Client:
 Version:           20.10.2
Server: Docker Engine - Community
 Engine:
  Version:          20.10.6
docker-compose version 1.29.1, build c34c88b2

Quick Start

To start the cluster simply run:

./start_demo.sh

Alternatively, you can use v2, which is built on top of my own spark-master and spark-history-server:

./start_demo_v2.sh

You can stop the cluster using ./stop_demo.sh or ./stop_demo_v2.sh. Also you can modify DOCKER_COMPOSE_FILE in start_demo.sh and stop_demo.sh to use other YAML files.

Interfaces

Connections

Use hdfs dfs to connect to hdfs://localhost:9000/ (Please make sure you have Hadoop installed first):

hdfs dfs -ls hdfs://localhost:9000/

Use Beeline to connect to HiveServer2 (Please make sure you have Hive installed first):

beeline -u jdbc:hive2://localhost:10000/default -n hive -p hive

Use spark-shell to connect to Hive Metastore via thrift protocol (Please make sure you have Spark installed first):

$ spark-shell

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.11)

scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.master("local")
              .config("hive.metastore.uris", "thrift://localhost:9083")
              .enableHiveSupport.appName("thrift-test").getOrCreate

spark.sql("show databases").show


// Exiting paste mode, now interpreting.

+---------+
|namespace|
+---------+
|  default|
+---------+

import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@1223467f

Use Presto CLI to connect to Presto and query Hive data:

wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.255/presto-cli-0.255-executable.jar
mv presto-cli-0.255-executable.jar presto
chmod +x presto
./presto --server localhost:8090 --catalog hive --schema default

Run MapReduce Job WordCount

This part is based on Big Data Europe's Hadoop Docker project.

First run hadoop-base as a helper container:

docker run -d --network hadoop --env-file hadoop.env --name hadoop-base bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8 tail -f /dev/null

Then run the following:

docker exec -it hadoop-base hdfs dfs -mkdir -p /input/
docker exec -it hadoop-base hdfs dfs -copyFromLocal -f /opt/hadoop-3.2.1/README.txt /input/
docker exec -it hadoop-base mkdir jars
docker cp jars/WordCount.jar hadoop-base:jars/WordCount.jar
docker exec -it hadoop-base /bin/bash 
hadoop jar jars/WordCount.jar WordCount /input /output

You should be able to see the job in http://localhost:8088/cluster/apps and http://localhost:8188/applicationhistory (when finished).

After the job is finished, check the result:

hdfs dfs -cat /output/*

Then type exit to exit the container.

Run Hive Job

Make sure hadoop-base is running. Then prepare the data:

docker exec -it hadoop-base hdfs dfs -mkdir -p /test/
docker exec -it hadoop-base mkdir test
docker cp data hadoop-base:test/data
docker exec -it hadoop-base /bin/bash
hdfs dfs -put test/data/* /test/
hdfs dfs -ls /test
exit

Then create the table:

docker cp scripts/hive-beers.q hive-server:hive-beers.q
docker exec -it hive-server /bin/bash
cd /
hive -f hive-beers.q
exit

Then play with data using Beeline:

beeline -u jdbc:hive2://localhost:10000/test -n hive -p hive

0: jdbc:hive2://localhost:10000/test> select count(*) from beers;

You should be able to see the job in http://localhost:8088/cluster/apps and http://localhost:8188/applicationhistory (when finished).

Run Spark Shell

Make sure you have prepared the data and created the table in the previous step.

docker exec -it spark-master spark/bin/spark-shell

scala> spark.sql("show databases").show
+---------+
|namespace|
+---------+
|  default|
|     test|
+---------+

scala> val df = spark.sql("select * from test.beers")
df: org.apache.spark.sql.DataFrame = [id: int, brewery_id: int ... 11 more fields]

scala> df.count
res0: Long = 7822

You should be able to see the Spark Shell session in http://localhost:8080/ and your job in http://localhost:4040/jobs/.

If you encounter the following warning when running spark-shell:

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Please check the logs of spark-master using docker logs -f spark-master. If the following exists, please restart your spark-worker using docker-compose restart spark-worker:

WARN Master: Got heartbeat from unregistered worker worker-20210622022950-xxx.xx.xx.xx-xxxxx. This worker was never registered, so ignoring the heartbeat.

Similarly, to run spark-sql, use docker exec -it spark-master spark/bin/spark-sql.

Run Spark Submit

docker exec -it spark-master /spark/bin/spark-submit --class org.apache.spark.examples.SparkPi /spark/examples/jars/spark-examples_2.12-3.1.1.jar 100

You should be able to see Spark Pi in http://localhost:8080/ and your job in http://localhost:4040/jobs/.

Configuration Files

Some configuration file locations are listed below. The non-empty configuration files are also copied to conf folder for future reference.

  • namenode:
    • /etc/hadoop/core-site.xml CORE_CONF
    • /etc/hadoop/hdfs-site.xml HDFS_CONF
    • /etc/hadoop/yarn-site.xml YARN_CONF
    • /etc/hadoop/httpfs-site.xml HTTPFS_CONF
    • /etc/hadoop/kms-site.xml KMS_CONF
    • /etc/hadoop/mapred-site.xml MAPRED_CONF
  • hive-server:
    • /opt/hive/hive-site.xml HIVE_CONF

docker-hadoop-workbench's People

Contributors

bambrow avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.