Giter Site home page Giter Site logo

kafka-sparkstreaming-cassandra's Introduction

Docker container for Kafka - Spark streaming - Cassandra

This Dockerfile sets up a complete streaming environment for experimenting with Kafka, Spark streaming (PySpark), and Cassandra. It installs

  • Kafka 0.10.2.1
  • Spark 2.1.1 for Scala 2.11
  • Cassandra 3.7

It additionnally installs

  • Anaconda distribution 4.4.0 for Python 2.7.10
  • Jupyter notebook for Python

Quick start-up guide

Run container using DockerHub image

docker run -p 4040:4040 -p 8888:8888 -p 23:22 -ti --privileged yannael/kafka-sparkstreaming-cassandra

See following video for usage demo.
Demo

Note that any changes you make in the notebook will be lost once you exit de container. In order to keep the changes, it is necessary put your notebooks in a folder on your host, that you share with the container, using for example

docker run -v `pwd`:/home/guest/host -p 4040:4040 -p 8888:8888 -p 23:22 -ti --privileged yannael/kafka-sparkstreaming-cassandra

Note:

  • The "-v pwd:/home/guest/host" shares the local folder (i.e. folder containing Dockerfile, ipynb files, etc...) on your computer - the 'host') with the container in the '/home/guest/host' folder.
  • Port are shared as follows:
    • 4040 bridges to Spark UI
    • 8888 bridges to the Jupyter Notebook
    • 23 bridges to SSH

SSH allows to get a onnection to the container

ssh -p 23 guest@containerIP

where 'containerIP' is the IP of th container (127.0.0.1 on Linux). Password is 'guest'.

Start services

Once run, you are logged in as root in the container. Run the startup_script.sh (in /usr/bin) to start

  • SSH server. You can connect to the container using user 'guest' and password 'guest'
  • Cassandra
  • Zookeeper server
  • Kafka server
startup_script.sh

Connect, create Cassandra table, open notebook and start streaming

Connect as user 'guest' and go to 'host' folder (shared with the host)

su guest

Start Jupyter notebook

notebook

and connect from your browser at port host:8888 (where 'host' is the IP for your host. If run locally on your computer, this should be 127.0.0.1 or 192.168.99.100, check Docker documentation)

Start Kafka producer

Open kafkaSendDataPy.ipynb and run all cells.

Start Kafka receiver

Open kafkaReceiveAndSaveToCassandraPy.ipynb and run cells up to start streaming. Check in subsequent cells that Cassandra collects data properly.

Connect to Spark UI

It is available in your browser at port 4040

Container configuration details

The container is based on CentOS 6 Linux distribution. The main steps of the building process are

  • Install some common Linux tools (wget, unzip, tar, ssh tools, ...), and Java (1.8)
  • Create a guest user (UID important for sharing folders with host!, see below), and install Spark and sbt, Kafka, Anaconda and Jupyter notbooks for the guest user
  • Go back to root user, and install startup script (for starting SSH and Cassandra services), sentenv.sh script to set up environment variables (JAVA, Kafka, Spark, ...), spark-default.conf, and Cassandra

User UID

In the Dockerfile, the line

RUN useradd guest -u 1000

creates the user under which the container will be run as a guest user. The username is 'guest', with password 'guest', and the '-u' parameter sets the linux UID for that user.

In order to make sharing of folders easier between the container and your host, make sure this UID matches your user UID on the host. You can see what your host UID is with

echo $UID

Build and running the container from scratch

Clone this repository

git clone https://github.com/Yannael/kafka-sparkstreaming-cassandra

Build

From Dockerfile folder, run

docker build -t kafka-sparkstreaming-cassandra .

It may take about 30 minutes to complete.

Run

docker run -v `pwd`:/home/guest/host -p 4040:4040 -p 8888:8888 -p 23:22 -ti --privileged kafka-sparkstreaming-cassandra

kafka-sparkstreaming-cassandra's People

Contributors

yannael avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

kafka-sparkstreaming-cassandra's Issues

Can't connect to Cassandra

Hello, I am trying to run this but I am having trouble to connect to Cassandra. Do you know what might be the issue? FYI I am building from scratch.

/home/guest/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(

Py4JJavaError: An error occurred while calling o250.load.
: java.io.IOException: Failed to open native connection to Cassandra at {127.0.0.1}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:163)

litpc45.ulb.ac.be timed out

Seems that the host litpc45.ulb.ac.be is not available at this time. Could you provide another location for the package?
It would be better than compiling it by myself ๐Ÿ˜

I've tried with different internet providers and seems that is truly down.
Also: http://downforeveryoneorjustme.com/litpc45.ulb.ac.be and even and exhaustive nmap scan shows a host down.

The step is located at line 26 of the Dockerfile:

#Install Spark
#Spark 2.0.0, precompiled with : mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Dyarn.version=2.6.0 -DskipTests -Dscala-2.11 -Phive -Phive-thriftserver clean package
RUN wget http://litpc45.ulb.ac.be/spark-2.0.0_Hadoop-2.6_Scala-2.11.tgz
RUN tar xvzf spark-2.0.0_Hadoop-2.6_Scala-2.11.tgz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.