Giter Site home page Giter Site logo

community-application-code's Introduction

SPA Assignement for WILP

Kakfa - Spark Streaming - Cassandra Community Streaming Application


Github link to the api used : https://github.com/lukePeavey/quotable
Actual post replica : https://api.quotable.io/random
  1. User creates a post in real time (Data is being scraped using an api)
  2. These posts are pushed by Kafka producer -->
  3. Read by kafka consumer -->
  4. Taken as input by spark streaming application -->
  5. Perform some analytical operation on received data in spark in real time -->
  6. Store the processed data in highly available cassandra database -->
  7. Retreive the data from cassandra and perform some analysis.

Contributors :

1. Sanka Mahesh Sai
2. Snigdha Tarua
3. Chandra Sekhar Gupta Aravapalli

Assignment Description :


  1. Out of the two main workflows defined in Assignment 1
  • Social Media Posts complete workflow
  • Messenger/Conversations between users/user groups .
    Choose any one workflow and implement it using open source Technologies such as Kafka, Spark, Flink etc. and in one programming language Python/Java.
  1. Create a streaming analytics pipeline and a dashboard that shows Realtime insights of the application. Note: Based on your workflow decide what could be valuable data points you can gather and generate insights out of it.

  2. Submit both the codes

  • Of the working project
  • Of the analytics pipeline
    Separately and Link of a short 5โ€“10-minute video helping to understand how the integration between the different system subcomponent works. Proper flow needs to be shown between the different classes defined for the workflow and data pipeline.

Overview :

This is an example of building Kafka + Spark streaming application from scratch.

When considering this project, we thought Twitter would be a great source of streamed data, plus it would be easy to peform simple transformations on the data. So for this example, we have :

  1. Fake data twitter data generator
  2. Create a stream of tweets that will be sent to a Kafka queue .
  3. Pull the tweets from the Kafka cluster and perform analysis .
  4. Save this data to a Cassandra table

Architecture :

IoT devices --> Kafka --> Spark --> Cassandra  

To do this, we are going to set up an environment that includes :

  • A single-node Kafka cluster
  • A single-node Hadoop cluster
  • Cassandra and Spark

Extract both the files. spark version : spark-3.1.2-bin-hadoop3.2.tgz kafka version : kafka-2.8.0-src.tgz

Install Kafka :

"Installing" Kafka is done by downloading the code from one of the several mirrors. After finding the latest binaries from the downloads page, choose one of the mirror sites and wget it into your home directory.


~$ tar -xvf kafka-2.8.0-src.tgz
~$ mv kafka-2.8.0-src.tgz kafka
~$ sudo apt install openjdk-8-jdk -y
~$ java -version
~$ pip3 install kakfa-python 
~$ pip3 list | grep kafka

Install Spark :

Download from https://spark.apache.org/downloads.html, make sure you choose the option for Hadoop 2.7 or later (unless you used and earlier version).

Unpack it, rename it

~$ tar -xvf Downloads/spark-3.1.2-bin-hadoop3.2.tgz
~$ mv spark-3.1.2-bin-hadoop3.2.tgz spark
~$ pip3 install pyspark
~$ pip3 list | grep spark
export PATH=$PATH:/home/<USER>/spark/bin
export PYSPARK_PYTHON=python3
~$ pyspark

Using Python version ....
SparkSession available as 'spark'.
>>> 


Run code :

  1. start zookeeper
$ bin/zookeeper-server-start.sh config/zookeeper.properties
  1. start kafka server
$ bin/kafka-server-start.sh config/server.properties
  1. pip install kafka-python

  2. pip install pyspark

  3. python producer.py

  4. python consumer.py

  5. Create tables in Cassandra

  6. Navigate to spark extracted folder

  7. run spark-submit job

$ ./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --master local ~/Desktop/stream_assignment/stream/om/trail_spark.py 100

community-application-code's People

Contributors

taruasnigdha avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.