Giter Site home page Giter Site logo

spark_kafka's Introduction

pyspark Dependencies to setup on MacOS

    1. SPARK_HOME (e.g: echo $SPARK_HOME Output: ./my_projects/spark-3.0.1-bin-hadoop2.7)
    1. PYSPARK_PYTHON --> By default, Python is not part of Spark environment. By setting PYSPARK_PYTHON, establishes relationship between Python and Spark environments. (e.g: echo $PYSPARK_PYTHON. output: ./pyspark)
    1. PYTHONPATH --> (e.g: echo $PYTHONPATH /Users/prammitr/Documents/Doc/my_projects/spark-3.0.1-bin-hadoop2.7/python:/Users/prammitr/Documents/Doc/my_projects/spark-3.0.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip)

** Check Setup: This should return valida valie (echo $SPARK_HOME/$python/lib/ Output: /Users/prammitr/Documents/Doc/my_projects/spark-3.0.1-bin-hadoop2.7//lib/)

Test Local Setup: Run spark-shell (e.g. /Users/my_projects/spark-3.0.1-bin-hadoop2.7/bin/spark-shell)

Welcome to ____ __ / / ___ / / \ / _ / _ `/ __/ '/ // .__/_,// //_\ version 3.0.1 //

scala>

How to quit Scala REPL?

scala> :quit (base) โœ” ~

Test Local Setup: Run spark-shell (e.g. /Users/my_projects/spark-3.0.1-bin-hadoop2.7/bin/pyspark)

11:44 $ /Users/my_projects/spark-3.0.1-bin-hadoop2.7/bin/pyspark Welcome to ____ __ / / ___ / / \ / _ / _ `/ __/ '/ / / ._/_,// //_\ version 3.0.1 //

Using Python version 2.7.16 (default, Jun 5 2020 22:59:21) SparkSession available as 'spark'.

Setup Intellij IDEA for Scala as Spark programming language and SBT as build tool.

Steps:

    1. Install IntelliJ (Community or otherwise)
    1. Start IntelliJ
    1. Scala Plugin Install: Click IntelliJ -> Congigure -> Plugin -> Search For "Scala" -> Install -> Restart IDE.
    1. IntelliJ -> Open

Sample Code: github.com/LearningJournal/Spark-Streaming-in-Scala

Learning Notes:: (If see error like ->Error: Could not find or load main class )

Then make sure to import porject from external source in Intellij and import as SBT project. Steps: File>New> Project from Existing Source>(select) Import project from external model> (select)sbt click Next>Finish

Kafka Instalation

https://kafka.apache.org/quickstart

1. Download Kafka binary

1.1. zookeeper.properties -> Used by Zookeeper Server

  • Changes --> dataDir=../kafka-logs/zookeeper

1.2. server.properties -> Kafka broker

  • Changes --> Uncomment --> listeners=PLAINTEXT://:9092 Update --> log.dirs=../kafka-logs/server-0

2 Set KAFKA_HOME & KAFKA/BIN to source path

2.1 Start ZooKeeper --> bin/zookeeper-server-start.sh config/zookeeper.properties

2.2 Start Kafka Server --> bin/kafka-server-start.sh config/server.properties

2.3 Create "topic" --> bin/kafka-topics.sh --create --topic quickstart-events --bootstrap-server localhost:9092 -->Created topicquickstart-events.

2.4 Kafka Producer --> kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092

Hello Spark Streaming1 Hello Spark Streaming2

2.5 Kafka Consumer --> kafka-console-consumer.sh --topic quickstart-events --from-beginning --bootstrap-server localhost:9092

Hello Spark Streaming1 Hello Spark Streaming2

Netcat in MACOS

nc -l 9999

Project -> Spark Streaming Word Count Listener

Requirements:

    1. Netcat on terminal (nc -l 9999)
    1. Code StreamingWC.scala

Demo process ->

  • Step1 : From Terminal -> nc -l 9999
  • Step2 : Intellij -> Scala/SBT Project -> Write code HelloSparkSQL.scala
  • Step3 : Start the program from Intellij
  • Step4 : Type stuff at terminal
  • Step5 : Observer output

Batch: 0

+---------+-----+ | word|count| +---------+-----+

Batch: 6

+---------+-----+ | word|count| +---------+-----+ | Spark| 4| | Hello| 3| | Sparking| 1| |Streaming| 2| | is| 1| +---------+-----+

Kafka Spark Streaming Meven Package Dependency add

Steps:

  • 1: From apache meven repo, search for apache spark sql kafka 0.10.12
  • 2: Add that to build.sbt (Follow exmp from KafkaStreamDemo.scala sbt.build file)
  • 3: Make sure this package and dependencies are available to Spark Master and Executors. How to do that?
  • Add the following lines to Spark-defaults
  • spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 ** Alternatively use --packages option while "spark-submit"

spark_kafka's People

Contributors

pramitmitra avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.