-
- SPARK_HOME (e.g: echo $SPARK_HOME Output: ./my_projects/spark-3.0.1-bin-hadoop2.7)
-
- PYSPARK_PYTHON --> By default, Python is not part of Spark environment. By setting PYSPARK_PYTHON, establishes relationship between Python and Spark environments. (e.g: echo $PYSPARK_PYTHON. output: ./pyspark)
-
- PYTHONPATH --> (e.g: echo $PYTHONPATH /Users/prammitr/Documents/Doc/my_projects/spark-3.0.1-bin-hadoop2.7/python:/Users/prammitr/Documents/Doc/my_projects/spark-3.0.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip)
** Check Setup: This should return valida valie (echo $SPARK_HOME/$python/lib/ Output: /Users/prammitr/Documents/Doc/my_projects/spark-3.0.1-bin-hadoop2.7//lib/)
Test Local Setup: Run spark-shell (e.g. /Users/my_projects/spark-3.0.1-bin-hadoop2.7/bin/spark-shell)
Welcome to ____ __ / / ___ / / \ / _ / _ `/ __/ '/ // .__/_,// //_\ version 3.0.1 //
scala>
scala> :quit (base) โ ~
11:44 $ /Users/my_projects/spark-3.0.1-bin-hadoop2.7/bin/pyspark Welcome to ____ __ / / ___ / / \ / _ / _ `/ __/ '/ / / ._/_,// //_\ version 3.0.1 //
Using Python version 2.7.16 (default, Jun 5 2020 22:59:21) SparkSession available as 'spark'.
-
- Install IntelliJ (Community or otherwise)
-
- Start IntelliJ
-
- Scala Plugin Install: Click IntelliJ -> Congigure -> Plugin -> Search For "Scala" -> Install -> Restart IDE.
-
- IntelliJ -> Open
Then make sure to import porject from external source in Intellij and import as SBT project. Steps: File>New> Project from Existing Source>(select) Import project from external model> (select)sbt click Next>Finish
https://kafka.apache.org/quickstart
- Changes --> dataDir=../kafka-logs/zookeeper
- Changes --> Uncomment --> listeners=PLAINTEXT://:9092 Update --> log.dirs=../kafka-logs/server-0
2.3 Create "topic" --> bin/kafka-topics.sh --create --topic quickstart-events --bootstrap-server localhost:9092 -->Created topicquickstart-events.
2.4 Kafka Producer --> kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092
Hello Spark Streaming1 Hello Spark Streaming2
2.5 Kafka Consumer --> kafka-console-consumer.sh --topic quickstart-events --from-beginning --bootstrap-server localhost:9092
Hello Spark Streaming1 Hello Spark Streaming2
nc -l 9999
-
- Netcat on terminal (nc -l 9999)
-
- Code StreamingWC.scala
- Step1 : From Terminal -> nc -l 9999
- Step2 : Intellij -> Scala/SBT Project -> Write code HelloSparkSQL.scala
- Step3 : Start the program from Intellij
- Step4 : Type stuff at terminal
- Step5 : Observer output
+---------+-----+ | word|count| +---------+-----+
+---------+-----+ | word|count| +---------+-----+ | Spark| 4| | Hello| 3| | Sparking| 1| |Streaming| 2| | is| 1| +---------+-----+
- 1: From apache meven repo, search for apache spark sql kafka 0.10.12
- 2: Add that to build.sbt (Follow exmp from KafkaStreamDemo.scala sbt.build file)
- 3: Make sure this package and dependencies are available to Spark Master and Executors. How to do that?
- Add the following lines to Spark-defaults
- spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 ** Alternatively use --packages option while "spark-submit"