Giter Site home page Giter Site logo

yabola / spark-util Goto Github PK

View Code? Open in Web Editor NEW

This project forked from linmingqiang/sparkstreaming

0.0 1.0 0.0 631 KB

:boom: :alien: :hotsprings::rocket:Encapsulated spark/sparkstreaming to read Kafka with Low level integration (offset in zookeeper).:boom: :alien: :hotsprings::rocket:And encapsulate other component tools.like hbase,flink,kudu,es, sparkstreaming1.6 kafka010(SSL) and so on

Home Page: https://github.com/LinMingQiang/spark-util/blob/master/README.md

Scala 100.00%

spark-util's Introduction

🎉Spark-Util

大数据生态圈中有许多组件,spark作为大数据的主流技术,在实际开发中经常会与其他组件结合使用的情况,但是官方没有比较完整的工具包。这里主要是封装了一些常用的组件,例如:Sparkstreaming-Kafka,Spark-hbase等,对于一些简单的需求项目比较适合,或者比较适合新人学习阅读。遇到bug的可以在此提交error反馈------LinMingQiang

There are many components in bigdata ecology. As the mainstream technology of bigdata, spark often combines with other components in actual development. But spark official did not respond to the toolkit to support it. Here is mainly encapsulated some common components with spark. for example:spark.Sparkstreaming-Kafka,Spark-hbase. For some simple needs, or newcomers may be more suitable. The following code is used in the production line and runs smoothly. Bug has not been discovered for the time being. If you use any bug or new idea in the process, you can leave a message.------LinMingQiang

最近更新 :
封装 StreamingDynamicContext ( https://github.com/LinMingQiang/spark-kafka):
实现以下功能: 不使用streamingContext 来实现流式计算,因为streamingContext是严格的时间间隔执行job任务,当job时间远小于batchtime时,会有大量的时间是在sleep等待下一个批次执行的到来(具体可以看看streamingContext的源码);
StreamingDynamicContext 的设计借鉴了streamingContext的设计。但是在Job的提交上不使用Queue队列来appending堆积的job。当job执行完后,用户可以自己选择是否立刻执行下一个批次的计算,还是选择继续等待指定时长。

封装 StreamingKafkaContext :你依然可以用 streamingContext来实现流式计算,词Api封装了读取kafka数据。

Encapsulation Streaming DynamicContext: Implement the following functions: Streaming DynamicContext is not used to implement streaming computing, because streaming Context is a strict time interval to perform job tasks, when job time is much less than batchtime, there will be a lot of time in sleep waiting for the next batch of execution (see the source code of streaming Context specifically), Streaming DynamicContext The design draws lessons from the design of streaming Context. But Job submissions do not use Queue queues to append stacked jobs. When the job is finished, the user can choose whether to perform the next batch immediately or to continue waiting for the specified length of time.

Support


scala version Kafka version hbase 1.0+ es 2.3.0 kudu 1.3.0 SSL
spark 1.3.x 2.10 0.8 👌 🌟 🍆 NO
spark 1.6.x 2.10 0.8 🐤 🎅 🌽 NO
spark 1.6.x 2.10 0.10+ 🐤 🎅 🌽 YES
spark 2.0.x 2.10/2.11 0.10+ 😃 🍒 🍑 YES

POINT


支持动态调节 streaming 的 批次间隔时间 (不同于sparkstreaming 的 定长的批次间隔)
支持在streaming过程中 重设 topics,用于生产中动态地增加删减数据源
添加了速率控制,KafkaRateController。用来控制读取速率,由于不是用的sparkstreaming,所有速率控制的一些参数拿不到,得自己去计算。
提供spark-streaming-kafka-0-10_2.10 spark 1.6 来支持 kafka的ssl
支持rdd.updateOffset 来管理偏移量。


🎃 Table of contents

Spark kafka

  • 封装了StreamingDynamicContext 。动态地调整 streaming的批次间隔时间,不像sparkstreaming的批次间隔时间是固定的(Streaming Kafka DynamicContext is encapsulated. Dynamically adjust the batch interval of streaming, unlike sparkstreaming, where the batch interval is fixed)
  • 使用StreamingDynamicContext 可以让你在流式程序的执行过程中动态的调整你的topic和获取kafkardd的方式。而不需要重新启动程序
  • 添加了 sparkStreaming 1.6 -> kafka 010 的 spark-streaming-kafka-0-10_2.10 。用以支持ssl 。
  • 封装了spark/sparkstreaming direct读取kafka数据的方式;提供rdd.updateOffset方法来手动管理偏移量到zk; 提供配置参数。
    (Encapsulated spark/sparkstreaming to read Kafka with Low level integration (offset in zookeeper)。Provides many configuration parameters to control the way to read Kafka data)
  • 支持topic新增分区
    (Support topic to add new partition)
  • 支持rdd数据写入kafka 的算子
    (Supporting RDD data to write to Kafka)
  • 支持 Kafka SSL (提供spark 1.6 + Kafka 010 的整合api)(sparkstreaming 1.6 with kafka 010 )
    (Support Kafka SSL (0.10+,spark 1.6+))
  • Add parameters : 'kafka.consumer.from' To dynamically decide whether to get Kafka data from last or from consumption point
  • The version support of spark2.x Kafka 0.10+ is provided.(0.8, there is a big change compared to the 0.10 version.)
  • https://github.com/LinMingQiang/spark-util/tree/spark-kafka-0-8_1.6 或者 https://github.com/LinMingQiang/spark-kafka
  val kp = SparkKafkaContext.getKafkaParam(brokers,groupId,"consum","earliest") 
  val skc = new SparkKafkaContext(kp,sparkconf) 
  val kafkadataRdd = skc.kafkaRDD(topics,last,msgHandle)
  //...do something 
  kafkadataRdd.updateOffsets(groupId)//update offset to zk

Spark Hbase

  • 根据scan条件扫描hbase数据成RDD
    (spark scan hbase data to RDD)
    scan -> RDD[T]
  • 根据RDD的数据来批量gethbase
    (spark RDD[T] get from hbase to RDD[U])
    RDD[T] -> Get -> RDD[U]
  • 根据RDD的数据来批量 写入
    spark RDD[T] write to hbase
    RDD[T] -> Put -> Hbase
  • 根据RDD的数据来批量更新rdd数据
    spark RDD[T] update with hbase data
    RDD[T] -> Get -> Combine -> RDD[U]
  • 根据RDD的数据来批量更新rdd数据并写回hbase
    spark RDD[T] update with hbase data then put return to hbase
    RDD[T] -> Get -> Combine -> Put -> Hbase
   val conf = new SparkConf().setMaster("local").setAppName("tets")
   val sc = new SparkContext(conf)
   val hc = new SparkHBaseContext(sc, zk)
   hc.hbaseRDD(tablename, f).foreach { println }
   hc.scanHbaseRDD(tablename, new Scan(), f)

Spark ES Util

sc.esRDD("testindex/testtype", query)

Spark Kudu

Flink kafka

  • 这是一个简单的例子。读取卡夫卡数据,实现WordCount统计并写入HBase
  • This is a simple example. Read Kafka data, implement WordCount statistics and write to HBase

Splunk

  • Splunk是一个日志显示和监视系统
    (Splunk is a log display and monitoring system.)
  • Splunk的安装和使用
    (Installation and use of Splunk)

Kafka Util

  • 操作kafka工具类,提供每天记录主题的偏移量,主要用于日重新计算、小时重新计算等功能。
    Operate the tool class of kafka, provide offset to record topic by day, mainly used for day recalculation, hour recalculation and other functions  

Hbase Util

  • 操作Hbase的工具类,查询HBase表的region信息,用于手动分割过大的region
    The tool class that operates Hbase, inquires the region information of HBase table, used for manual split some excessive region  

database util

  • Provides a connection tool for each database. include: es,hbase,mysql,mongo

Elasticserach shade

  • Resolving conflicts between ES and spark and Hadoop related packages

Rabbitmq util

spark-util's People

Contributors

linmingqiang avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.