Giter Site home page Giter Site logo

zdh_server's Introduction

技术栈

  • spark 2.4.4
  • hadoop 3.1.x
  • hive > 2.3.3
  • kafka 1.x,2.x
  • scala 2.11.12
  • java 1.8
  • hbase < 1.3.6 (可选)

提示

zdh_server改名为zdh_spark
zdh 分2部分,前端配置+后端数据ETL处理,此部分只包含ETL处理
前端配置项目 请参见项目 https://github.com/zhaoyachao/zdh_web
zdh_web 和zdh_server 保持同步 大版本会同步兼容 如果zdh_web 选择版本1.0 ,zdh_server 使用1.x 都可兼容
二次开发同学 请选择dev 分支,dev 分支只有测试通过才会合并master,所以master 可能不是最新的,但是可保证可用性

在线预览

http://zycblog.cn:8081/login
用户名:zyc
密码:123456

服务器资源有限,界面只供预览,不包含数据处理部分,谢码友们手下留情    

项目介绍

数据采集ETL处理,通过spark 平台抽取数据,并根据etl 相关函数,做数据处理
新增数据源需要继承ZdhDataSources 公共接口,重载部分函数即可

项目编译打包

项目采用maven 管理
打包命令,在当前项目目录下执行
window: mvn package -Dmaven.test.skip=true 

项目需要的jar 会自动生成到zdh_spark-xxxx-RELEASE 目录下

部署

1 拷贝zdh_spark-xxxx-RELEASE到服务器(linux)
2 拷贝zdh_spark-xxxx-RELEASE/copy_spark_jars 目录下的jar 拷贝到spark home 目录下的jars 目录
3 修改zdh_spark-xxxx-RELEASE/conf/datasources.propertites
4 修改zdh_spark-xxxx-RELEASE/conf/log4j.propertites
5 配置系统spark环境变变量SPARK_HOME
6 启动脚本 start_server.sh

启动脚本

注意项目需要用到log4j.properties 需要单独放到driver 机器上,启动采用client 模式

停止脚本

 kill `ps -ef |grep SparkSubmit |grep zdh_server |awk -F ' ' '{print $2}'`

个人联系方式

FAQ

使用tidb 连接时,需要在zdh_server 启动配置文件中添加如下配置
spark.tispark.pd.addresses 192.168.1.100:2379
spark.sql.extensions org.apache.spark.sql.TiExtensions

版本更新说明

  • v5.1.1 修复http数据源

  • v5.3.0 优化pom文件

  • v5.3.4 支持消息队列获取任务

  • v5.3.5 升级优先级队列版本

  • v5.3.6 无改动仅配合版本变更

  • v5.4.0 修复启动jar缺失redisson,hutool

zdh_server's People

Contributors

zhaoyachao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

zdh_server's Issues

演示环境看不了

演示环境看不了了。之前看到说要代码会更新,好像还没更新啊

启动zdh_server报错

image
您好,我run之后报这个错误,不知道是什么原因,请您指点一下,谢谢。
我用的是branch-v4.5这个分支。

版本问题

你好,web端运行的是4.7.16版本,请问server端哪个版本可以匹配呢?

调试问题

你好!请问我任务进度不跟新了,怎么调试和排错啊。
image

zdh_server启动报错

您好,按步骤 进行编译,最后执行脚本命令运行后报错,请看截图
image

大佬 有个兼容性的报错 换了很多版本不行

2024-07-25 18:23:35:720[INFO]: 开始初始化SparkSession
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:105)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:1117)
at org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:866)
at com.zyc.common.SparkBuilder$.initSparkSession(SparkBuilder.scala:34)
at com.zyc.common.SparkBuilder$.getSparkSession(SparkBuilder.scala:44)
at com.zyc.SystemInit$.main(SystemInit.scala:27)
at com.zyc.SystemInit.main(SystemInit.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.1.2
at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
at org.apache.hadoop.hive.conf.HiveConf$ConfVars.(HiveConf.java:368)

尝试了 hive切换为3.1.2也不行

任务运行失败...

image
上图中的server就只跑了一个任务,但出现了:调度任务ID:836968828486291456,任务实例ID:836968879132512256,任务执行时间:2021-04-28 22:28:54.0,日志[ERROR]:[调度平台]:[ETL] JOB ,开始发送单源ETL 处理请求,异常请检查zdh_server服务是否正常运行,或者检查网络情况,postRequest -- IO error!的问题。前面有成功过相同的任务。这是为啥啊

spark读取mysql数据写入hbase异常问题

原程序hbase版本为1.3.6时mysql数据无法写入hbase,错误显示找不到类!
hbase升级到2.1.0后spark读取mysql数据后写入hbase时报!
报错信息如下:
2022-02-24 12:08:09:174[INFO]: [数据采集]:[HBASE]:检查表是否存在:t1
2022-02-24 12:08:09:179[INFO]: [数据采集]:[HBASE]:检查表已存在,检查 列族:cf1
2022-02-24 12:08:09:186[INFO]: [数据采集]:[HBASE]:tableDescriptor:'t1', {NAME => 'cf1', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}
2022-02-24 12:08:09:190[INFO]: Got an error when resolving hostNames. Falling back to /default-rack for all
2022-02-24 12:08:09:189[INFO]: [数据采集]:[HBASE]:[WRITE]:writeDS:=====开始=======
2022-02-24 12:08:10:191[INFO]: Got an error when resolving hostNames. Falling back to /default-rack for all
2022-02-24 12:08:10:201[INFO]: Code generated in 308.11907 ms
2022-02-24 12:08:10:263[INFO]: [数据采集]:[HBASE]:[WRITE]:DataFrame:=====MapPartitionsRDD[3] at rdd at HbaseDataSources.scala:214
2022-02-24 12:08:10:294[INFO]: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2022-02-24 12:08:10:299[INFO]: Using output committer class org.apache.hadoop.mapred.FileOutputCommitter
2022-02-24 12:08:10:301[INFO]: File Output Committer Algorithm version is 2
2022-02-24 12:08:10:301[INFO]: FileOutputCommitter skip cleanup temporary folders under output directory:false, ignore cleanup failures: false
2022-02-24 12:08:10:301[WARN]: Output Path is null in setupJob()
2022-02-24 12:08:10:325[INFO]: Starting job: runJob at SparkHadoopWriter.scala:78
2022-02-24 12:08:10:341[INFO]: Got job 0 (runJob at SparkHadoopWriter.scala:78) with 1 output partitions
2022-02-24 12:08:10:342[INFO]: Final stage: ResultStage 0 (runJob at SparkHadoopWriter.scala:78)
2022-02-24 12:08:10:342[INFO]: Parents of final stage: List()
2022-02-24 12:08:10:344[INFO]: Missing parents: List()
spark.rdd.scope.noOverride===true
spark.jobGroup.id===946377927967121408

spark.rdd.scope==={"id":"6","name":"saveAsHadoopDataset"}
spark.job.description===mysql2hbase_2022-02-24 12:07:58_946377927967121408
spark.job.interruptOnCancel===false
=====jobStart.properties:{spark.rdd.scope.noOverride=true, spark.jobGroup.id=946377927967121408_, spark.rdd.scope={"id":"6","name":"saveAsHadoopDataset"}, spark.job.description=mysql2hbase_2022-02-24 12:07:58_946377927967121408, spark.job.interruptOnCancel=false}
Process:null
2022-02-24 12:08:10:348[INFO]: Submitting ResultStage 0 (MapPartitionsRDD[4] at map at HbaseDataSources.scala:215), which has no missing parents
2022-02-24 12:08:10:348[ERROR]: Listener ServerSparkListener threw an exception
scala.MatchError: null
at com.zyc.common.ServerSparkListener.onJobStart(ServerSparkListener.scala:32)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:37)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)

大佬 提交一段代码

image

web 修改 pom.xml 添加


   <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>${hive-jdbc.version}</version>
            <scope>compile</scope>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>log4j-over-slf4j</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.eclipse.jetty.aggregate</groupId>
                    <artifactId>jetty-all</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

WEB的
DBUtil文件 167 添加

配置了jdbc的hive数据源


            if(url.startsWith("jdbc:hive2")) {


                    preparedStatement = connection.prepareStatement("SHOW TABLES");
                    resultSet = preparedStatement.executeQuery();

                    while (resultSet.next()) {
                        String tableName = resultSet.getString(1); // 表名在第一列
                        result.add(tableName);
                    }
                return result;
            }
        取数据库表
        
        
        server的
        DataSources 839 修改  (好像这里是别名有问题,直接注释掉了 然后测通了)

if (!inPut.toLowerCase.equals("hbase") && !inputOptions.asInstanceOf[Map[String,String]].getOrElse("url","").toLowerCase.contains("jdbc:hive2:"))

  不知道您哪里实现了 然后测不通hive的jdbc,就自己添加了。 还有一个问题想请教您

image

当数据源类型为HIVE的时候 server 找不到tableName, 然后我就在Dsi_Info中添加了 , 又遇到了
image
我看您hivedatasource 的实现好像用的是spark on hive ,不太清楚也不太懂 这里的配置 ,总之一直找的都是默认的,是不是我服务器要修改一些配置文件 ,让spark启动的的时候能找到hive

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.