kyligence / spark Goto Github PK

This project forked from apache/spark

customized spark for KAP use, checkout kyspark branch

License: Apache License 2.0

Shell 0.37% Batchfile 0.04% R 1.83% Makefile 0.01% C 0.01% Java 6.82% Scala 66.87% JavaScript 0.32% CSS 0.04% HTML 0.06% PowerShell 0.01% Python 14.26% Roff 0.03% ANTLR 0.09% PLpgSQL 0.50% Thrift 0.01% Dockerfile 0.02% Jupyter Notebook 6.10% ReScript 0.01% HiveQL 2.63%

spark's Introduction

Apache Spark

Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

https://spark.apache.org/

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

./build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.)

More detailed documentation is available from the project site, at "Building Spark".

For general development tips, including info on developing Spark using an IDE, see "Useful Developer Tools".

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1,000,000,000:

scala> spark.range(1000 * 1000 * 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1,000,000,000:

>>> spark.range(1000 * 1000 * 1000).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version and Enabling YARN" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

Contributing

Please review the Contribution to Spark guide for information on how to get started contributing to the project.

spark's People

Contributors

Stargazers

Watchers

Forkers

xbwcgnt1997 zuoc 7mming7 zheniantoushipashi wangpancn jiezoush yugan95 eventd shanxuecheng xinbei weiwei121723 chenghao2262 jhusijeremy zhangbushi5 leozhiyu markz666 wuzhim chenzhx frearb whuguo lxian zbbkeepgoing rupengwang huangfeng1993 seraphlich songzhxlh-max kongfanshu111 gleonsun limdeng jialehe 839224346 wuqi111 ygjia xifeng 188xuhe jlfsdtc zgzzbws yhcast0 leejaywei leesanqaq jacobhua zbhdd sibingzhang mrhs121 fanfanalice j112929 jinyishu aeotasy hellozepp thy950523 yihuawu1986 lwz9103

spark's Issues

performance issue when gc blocks

对于正在运行的load test，如果突然把环境的网络打满，待回收的block就会迅速堆积。
目前的spark实现，堆积的越多，清理地就会越慢：

证据1：

证据2：（jstack）

"block-manager-slave-async-thread-pool-11" daemon prio=10 tid=0x00007ff93c57a800 nid=0x6bfb runnable [0x00007ff936fcc000]
java.lang.Thread.State: RUNNABLE
at scala.collection.mutable.HashMap$$anonfun$iterator$1.apply(HashMap.scala:97)
at scala.collection.mutable.HashMap$$anonfun$iterator$1.apply(HashMap.scala:97)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:254)
at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54)
at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:278)
at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:286)
at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
at org.apache.spark.storage.BlockInfoManager.entries(BlockInfoManager.scala:399)
- locked <0x00000006f18127c0> (a org.apache.spark.storage.BlockInfoManager)
at org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1358)
at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:66)
at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:66)
at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:66)
at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:82)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)

证据3：

http://blog.csdn.net/qq_33160722/article/details/60583286

所以需要立刻改进这里。

效果图：

实现不依赖 ExternalShuffleService, 可以动态扩容缩容instance

优化 spark map status 的序列化方式

option to disable AmIpFilter

把 parquet metric 展示到 spark ui 上

broadcast join oom

sql app status add task skewness detection

S3AFileSystem read retry before getFileStatus

expose outputPartitioning/outputOrdering in Kylin datasource

在 KE apache#13606 ,我们会做一个优化暴露数据源上的 outputPartitioning/outputOrdering, 这里会有些 gap, 需要在 project exec 上做一些处理.

disable callsite collections

为了在job页面显示每个job是哪一个类的哪一行代码调用:

spark会使用getstacktrace来获取这些信息，但是这个操作是很耗时的，从下图中可以看到，同样是sparder引擎，开或关这个功能能导致20左右的qps差距

这个类名和行号对我们意义也不是很大，因此考虑直接禁用该功能

在HiveTableScanExec与FileSourceScanExec添加readBytes metric

根据需要，在HiveTableScanExec与FileSourceScanExec两个算子中添加一个新的metric readBytes。
这个metric的获取办法为在TaskContext上添加一个TaskCompletionListener，将context.taskMetrics().inputMetrics.bytesRead 的数据更新到这个metric上

生成sparder执行计划时，EliminateOuterJoin一直循环导致查询时间过长

具体看 90% 的时间都花在了 EliminateOutJoin 和 InferFiltersFromConstraints 两个rule 对树的遍历上了，

性能主要消耗在 getAliasedConstraints 方法上，

具体问题参考

https://github.com/Kyligence/KAP/issues/13145

https://github.com/kyligence-git/Customer_Voice/issues/1875

考虑在 shuffle 进行完的时候，立即删除shuffle 数据

代码可以参考 https://github.com/Kyligence/spark/pull/68/files 进行修改

需要解决的问题是在 DF 复用的时候，如果在 DF 上一个query进行完的时候删除 shuffle，如果底层的 shuffleExchageExec shuffleRowRDD 中使用相同的RDD，和相同的 shuffleDependence 中的shuffleId，会导致错误

在DAGScheduler 中 registerMapOutput 的时候报错

java.util.NoSuchElementException: key not found: 1425

需要跑通 spark-it

show hive table privileges

reference https://github.com/Kyligence/KAP/issues/16210

优化parquet读取性能

https://github.com/Kyligence/KAP/issues/15841
Kyligence/parquet-mr#8

HDFS append 里面的fs无法更新token

root cause: spark的所有代码都写在了自己创建的remote用户的doas里,导致hdfs append拿到的currentuser跟spark不一致,spark更新token的时候无法更新
思路: 保存spark创建的UGI到sparkenv,然后hdfsappend用这个UGI

暴露接口可以直接增加 executor/node 的黑名单

https://github.com/Kyligence/KAP/issues/15855

优化spark project

org.apache.spark.sql.catalyst.expressions.UnsafeProjection#toBoundExprs 每次调用BindReferences.bindReference都会发生Seq[Expression]到AttributeSeq的隐式转换,AttributeSeq里面又需要根据Expression生成一个map,每个列都要进行一次该操作,在列很多的情况下,这个操作相当耗时

这里每次会出现Seq[Expression]到AttributeSeq的隐式转换,导致每次AttributeSeq无法保留中间结果,都得重新初始化一遍field 如图