Giter Site home page Giter Site logo

kyligence / spark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from apache/spark

4.0 33.0 52.0 397.54 MB

customized spark for KAP use, checkout kyspark branch

License: Apache License 2.0

Shell 0.37% Batchfile 0.04% R 1.83% Makefile 0.01% C 0.01% Java 6.82% Scala 66.87% JavaScript 0.32% CSS 0.04% HTML 0.06% PowerShell 0.01% Python 14.26% Roff 0.03% ANTLR 0.09% PLpgSQL 0.50% Thrift 0.01% Dockerfile 0.02% Jupyter Notebook 6.10% ReScript 0.01% HiveQL 2.63%

spark's Introduction

Apache Spark

Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

https://spark.apache.org/

GitHub Actions Build AppVeyor Build PySpark Coverage PyPI Downloads

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

./build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.)

More detailed documentation is available from the project site, at "Building Spark".

For general development tips, including info on developing Spark using an IDE, see "Useful Developer Tools".

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1,000,000,000:

scala> spark.range(1000 * 1000 * 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1,000,000,000:

>>> spark.range(1000 * 1000 * 1000).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version and Enabling YARN" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

Contributing

Please review the Contribution to Spark guide for information on how to get started contributing to the project.

spark's People

Contributors

angerszhuuuu avatar ankurdave avatar beliefer avatar cloud-fan avatar dongjoon-hyun avatar gatorsmile avatar gengliangwang avatar huaxingao avatar hyukjinkwon avatar joshrosen avatar liancheng avatar luciferyang avatar marmbrus avatar maropu avatar mateiz avatar maxgekk avatar mengxr avatar pwendell avatar rxin avatar sarutak avatar srowen avatar tdas avatar ueshin avatar viirya avatar wangyum avatar yanboliang avatar yaooqinn avatar yhuai avatar zhengruifeng avatar zsxwing avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark's Issues

performance issue when gc blocks

related issue: https://github.com/Kyligence/KAP/issues/3406

对于正在运行的load test,如果突然把环境的网络打满,待回收的block就会迅速堆积。
目前的spark实现,堆积的越多,清理地就会越慢:

证据1:

image

证据2: (jstack)

"block-manager-slave-async-thread-pool-11" daemon prio=10 tid=0x00007ff93c57a800 nid=0x6bfb runnable [0x00007ff936fcc000]
java.lang.Thread.State: RUNNABLE
at scala.collection.mutable.HashMap$$anonfun$iterator$1.apply(HashMap.scala:97)
at scala.collection.mutable.HashMap$$anonfun$iterator$1.apply(HashMap.scala:97)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:254)
at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54)
at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:278)
at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:286)
at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
at org.apache.spark.storage.BlockInfoManager.entries(BlockInfoManager.scala:399)
- locked <0x00000006f18127c0> (a org.apache.spark.storage.BlockInfoManager)
at org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1358)
at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:66)
at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:66)
at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:66)
at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:82)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)

证据3:

http://blog.csdn.net/qq_33160722/article/details/60583286

所以需要立刻改进这里。

效果图:

image

disable callsite collections

为了在job页面显示每个job是哪一个类的哪一行代码调用:

image

spark会使用getstacktrace来获取这些信息,但是这个操作是很耗时的,从下图中可以看到,同样是sparder引擎,开或关这个功能能导致20左右的qps差距

image

这个类名和行号对我们意义也不是很大,因此考虑直接禁用该功能

image

考虑在 shuffle 进行完的时候,立即删除shuffle 数据

代码可以参考 https://github.com/Kyligence/spark/pull/68/files 进行修改

需要解决的问题是在 DF 复用的时候,如果在 DF 上一个query进行完的时候删除 shuffle, 如果 底层的 shuffleExchageExec shuffleRowRDD 中使用相同的RDD, 和相同的 shuffleDependence 中的shuffleId, 会导致错误

在DAGScheduler 中 registerMapOutput 的时候报错

java.util.NoSuchElementException: key not found: 1425

需要跑通 spark-it

HDFS append 里面的fs无法更新token

root cause: spark的所有代码都写在了自己创建的remote用户的doas里,导致hdfs append拿到的currentuser跟spark不一致,spark更新token的时候无法更新
思路: 保存spark创建的UGI到sparkenv,然后hdfsappend用这个UGI

优化spark project

org.apache.spark.sql.catalyst.expressions.UnsafeProjection#toBoundExprs 每次调用BindReferences.bindReference都会发生Seq[Expression]到AttributeSeq的隐式转换,AttributeSeq里面又需要根据Expression生成一个map,每个列都要进行一次该操作,在列很多的情况下,这个操作相当耗时

这里每次会出现Seq[Expression]到AttributeSeq的隐式转换,导致每次AttributeSeq无法保留中间结果,都得重新初始化一遍field 如图
image

image
这个field很消耗时间
image

fix block miss when df reusered

构建tpch50使用小规模数据的时候发现,当executor内存占用太多被yarnkill掉时,会导致任务失败,找不到broadcast

kerboes问题

kap因为tomcat和spark driver在一起,而且driver会收到 appmaster的控制去renew的token,这里把driver的事件取消去掉

spark无法读取hive使用parquet创建的表

Root Cause

打包没有把hadoop-bundle打入进去,导致缺少parquet writer

Fix Design evidence

pom加入hadoop-bundle

Dev test evidence, Test cover by UT/IT ?

QA needed ? (Y/N)

Test suggestions (other feature affected) to QA

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.