Giter Site home page Giter Site logo

spark-overflow's Introduction

spark-overflow

Collect a lots of Spark information, solution, debugging etc. Feel Free to open a PR to contribute what you see or your experience on Apache Spark.

Knowledge

Spark executor memory(ref)

spark-submit --verbose(ref)

  • Always add --verbose options on spark-submit to print following information
    • All default properties
    • Command line options
    • Settings from spark conf file

Spark Executor on YARN(ref)

  • Following is the memory relation config on YARN
  • YARN container size - yarn.nodemanager.resource.memory-mb
  • Memory Overhead - spark.yarn.executor.memoryOverhead

Tunning

Tune the shuffle partitions

  • Tune the number of spark.sql.shuffle.partitions

Avoid using jets3t 1.9(ref)

  • it's a jar default on hadoop 2.0
  • Inexplicably terrible performance

Use reduceBykey() instead of groupByKey()

  • reduceByKey

  • groupByKey

GC policy(ref)

  • G1GC is a new feature you can Use
  • Used by -XX:+UseG1GC

Join a large Table with a small table(ref)

  • Default is ShuffledHashJoin, problem is all the data of big one will be shuffled
  • Use BroadcasthashJoin
    • it will broadcast the small one to all workers
    • Set spark.sql.autoBroadcastJoinThreshold

Use forEachPartition

  • If your task involve a large setup time, use forEachPartition instead
  • For example: DB connection, Remote Call etc.

Data Serialization

  • Default Java Serialization is too slow
  • Use Kyro
    • conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

Solution

java.io.IOException: No space left on device

  • It's about /tmp is full, check spark.local.dir in spark-conf.default
  • How to fix? Mount more disk space
    • spark.local.dir /data/disk1/tmp,/data/disk2/tmp,/data/disk3/tmp,/data/disk4/tmp

java.lang.OutOfMemoryError: GC overhead limit exceeded(ref)

  • Too much GC time, you can check on Spark metrics
  • How to fix?
    • Increase executor heap size by --executor-memory
    • Increase spark.storage.memoryFraction
    • Change GC policy(ex: use G1GC)

shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space(ref)

  • OOM on Spark driver
  • Usually happened when you fetch a huge data to driver(client)
  • Spark SQL and Streaming is a typical workload which need large heap on driver
  • How to fix?
    • Increase --driver-memory

java.lang.NoClassDefFoundError(ref)

  • Compiled ok, but got error on run-time
  • How to fix?
    • Use --jars to upload and place on the classpath of your application
    • Use --packages to include comma-sparated list of Maven coordinates of JARs.
      EX: --packages com.google.code.gson:gson:2.6.2
      This example will add jar of gson to both executor and driver classpath

Serialization stack error

  • Error message likes:
    Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: com.spark.demo.MyClass Serialization stack:
    -object not serializable (class: com.spark.demo.MyClass, value: com.spark.demo.MyClass@6951e281)
    -element of array (index: 0)
    -array (class [Ljava.lang.Object;, size 6)
  • How to fix?
    • Make com.spark.demo.MyClass to implement java.io.Serializable

java.io.FileNotFoundException: spark-assembly.jar does not exist

  • How to fix?
  1. Upload Spark-assembly.jar to hadoop
  2. Set spark.yarn.jar, there are two way to configure
    • Add --conf spark.yarn.jar when launch spark-submit
    • Set spark.yarn.jar on SparkConf in your spark driver.

java.io.IOException: Resource spark-assembly.jar changed on src filesystem (ref)

  • Spark-assembly.jar exist in HDFS, but still got assembly jar changed error.
  • How to fix?
  1. Upload Spark-assembly.jar to hadoop
  2. Set spark.yarn.jar, there are two way to configure
    • Add --conf spark.yarn.jar when launch spark-submit
    • Set spark.yarn.jar on SparkConf in your spark driver.

spark-overflow's People

Contributors

allenfang avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.