ehiggs / spark-terasort Goto Github PK

Spark Terasort

License: Apache License 2.0

Java 42.99% Shell 18.30% Scala 38.70%

terasort spark

spark-terasort's Introduction

TeraSort benchmark for Spark

This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch, but it is not the same TeraSort program that currently holds the record. That program is here.

Building

mvn install

The default is to link against Spark 2.4.4 jars (released September 2019). If you plan to run using an older version of Spark (e.g. 1.6) you will have to try -Dspark.version=1.6. If possible, it's probably a better idea to just update to a more recent version or Spark.

Running

cd to your your Spark install.

Generate data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen 
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
1g file://$HOME/data/terasort_in

Sort the data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraSort
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
file://$HOME/data/terasort_in file://$HOME/data/terasort_out

Validate the data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraValidate
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
file://$HOME/data/terasort_out file://$HOME/data/terasort_validate

Known issues

Performance

This terasort doesn't use the partitioning scheme that Hadoop's Terasort uses. This results in not very good performance. I could copy the Partitioning code from the Hadoop tree verbatim but I thought it would be more appropriate to rewrite more of it in Scala.

I haven't pulled the DaytonaPartitioner from the record holding sort yet because it's pretty intertwined into the rest of the code and AFAIK it's not really idiomatic Spark.

Functionality on native file systems

TeraValidate can read the file parts in the wrong order on native file systems (e.g. if you run Spark on your laptop, on Lustre, Panasas, etc). HDFS apparently always returns the files in alphanumeric order so most Hadoop users aren't affected. I thought I fixed this in the TeraInputFormat, but I was able to reproduce it since migrating the code from my Spark terasort branch.

Contributing

PRs are very welcome!

spark-terasort's People

Contributors

Stargazers

Watchers

Forkers

cindia-blue minyk viplav wildart mmozum javadba kensipe zjffdu pradeepsjsu rkawajiri pxy0592 conglongli yyuzhong arijitt klonikar pandaprinter frankfzw jerryshao fengshenwu jie693 anujsrc bigstepinc snowind animeshtrivedi ganeshrajulinaro flint-stone ahanagemini pulasthi nchaimov mrmikefloyd kimoonkim xubo245 r4ravi2008 susanxhuynh lpuskas rjkeevil paul-carron maniaabdi ziq211 mbtech nelango petro-rudenko liuendy jesinity tool-recommender-bot suifeng227 uprush varshitpallem plusplusjiajia novosibman lla4um jeffrodriguez mihai-dev-ro xueyuuu thipuvaasan ynotzort chiragkumarp diybigdata william-wang oscarfmdc jahstreet zhaoyim ouyangxiaochen tomz hualongfeng cgumpert franklsf95 jigaoluo shortneyrules evgenynenashev sandatavi zfa19scm87x hiboyang bo014465 stevenybw iulianov iiventura iosfwd zuston wangshengjie123 ananduber anuyogamlab angelzhan pspoerri jimvin littlelittlewhite09 duongkame tomscut

spark-terasort's Issues

Why TersSort don't have the job to sampling ？

Oozie workflow executing a jar throws "java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef" Error

Hi,

I'm using the version of spark 2.1.0 and scala of 2.11.8 I developed the application and the spark-submit is working fine with the application jar. But when I setup a oozie workflow and trigger the spark-submit using shell action(version 0.2),I'm getting the "java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef" error. Please let me know on which configurations I should have a check.

TeraSort: Getting: Final app status: FAILED, exitCode: 16,

I'm using HDP HDP-2.6.4 (with Spark 2.2.0).
The application runs, produces sorted files (which I verified using teraValidate).
At the end, annoying status: FAILED error appears.

I don't recall this behavior with HDP 2.4.3 (Spark 1.6.0).

IMHO, the problem/solution is described here:
https://stackoverflow.com/questions/41468833/why-does-spark-exit-with-exitcode-16/41479296

Terasort error

Hi,
I am trying to run the spark terasort app on a RoCE network. The teragen code works fine, however the terasort code fails with following error msg:

all these error msg comes from the spark job history server.

TeraGenerate not output to file

Having looked at TeraGenerate.scala it seems the output is not written to file even though the README example includes an output file parameter. Is this correct?

Works out of the box but how to extract and interpret the numbers?

Fellow Experts,

Super pleased to get it up and running in no time. But, how do I:

extract and interpret the numbers? if they are good or bad?
tell if spark has been properly tuned?

Thanking in advanced.

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraSort
path/to/spark-terasort/target/spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar 
file://$HOME/data/terasort_in file://$HOME/data/terasort_out

val dataset = sc.newAPIHadoopFile[Array[Byte], Array[Byte], TeraInputFormat](inputFile)

Scala runtime error

Hi,

#!/usr/bin/env bash

spark-submit --queue "priority" --master yarn --deploy-mode client --class com.github.ehiggs.spark.terasort.TeraGen /tmp/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 1g /teraInput

spark-submit --queue "priority" --master yarn --deploy-mode client --class com.github.ehiggs.spark.terasort.TeraSort /tmp/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar /teraInput /teraSort_out

spark-submit --queue "priority" --master yarn --deploy-mode client --class com.github.ehiggs.spark.terasort.TeraValidate /tmp/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar /teraSort_out /teraSort_valid

Running the above succeeded, although it ended with an exception that doesn't seem to have been handled.

num records: 10000000
checksum: 4c48c175ea9cd9
Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
	at com.github.ehiggs.spark.terasort.TeraValidate$.validate(TeraValidate.scala:108)
	at com.github.ehiggs.spark.terasort.TeraValidate$.main(TeraValidate.scala:61)
	at com.github.ehiggs.spark.terasort.TeraValidate.main(TeraValidate.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Have you seen this?