Giter Site home page Giter Site logo

spark-terasort's Introduction

TeraSort benchmark for Spark

Build Status

This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch, but it is not the same TeraSort program that currently holds the record. That program is here.

Building

mvn install

The default is to link against Spark 2.4.4 jars (released September 2019). If you plan to run using an older version of Spark (e.g. 1.6) you will have to try -Dspark.version=1.6. If possible, it's probably a better idea to just update to a more recent version or Spark.

Running

cd to your your Spark install.

Generate data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen 
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
1g file://$HOME/data/terasort_in 

Sort the data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraSort
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
file://$HOME/data/terasort_in file://$HOME/data/terasort_out

Validate the data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraValidate
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
file://$HOME/data/terasort_out file://$HOME/data/terasort_validate

Known issues

Performance

This terasort doesn't use the partitioning scheme that Hadoop's Terasort uses. This results in not very good performance. I could copy the Partitioning code from the Hadoop tree verbatim but I thought it would be more appropriate to rewrite more of it in Scala.

I haven't pulled the DaytonaPartitioner from the record holding sort yet because it's pretty intertwined into the rest of the code and AFAIK it's not really idiomatic Spark.

Functionality on native file systems

TeraValidate can read the file parts in the wrong order on native file systems (e.g. if you run Spark on your laptop, on Lustre, Panasas, etc). HDFS apparently always returns the files in alphanumeric order so most Hadoop users aren't affected. I thought I fixed this in the TeraInputFormat, but I was able to reproduce it since migrating the code from my Spark terasort branch.

Contributing

PRs are very welcome!

spark-terasort's People

Contributors

cgumpert avatar ehiggs avatar franklsf95 avatar michaelkamprath avatar pspoerri avatar zjffdu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-terasort's Issues

Oozie workflow executing a jar throws "java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef" Error

Hi,

I'm using the version of spark 2.1.0 and scala of 2.11.8 I developed the application and the spark-submit is working fine with the application jar. But when I setup a oozie workflow and trigger the spark-submit using shell action(version 0.2),I'm getting the "java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef" error. Please let me know on which configurations I should have a check.

Terasort error

Hi,
I am trying to run the spark terasort app on a RoCE network. The teragen code works fine, however the terasort code fails with following error msg:
error2

error3
all these error msg comes from the spark job history server.

TeraGenerate not output to file

Having looked at TeraGenerate.scala it seems the output is not written to file even though the README example includes an output file parameter. Is this correct?

Add to maven central

Wouldn't it be awesome if your example could be run in just one line from spark-submit?

terasort supports only Hadoop file system. It does not accept Linux in memory file system.

Terasort only accepts hdfs path, it does accept OS path to sort. is that because of newAPIHadoopFile?

Terasort syntax:

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraSort
path/to/spark-terasort/target/spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar 
file://$HOME/data/terasort_in file://$HOME/data/terasort_out
val dataset = sc.newAPIHadoopFile[Array[Byte], Array[Byte], TeraInputFormat](inputFile)

Scala runtime error

Hi,

#!/usr/bin/env bash

spark-submit --queue "priority" --master yarn --deploy-mode client --class com.github.ehiggs.spark.terasort.TeraGen /tmp/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 1g /teraInput

spark-submit --queue "priority" --master yarn --deploy-mode client --class com.github.ehiggs.spark.terasort.TeraSort /tmp/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar /teraInput /teraSort_out

spark-submit --queue "priority" --master yarn --deploy-mode client --class com.github.ehiggs.spark.terasort.TeraValidate /tmp/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar /teraSort_out /teraSort_valid

Running the above succeeded, although it ended with an exception that doesn't seem to have been handled.

num records: 10000000
checksum: 4c48c175ea9cd9
Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
	at com.github.ehiggs.spark.terasort.TeraValidate$.validate(TeraValidate.scala:108)
	at com.github.ehiggs.spark.terasort.TeraValidate$.main(TeraValidate.scala:61)
	at com.github.ehiggs.spark.terasort.TeraValidate.main(TeraValidate.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Have you seen this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.