intel-bigdata / hibench Goto Github PK

View Code? Open in Web Editor NEW

1.4K 126.0 756.0 643.74 MB

HiBench is a big data benchmark suite.

License: Other

Shell 18.04% HTML 1.40% Python 8.21% Java 53.10% Scala 18.87% Dockerfile 0.39%

hibench's Introduction

HiBench Suite

The bigdata micro benchmark suite

Current version: 7.1.1
Homepage: https://github.com/intel-hadoop/HiBench
Contents:
1. Overview
2. Getting Started
3. Workloads
4. Supported Releases

OVERVIEW

HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Repartition, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc. It also contains several streaming workloads for Spark Streaming, Flink, Storm and Gearpump.

Getting Started

Build HiBench
Run HadoopBench
Run SparkBench
Run StreamingBench (Spark streaming, Flink, Storm, Gearpump)

Workloads

There are totally 29 workloads in HiBench. The workloads are divided into 6 categories which are micro, ml(machine learning), sql, graph, websearch and streaming.

Micro Benchmarks:

Sort (sort)

This workload sorts its text input data, which is generated using RandomTextWriter.
WordCount (wordcount)

This workload counts the occurrence of each word in the input data, which are generated using RandomTextWriter. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.
TeraSort (terasort)

TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.
Repartition (micro/repartition)

This workload benchmarks shuffle performance. Input data is generated by Hadoop TeraGen. The workload randomly selects the post-shuffle partition for each record, performs shuffle write and read, evenly repartitioning the records. There are 2 parameters providing options to eliminate data source & sink I/Os: hibench.repartition.cacheinmemory(default: false) and hibench.repartition.disableOutput(default: false), controlling whether or not to 1) cache the input in memory at first 2) write the result to storage
Sleep (sleep)

This workload sleep an amount of seconds in each task to test framework scheduler.
enhanced DFSIO (dfsioe)

Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster. Note: this benchmark doesn't have Spark corresponding implementation.

Machine Learning:

Bayesian Classification (Bayes)

Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. This workload is implemented in spark.mllib and uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.ords.
K-means clustering (Kmeans)

This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in spark.mllib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution. There is also an optimized K-means implementation based on DAL (Intel Data Analytics Library), which is available in the dal module of sparkbench.
Gaussian Mixture Model (GMM)

Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability. It's implemented in spark.mllib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.
Logistic Regression (LR)

Logistic Regression (LR) is a popular method to predict a categorical response. This workload is implemented in spark.mllib with LBFGS optimizer and the input data set is generated by LogisticRegressionDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.
Alternating Least Squares (ALS)

The alternating least squares (ALS) algorithm is a well-known algorithm for collaborative filtering. This workload is implemented in spark.mllib and the input data set is generated by RatingDataGenerator for a product recommendation system.
Gradient Boosted Trees (GBT)

Gradient-boosted trees (GBT) is a popular regression method using ensembles of decision trees. This workload is implemented in spark.mllib and the input data set is generated by GradientBoostedTreeDataGenerator.
XGBoost (XGBoost)

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. This workload is implemented with XGBoost4J-Spark API in spark.mllib and the input data set is generated by GradientBoostedTreeDataGenerator.
Linear Regression (Linear)

Linear Regression (Linear) is a workload that implemented in spark.ml with ElasticNet. The input data set is generated by LinearRegressionDataGenerator.
Latent Dirichlet Allocation (LDA)

Latent Dirichlet allocation (LDA) is a topic model which infers topics from a collection of text documents. This workload is implemented in spark.mllib and the input data set is generated by LDADataGenerator.
Principal Components Analysis (PCA)

Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. PCA is used widely in dimensionality reduction. This workload is implemented in spark.ml. The input data set is generated by PCADataGenerator.
Random Forest (RF)

Random forests (RF) are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. This workload is implemented in spark.mllib and the input data set is generated by RandomForestDataGenerator.
Support Vector Machine (SVM)

Support Vector Machine (SVM) is a standard method for large-scale classification tasks. This workload is implemented in spark.mllib and the input data set is generated by SVMDataGenerator.
Singular Value Decomposition (SVD)

Singular value decomposition (SVD) factorizes a matrix into three matrices. This workload is implemented in spark.mllib and its input data set is generated by SVDDataGenerator.

SQL:

Scan (scan) 2. Join (join), 3. Aggregate (aggregation)

These workloads are developed based on SIGMOD 09 paper "A Comparison of Approaches to Large-Scale Data Analysis" and HIVE-396. It contains Hive queries (Aggregation and Join) performing the typical OLAP queries described in the paper. Its input is also automatically generated Web data with hyperlinks following the Zipfian distribution.

Websearch Benchmarks:

PageRank (pagerank)

This workload benchmarks PageRank algorithm implemented in Spark-MLLib/Hadoop (a search engine ranking benchmark included in pegasus 2.0) examples. The data source is generated from Web data whose hyperlinks follow the Zipfian distribution.
Nutch indexing (nutchindexing)

Large-scale search indexing is one of the most significant uses of MapReduce. This workload tests the indexing sub-system in Nutch, a popular open source (Apache project) search engine. The workload uses the automatically generated Web data whose hyperlinks and words both follow the Zipfian distribution with corresponding parameters. The dict used to generate the Web page texts is the default linux dict file.

Graph Benchmark:

NWeight (nweight)

NWeight is an iterative graph-parallel algorithm implemented by Spark GraphX and pregel. The algorithm computes associations between two vertices that are n-hop away.

Streaming Benchmarks:

Identity (identity)

This workload reads input data from Kafka and then writes result to Kafka immediately, there is no complex business logic involved.
Repartition (streaming/repartition)

This workload reads input data from Kafka and changes the level of parallelism by creating more or fewer partitions. It tests the efficiency of data shuffle in the streaming frameworks.
Stateful Wordcount (wordcount)

This workload counts words cumulatively received from Kafka every few seconds. This tests the stateful operator performance and Checkpoint/Acker cost in the streaming frameworks.
Fixwindow (fixwindow)

The workloads performs a window based aggregation. It tests the performance of window operation in the streaming frameworks.

Supported Hadoop/Spark/Flink/Storm/Gearpump releases:

Hadoop: Apache Hadoop 3.0.x, 3.1.x, 3.2.x, 2.x, CDH5, HDP
Spark: Spark 2.4.x, Spark 3.0.x, Spark 3.1.x
Flink: 1.0.3
Storm: 1.0.1
Gearpump: 0.8.1
Kafka: 0.8.2.2

hibench's People

Contributors

Stargazers

Watchers

Forkers

yliu318 prashanthig srinivas-surasani allany hibench strategist922 bepcyc helebest ronggu cswanghan zwqjsj0404 julianzhang sehrish-amjad bjutlgr hangang128 ibmcb lingcc lohitvijayarenu mailmahee ballacky13 brownsys rikima mayanhui yzxlr itisaid li-zhihui lidongsjtu ecebuzz sunzhiwei8 kai-wei jack19861225 xuyanhui charwliu vinayakponangi pipamc colorant junmingzhao jtao-git w19830224y caomiaoke john-song yinxusen fengshikun bilguun lalaguozhe playchao obsdeck kuoliu bitted iopenstack npoggi jtaleric aroyc andrew-stewart mdespriee wang404127681 scamicha rtvt123 davidkj69 sasan-j je57rc10 ekasitk sudikhya ajaysinghemail achun2080 cemsbr bowers jason0108 web5design chengxiangli chinna1986 push7joshi kyo88 yanjiegao xhao1 mengdiwang piotrszul caolingxin nishkamravi2 zhuozhaoli shimingfei sjtufighter iwasakims roengram eoriented meraboxer shenyan1 advancedxy wangbin83-gmail-com qiansl127 jayunit100 adrian-wang onesagedude pfxuan cailurus earne benoitlange cloudera-intel-qa-transition dharmeshkakadia xiaoyuyao

hibench's Issues

Nutchindex fails

FATAL indexer.Indexer: Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:76)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:97)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:106)

check more detail on #77 . Let's use this ticket to track the status.
Thanks @Jay7

shell script exits with unbound variable error

In ./HiBench/bin/hibench-config.sh, the first part of the global paths section checks to see if HADOOP_HOME is set and if it is then it sets the conf, bin, and jar directory locations for hadoop.

HADOOP_EXECUTABLE=
HADOOP_CONF_DIR=
HADOOP_EXAMPLES_JAR=

if [ -n "$HADOOP_HOME" ]; then
HADOOP_EXECUTABLE=$HADOOP_HOME/bin/hadoop
HADOOP_CONF_DIR=$HADOOP_HOME/conf
HADOOP_EXAMPLES_JAR=$HADOOP_HOME/hadoop-examples*.jar
else

If hadoop_home is not set, then the script is supposed to try to guess what the directories and if that fails then display an error to the user about the need to set these paths. However, instead of that happening, the script exits with an unbound variable.

The cause of this problem is a change made in pull request #29. In that pull request, set u was placed at the top of the shell script. This causes a shell script to exit with a non-zero exit code if it encounters an unbound variable.

I think there are two solutions here. One is to remove the set -u. I do not prefer that option because set -u is useful for catching basic problems in the script. The other option, which i prefer is to first set the HADOOP_HOME variable using the printenv command. The printenv command returns either the value of the environment variable if its set or an empty string if its not. If this is done, then the if [ -n "$HADOOP_HOME" ]; will behave as expected.

Here's what this would look like.

HADOOP_EXECUTABLE=
HADOOP_CONF_DIR=
HADOOP_EXAMPLES_JAR=

HADOOP_HOME=printenv HADOOP_HOME
HADOOP_EXECUTABLE=printenv HADOOP_EXECUTABLE
HIBENCH_HOME=printenv HIBENCH_HOME
HIBENCH_CONF=printenv HIBENCH_CONF
HIVE_HOME=printenv HIVE_HOME
MAHOUT_HOME=printenv MAHOUT_HOME
NUTCH_HOME=printenv NUTCH_HOME
DATATOOLS=`printenv DATATOOLS

if [ -n "$HADOOP_HOME" ]; then
HADOOP_EXECUTABLE=$HADOOP_HOME/bin/hadoop
HADOOP_CONF_DIR=$HADOOP_HOME/conf
HADOOP_EXAMPLES_JAR=$HADOOP_HOME/hadoop-examples*.jar
else

Let me know your thoughts on this. I can of course take care of the fix via a pull request if that's desired.

no main manifest attribute in "datatools.jar"

Downloaded the latest of HiBench/tree/yarn. Attempting to run any of the tests which use datatools.jar, it complains it is not a valid jar. When I do a java -jar datatools.jar it complains there is "no main manifest attribute, in datatools.jar".

HiBench hivebench external table field delimiter ',' problem

hivebench external table is delimited by comma ',' , but in hivebench table field 'useragent' filed, there exists comma in the raw data, like "Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/xxx", so hive will delimit field by mistake, original filed useragent's data will wrongly delimit to the next field, and all the next left fields' data is wrong.
like:
48.230.80.233 wyfctppjxtyhbcbngouswjzsekwdzqiaaapmomt 1976-04-18 0.44219965 Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML like Gecko) Chrome/xxx LBY LBY-AR NULL

HiBench hivebench run-join.sh variables not set

run-join.sh line 32, 33
echo "set mapred.map.tasks=$NUM_OF_MAP;">>$DIR/hive-benchmark/rankings_uservisits_join.hive
echo "set mapred.reduce.tasks=$NUM_OF_RED;">>$DIR/hive-benchmark/rankings_uservisits_join.hive
variable $NUM_OF_MAP, $NUM_OF_RED is not set, so reduce number will be set to 1 and cover the same configuration in mapred-site.xml in Hadoop.

Check wordcount in MR1 for input/output format

We still need to check MR versions of workload in MR1 environment.

README.md in HiBench 2.2.1 is not updated

I am trying to use Hibench benchmarking CDH4.6 Yarn Model. Does HiBench 2.2.1 support YARN model?
I downloaded HiBench 2.2.1 and discovered the READ.md is not updated. I found difficulties in running it on Yarn. Would you update the READ.md file, please?

HiBench on Amazon EMR and S3

Hi there,

I am hoping to use HiBench for a school project to test the IO throughput of using Amazon S3 as the datasource for Amazon EMR, vs. using HDFS on the EMR instances as the datasource. However, I am running into trouble pointing DFSIO Enhanced to read and write from and to S3.

Here is what I've done:

on Hibench/bin/hibench-config.sh,
export DATA_HDFS=s3://<mybucket>

However, when I try running bash run-write.sh under the dfsio/bin directory to start the DFSIO-e write test (I did not run prepare.sh beforehand because to my understanding you do not need to run it prior to the write test), I get the following error:

15/04/02 05:59:37 INFO dfsioe.TestDFSIOEnh: creating control file: 100 mega bytes, 20 files java.lang.IllegalArgumentException: Wrong FS: s3://cmpt886-testbucket/benchmarks/TestDFSIO-Enh/io_control, expected: hdfs://172.31.41.39:9000 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:191) at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:102) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:595) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:591) at org.apache.hadoop.fs.dfsioe.TestDFSIOEnh.createControlFile(TestDFSIOEnh.java:636) at org.apache.hadoop.fs.dfsioe.TestDFSIOEnh.run(TestDFSIOEnh.java:598) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.fs.dfsioe.TestDFSIOEnh.main(TestDFSIOEnh.java:624) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) 15/04/02 05:59:43 INFO fs.EmrFileSystem: Consistency enabled, using com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2 as filesystem implementation

I also tried setting the value in <property><name>fs.default.name</name><value>hdfs://172.31.41.39:9000</value></property> from hdfs://.. to s3://<bucketname> but then I get the following error:

15/04/02 05:55:48 INFO mapreduce.Cluster: Failed to use org.apache.hadoop.mapred.YarnClientProtocolProvider due to error: Error in instantiating YarnClient java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120) at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82) at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:470) at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:449) at org.apache.hadoop.fs.dfsioe.TestDFSIOEnh.run(TestDFSIOEnh.java:578) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.fs.dfsioe.TestDFSIOEnh.main(TestDFSIOEnh.java:624) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Is it possible to point DFSIOe to S3? And if so, can you tell me what configuration settings to change?

Honto

Paths in bin/functions/load-config.py are hard-coded

Hi,
when I run workload/pagerank/prepare/prepare.sh, I get this error:
This filename pattern "/usr/hdp/2.2.4.2-2/hadoop//share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar" is required to match only one file.
However, there's no file found, please fix it.

The path name is concatenate in bin/functions/load-config.py line 269:
HibenchConf["hibench.hadoop.examples.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.home'] + "/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar")

It would be nice if the second part of the path could be set via variables, too

NutchIndexing run.sh fails with "WritableName can't load class: org.apache.nutch.parse.ParseText"

Hello,

When I execute NutchIndexing, prepare.sh runs successfully, and generates input folders: crawldb, indexes, linkdb, segments. However, when I execute run.sh, all tasks fail with the following error message:

Error: java.lang.RuntimeException: java.io.IOException: WritableName can't load class: org.apache.nutch.parse.ParseText
    at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2030)
    at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1960)
    at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1759)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
    at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:167)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:408)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.io.IOException: WritableName can't load class: org.apache.nutch.parse.ParseText
    at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
    at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2028)
    ... 14 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.nutch.parse.ParseText not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
    at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
    ... 15 more

Release version: 3.0
Hadoop version: 2.2.0

Thanks!

hivebench

How do I run hivebench to view results according to SIGMOD paaper? I also did run hive --service metastore to establish concurrency as stated.The command /HiBench/hivebench/conf$ configure.sh
configure.sh: command not found
Am benchmarking hivebench for a my Master projects. Please any input as to what am doing wrong is appreciated.

shell scripts fail to check exit code after command execution

Many of the hibench shell scripts do not check the exit code after the hadoop command is executed. This means that if the hadoop command fails, then shell script could exit with a code of 0. This problematic if the hibench shell scripts are called by a different program. The caller program does not have an easy way to tell if the script completed successfully or not.

For example, consider /terasort/bin/run.sh. Here's the hadoop command.

$HADOOP_EXECUTABLE jar $HADOOP_EXAMPLES_JAR terasort -D mapred.reduce.tasks=$NUM_REDS $INPUT_HDFS $OUTPUT_HDFS

If however the exit code is checked then the shell script could exit with a non-zero exit code. The following is an example.

$HADOOP_EXECUTABLE jar $HADOOP_EXAMPLES_JAR terasort -D mapred.reduce.tasks=$NUM_REDS $INPUT_HDFS $OUTPUT_HDFS

if [ $? -ne 0 ]
then
exit 1
fi

i propose that all the shell scripts be modified to check the exit code after a hadoop command is executed. If the commands exit code is not zero, then the shell script should exit with a non-zero exit code.

Are we supporting only cdh4 and hadoop 1..?

Number of mappers is always 10 per host which is default for wordcount

Let me introduce my scenario first..

want to run wordsount for 350GB with 1400 mappers..
Hence i configured NUM_MAPS=1400 and DataSize=350GB in bytes with 256MB block size..

But prepare job is running with 70 maps, As I have 7 nodes in cluster..

this is because, randomtextwriter job by default, it will take 10 maps for host...
int numMapsPerHost = conf.getInt("mapreduce.randomtextwriter.mapsperhost", 10);

Currently I did like following and go head..
HADOOP_EXECUTABLE jar $HADOOP_EXAMPLES_JAR randomtextwriter
$COMPRESS_OPT
-D mapreduce.randomtextwriter.bytespermap=268435456 -D mapreduce.randomtextwriter.mapsperhost=200
$INPUT_HDFS

can we fix same..?

error ./prepare.sh: line 25: INPUT_HDFS: unbound variable

Hi
I am trying to use HiBench 4.0 for some performance benchmarking of HDP Hadoop 2.6.0.2.2.4.2-2
I have set the following properties (I dont have spark, I just want to test mapreduce)
hibench.hadoop.home /usr/lib/hadoop
hibench.spark.home /PATH/TO/YOUR/SPARK/ROOT
hibench.hdfs.master hdfs://10.0.2.15:50070

But i get the following error:
Any idea why? What configuration am I missing? The README in HiBench 4.0 doesnt say about any other configurations.

root@sandbox prepare]# ./prepare.sh
Parsing conf: /root/HiBench/conf/00-default-properties.conf
Parsing conf: /root/HiBench/conf/10-data-scale-profile.conf
Parsing conf: /root/HiBench/conf/99-user_defined_properties.conf
Parsing conf: /root/HiBench/workloads/sort/conf/00-sort-default.conf
Parsing conf: /root/HiBench/workloads/sort/conf/10-sort-userdefine.conf
Traceback (most recent call last):
File "/root/HiBench/bin/functions/load-config.py", line 440, in
load_config(conf_root, workload_root, workload_folder)
File "/root/HiBench/bin/functions/load-config.py", line 154, in load_config
generate_optional_value()
File "/root/HiBench/bin/functions/load-config.py", line 209, in generate_optional_value
if hadoop_version[0] != '1': # hadoop2? or CDH's MR1?
IndexError: string index out of range
/root/HiBench/bin/functions/workload-functions.sh: line 33: .: filename argument required
.: usage: . filename [arguments]
start HadoopPrepareSort bench
./prepare.sh: line 25: INPUT_HDFS: unbound variable

Data lost during Input data preparation

if Java heap size for mapper and reducer is too small, some data may lost during data preparation.
eg: heap size for mapper is set to 220MB, for reducer is set to 320MB, input data size generated is only about one third of the normal size
USERVISITS=800000000
PAGES=20000000

how to rebuild nutch-1.2.jar??

Hi ,
I want to modify IndexingMapReduce.java file from nutch-indexing, but I'm not ale to recompile it back to the nutch-1.2.jar file. When the ran the provided build.xml file, it complained that it cant find the package org.apache.hadoop.* . But I already configure my HADOOP_HOME in hibench-config.sh. Is any thing else I need to provide in my build?
Any help would be appriciated.

Nhung

The datatools.jar is not compatible with JDK 6

The datatools.jar is built with JDK 7, which cannot be run in JDK 6. A solution is to let users build the datatools.jar themselves by running "ant" in HiBench-master/common/autogen.

Mkdirs failed to create /HiBench/KMeans/Input-comp/samples-seeds

I have setup a Hadoop 5 node cluster using EC2 instances. When I run prepare.sh in /intel-hadoop-HiBench-4aa2ffa/kmeans/bin, following errors show. Could some one help me why failed to create /HiBench/KMeans/Input-comp/samples-seeds? Thanks.

[ec2-user@ip-10-0-1-131 bin]$ ./prepare.sh
========== preparing kmeans data ==========
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

rmr: DEPRECATED: Please use 'rm -r' instead.
14/01/03 05:02:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rmr: `/HiBench/KMeans/Input-comp': No such file or directory
Generating Mahout KMeans Input Dataset
2014-01-03 05:02:59,654 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(533)) - KMeans Clustering Input Dataset : Synthetic
2014-01-03 05:03:13,461 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(548)) - Successfully generated Sample Generator Seeds
2014-01-03 05:03:13,465 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(549)) - samples seeds are:
2014-01-03 05:03:13,466 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(561)) - mean: (892.7784851910154,668.1388900417477,12.37156387273286,948.2218936234988,848.0531762771017) std: (21.53698108434338,-83.7640042027302,51.71172927473776,-57.58010311095332,86.77593136548654)
2014-01-03 05:03:13,466 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(561)) - mean: (669.8506783150071,724.2558217497868,311.4969971723691,65.90094885376918,969.4373030238024) std: (-85.60713480071416,-29.45824491614144,-46.377340121271395,32.597608559791894,35.05404780813254)
2014-01-03 05:03:13,467 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(561)) - mean: (541.0456390904941,24.720629481739586,265.5405521391898,258.2318954521949,769.7897698977738) std: (74.03358121686949,-77.7441591963003,89.44221601480075,84.50375569884392,-17.255177753375108)
2014-01-03 05:03:13,467 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(561)) - mean: (324.95794488552923,968.4790509173768,610.001973472676,552.982355217928,407.83145278236987) std: (-49.56892373451986,34.59048486779125,55.73981450339204,-86.84406597819148,91.95550842982595)
2014-01-03 05:03:13,467 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(561)) - mean: (893.9474562466451,235.56754623307097,69.62132354623508,759.6767497398067,784.0298583978015) std: (36.07391483435313,-0.22671060641496865,51.07963977868485,8.086985681433973,83.5688609276213)
2014-01-03 05:03:13,468 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(561)) - mean: (275.39192371158396,821.0238890004695,315.40209652639606,323.3966916446639,423.77630078524044) std: (0.8027806191321929,-67.66455949637484,25.185752594614215,47.57919003073934,48.33609529044088)
2014-01-03 05:03:13,468 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(561)) - mean: (230.320922733655,707.1101364142328,412.6527687666621,498.5858038853449,707.0196528757224) std: (86.50377128662385,75.53783765391611,-37.23788639173273,7.994959388079764,13.387875739680794)
2014-01-03 05:03:13,468 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(561)) - mean: (96.43690487109747,810.5655581247838,202.0019767724055,114.56999817219904,751.2805146674057) std: (-20.354396723495242,-21.375714239878945,-10.145690002325907,21.59747104097754,66.54075151227013)
2014-01-03 05:03:13,469 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(561)) - mean: (989.1506528216164,479.3619309604332,52.78522943823172,195.9863637719068,626.7163340702299) std: (98.35236908314263,46.11528185013191,20.443426585859626,-56.07126643187359,-27.35117053158757)
2014-01-03 05:03:13,469 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(561)) - mean: (871.335195347875,933.0810540394993,723.8022526648174,99.58823315381548,104.78267156298404) std: (-83.10245357203998,-99.84546140675654,27.81218603556492,9.200946461175732,47.15305511901062)
2014-01-03 05:03:13,634 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(981)) - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2014-01-03 05:03:13,634 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(981)) - mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
2014-01-03 05:03:13,634 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(981)) - mapred.output.compression.codec is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec
2014-01-03 05:03:13,637 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(580)) - mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ec2-user/hadoop-2.0.0-cdh4.5.0/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/ec2-user/mahout-0.7-cdh4.5.0/mahout-examples-0.7-cdh4.5.0-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/ec2-user/mahout-0.7-cdh4.5.0/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/ec2-user/mahout-0.7-cdh4.5.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2014-01-03 05:03:13,838 WARN util.NativeCodeLoader (NativeCodeLoader.java:(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2014-01-03 05:03:13,992 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(585)) - Generate K-Means input Dataset : samples = 300000sample dimension5 cluster num =10
2014-01-03 05:03:13,992 INFO kmeans.GenKMeansDataset (GenKMeansDataset.java:run(589)) - Start producing samples...
Exception in thread "main" java.io.IOException: Mkdirs failed to create /HiBench/KMeans/Input-comp/samples-seeds
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:867)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:766)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:755)
at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:1092)
at org.apache.mahout.clustering.kmeans.GenKMeansDataset$SampleProducer.createNewFile(GenKMeansDataset.java:173)
at org.apache.mahout.clustering.kmeans.GenKMeansDataset$GaussianSampleGenerator.writeSeeds(GenKMeansDataset.java:306)
at org.apache.mahout.clustering.kmeans.GenKMeansDataset$GaussianSampleGenerator.produceSamples(GenKMeansDataset.java:322)
at org.apache.mahout.clustering.kmeans.GenKMeansDataset.run(GenKMeansDataset.java:590)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.mahout.clustering.kmeans.GenKMeansDataset.main(GenKMeansDataset.java:601)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

Mismatch in filenames for word list file

The current README states that "the dict used for text generation is also from the default linux file /usr/share/dict/linux.words". This is in fact not the case--the file RawData.java has '/usr/share/dict/words' hardcoded as the path.

This should be fixed in the README documentation. But even better--it would be nice if this path were configurable because there are several variants of this file in different linux distros.

Encountered problems with Hibench and question about concurrency

Hello,

I've been using hadoop and Hibench for 2,5 months and I have experienced some problems as I was working with this. Now, it looks that everything is ok and all the benchmarks run BUT I still have some problems with nutchindexing and hivebench when I run these more than once in parallel (concurrently). That's why I need your valuable help!

This is my bin/hibench-config.sh:
export JAVA_HOME=/home/hduser/jdk1.7.0_51
export HADOOP_HOME=/home/hduser/hadoop2.5.1
export HADOOP_EXECUTABLE=/home/hduser/hadoop2.5.1/bin/hadoop
export HADOOP_CONF_DIR=/home/hduser/hadoop2.5.1/etc/hadoop
export HADOOP_EXAMPLES_JAR=/home/hduser/hadoop2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar
export MAPRED_EXECUTABLE=/home/hduser/hadoop2.5.1/bin/mapred

Set the varaible below only in YARN mode

export HADOOP_JOBCLIENT_TESTS_JAR=/home/hduser/hadoop2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.1-tests.jar

export HADOOP_MAPRED_HOME=/home/hduser/hadoop2.5.1
export HADOOP_VERSION=hadoop2 # set it to hadoop1 to enable MR1, hadoop2 to enable MR2

Here are the possible bugs I noticed and (almost) fixed in order to run all the benchmarks:

At first, the dfsioe benchmark did not work because in dfsioe/conf/configure.sh file these lines didn't work properly and they didn' assign any value to the variable:
MAP_JAVA_OPTS=cat $HADOOP_CONF_DIR/mapred-site.xml | grep "mapreduce.map.java.opts" | awk -F\< '{print $5}' | awk -F\> '{print $NF}'
RED_JAVA_OPTS=cat $HADOOP_CONF_DIR/mapred-site.xml | grep "mapreduce.reduce.java.opts" | awk -F\< '{print $5}' | awk -F\> '{print $NF}'

I don't know if I did something wrong but when I fixed that dfsioe worked! So, how else should this be fixed?
(something which was not mentioned here: https://github.com/intel-hadoop/HiBench and it wasn't required to run hadoop is to set memory limits in mapred-site.xml and yarn-site.xml)

bayes and kmeans use different mahout so in order to run both of them I have 2 different Hibench folders from which I run bayes from one folder and the rest of the benchmarks from the second one. I feel that this is a little bit stupid, so should I edit the bin/hibench-config.sh or something else somehow else?

About the concurrent run:

I have noticed that nutchindexing uses a temp file which erases in the end of each run so I think this is one reason nutchindexing can't run more than one time concurrently. Also sometimes the benchmarks "breaks" and can't run properly again so I delete everything in common/hibench/nutchindexing/* and I run mvn process-sources again. Is there any solution for this please?
FATAL indexer.Indexer: Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:76)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:97)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:106)
About hivebench I run the suggested command "hive --service metastore" but it doesn't give any different results than without running it. How hivebench can run concurrently as you mention in your paper?
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:626)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:570)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.Session
HiveMetaStoreClient
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java
:63)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.ja
va:73)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2664)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2683)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:425)
... 7 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.
java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1451)
... 12 more
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database
. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection

pool (set lazyInit to true if you expect to start your database after your app). Original Exception: --

java.sql.SQLException: Failed to start database 'metastore_db' with class loader sun.misc.Launcher$AppC
lassLoader@5947e54e, see the next exception for details.
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source)

e.t.c. . . . . . . . . .

I am looking forward for your answer!

Thanks.

The hive package cannot be downloaded by Maven

When download hive package using "mvn install", I got this error message.

org.apache.maven.wagon.ResourceDoesNotExistException: File: http://mirrors.ibiblio.org/apache/hive/hive-0.12.0/hive-0.12.0-bin.tar.gz , ReasonPhrase:Not Found.

I find out that the repo source is not available any more.
Have sent a PR to fix this by using another available repo source.

Hivebench: how to generate data without compression

I generate data for the hivebench:

cd ~/HiBench-master/hivebench
bin/prepare.sh

I would like to have the data without compression.
The generated data are compressed,
whether COMPRESS_GLOBAL in bin/hibench-config.sh is set to 0 or 1.

hdfs dfs -text /user/hanalite/hibench/Hive/Input/rankings/part-00000 | head
15/01/29 10:56:31 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
15/01/29 10:56:31 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
15/01/29 10:56:31 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
15/01/29 10:56:31 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
15/01/29 10:56:31 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
0 nbizrgdziebsaecsecujfjcqtvnpcnxxwiopmddorcxnlijdizgoi,65738,16
48 czoqdnkqnonnkmjlzfsntboumupseoahz,9451,29
96 jxgpkxbhomwgxrjpqdn,6678,20

Any tip for me?

Guidance on dfsioe settings--need to size for larger than RAM avialable on nodes?

When running TestDFSIO in the past, I recall guidance on sizing the data to make sure it wasn't simply being served out of OS cache by making the data size larger than available RAM (to prevent inflated read stats). Is this guidance still at play with dfsioe? I didn't see it mentioned anywhere unless I missed it.

Also, is there any general guidance on configuring tests to scale with cluster size (I didn't seem much but again perhaps I missed it).

Thanks!

dfsioe should handle empty mapred-site.xml

This is first discussed in #77 from @Jay7

Adding Bigpetstore as a benchmark to HiBench

@jayunit100

Benchmarking for nutchindexing

HI, I am trying to run the Nutchindexing job (https://github.com/hibench/HiBench-2.1/tree/f1d43780f5ae813ccd4e891e353429e7871c9c41). It says that "Total input paths to process is 0".
Can anyone help me in understanding what the programs expecting as input and where should i give it?

Thanks,
Prashanthi

org.apache.hadoop.hdfs.HftpFileSystem could not be instantiated

The workload of Kmeans always failed in the prepare phase with the following errors:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hibench-2.2.1/common/mahout-distribution-0.7-cdh4/examples/target/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hibench-2.2.1/common/mahout-distribution-0.7-cdh4/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hibench-2.2.1/common/mahout-distribution-0.7-cdh4/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.HftpFileSystem could not be instantiated: java.lang.IllegalAccessError: tried to access method org.apache.hadoop.fs.DelegationTokenRenewer.(Ljava/lang/Class;)V from class org.apache.hadoop.hdfs.HftpFileSystem
at java.util.ServiceLoader.fail(ServiceLoader.java:224)
at java.util.ServiceLoader.access$100(ServiceLoader.java:181)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377)
at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2275)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2286)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:87)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2342)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2324)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:351)
at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:322)
at org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:383)
at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:281)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:422)
at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:168)
at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:151)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.mahout.clustering.kmeans.GenKMeansDataset.main(GenKMeansDataset.java:604)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.lang.IllegalAccessError: tried to access method org.apache.hadoop.fs.DelegationTokenRenewer.(Ljava/lang/Class;)V from class org.apache.hadoop.hdfs.HftpFileSystem
at org.apache.hadoop.hdfs.HftpFileSystem.(HftpFileSystem.java:84)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:374)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
... 21 more

exec "$HADOOP_EXECUTABLE" --config $HADOOP_CONF_DIR jar ${DATATOOLS} org.apache.mahout.clustering.kmeans.GenKMeansDataset -libjars $MAHOUT_HOME/examples/target/mahout-examples-0.7-job.jar ${COMPRESS_OPT} ${OPTION}

CDH 4.3 installed with Cloudera Manager 4.6
HiBench 2.2.1
RHEL6.3 64-bit

Thanks.
Vincent

dfsioe yarn error

when the dfsioe/run-write.sh

it failed with the following error:


13/09/09 15:16:03 INFO mapreduce.Job: Task Id : attempt_1378468570219_0008_m_000000_2, Status : FAILED
Error: java.io.IOException: Mkdirs failed to create /benchmarks/TestDFSIO-Enh/io_data
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:434)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:420)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:840)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:821)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)
        at org.apache.hadoop.fs.dfsioe.TestDFSIOEnh$WriteMapperEnh.doIO(TestDFSIOEnh.java:209)
        at org.apache.hadoop.fs.dfsioe.IOMapperBase.map(IOMapperBase.java:123)
        at org.apache.hadoop.fs.dfsioe.IOMapperBase.map(IOMapperBase.java:41)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:399)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1375)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)

But the /benchmarks/TestDFSIO-Enh/io_input have beed created successfuly,

missing NUM_MAPS & NUM_REDS in kmeans' configuration

in kmeans/conf/configure.sh, there's no option to set the number of mappers & reducers.

NutchIndexing

Hi,
I am seeing only 1 reduce task created on my 3 node cluster. I have 576 map tasks which finish in under 10 mins, and only 1 reduce task which after failing twice (after 3 hours, and doing %100 with error failed to report status for 600 seconds) finally finishes at 3rd try. Is it expected to have only 1 reduce job?

Kmeans problem--Class not found

Running on hadoop, using /export1/srini/dsperftest/HA/install/hadoop/journalnode/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /export1/dsperf/HiBench-master/common/mahout-distribution-0.7-hadoop1/examples/target/mahout-examples-0.7-job.jar
14/10/30 14:59:33 WARN driver.MahoutDriver: Unable to add class: org.apache.mahout.clustering.kmeans.GenKMeansDataset
java.lang.ClassNotFoundException: org.apache.mahout.clustering.kmeans.GenKMeansDataset
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:236)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

HiBench Installation Guide

I download HiBench to my PC. I can't find any executable file to install HiBench. In addition, my Hadoop exists on AWS. Do I have to install HiBench in AWS? Where I can find the HiBench Installation Guide / instruction.

Thanks.

David

The configuration of DATA_HDFS causes problem in Hammer

HiBench allows configuration of the HDFS path of HiBench's data in bin/hibench-config.sh:

hdfs data path

DATA_HDFS=

However, if it is configured other than the default configuration (/HiBench), Hammer will have problem.

In hammer/conf/configure.sh, HAMMER_HDFS_BASE=${DATA_HDFS}/hammer.
While in hammer/bin/dbgen.sh, HAMMER_HDFS_BASE=/HiBench/hammer. The user configuration is not honored in dbgen.sh, so inconsistency occurs.

Before this issue is fixed, not setting DATA_HDFS is a temporary workaround.

dfsioe YARN error while generating Input Data

I am getting following error message while generating Input data for DFSIOE benchmark.

HiBench : 2.2 , yarn branch
JAVA : jdk1.7.0_45
Hadoop : 2.3.0
myHadoop : 2.1.0

15/03/26 16:50:40 INFO dfsioe.TestDFSIOEnh: maximum concurrent maps = 2
15/03/26 16:50:40 INFO dfsioe.TestDFSIOEnh: creating control file: 200 mega bytes, 256 files
java.io.IOException: Mkdirs failed to create /benchmarks/TestDFSIO-Enh/io_control
at org.apache.hadoop.fs.dfsioe.TestDFSIOEnh.createControlFile(TestDFSIOEnh.java:648)
at org.apache.hadoop.fs.dfsioe.TestDFSIOEnh.run(TestDFSIOEnh.java:598)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.dfsioe.TestDFSIOEnh.main(TestDFSIOEnh.java:624)

kmeans & target=spark/python: storage.MemoryStore: Not enough space

Running kmeans with the spark/python target, I see many instances of lost executors, like this:

15/05/28 09:46:09 WARN TaskSetManager: Lost task 16.0 in stage 15.0 (TID 479, anders-41): ExecutorLostFailure (executor 11 lost)
15/05/28 09:46:09 WARN TaskSetManager: Lost task 22.0 in stage 15.0 (TID 482, anders-41): ExecutorLostFailure (executor 11 lost)
15/05/28 09:46:09 WARN TaskSetManager: Lost task 24.0 in stage 15.0 (TID 484, anders-41): ExecutorLostFailure (executor 11 lost)
...
15/05/28 09:46:26 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@anders-41:47915/user/Executor#707073176] with ID 15
15/05/28 09:46:26 INFO BlockManagerMasterActor: Registering block manager anders-41:41984 with 2.1 GB RAM, BlockManagerId(15, anders-41, 41984)

I caught one stderr file (/var/log/hadoop-yarn/container/application_1432738106848_0062/container_1432738106848_0062_01_000005/stderr) with this:
15/05/28 09:29:29 INFO storage.BlockManager: Found block rdd_5_17 locally
15/05/28 09:29:30 WARN storage.MemoryStore: Not enough space to cache rdd_15_17 in memory! (computed 14.6 MB so far)
15/05/28 09:29:30 INFO storage.MemoryStore: Memory use = 1334.2 MB (blocks) + 778.6 MB (scratch space shared across 7 thread(s)) = 2.1 GB. Storage limit = 2.1 GB.
15/05/28 09:29:31 INFO executor.Executor: Finished task 17.0 in stage 6.0 (TID 190). 2474 bytes result sent to driver

I changed both executor.memory and spark.driver.memory to 4G in hadoop-hdfs/HiBench/HiBench-master/conf/99-user_defined_properties.conf.
spark-submit is run with "--executor-memory 4G --driver-memory 4G"

In hadoop-hdfs/HiBench/HiBench-master/report/kmeans/spark/python/conf/../bench.log, however, I still see 2.1 GB being used:
15/05/28 11:10:24 INFO MemoryStore: MemoryStore started with capacity 2.1 GB

(I presume the 2.1 is because 2 GiB ~= 2.1 GB (2.147483648) )

HiBench compatibility with Hadoop MR 2.0/Cloudera CDH4

We would like to see the HiBench benchmark suite upgraded to work with Cloudera CDH4.
Thanks

org.apache.mahout.clustering.kmeans.GenKMeansDataset not running.

When running the prepare script under kmeans/conf directory, the GenKMeansDataset results in the following error.

Exception in thread "main" java.lang.NoClassDefFoundError: org/uncommons/maths/random/MersenneTwisterRNG
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:201)
Caused by: java.lang.ClassNotFoundException: org.uncommons.maths.random.MersenneTwisterRNG
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 3 more

Trying to build the datatools.jar file from the build.xml, doesn't include this Class. Any help would be appreciated.

Thanks.
Vinayak.

Can't unzip HiBench-Master.zip and HiBench-etl-recomm.zip after download.

Have tried with Windows zip utility and 7-zip.

Error message:

"Windows cannot open the folder
The Compressed (zipped) Folder ... is invalid "

The 7-zip (version 9.20) has mix success (work only on one Windows 7 system).

ERROR: number of words should be greater than 0

I use hibench as my benchmark tool. When I do bayes and nutchindexing test, this error comes to in the beginning:

14/04/14 13:05:15 INFO HiBench.NutchData: Initializing Nutch data generator...
curIndex: 2012, total: 2013
ERROR: number of words should be greater than 0

The Hibench version is 2.2, Hadoop is 1.2.1, jdk is 1.6.0.26, and ubuntu is 12.04

by the way, I checked my ubuntu, and it has /usr/share/dict/words, but not found /usr/share/dict/linux.words. I do not know if it matters.

`TEMP_HDFS`: unbound variable

The location is on pagerank/bin/run.sh: line37.
I think the definition of $TEMP_HDFS should be added to conf files.

HiveBench Data Loaded in to HDFS

Hello,
I have some question regarding hivebench that i need clarity for?Your help is greatly appreciated.

What is the specific data that is loaded into hdfs in hivebench after you run ./prepare.sh. Is it the 600,000 html files stated in SIGMOD 09 paper? What is size (How many giga or kilo bytes per html file, or average size of these 600,000 html files?

2.when you run-aggregation.sh does the Time Taken depict the aggregated throughput? If not please help me understand?
rm: Cannot remove directory "hdfs://master:8020/user/hive/warehouse/uservisits_aggre", use -rmr instead
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in jar:file:/home/ubuntu/hive/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/ubuntu/hive_job_log_ubuntu_201502200212_1471946535.txt
OK
Time taken: 4.318 seconds
OK
Time taken: 0.519 seconds
OK
Time taken: 0.023 seconds
OK
Time taken: 0.731 seconds
OK
Time taken: 0.033 seconds
3.Can you explain to me the number after the timestamp that is (103,135...) what is the significance?
2015-02-20 02:12:45,103 Stage-1 map = 0%, reduce = 0%
2015-02-20 02:12:48,135 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.75 sec
2015-02-20 02:12:49,140 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.75 sec
2015-02-20 02:12:50,149 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.75 sec
Appreciate your help.
Thank you

NutchIndexing - File not found exception

I am trying to run the nutchindexing benchmark but I see the following errors when I run the prepare.sh script:

14/03/03 12:42:04 INFO mapred.JobClient: Task Id : attempt_201402281004_0023_m_000028_0, Status : FAILED
java.lang.NullPointerException
at HiBench.NutchData$CreateNutchPages.map(NutchData.java:349)
at HiBench.NutchData$CreateNutchPages.map(NutchData.java:296)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

attempt_201402281004_0023_m_000028_0: java.io.FileNotFoundException: File does not exist: urls-0/data
attempt_201402281004_0023_m_000028_0: at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:824)

Seems like it is looking for the urls-0/data file, but cannot find it.

I have Hadoop HDFS and MapReduce running on my cluster. I installed Intel HiBench and I was able to successfully run the sort, pagerank and hivebench benchmarks. But I am running into issues when doing the nutchindexing benchmark as above.

Can someone help?

Thanks,
Madhura

hivebench — input data's format

does hivebench input data's format suport "ORC" ? or just support Text

Hadoop 1.2.1, HiBench 3.0.0 and Mahout 0.7 compatible?

Hi there,

I would like to check if HiBench 3.0.0 is compatible with Hadoop 1.2.1? I notice the document of HiBench mentioned that HiBench is tested against Hadoop 1.0.4 and 2.2.0. What about Hadoop 1.2.1? 

I have issues when running HiBench 3.0.0 againt Hadoop 1.2.1. Am wondering if this might be the issue?

Thanks a lot in advance!
Gina

HiBench Hive

We have been able to make all of the tests in HiBench work with Hadoop 2.2.0 on AWS, using their AMI version 3.0.3, except for the hivebench/run-join.sh.

When that test is run, it errors out with the Java error listed below. The test will run and complete but, when the output is appended to the hibench.report file, it shows 0 for "Input_data_size","Throughput(bytes/s)" and "Throughput/node".

Ignore unrecognized file: uservisits
Exception in thread "main" java.io.IOException: Unable to initialize History Viewer

Entire Output:
========== running hive-join bench ==========
JAVA_HOME=/usr/java/latest/
HADOOP_HOME=/home/hadoop/
HADOOP_EXECUTABLE=/home/hadoop/bin/hadoop
HADOOP_CONF_DIR=/home/hadoop/conf/
HADOOP_EXAMPLES_JAR=/home/hadoop/hadoop-examples.jar
MAPRED_EXECUTABLE=/home/hadoop/bin/mapred
14/07/07 20:57:57 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
rm: `/user/hive/warehouse/rankings_uservisits_join': No such file or directory
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.

14/07/07 20:57:59 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
14/07/07 20:58:00 INFO client.RMProxy: Connecting to ResourceManager at /172.31.13.157:9022
Ignore unrecognized file: uservisits
Exception in thread "main" java.io.IOException: Unable to initialize History Viewer
at org.apache.hadoop.mapreduce.jobhistory.HistoryViewer.(HistoryViewer.java:89)
at org.apache.hadoop.mapreduce.tools.CLI.viewHistory(CLI.java:463)
at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:306)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1237)
Caused by: java.io.IOException: Unable to initialize History Viewer
at org.apache.hadoop.mapreduce.jobhistory.HistoryViewer.(HistoryViewer.java:83)
... 5 more
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.

14/07/07 20:58:01 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
14/07/07 20:58:02 INFO client.RMProxy: Connecting to ResourceManager at /172.31.13.157:9022
Ignore unrecognized file: rankings
Exception in thread "main" java.io.IOException: Unable to initialize History Viewer
at org.apache.hadoop.mapreduce.jobhistory.HistoryViewer.(HistoryViewer.java:89)
at org.apache.hadoop.mapreduce.tools.CLI.viewHistory(CLI.java:463)
at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:306)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1237)
Caused by: java.io.IOException: Unable to initialize History Viewer
at org.apache.hadoop.mapreduce.jobhistory.HistoryViewer.(HistoryViewer.java:83)
... 5 more

Launching nutchindexing on CDH5

Hi.

I am using CDH5 5.0.2 which is latest.

and I have downloaded latest HiBench source.

All benchmark suites(wordcount, terasort, kmeans, hive bench, etc.) operate pretty well except nutchindexing.

When I ran nutchindexing/bin/prepare.sh. it looks finished preparation job without errors.

After that, if I run nutchindexing/bin/run.sh, I got below error logs

========== running nutchindex data ==========
JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera
HADOOP_HOME=/opt/cloudera/parcels/CDH
HADOOP_EXECUTABLE=/opt/cloudera/parcels/CDH/bin/hadoop
HADOOP_CONF_DIR=/etc/alternatives/hadoop-conf
HADOOP_EXAMPLES_JAR=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
MAPRED_EXECUTABLE=/opt/cloudera/parcels/CDH/bin/mapred
nutchindexing/bin/run.sh: line 26: check-compress: command not found
Deleted /HiBench/Nutch/Input/indexes
/root/HiBench/bin/../nutchindexing/nutch-1.2-cdh5
JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera
HADOOP_HOME=/opt/cloudera/parcels/CDH
HADOOP_EXECUTABLE=/opt/cloudera/parcels/CDH/bin/hadoop
HADOOP_CONF_DIR=/etc/alternatives/hadoop-conf
HADOOP_EXAMPLES_JAR=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
MAPRED_EXECUTABLE=/opt/cloudera/parcels/CDH/bin/mapred
14/07/01 11:15:22 INFO indexer.Indexer: Indexer: starting at 2014-07-01 11:15:22
14/07/01 11:15:22 INFO indexer.IndexerMapReduce: IndexerMapReduce: crawldb: /HiBench/Nutch/Input/crawldb
14/07/01 11:15:22 INFO indexer.IndexerMapReduce: IndexerMapReduce: linkdb: /HiBench/Nutch/Input/linkdb
14/07/01 11:15:22 INFO indexer.IndexerMapReduce: IndexerMapReduces: adding segment: /HiBench/Nutch/Input/segments/*
14/07/01 11:15:23 INFO client.RMProxy: Connecting to ResourceManager at swat3-33/172.23.33.1:8032
14/07/01 11:15:23 INFO client.RMProxy: Connecting to ResourceManager at swat3-33/172.23.33.1:8032
14/07/01 11:15:24 INFO mapred.FileInputFormat: Total input paths to process : 432
14/07/01 11:15:25 INFO mapreduce.JobSubmitter: number of splits:432
14/07/01 11:15:25 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1404170587682_0022
14/07/01 11:15:25 INFO impl.YarnClientImpl: Submitted application application_1404170587682_0022
14/07/01 11:15:25 INFO mapreduce.Job: The url to track the job: http://swat3-33:8088/proxy/application_1404170587682_0022/
14/07/01 11:15:25 INFO mapreduce.Job: Running job: job_1404170587682_0022
14/07/01 11:15:32 INFO mapreduce.Job: Job job_1404170587682_0022 running in uber mode : false
14/07/01 11:15:32 INFO mapreduce.Job: map 0% reduce 0%
14/07/01 11:15:35 INFO mapreduce.Job: Task Id : attempt_1404170587682_0022_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 17 more
Caused by: java.lang.NullPointerException
at org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:87)
at org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:71)
at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:95)
at org.apache.nutch.indexer.IndexingFilters.(IndexingFilters.java:60)
at org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:61)
... 22 more

^C^C^C

Any comments about this error?

Thank you in advance.

How long will it take to run "prepare.sh"?

I am running HiBench to verify the performance SPARK SQL. And the "prepare.sh" took more than 3 hours and hasn't finished yet. This is my console below:

yang@xxxxx:~/HiBench/bin$ ./run-all.sh
Prepare join ...
Exec script: /home/yang/HiBench/workloads/join/prepare/prepare.sh
Parsing conf: /home/yang/HiBench/conf/00-default-properties.conf
Parsing conf: /home/yang/HiBench/conf/10-data-scale-profile.conf
Parsing conf: /home/yang/HiBench/conf/99-user_defined_properties.conf
Parsing conf: /home/yang/HiBench/workloads/join/conf/00-join-default.conf
Parsing conf: /home/yang/HiBench/workloads/join/conf/10-join-userdefine.conf
Probing spark verison, may last long at first time...
start HadoopPrepareJoin bench
hdfs rm -r: /home/yang/hadoop/bin/hadoop --config /home/yang/hadoop/etc/hadoop fs -rm -r -skipTrash hdfs://andromeda:9000/HiBench/Join/Input
15/05/27 16:32:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rm: `hdfs://xxxxx:9000/HiBench/Join/Input': No such file or directory
Pages:120000, USERVISITS:1000000
Submit MapReduce Job: /home/yang/hadoop/bin/hadoop --config /home/yang/hadoop/etc/hadoop jar /home/yang/HiBench/src/autogen/target/autogen-4.0-SNAPSHOT-jar-with-dependencies.jar HiBench.DataGen -t hive -b hdfs://xxxxx:9000/HiBench/Join -n Input -m 12 -r 6 -p 120000 -v 1000000 -o sequence
15/05/27 16:32:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/05/27 16:32:49 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/05/27 16:32:50 INFO mapreduce.Job: Running job: job_1432692021703_0006

And in my hadoop web page, the log shows as below:
2015-05-27 16:42:12,268 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-05-27 16:42:12,269 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).

So, is there anything wrong with my hdfs configuration?

Enhanced faile

when the run_read.sh or the run_write.sh almost finished will occure a problem
(standard_in) 1: syntax error
/root/hibench/bin/../conf/funcs.sh: line 56: /root/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar: Permission denied
(standard_in) 1: syntax error

the line 56 in funcs.sh is 56 local nodes=$MAPRED_EXECUTABLE job -list-active-trackers | wc -l

when i execute wc -l in /root/hadoop-2.5.2/share/hadoop/mapreduce/ no problems occurs.

the hadoop-mapreduce-examples-2.5.2.jar 's authority is
-rw-rw-rw-. 1 10021 10021 270323 Nov 14 18:53 hadoop-mapreduce-examples-2.5.2.jar

no common/autogen

There is no common/autogen, so bayes/bin/prepare.sh fails.