HI, I am trying to run the Nutchindexing job (<a href="https://github.com/hibench/HiBe

Benchmarking for nutchindexing,about intel-bigdata/hibench

yliu318 commented on June 17, 2024

Hi, we haven't met this problem. Could you please give more detailed information about it? It seems like you haven't successfully upload your data into HDFS. Please have a check and give more description about it.

Thanks,
Yan

from hibench.

yliu318 commented on June 17, 2024

And the NutchIndexing benchmark requires an external input dataset, please refer to step 2.0.1 in the README file in HiBench2.1 package.

Thanks,
Yan

from hibench.

prashanthig commented on June 17, 2024

As per the step 2.0.1 in the README file , i have to download a data set from the dropbox which is of 16 GB size. I am not able to download such a big file. Can you please give me anyother sample file which is of less size(1GB) for the input.
Help in this regard is highly appreciated.

Thanks in advance.

Prashanthi

from hibench.

yliu318 commented on June 17, 2024

Hi,

At present, we only have this dataset for NutchIndexing. Adimittedly, it is really tough and time-consuming to download it. We are now planing to implement a data generator for NutchIndexing. Perhaps you can use it in next HiBench releases. But now, we feel sorry that the dataset stored in dropbox is the only available one.

Thanks,
Yan

from hibench.

prashanthig commented on June 17, 2024

Thanks for the response.

Can i create a text file containg some 1000 URLS and use it as input for this by placing in wikinutch folder?

from hibench.

yliu318 commented on June 17, 2024

Of course you can create your own files. As long as they fit into the input format of this benchmark.

Thanks,
Yan

from hibench.

prashanthig commented on June 17, 2024

When i have done that, it copied my data file from wikinutch folder to hdfs system. But still it is saying "Total input paths to process is 0".

Is there any specific input format for this benchmark?

from hibench.

AllanY commented on June 17, 2024

Hi,

No you cannot do so to make the nutch bench work. Nutch basically includes three functional components, i.e., crawling (inject, generate, fetch, parse, update, invert), indexing, and searching. Our nutchindexing benchmark only measures the most representative and important part of nutch – indexing. So the input of nutchindexing should be the result of nutch crawling, while not simply the url list as you thinks. So if you want to build your own input data for nutch indexing, please use nutch to crawl a given part of www and make sure to put the result data into the location as specified in README. Thanks

Btw. Do other benchmarks workable on your hadoop cluster?

Cheers
Lionel

From: prashanthig [mailto:[email protected]]
Sent: Thursday, August 23, 2012 2:52 PM
To: hibench/HiBench-2.1
Subject: Re: [HiBench-2.1] Benchmarking for nutchindexing (#4)

Thanks for the response.

Can i create a text file containg some 1000 URLS and use it as input for this by placing in wikinutch folder?

—
Reply to this email directly or view it on GitHubhttps://github.com/hibench/HiBench-2.1/issues/4#issuecomment-7961487.

from hibench.

yliu318 commented on June 17, 2024

Hi,

The input of Nutchindexing is the result of crawling a copy of wikipedia dump using the nutch crawler.
According to our experience the crawling process took a very long time. So we basically didn’t repeat the process in prepare job and just copy the crawler result from local.

When we prepared the input data, we downloaded the full image of Wikipedia and set up a mirror locally and then crawl.
The command we used is “bin/nutch crawl urls -dir crawl -depth 3 -topN 50”.

You can find how to download the wikipadia dump here. http://en.wikipedia.org/wiki/Wikipedia:Database_download
Then you can set up a mirror locally using the dump.

But it is really time-consuming. We still suggest you download our data.

Thanks,
Yan

from hibench.

prashanthig commented on June 17, 2024

Hi. am just trying to see how benchmark works for nutch indexing . Am trying on a 20 GB Virtual machine where 12-13 GB is already occupied. So please let me know if there is any other way to generate a small dump to perform the crawling and give the output as input for indexing.

Thanks,
Prashanthi

from hibench.

AllanY commented on June 17, 2024

Hi,

Regarding your description, I suggest you use nutch directly to crawl the wiki site (or other sites you are interested in). And then use that data for nutchindexing benchmark.
In HiBench-2.1, we do not support variable size of nutch data generation. If you really need it, you can wait until the end of Sept when we supposed to publish the next version of HiBench.

Cheers
Lionel

From: prashanthig [mailto:[email protected]]
Sent: Thursday, August 23, 2012 6:08 PM
To: hibench/HiBench-2.1
Cc: Yi, Lan
Subject: Re: [HiBench-2.1] Benchmarking for nutchindexing (#4)

Hi. am just trying to see how benchmark works for nutch indexing . Am trying on a 20 GB Virtual machine where 12-13 GB is already occupied. So please let me know if there is any other way to generate a small dump to perform the crawling and give the output as input for indexing.

Thanks,
Prashanthi

—
Reply to this email directly or view it on GitHubhttps://github.com/hibench/HiBench-2.1/issues/4#issuecomment-7965218.

from hibench.

prashanthig commented on June 17, 2024

Hi,
As suggested by Yan, am trying to run the crawler. In the command given "bin/nutch crawl urls -dir crawl -depth 3 -topN 50" what should be the urls directory should contain?

Thanks,
prashanthi.

from hibench.

prashanthig commented on June 17, 2024

The urls directory is the path where the Fetcher Segments are created?

Am running the above with a empty urls direcotry then am getting the below exception

Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Generator.generate(Generator.java:526)

Please help me on resolving the issue.

Thanks in Advance.

from hibench.

yliu318 commented on June 17, 2024

Hi,

We feel so sorry for the later reply. We are not very sure your problem and try to find why. We guess you should not use empty url list. Please have a try.

Thanks,
Yan

From: prashanthig [mailto:[email protected]]
Sent: Monday, August 27, 2012 8:44 PM
To: hibench/HiBench-2.1
Cc: Liu, Yan B
Subject: Re: [HiBench-2.1] Benchmarking for nutchindexing (#4)

The urls directory is the path where the Fetcher Segments are created?

Am running the above with a empty urls direcotry then am getting the below exception

Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Generator.generate(Generator.java:526)

Please help me on resolving the issue.

Thanks in Advance.

—
Reply to this email directly or view it on GitHubhttps://github.com/hibench/HiBench-2.1/issues/4#issuecomment-8054578.

from hibench.

AllanY commented on June 17, 2024

urls here is the path leading to the file containing all those seed urls which will be injected to crawldb for subsequent crawling. Use an empty directory will lead to problem as nutch could not find where to start.

Cheers
Lionel

From: prashanthig [mailto:[email protected]]
Sent: Monday, August 27, 2012 8:44 PM
To: hibench/HiBench-2.1
Cc: Yi, Lan
Subject: Re: [HiBench-2.1] Benchmarking for nutchindexing (#4)

The urls directory is the path where the Fetcher Segments are created?

Am running the above with a empty urls direcotry then am getting the below exception

Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Generator.generate(Generator.java:526)

Please help me on resolving the issue.

Thanks in Advance.

—
Reply to this email directly or view it on GitHubhttps://github.com/hibench/HiBench-2.1/issues/4#issuecomment-8054578.

from hibench.

prashanthig commented on June 17, 2024

Hi,

When i ran that with a list of urls in a file it is giving me the below exception
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)

Thanks.

from hibench.

AllanY commented on June 17, 2024

Hi,

Sorry, with little information it’s hard for us to understand what has happened in your working environment.

One thing is, if you do not mean to measure or compare your hadoop cluster (with others), it’s not necessary to run the HiBench. As I can see in your case, instead of preparing data and running nutch in HiBench-2.1, you can actually set up an independent nutch on your cluster and then directly run the nutch processes, i.e., crawling, indexing, and searching, etc.

Finally, if you really want to prepare your own data and run the nutch in HiBench-2.1, we strongly suggest you to carefully read the nutch (and also HiBench-2.1) documents and command help first. You may get tons of problems if you do not follow the instructions you know.

Cheers
Lionel

From: prashanthig [mailto:[email protected]]
Sent: Tuesday, August 28, 2012 2:48 PM
To: hibench/HiBench-2.1
Cc: Yi, Lan
Subject: Re: [HiBench-2.1] Benchmarking for nutchindexing (#4)

Hi,

When i ran that with a list of urls in a file it is giving me the below exception
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)

Thanks.

—
Reply to this email directly or view it on GitHubhttps://github.com/hibench/HiBench-2.1/issues/4#issuecomment-8080918.

from hibench.

prashanthig commented on June 17, 2024

I have executed the crawling and indexing successfully now...

Thanks a lot for the help

from hibench.

adrian-wang commented on June 17, 2024

Closed as resolved

from hibench.

Benchmarking for nutchindexing about hibench HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent