Giter Site home page Giter Site logo

yahooarchive / anthelion Goto Github PK

View Code? Open in Web Editor NEW
2.8K 322.0 666.0 34.5 MB

Anthelion is a plugin for Apache Nutch to crawl semantic annotations within HTML pages.

Home Page: https://labs.yahoo.com/publications/6702/focused-crawling-structured-data

License: Apache License 2.0

Java 96.64% XSLT 0.08% Shell 0.65% HTML 2.63%

anthelion's Introduction

nutch-anth

Anthelion is a Nutch plugin for focused crawling of semantic data. The project is an open-source project released under the Apache License 2.0.

Note: This project contains the complete Nutch 1.6 distribution. The plugin itself can be found in /src/plugin/parse-anth

Table of Contents

  • [Nutch-Anthelion Plugin](#nutch-anthelion plugin)
    • [Plugin Overview] (#plugin-overview)
    • [Usage and Development] (#usage-and-development)
    • [Some Results] (#some-results)
    • [3rd Party Libraries] (#3rd-party-libraries)
  • Anthelion
  • References

Nutch-Anthelion Plugin

The plugin uses an online learning approach to predict data-rich web pages based on the context of the page as well as using feedback from the extraction of metadata from previously seen pages [1].

Plugin Overview

To perform the focused crawling the plugin implements three extensions:

  1. AnthelionScoringFilter (implements the ScoringFilter interface): wraps around the Anthelion online classifier to classify newly discovered outlinks, as relevant or not. This extension gives score to each outlink, which is then used in the Generate stage, i.e., the URLs for the next fetch cycle are selected based on the score. This extension also pushes feedback to the classifier for the already parsed web pages. The online classifier can be configured and tuned (see [Usage and Development](#usage and development)).

  2. WdcParser (implements the Parser interface): This extension parses the web page content and tries to extract semantic data. The parser is adaptation of an already existing Nutch parser plugin implemented in [2]. The parser is based on the any23 library and is able to extract Microdata, Microformats and RDFa annotation from HTML. The extracted triples are stored in the Content field.

  3. TripleExtractor (implements the IndexingFilter interface): This extension stores new fields to the index that can be later used for querying.

An overview of the complete crawling process using the Anthelion plugin is given in the following figure.

Anthelion Architecture

Usage and Development

As mentioned in the beginning of the document this project contains the complete Nutch 1.6 code, including the plugin. If you download the complete project, there is no need for any changes and settings. If you want to download only the plugin, please download only the nutch-anth.zip from the root of the folder and go to step 2 of the configuration. If you want to contribute to the plugin and/or want to use the sources with another version of Nutch, please follow the following instructions:

  1. Download and copy the /src/plugin/parse-anth folder into your Nutch's plugins directory.

  2. Enable the plugin in conf/nutch-site.xml by adding parse-anth in the plugin.includes property.

  3. Copy the properties from nutch-anth.xml to conf/nutch-site.xml.

    3.1. Download the baseline.properties file and set the property anth.scoring.classifier.PropsFilePath conf/nutch-site.xml to point to the file. This file contains all configurations for the online classifier.

  4. In order for ant to compile and deploy the plugin you need to edit the src/plugin/build.xml, by adding the following line in the deploy target:

    <ant dir="parse-anth" target="deploy"/>
  5. Add the following lines in conf/parse-plugins.xml:

    <mimeType name="text/html">
    		<plugin id="parse-anth" />
    	</mimeType>
    
            <mimeType name="application/xhtml+xml">
    		<plugin id="parse-anth" />
    	</mimeType>
  6. Add the following line in the alias property in conf/parse-plugins.xml:

    <alias name="parse-anth" extension-id="com.yahoo.research.parsing.WdcParser" />
  7. Copy the lib folder into the root of the Nutch distribution.

  8. Run mvn package inside the anthelion folder. This will create the jar "Anthelion-1.0.0-jar-with-dependencies.jar". Copy the jar to src/plugin/parse-anth/lib.

  9. Add the following field in conf/schema.xml (also add it to the Solr schema.xml, if you are using Solr):

    <field name="containsSem" type="text_general" stored="true" indexed="true"/>
  10. Run ant in the root of your folder.

Some Results

In order to evaluate the focused crawler we measure the precision of the crawled pages, i.e., the ratio of the number of crawled web pages that contain semantic data and the total number of crawled web pages. So far, we have evaluated using three different seeds sample, and several different configurations. An overview is given in the following table.

#seeds nutch options standard scoring anthelion scoring
#total pages #sem pages precision #total pages #sem pages precision
2 -depth 3 -topN 15 17 2 0.12 22 7 0.32
10 -depth 8 -topN 15 99 2 0.02 49 11 0.22
1000 -depth 4 -topN 1000 3200 212 0.07 2910 1469 0.50
1000 -depth 5 -topN 2000 8240 511 0.06 9781 7587 0.78

The pairwise comparison is given in the following chart:

Architecture

3rd Party Libraries

The Anthelion plugin uses several 3rd party open source libraries and tools. Here we summarize the tools used, their purpose, and the licenses under which they're released.

  1. This project includes the sources of Apache Nutch 1.6 (Apache License 2.0 - http://www.apache.org/licenses/LICENSE-2.0)

  2. Apache Any23 1.2 (Apache License 2.0 - http://www.apache.org/licenses/LICENSE-2.0)

    • Used for extraction of semantic annotation from HTML.
    • https://any23.apache.org/
    • More information about the 3rd party dependencies used in the any23 library can be found here
  3. The classes com.yahoo.research.parsing.WdcParser and com.yahoo.research.parsing.FilterableTripleHandler are modified versions of existing Nutch plugins (Apache License 2.0 - http://www.apache.org/licenses/LICENSE-2.0)

  4. For the libraries and tools used in Anthelion, please check the Anthelion [README file] (https://github.com/yahoo/anthelion/blob/master/anthelion/README.md).

Anthelion

For more details about the Anthelion project please check the Anthelion [README file] (https://github.com/yahoo/anthelion/blob/master/anthelion/README.md).

References

[1]. Meusel, Robert, Peter Mika, and Roi Blanco. "Focused Crawling for Structured Data." Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM, 2014.

[2]. Hellmann, Sebastian, et al. "Knowledge Base Creation, Enrichment and Repair." Linked Open Data--Creating Knowledge Out of Interlinked Data. Springer International Publishing, 2014. 45-69.

###Troubleshooting (TODO)

anthelion's People

Contributors

aaroncritchley avatar petarr avatar robertmeusel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anthelion's Issues

Seems the HDFS file path contains ':' colon will throw exception

I had download this whole source code and built it successfully. When i tried to run a crawl test:
bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/nutch 2
I run into this URI path name issue.
hadoop.log.zip

i have this log file attached. It seems the HDFS file path name special characters issue is still there?

2016-01-03 13:27:08,405 INFO fetcher.Fetcher - Fetcher: starting at 2016-01-03 13:27:08
2016-01-03 13:27:08,405 INFO fetcher.Fetcher - Fetcher: segment: TestCrawl/segments/drwxr-xr-xnn4nstevennstaffnn136nJannn3n13:24n20160103090925
2016-01-03 13:27:08,406 INFO fetcher.Fetcher - Fetcher Timelimit set for : 1451809628406
2016-01-03 13:27:08,631 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-01-03 13:27:08,677 ERROR fetcher.Fetcher - Fetcher: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: drwxr-xr-xnn4nstevennstaffnn136nJannn3n13:24n20160103090925
at org.apache.hadoop.fs.Path.initialize(Path.java:148)
at org.apache.hadoop.fs.Path.(Path.java:126)
at org.apache.hadoop.fs.Path.(Path.java:50)

Hi

image
what's the meaning

Add Desc

"Anthelion is a Nutch plugin for focused crawling of semantic data."

Propose Anthelion for Nutch Trunk

Hi Folks,
Are you interested in proposing Anthelion for integration into the Nutch trunk source code?
I think I've spoken with a few of you over on the Any23 ML and I am very glad to see you publish the outcome of the work as source code.
What do you guys think? I am one of the main developers of Nutch and Any23 so I am very interested in seeing this plugin available for more to use.

nutch 2.2 not compatible

when I follow the instructions 1-9 for nutch 2.2, some files in not found and is in nutch 1.6 library:

compile:
[echo] Compiling plugin: parse-anth
[javac] Compiling 13 source files to NUTCH2.2_ROOT/build/parse-anth/classes
[javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/indexing/TripleExtractor.java:23: error: cannot find symbol
[javac] import org.apache.nutch.crawl.CrawlDatum;
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: package org.apache.nutch.crawl
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/indexing/TripleExtractor.java:24: error: cannot find symbol
[javac] import org.apache.nutch.crawl.Inlinks;
[javac] ^
[javac] symbol: class Inlinks
[javac] location: package org.apache.nutch.crawl
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/indexing/TripleExtractor.java:45: error: cannot find symbol
[javac] public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) {
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class TripleExtractor
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/indexing/TripleExtractor.java:45: error: cannot find symbol
[javac] public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) {
[javac] ^
[javac] symbol: class Inlinks
[javac] location: class TripleExtractor
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:42: error: cannot find symbol
[javac] import org.apache.nutch.parse.HtmlParseFilters;
[javac] ^
[javac] symbol: class HtmlParseFilters
[javac] location: package org.apache.nutch.parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:45: error: cannot find symbol
[javac] import org.apache.nutch.parse.ParseData;
[javac] ^
[javac] symbol: class ParseData
[javac] location: package org.apache.nutch.parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:46: error: cannot find symbol
[javac] import org.apache.nutch.parse.ParseImpl;
[javac] ^
[javac] symbol: class ParseImpl
[javac] location: package org.apache.nutch.parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:47: error: cannot find symbol
[javac] import org.apache.nutch.parse.ParseResult;
[javac] ^
[javac] symbol: class ParseResult
[javac] location: package org.apache.nutch.parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:48: error: cannot find symbol
[javac] import org.apache.nutch.parse.ParseStatus;
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: package org.apache.nutch.parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:143: error: cannot find symbol
[javac] private HtmlParseFilters htmlParseFilters;
[javac] ^
[javac] symbol: class HtmlParseFilters
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:177: error: cannot find symbol
[javac] public ParseResult getParse(Content content) {
[javac] ^
[javac] symbol: class ParseResult
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:30: error: cannot find symbol
[javac] import org.apache.nutch.crawl.CrawlDatum;
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: package org.apache.nutch.crawl
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:31: error: cannot find symbol
[javac] import org.apache.nutch.crawl.Inlinks;
[javac] ^
[javac] symbol: class Inlinks
[javac] location: package org.apache.nutch.crawl
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:34: error: cannot find symbol
[javac] import org.apache.nutch.parse.ParseData;
[javac] ^
[javac] symbol: class ParseData
[javac] location: package org.apache.nutch.parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:139: error: cannot find symbol
[javac] public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData parseData,
[javac] ^
[javac] symbol: class ParseData
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:140: error: cannot find symbol
[javac] Collection<Entry<Text, CrawlDatum>> targets, CrawlDatum adjust, int allCount)
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:140: error: cannot find symbol
[javac] Collection<Entry<Text, CrawlDatum>> targets, CrawlDatum adjust, int allCount)
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:139: error: cannot find symbol
[javac] public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData parseData,
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:182: error: cannot find symbol
[javac] public void injectedScore(Text url, CrawlDatum datum) throws ScoringFilterException {
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:188: error: cannot find symbol
[javac] public void initialScore(Text url, CrawlDatum datum) throws ScoringFilterException {
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:199: error: cannot find symbol
[javac] public float generatorSortValue(Text url, CrawlDatum datum, float initSort) throws ScoringFilterException {
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:205: error: cannot find symbol
[javac] public void passScoreBeforeParsing(Text url, CrawlDatum datum, Content content) throws ScoringFilterException {
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:211: error: cannot find symbol
[javac] public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List inlinked)
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:211: error: cannot find symbol
[javac] public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List inlinked)
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:211: error: cannot find symbol
[javac] public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List inlinked)
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:218: error: cannot find symbol
[javac] public float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse,
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:218: error: cannot find symbol
[javac] public float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse,
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:219: error: cannot find symbol
[javac] Inlinks inlinks, float initScore) throws ScoringFilterException {
[javac] ^
[javac] symbol: class Inlinks
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:112: error: cannot find symbol
[javac] private HtmlParseFilters htmlParseFilters;
[javac] ^
[javac] symbol: class HtmlParseFilters
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:116: error: cannot find symbol
[javac] public ParseResult getParse(Content content) {
[javac] ^
[javac] symbol: class ParseResult
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/indexing/TripleExtractor.java:38: error: TripleExtractor is not abstract and does not override abstract method filter(NutchDocument,String,WebPage) in IndexingFilter
[javac] public class TripleExtractor implements IndexingFilter {
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/indexing/TripleExtractor.java:50: error: cannot find symbol
[javac] containsSem = parse.getData().getMeta(WdcParser.META_CONTAINS_SEM);
[javac] ^
[javac] symbol: method getData()
[javac] location: variable parse of type Parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:74: error: WdcParser is not abstract and does not override abstract method getParse(String,WebPage) in Parser
[javac] public class WdcParser implements Parser {
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:194: error: cannot find symbol
[javac] return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:210: error: no suitable method found for autoDetectClues(Content,boolean)
[javac] detector.autoDetectClues(content, true);
[javac] ^
[javac] method EncodingDetector.autoDetectClues(ByteBuffer,Utf8,String,boolean) is not applicable
[javac](actual and formal argument lists differ in length)
[javac] method EncodingDetector.autoDetectClues(WebPage,boolean) is not applicable
[javac](actual argument Content cannot be converted to WebPage by method invocation conversion)
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:212: error: no suitable method found for guessEncoding(Content,String)
[javac] String encoding = detector.guessEncoding(content, defaultCharEncoding);
[javac] ^
[javac] method EncodingDetector.guessEncoding(String,String) is not applicable
[javac](actual argument Content cannot be converted to String by method invocation conversion)
[javac] method EncodingDetector.guessEncoding(WebPage,String) is not applicable
[javac](actual argument Content cannot be converted to WebPage by method invocation conversion)
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:223: error: cannot find symbol
[javac] return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:225: error: cannot find symbol
[javac] return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:227: error: cannot find symbol
[javac] return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:230: error: cannot find symbol
[javac] return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:282: error: cannot find symbol
[javac] ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:282: error: cannot find symbol
[javac] ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:282: error: cannot find symbol
[javac] ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
[javac] ^
[javac] symbol: variable ParseStatus
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:284: error: cannot find symbol
[javac] status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);
[javac] ^
[javac] symbol: variable ParseStatus
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:288: error: cannot find symbol
[javac] ParseData parseData = new ParseData(status, title, outlinks, content.getMetadata(), metadata);
[javac] ^
[javac] symbol: class ParseData
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:288: error: cannot find symbol
[javac] ParseData parseData = new ParseData(status, title, outlinks, content.getMetadata(), metadata);
[javac] ^
[javac] symbol: class ParseData
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:289: error: cannot find symbol
[javac] ParseResult parseResult = ParseResult.createParseResult(content.getUrl(), new ParseImpl(text, parseData));
[javac] ^
[javac] symbol: class ParseResult
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:289: error: cannot find symbol
[javac] ParseResult parseResult = ParseResult.createParseResult(content.getUrl(), new ParseImpl(text, parseData));
[javac] ^
[javac] symbol: class ParseImpl
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:289: error: cannot find symbol
[javac] ParseResult parseResult = ParseResult.createParseResult(content.getUrl(), new ParseImpl(text, parseData));
[javac] ^
[javac] symbol: variable ParseResult
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:292: error: cannot find symbol
[javac] parse.getData().getContentMeta().set(META_CONTAINS_SEM, Boolean.toString(containsSem));
[javac] ^
[javac] symbol: method getData()
[javac] location: variable parse of type Parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:298: error: cannot find symbol
[javac] entry.getValue().getData().getParseMeta().set(Nutch.CACHING_FORBIDDEN_KEY, cachingPolicy);
[javac] ^
[javac] symbol: method getData()
[javac] location: class Parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:372: error: cannot find symbol
[javac] System.out.println("data: " + parse.getData());
[javac] ^
[javac] symbol: method getData()
[javac] location: variable parse of type Parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:376: error: cannot find symbol
[javac] String contains = parse.getData().getMeta(META_CONTAINS_SEM);
[javac] ^
[javac] symbol: method getData()
[javac] location: variable parse of type Parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/parsing/WdcParser.java:420: error: cannot find symbol
[javac] this.htmlParseFilters = new HtmlParseFilters(getConf());
[javac] ^
[javac] symbol: class HtmlParseFilters
[javac] location: class WdcParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/DOMContentUtils.java:388: error: cannot find symbol
[javac] URL url = URLUtil.resolveURL(base, target);
[javac] ^
[javac] symbol: method resolveURL(URL,String)
[javac] location: class URLUtil
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:54: error: AnthelionScoringFilter is not abstract and does not override abstract method indexerScore(String,NutchDocument,WebPage,float) in ScoringFilter
[javac] public class AnthelionScoringFilter implements ScoringFilter {
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:110: error: cannot find symbol
[javac] containsSem = Boolean.parseBoolean(parse.getData().getMeta(WdcParser.META_CONTAINS_SEM));
[javac] ^
[javac] symbol: method getData()
[javac] location: variable parse of type Parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:112: error: cannot find symbol
[javac] semFather = Boolean.parseBoolean(parse.getData().getMeta(WdcParser.META_CONTAINS_SEM_FATHER));
[javac] ^
[javac] symbol: method getData()
[javac] location: variable parse of type Parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:129: error: cannot find symbol
[javac] parse.getData().getContentMeta().set(WdcParser.META_CONTAINS_SEM_FATHER_FOR_SUB, Boolean.toString(containsSem));
[javac] ^
[javac] symbol: method getData()
[javac] location: variable parse of type Parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:98: error: method does not override or implement a method from a supertype
[javac] @OverRide
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:146: error: cannot find symbol
[javac] for (Entry<Text, CrawlDatum> target : targets) {
[javac] ^
[javac] symbol: class CrawlDatum
[javac] location: class AnthelionScoringFilter
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:138: error: method does not override or implement a method from a supertype
[javac] @OverRide
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:181: error: method does not override or implement a method from a supertype
[javac] @OverRide
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:187: error: method does not override or implement a method from a supertype
[javac] @OverRide
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:198: error: method does not override or implement a method from a supertype
[javac] @OverRide
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:204: error: method does not override or implement a method from a supertype
[javac] @OverRide
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:210: error: method does not override or implement a method from a supertype
[javac] @OverRide
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/com/yahoo/research/scoring/AnthelionScoringFilter.java:217: error: method does not override or implement a method from a supertype
[javac] @OverRide
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:44: error: HtmlParser is not abstract and does not override abstract method getParse(String,WebPage) in Parser
[javac] public class HtmlParser implements Parser {
[javac] ^
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:123: error: cannot find symbol
[javac] return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:138: error: no suitable method found for autoDetectClues(Content,boolean)
[javac] detector.autoDetectClues(content, true);
[javac] ^
[javac] method EncodingDetector.autoDetectClues(ByteBuffer,Utf8,String,boolean) is not applicable
[javac](actual and formal argument lists differ in length)
[javac] method EncodingDetector.autoDetectClues(WebPage,boolean) is not applicable
[javac](actual argument Content cannot be converted to WebPage by method invocation conversion)
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:140: error: no suitable method found for guessEncoding(Content,String)
[javac] String encoding = detector.guessEncoding(content, defaultCharEncoding);
[javac] ^
[javac] method EncodingDetector.guessEncoding(String,String) is not applicable
[javac](actual argument Content cannot be converted to String by method invocation conversion)
[javac] method EncodingDetector.guessEncoding(WebPage,String) is not applicable
[javac](actual argument Content cannot be converted to WebPage by method invocation conversion)
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:149: error: cannot find symbol
[javac] return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:151: error: cannot find symbol
[javac] return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:153: error: cannot find symbol
[javac] return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:156: error: cannot find symbol
[javac] return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:187: error: cannot find symbol
[javac] ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:187: error: cannot find symbol
[javac] ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
[javac] ^
[javac] symbol: class ParseStatus
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:187: error: cannot find symbol
[javac] ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
[javac] ^
[javac] symbol: variable ParseStatus
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:189: error: cannot find symbol
[javac] status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);
[javac] ^
[javac] symbol: variable ParseStatus
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:193: error: cannot find symbol
[javac] ParseData parseData = new ParseData(status, title, outlinks,
[javac] ^
[javac] symbol: class ParseData
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:193: error: cannot find symbol
[javac] ParseData parseData = new ParseData(status, title, outlinks,
[javac] ^
[javac] symbol: class ParseData
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:195: error: cannot find symbol
[javac] ParseResult parseResult = ParseResult.createParseResult(content.getUrl(),
[javac] ^
[javac] symbol: class ParseResult
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:196: error: cannot find symbol
[javac] new ParseImpl(text, parseData));
[javac] ^
[javac] symbol: class ParseImpl
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:195: error: cannot find symbol
[javac] ParseResult parseResult = ParseResult.createParseResult(content.getUrl(),
[javac] ^
[javac] symbol: variable ParseResult
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:199: error: cannot find symbol
[javac] ParseResult filteredParse = this.htmlParseFilters.filter(content, parseResult,
[javac] ^
[javac] symbol: class ParseResult
[javac] location: class HtmlParser
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:203: error: cannot find symbol
[javac] entry.getValue().getData().getParseMeta().set(Nutch.CACHING_FORBIDDEN_KEY,
[javac] ^
[javac] symbol: method getData()
[javac] location: class Parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:281: error: cannot find symbol
[javac] System.out.println("data: "+parse.getData());
[javac] ^
[javac] symbol: method getData()
[javac] location: variable parse of type Parse
[javac] NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/HtmlParser.java:289: error: cannot find symbol
[javac] this.htmlParseFilters = new HtmlParseFilters(getConf());
[javac] ^
[javac] symbol: class HtmlParseFilters
[javac] location: class HtmlParser
[javac] Note: NUTCH2.2_ROOT/src/plugin/parse-anth/src/java/org/apache/nutch/parse/html/DOMBuilder.java uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 89 errors
[javac] 1 warning

BUILD FAILED
NUTCH2.2_ROOT/build.xml:108: The following error occurred while executing this line:
NUTCH2.2_ROOT/src/plugin/build.xml:29: The following error occurred while executing this line:
NUTCH2.2_ROOT/src/plugin/build-plugin.xml:117: Compile failed; see the compiler error output for details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.