Giter Site home page Giter Site logo

ia-hadoop-tools's People

Contributors

aaronbinns avatar aloisius avatar ikreymer avatar kngenie avatar nlevitt avatar sebastian-nagel avatar vinaygo avatar vinaygoel avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ia-hadoop-tools's Issues

Reduce logging when transforming WARC to WAT/WET

When transforming WARC to WAT/WET (org.archive.hadoop.jobs.WEATGenerator) the ExtractingResourceProducer logs every WARC record extensively:

Nov 21, 2019 9:51:26 AM org.archive.extract.ExtractingResourceProducer getNext
INFO: Extracting (class org.archive.resource.warc.WARCResource) with (class org.archive.resource.http.HTTPRequestResourceFactory)

Nov 21, 2019 9:51:26 AM org.archive.extract.ExtractingResourceProducer getNext
INFO: Extracting (class org.archive.resource.warc.WARCResource) with (class org.archive.resource.http.HTTPResponseResourceFactory)

Nov 21, 2019 9:51:26 AM org.archive.extract.ExtractingResourceProducer getNext
INFO: Extracting (class org.archive.resource.http.HTTPResponseResource) with (class org.archive.resource.html.HTMLResourceFactory)

The log messages are not really informative and aggregate up to 40 MB per processed WARC file. For one monthly crawl (Common Crawl) the logs occupy 5-7 TiB on HDFS - and unneeded waste of resources.

Add attribute `property` of HTML meta elements

(reported by Christian Lund on Common Crawl Google group)

For HTML elements only the attributes name, rel, content and http-equiv are extracted. The attribute property is missing which leads to unpaired, value-only items in the WAT file

"HTML-Metadata":{"Head":{"Metas":[...,{"content":"website"},...]}}

e.g, for open graph properties

<meta property="og:type" content="website"/>

property is an RDFa attribute and is not part of the HTML standard. However, it's widely used. The WAT specification describes the data contained in HTML-Metadata as "attributes and values of HTML head elements: title, base, style, link, meta and script". There is no explicit restriction to attributes covered by one of the HTML standards.

WEATGenerator to recover from failed tasks with partial uploads

If a task of WEATGenerator fails while uploading the resulting WAT and WET files (e.g., due to a task timeout), an unpaired WAT or WET file may remain. This causes restarted tasks to fail:

17/06/30 19:20:51 INFO mapreduce.Job: Task Id : attempt_1497855985973_0182_m_000097_1001, Status : FAILED
Error: java.io.IOException: org.apache.hadoop.fs.FileAlreadyExistsException: s3a://commoncrawl/crawl-data/CC-MAIN-2017-26/segments/1498128320063.74/wat/CC-MAIN-20170623133357-20170623153357-00390.warc.wat.gz already exists
        at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:126)
        at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:48)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: s3a://commoncrawl/crawl-data/CC-MAIN-2017-26/segments/1498128320063.74/wat/CC-MAIN-20170623133357-20170623153357-00390.warc.wat.gz already exists
        at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:633)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:925)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:803)
        at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:97)
        ... 9 more

and makes it necessary to manually remove the unpaired file and restart the job with a new list of WARC files to be converted to WAT/WET. Manual interaction is slow and error-prone. Ideally, WEATGenerator should log an unpaired file and overwrite it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.