commoncrawl / ia-hadoop-tools Goto Github PK
View Code? Open in Web Editor NEWThis project forked from aloisius/ia-hadoop-tools
Web archiving tools on Hadoop
This project forked from aloisius/ia-hadoop-tools
Web archiving tools on Hadoop
When transforming WARC to WAT/WET (org.archive.hadoop.jobs.WEATGenerator) the ExtractingResourceProducer logs every WARC record extensively:
Nov 21, 2019 9:51:26 AM org.archive.extract.ExtractingResourceProducer getNext
INFO: Extracting (class org.archive.resource.warc.WARCResource) with (class org.archive.resource.http.HTTPRequestResourceFactory)
Nov 21, 2019 9:51:26 AM org.archive.extract.ExtractingResourceProducer getNext
INFO: Extracting (class org.archive.resource.warc.WARCResource) with (class org.archive.resource.http.HTTPResponseResourceFactory)
Nov 21, 2019 9:51:26 AM org.archive.extract.ExtractingResourceProducer getNext
INFO: Extracting (class org.archive.resource.http.HTTPResponseResource) with (class org.archive.resource.html.HTMLResourceFactory)
The log messages are not really informative and aggregate up to 40 MB per processed WARC file. For one monthly crawl (Common Crawl) the logs occupy 5-7 TiB on HDFS - and unneeded waste of resources.
(reported by Christian Lund on Common Crawl Google group)
For HTML elements only the attributes name
, rel
, content
and http-equiv
are extracted. The attribute property
is missing which leads to unpaired, value-only items in the WAT file
"HTML-Metadata":{"Head":{"Metas":[...,{"content":"website"},...]}}
e.g, for open graph properties
<meta property="og:type" content="website"/>
property
is an RDFa attribute and is not part of the HTML standard. However, it's widely used. The WAT specification describes the data contained in HTML-Metadata
as "attributes and values of HTML head elements: title, base, style, link, meta and script". There is no explicit restriction to attributes covered by one of the HTML standards.
If a task of WEATGenerator fails while uploading the resulting WAT and WET files (e.g., due to a task timeout), an unpaired WAT or WET file may remain. This causes restarted tasks to fail:
17/06/30 19:20:51 INFO mapreduce.Job: Task Id : attempt_1497855985973_0182_m_000097_1001, Status : FAILED
Error: java.io.IOException: org.apache.hadoop.fs.FileAlreadyExistsException: s3a://commoncrawl/crawl-data/CC-MAIN-2017-26/segments/1498128320063.74/wat/CC-MAIN-20170623133357-20170623153357-00390.warc.wat.gz already exists
at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:126)
at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:48)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: s3a://commoncrawl/crawl-data/CC-MAIN-2017-26/segments/1498128320063.74/wat/CC-MAIN-20170623133357-20170623153357-00390.warc.wat.gz already exists
at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:633)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:925)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:803)
at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:97)
... 9 more
and makes it necessary to manually remove the unpaired file and restart the job with a new list of WARC files to be converted to WAT/WET. Manual interaction is slow and error-prone. Ideally, WEATGenerator should log an unpaired file and overwrite it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.