archive-commons's Issues
WAT extractor produces invalid WARC records.
While examining WAT files produced by the WAT extractor tool, I noticed that the end-of-record-boundary is incorrect.
From the WARC draft spec:
warc-file = 1*warc-record
warc-record = header CRLF
block CRLF CRLF
header = version warc-fields
version = "WARC/0.18" CRLF
warc-fields = *named-field CRLF
block = *OCTET
But in:
Only a single CRLF is written after the block. Also, that single CRLF is included in the Content-Length value, which is also in violation of the WARC spec. That value should only count the bytes in the block itself, not including the record-terminating CRLF CRLF sequence.
option for ArchiveJSONViewLoader to extract HTTP body
Is it planned to include an option for the ArchiveJSONViewLoader to extract the complete HTTP body such that it can be analysed with Pig, too?
FileBackedOutputStreams not cleaned-up
In most places where a FileBackedOutputStream is created, the 'resetOnFinalize' flag is left out, which defaults to 'false'; ergo the files are not deleted when the JVM exits and we get tons and tons of the temp backing files accumulating in /tmp.
See:
src/main/java/org/archive/extract/WATExtractorOutput.java
src/main/java/org/archive/format/gzip/GZIPMemberWriter.java
src/main/java/org/archive/format/warc/WARCRecordWriter.java
Oddly enough, the flag is set to 'true' in this file:
src/main/java/org/archive/format/gzip/GZIPMemberWriterCommittedOutputStream.java
HTTPImportMapper --soft-fails deletes file in HDFS if it isn't found on remote server.
When using the HTTPImport feature of the archive-commons library, you can unexpectedly delete files from HDFS by enabling --soft-fails.
Suppose you have a file foo.warc.gz
in HDFS. You then try to import a remote file with the same name via http, such as http://example.org/foo.warc.gz
. Now suppose that the remote web server is having a problem and sends a 404 or 500 response.
By default, without --soft-fails
, the HTTPImport code will see the non-200 and the task will fail. Your local copy of foo.warc.gz
is safe.
However, if you enable --soft-fails
, then the non-200 is essentially ignored, the HTTPImportMapper
proceeds to delete the copy of foo.warc.gz
in HDFS, then it tries to re-get the file from the remote server, which of course fails. Doh!
I was bit by this when importing a large list of (w)arc files from some unreliable repository servers. I was using --soft-fails
because the remote servers were flakey and I didn't want one failed xfer to abort the entire job. My plan was to run the big import job, and if some amount of the files failed to xfer because of remote server troubles, simply run the job again to pick up more files...wash, rinse, repeat.
Much to my surprise, suppose file foo.warc.gz
was xfer'd successfully on pass 1, but then on pass 2 the remote server sent a 500...then foo.warc.gz
was deleted from HDFS! Doh!
It looks like HTTPImportMapper has a logic bug where if --soft-fails
is enabled, it ignores the remote server's non-200 response, but instead of returning, it proceeds to remove the local file (in HDFS) and then tries to re-get it from the remote server, which of course fails.
Here's a simple way to see the bug in action:
- Have a file
foo
in HDFS - Use HTTPImport to import
http://archive.org/404/foo
with--soft-fails
- Notice your copy of
foo
is deleted because the server sent a 404.
Unable to build
I am currently unable to build archive-commons. Here is how you can (I think) reproduce the problem. This is as of
b359080
$ rm -rf ~/.m2/repository/
$ cd archive-commons
$ mvn package
[INFO] Scanning for projects...
[INFO] Reactor build order:
[INFO] archive-surt
[INFO] archive-commons
[INFO] ia-tools
[INFO] Unnamed - org.archive:archive-commons-parent:pom:0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] Building archive-surt
[INFO] task-segment: [package]
[INFO] ------------------------------------------------------------------------
[INFO] [resources:resources {execution: default-resources}]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/egh/d/software/archive-commons/archive-surt/src/main/resources
Downloading: http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/3.1.0-SNAPSHOT/heritrix-commons-3.1.0-SNAPSHOT.pom
10K downloaded (heritrix-commons-3.1.0-SNAPSHOT.pom)
[WARNING] Unable to get resource 'org.archive.heritrix:heritrix-commons:pom:3.1.0-SNAPSHOT' from repository internetarchive (http://builds.archive.org:8080/maven2): Error retrieving checksum file for org/archive/heritrix/heritrix-commons/3.1.0-SNAPSHOT/heritrix-commons-3.1.0-SNAPSHOT.pom
Downloading: http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/3.1.0-SNAPSHOT/heritrix-commons-3.1.0-SNAPSHOT.jar
443K downloaded (heritrix-commons-3.1.0-SNAPSHOT.jar)
[WARNING] Unable to get resource 'org.archive.heritrix:heritrix-commons:jar:3.1.0-SNAPSHOT' from repository internetarchive (http://builds.archive.org:8080/maven2): Error retrieving checksum file for org/archive/heritrix/heritrix-commons/3.1.0-SNAPSHOT/heritrix-commons-3.1.0-SNAPSHOT.jar
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] Failed to resolve artifact.
Missing:
----------
1) org.archive.heritrix:heritrix-commons:jar:3.1.0-SNAPSHOT
Try downloading the file manually from the project website.
Then, install it using the command:
mvn install:install-file -DgroupId=org.archive.heritrix -DartifactId=heritrix-commons -Dversion=3.1.0-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file
Alternatively, if you host your own repository you can deploy the file there:
mvn deploy:deploy-file -DgroupId=org.archive.heritrix -DartifactId=heritrix-commons -Dversion=3.1.0-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]
Path to dependency:
1) org.archive:archive-surt:jar:1.0-SNAPSHOT
2) org.archive.heritrix:heritrix-commons:jar:3.1.0-SNAPSHOT
----------
1 required artifact is missing.
for artifact:
org.archive:archive-surt:jar:1.0-SNAPSHOT
from the specified remote repositories:
central (http://repo1.maven.org/maven2),
internetarchive (http://builds.archive.org:8080/maven2)
[INFO] ------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4 seconds
[INFO] Finished at: Mon Jul 02 15:05:20 PDT 2012
[INFO] Final Memory: 8M/45M
[INFO] ------------------------------------------------------------------------
WAT generation ignores HTML5 audio and video tags.
It appears that WAT generation doesn't know about the HTML5 tags related to audio and video:
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.