Giter Site home page Giter Site logo

archive-commons's Issues

WAT extractor produces invalid WARC records.

While examining WAT files produced by the WAT extractor tool, I noticed that the end-of-record-boundary is incorrect.

From the WARC draft spec:

    warc-file = 1*warc-record
    warc-record = header CRLF
                  block CRLF CRLF
    header = version warc-fields
    version = "WARC/0.18" CRLF
    warc-fields = *named-field CRLF
    block = *OCTET

But in:

private void writeRecord(OutputStream out, HttpHeaders headers,

Only a single CRLF is written after the block. Also, that single CRLF is included in the Content-Length value, which is also in violation of the WARC spec. That value should only count the bytes in the block itself, not including the record-terminating CRLF CRLF sequence.

FileBackedOutputStreams not cleaned-up

In most places where a FileBackedOutputStream is created, the 'resetOnFinalize' flag is left out, which defaults to 'false'; ergo the files are not deleted when the JVM exits and we get tons and tons of the temp backing files accumulating in /tmp.

See:
src/main/java/org/archive/extract/WATExtractorOutput.java
src/main/java/org/archive/format/gzip/GZIPMemberWriter.java
src/main/java/org/archive/format/warc/WARCRecordWriter.java

Oddly enough, the flag is set to 'true' in this file:
src/main/java/org/archive/format/gzip/GZIPMemberWriterCommittedOutputStream.java

HTTPImportMapper --soft-fails deletes file in HDFS if it isn't found on remote server.

When using the HTTPImport feature of the archive-commons library, you can unexpectedly delete files from HDFS by enabling --soft-fails.

Suppose you have a file foo.warc.gz in HDFS. You then try to import a remote file with the same name via http, such as http://example.org/foo.warc.gz. Now suppose that the remote web server is having a problem and sends a 404 or 500 response.

By default, without --soft-fails, the HTTPImport code will see the non-200 and the task will fail. Your local copy of foo.warc.gz is safe.

However, if you enable --soft-fails, then the non-200 is essentially ignored, the HTTPImportMapper proceeds to delete the copy of foo.warc.gz in HDFS, then it tries to re-get the file from the remote server, which of course fails. Doh!

I was bit by this when importing a large list of (w)arc files from some unreliable repository servers. I was using --soft-fails because the remote servers were flakey and I didn't want one failed xfer to abort the entire job. My plan was to run the big import job, and if some amount of the files failed to xfer because of remote server troubles, simply run the job again to pick up more files...wash, rinse, repeat.

Much to my surprise, suppose file foo.warc.gz was xfer'd successfully on pass 1, but then on pass 2 the remote server sent a 500...then foo.warc.gz was deleted from HDFS! Doh!

It looks like HTTPImportMapper has a logic bug where if --soft-fails is enabled, it ignores the remote server's non-200 response, but instead of returning, it proceeds to remove the local file (in HDFS) and then tries to re-get it from the remote server, which of course fails.

Here's a simple way to see the bug in action:

  1. Have a file foo in HDFS
  2. Use HTTPImport to import http://archive.org/404/foo with --soft-fails
  3. Notice your copy of foo is deleted because the server sent a 404.

Unable to build

I am currently unable to build archive-commons. Here is how you can (I think) reproduce the problem. This is as of
b359080

$ rm -rf ~/.m2/repository/
$ cd archive-commons
$ mvn package
[INFO] Scanning for projects...
[INFO] Reactor build order: 
[INFO]   archive-surt
[INFO]   archive-commons
[INFO]   ia-tools
[INFO]   Unnamed - org.archive:archive-commons-parent:pom:0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] Building archive-surt
[INFO]    task-segment: [package]
[INFO] ------------------------------------------------------------------------
[INFO] [resources:resources {execution: default-resources}]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/egh/d/software/archive-commons/archive-surt/src/main/resources
Downloading: http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/3.1.0-SNAPSHOT/heritrix-commons-3.1.0-SNAPSHOT.pom
10K downloaded  (heritrix-commons-3.1.0-SNAPSHOT.pom)
[WARNING] Unable to get resource 'org.archive.heritrix:heritrix-commons:pom:3.1.0-SNAPSHOT' from repository internetarchive (http://builds.archive.org:8080/maven2): Error retrieving checksum file for org/archive/heritrix/heritrix-commons/3.1.0-SNAPSHOT/heritrix-commons-3.1.0-SNAPSHOT.pom
Downloading: http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/3.1.0-SNAPSHOT/heritrix-commons-3.1.0-SNAPSHOT.jar
443K downloaded  (heritrix-commons-3.1.0-SNAPSHOT.jar)
[WARNING] Unable to get resource 'org.archive.heritrix:heritrix-commons:jar:3.1.0-SNAPSHOT' from repository internetarchive (http://builds.archive.org:8080/maven2): Error retrieving checksum file for org/archive/heritrix/heritrix-commons/3.1.0-SNAPSHOT/heritrix-commons-3.1.0-SNAPSHOT.jar
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] Failed to resolve artifact.

Missing:
----------
1) org.archive.heritrix:heritrix-commons:jar:3.1.0-SNAPSHOT

  Try downloading the file manually from the project website.

  Then, install it using the command: 
      mvn install:install-file -DgroupId=org.archive.heritrix -DartifactId=heritrix-commons -Dversion=3.1.0-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file there: 
      mvn deploy:deploy-file -DgroupId=org.archive.heritrix -DartifactId=heritrix-commons -Dversion=3.1.0-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]

  Path to dependency: 
    1) org.archive:archive-surt:jar:1.0-SNAPSHOT
    2) org.archive.heritrix:heritrix-commons:jar:3.1.0-SNAPSHOT

----------
1 required artifact is missing.

for artifact: 
  org.archive:archive-surt:jar:1.0-SNAPSHOT

from the specified remote repositories:
  central (http://repo1.maven.org/maven2),
  internetarchive (http://builds.archive.org:8080/maven2)



[INFO] ------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4 seconds
[INFO] Finished at: Mon Jul 02 15:05:20 PDT 2012
[INFO] Final Memory: 8M/45M
[INFO] ------------------------------------------------------------------------

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.