Giter Site home page Giter Site logo

Comments (6)

machawk1 avatar machawk1 commented on May 25, 2024

Add jwattools as a potential means of validation. Some WARCs generated are perfectly valid. The troublemaking resources/captures/transactions need to be identified and have their handlers repaired in the code.

from warcreate.

machawk1 avatar machawk1 commented on May 25, 2024

This is hit-or-miss depending on the webpage and is tied to #7 , so difficult to debug. Some concrete examples are needed with good, isolated testing methods.

from warcreate.

machawk1 avatar machawk1 commented on May 25, 2024

René Voorburg reported:

I tried to create a warc for http://www.faz.net/. However, when I tried to create a cdx index for it I got an error. I tried it with both the cdx creator from wayback (1.6) and openwayback. The error is below. I used chrome on macosx.

23-jan-2015 16:10:37 org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext

WARNING: Trying skip of failed record cleanup of {WARC-Type=metadata, reader-identifier=/home/kbuser/20150123091322216.warc, WARC-Date=2015-01-23T09:13:26Z, absolute-offset=1735, Content-Length=92184, WARC-Record-ID=urn:uuid:6fef2a49-a9ba-4b40-9f4a-5ca5db1fd5c6, WARC-Target-URI=http://www.faz.net/, WARC-Concurrent-To=urn:uuid:dddc4ba2-c1e1-459b-8d0d-a98a20b87e96, Content-Type=application/warc-fields}: Unexpected character 61(Expecting d)

23-jan-2015 16:10:37 org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext

WARNING: Trying skip of failed record cleanup of {WARC-Type=metadata, reader-identifier=/home/kbuser/20150123091322216.warc, WARC-Date=2015-01-23T09:13:26Z, absolute-offset=1735, Content-Length=92184, WARC-Record-ID=urn:uuid:6fef2a49-a9ba-4b40-9f4a-5ca5db1fd5c6, WARC-Target-URI=http://www.faz.net/, WARC-Concurrent-To=urn:uuid:dddc4ba2-c1e1-459b-8d0d-a98a20b87e96, Content-Type=application/warc-fields}: Unexpected character 6c(Expecting d)

23-jan-2015 16:10:37 org.archive.io.ArchiveReader$ArchiveRecordIterator next

WARNING: Bad Record. Trying skip (Current offset 94220): Unexpected character 6c(Expecting d) Exception in thread "main" java.lang.RuntimeException: After retry (Offset 94220)

            at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:533)

            at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:459)

            at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)

            at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)

            at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:53)

            at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:52)

            at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:52)

            at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:209)

Caused by: java.io.IOException: Unexpected character 2d(Expecting d)

            at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:79)

            at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:69)

            at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:189)

            at org.archive.io.ArchiveReader.get(ArchiveReader.java:139)

            at org.archive.io.ArchiveReader$ArchiveRecordIterator.innerNext(ArchiveReader.java:583)

            at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:558)

            at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:526)

            ... 7 more

Null message body; hope that's ok

from warcreate.

machawk1 avatar machawk1 commented on May 25, 2024

Using https://sbforge.org/display/JWAT/JWAT might be a good solution, at least for testing.

from warcreate.

shawnmjones avatar shawnmjones commented on May 25, 2024

Some of them work fine. Some do not validate in pywb 0.33.0.

Here is the output from pywb for the attached file:

2016-11-30 13:04:23,294: [INFO]: Copied /Users/shawnjones/tmp/20161128215640666.warc to /Users/shawnjones/playback/collections/testcoll/archive
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 67348
    Remainder: b'ARC/1.0\r\n'
Error: Invalid WARC record, first line: WARC-Type: request

20161128215640666.warc.txt

from warcreate.

machawk1 avatar machawk1 commented on May 25, 2024

Generated test WARCs. They now validate. Thanks, @N0taN3rd

from warcreate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.