Comments (6)
Add jwattools as a potential means of validation. Some WARCs generated are perfectly valid. The troublemaking resources/captures/transactions need to be identified and have their handlers repaired in the code.
from warcreate.
This is hit-or-miss depending on the webpage and is tied to #7 , so difficult to debug. Some concrete examples are needed with good, isolated testing methods.
from warcreate.
René Voorburg reported:
I tried to create a warc for http://www.faz.net/. However, when I tried to create a cdx index for it I got an error. I tried it with both the cdx creator from wayback (1.6) and openwayback. The error is below. I used chrome on macosx.
23-jan-2015 16:10:37 org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext
WARNING: Trying skip of failed record cleanup of {WARC-Type=metadata, reader-identifier=/home/kbuser/20150123091322216.warc, WARC-Date=2015-01-23T09:13:26Z, absolute-offset=1735, Content-Length=92184, WARC-Record-ID=urn:uuid:6fef2a49-a9ba-4b40-9f4a-5ca5db1fd5c6, WARC-Target-URI=http://www.faz.net/, WARC-Concurrent-To=urn:uuid:dddc4ba2-c1e1-459b-8d0d-a98a20b87e96, Content-Type=application/warc-fields}: Unexpected character 61(Expecting d)
23-jan-2015 16:10:37 org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext
WARNING: Trying skip of failed record cleanup of {WARC-Type=metadata, reader-identifier=/home/kbuser/20150123091322216.warc, WARC-Date=2015-01-23T09:13:26Z, absolute-offset=1735, Content-Length=92184, WARC-Record-ID=urn:uuid:6fef2a49-a9ba-4b40-9f4a-5ca5db1fd5c6, WARC-Target-URI=http://www.faz.net/, WARC-Concurrent-To=urn:uuid:dddc4ba2-c1e1-459b-8d0d-a98a20b87e96, Content-Type=application/warc-fields}: Unexpected character 6c(Expecting d)
23-jan-2015 16:10:37 org.archive.io.ArchiveReader$ArchiveRecordIterator next
WARNING: Bad Record. Trying skip (Current offset 94220): Unexpected character 6c(Expecting d) Exception in thread "main" java.lang.RuntimeException: After retry (Offset 94220)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:533)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:459)
at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:53)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:52)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:52)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:209)
Caused by: java.io.IOException: Unexpected character 2d(Expecting d)
at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:79)
at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:69)
at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:189)
at org.archive.io.ArchiveReader.get(ArchiveReader.java:139)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.innerNext(ArchiveReader.java:583)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:558)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:526)
... 7 more
Null message body; hope that's ok
from warcreate.
Using https://sbforge.org/display/JWAT/JWAT might be a good solution, at least for testing.
from warcreate.
Some of them work fine. Some do not validate in pywb 0.33.0.
Here is the output from pywb for the attached file:
2016-11-30 13:04:23,294: [INFO]: Copied /Users/shawnjones/tmp/20161128215640666.warc to /Users/shawnjones/playback/collections/testcoll/archive
WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: 67348
Remainder: b'ARC/1.0\r\n'
Error: Invalid WARC record, first line: WARC-Type: request
from warcreate.
Generated test WARCs. They now validate. Thanks, @N0taN3rd
from warcreate.
Related Issues (20)
- Provide additional options within the popup window
- URIs with invalid characters are not escaped HOT 1
- Consider recording Memento Traces to WARCs HOT 1
- Store screenshot of page in WARC, too HOT 2
- WARCs of PDF include browser's wrapper
- Working status, how does it work? HOT 9
- [discussion/thought] Would a custom browser solution work better in terms of capabilities/UI than most current tools/proxies? HOT 14
- Generate WARC from offline MHTML HOT 2
- Adapting code to use manifest v3 HOT 6
- Use WABAC to replay WARCs
- Irrelevant Web Worker requests included HOT 1
- Add BibTeX reference to README
- `Import` WARC with `WebRecorder.AppImage` (and `Upload` to Conifer.Rhizome.org) stops at 50% with `"Error Encountered"`. HOT 4
- Embedded fonts are not included in WARCs HOT 3
- What are the barriers in adapting WARCreate be used in the TOR browser? HOT 1
- Extension not working on most websites HOT 8
- Would it be possible to port this to Firefox? HOT 4
- Decouple from TravisCI for linting/testing
- WARC file names should follow the format recommended in Annex C
- Rádio HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warcreate.