Comments (8)
That error message is saying that warcio thinks the file is invalid. Is it valid? if you can send me a copy, I'll look at it for you.
from warcio.
from warcio.
Looks to me like the metadata records in this WAT are missing a \r\n at the end of the body -- there are supposed to be 2 pairs, and there's only 1. So yes, it's invalid.
This is not that unusual in the WARC community, standards-conformance has been historically hit-or-miss.
If this is a common problem, I think warcio ought to tolerate this weirdness.
from warcio.
Thanks for checking it out.
The WAT is generated by archive-metadata-extractor, so I suppose options are to either fix archive-metadata-extractor or add tolerance in warcio. Is there a potential downside to making warcio tolerate the missing \r\n pair?
https://webarchive.jira.com/wiki/spaces/Iresearch/pages/14057510/archive-metadata-extractor.jar
Alternatively, are there other tools for going from WARC/ARC to WAT besides archive-metadata-extractor?
from warcio.
Which version of archive-metadata-extractor are you running? If it's the one linked from
https://webarchive.jira.com/wiki/spaces/Iresearch/pages/14057510/archive-metadata-extractor.jar
notice the comment at the bottom explaining that a newer version fixes this bug.
from warcio.
@wumpus Thanks so much again for the help and sorry we overlooked the comment about the bug.
I know it's not directly related to warcio, but we are unsure how to invoke webarchive-commons the same way we used to invoke the archive-metadata-extractor.jar from the command line to generate a WAT and weren't able to find docs on that. Any quick tips?
We appreciate it.
from warcio.
Building webarchive-commons generates 2 jar files under target directory as follows:
webarchive-commons-1.1.5-IA.jar
webarchive-commons-jar-with-dependencies.jar
When executing
java -jar webarchive-commons-jar-with-dependencies.jar
it throws this error message
no main manifest attribute, in webarchive-commons-jar-with-dependencies.jar
Any suggestions?
from warcio.
Thanks @wumpus for looking into this. I don't know that this is particularly common, and its very old code from IA that's generating these WATs.. IA might have an updated version of these files, would recommend checking with them.
I suppose warcio recompress
could be able to fix these types of errors, but haven't really seen this issue in general, so closing for now.
from warcio.
Related Issues (20)
- warcio check does not raise error when GZip records are truncated HOT 5
- `capture_http` fails in tests, but works otherwise HOT 5
- Record not followed by newline (conversion error) HOT 1
- Warcio does not support replay of sites hosted on NCSA 1.5 HOT 3
- Issues with encoding of http-answers HOT 2
- Documentation: Clarify that capture_http writer with filename has no get_stream methood HOT 3
- warcio.exceptions.ArchiveLoadFailed: Unknown archive format HOT 3
- Empty WARC files when deploying warcio on Airflow HOT 5
- Trying to write to closed file when using `requests.Session`
- Patching WARCs using warcio
- warcio cannot write wet files
- webrecorder fails to open IA warc file on MacOS X Ventura 13.2.1 HOT 2
- wget warc status code? HOT 3
- doc bugs linking to source code files
- "warcio check" incorrectly reporting payload digest failures for non-HTTP WARCs HOT 2
- warcio accepts a bare LF everywhere a CRLF is required by the spec HOT 1
- "warcio check" does not warn of illegal characters in field names or values, including LF HOT 8
- warcio recompress adds WARC-Block-Digest fields to records without one
- warcio recompress adds "WARC-Payload-Digest" to records without understanding them
- DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warcio.