Comments (3)
Side note: NO_BLOCK_DIGEST_TYPES
is actually a string, not a tuple as certainly intended. That won't have any negative effect though unless someone is writing custom records with a WARC-Type
that is a substring of warcinfo
.
from warcio.
The ('warcinfo') aspect of the bug is old. It was (accidentally) introduced in
25194bc
warcwriter: if writing a record with no Content-Length, buffer and compute length, as well as digests, as needed
You're right that any recordtype that's a subset of the string 'warcinfo' will match, fortunately there is no such standard recordtype.
As for that explicit code that stops writing block digests for warcinfo records, that code was in the original checkin, when the code was split out of webrecorder. I'm kind of surprised that I didn't notice it while working on "warcio test".
from warcio.
I dug around a bit more. a791617 is not the commit that introduced this behaviour. It appears that it was already like that back when warcinfo records were first added in webrecorder/pywb@d40edfc2; at the time, digests were only written for responses as far as I can tell.
I can't really think of any good reason why block digests shouldn't simply always be present for verification purposes. The only exception would be zero-length records, where the digest is useless, but those are so rare and odd that worrying about them isn't worth it. I'll create a PR shortly.
from warcio.
Related Issues (20)
- Add version tags to the repository HOT 2
- Invalid WARCs are silently accepted instead of raising an error HOT 5
- warcio mangles non-ASCII HTTP headers HOT 9
- warcio does not preserve HTTP header whitespace HOT 3
- quoted-string WARC header values are not parsed correctly
- Not compatible with WARC-files/records writtin by ArchiveSpark HOT 1
- get_test_file missing from the PyPI release HOT 4
- Offline tests HOT 2
- extract entire warc file? HOT 4
- warcio check does not raise error when GZip records are truncated HOT 5
- `capture_http` fails in tests, but works otherwise HOT 5
- Record not followed by newline (conversion error) HOT 1
- Warcio does not support replay of sites hosted on NCSA 1.5 HOT 3
- Issues with encoding of http-answers HOT 2
- Documentation: Clarify that capture_http writer with filename has no get_stream methood HOT 3
- warcio.exceptions.ArchiveLoadFailed: Unknown archive format HOT 3
- Empty WARC files when deploying warcio on Airflow HOT 5
- Trying to write to closed file when using `requests.Session`
- Patching WARCs using warcio
- warcio cannot write wet files
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warcio.