Comments (5)
Here's a list of things that I think warcio should validate before even emitting an ArcWarcRecord
:
- Does the stream begin with
WARC/
, a supported spec version, and CRLF? - Does every following line end with CRLF?
- Once an empty line has been read: Is every mandatory header (
WARC-Record-ID
,Content-Length
,WARC-Date
, andWARC-Type
) present with a valid value?
Then, upon reading from raw_stream
(directly or via content_stream
):
- Can enough bytes be read from the stream, in accordance with the
Content-Length
header? - Is that data followed by CRLFCRLF?
If any of these points fail, an exception should be raised.
As far as I can tell, only the first point is partially (no CRLF check) implemented so far.
from warcio.
I took a quick look how this could be implemented. warcio uses the same code for WARC and HTTP header parsing, warcio.statusandheaders.StatusAndHeadersParser
. Unfortunately, HTTP servers are horrible at complying with the specifications, but some clients (browsers) 'fix' those inconsistencies instead of throwing bricks at the developer, and so other clients are forced to also be flexible when it comes to parsing HTTP (e.g. wrong linebreaks). Such flexibility is undesirable for WARC parsing in my opinion.
I'll take a shot at adding a strict
kwarg to the parser which requires CRLF line endings, UTF-8 encoding without fallback to ISO-8859-1, colons except in continuation lines, etc. Then this can be enabled for the WARC parsing but not the HTTP one. I'm not sure about the name of that kwarg since at least the encoding part does not apply for HTTP even if strictly following the spec, but we can sort that out later in the PR. :-)
from warcio.
One interface thing to keep in mind is that looping over an iterator cannot be continued if the iterator raises.
That's why warcio's digest verification has a complicated interface, with 4 options: don't check (default), record problems but carry on, record problems and print something to stderr but carry on, and finally raise on problems.
from warcio.
Right. But should the iterator be resumable if the underlying stream is not a valid WARC file like in the examples above? For digest verification, it makes sense to log digest mismatches and check the rest of the file. Similarly, content type mismatches or invalid payload data (e.g. corrupted images) can and should be handled downstream. But that isn't really possible or sensible if the file isn't a parsable WARC in the first place.
Generic recovery from such a situation isn't possible either. I've had a case in the past where a record in the middle of a WARC was truncated for unknown reasons. Fortunately, the file used per-record compression, so some nasty processing allowed to find the next record and then produce a new file without the offending record. But that's not possible in the general case because the file might be compressed as a whole rather than per-record or, even worse, gzip member boundaries might be offset entirely compared to the WARC record boundaries. You can't simply decompress everything and then search for '\r\n\r\nWARC/1.0\r\n'
either because that string could appear within a record, too.
I suppose it makes sense to split the points mentioned above into two categories.
First, there are hard parsing errors. These are errors that are absolutely impossible to recover from. For example, if a file doesn't start with WARC/1.1
+ CRLF (or other valid known version) or there isn't an empty line marking the end of the headers, it simply cannot be parsed. A Content-Length
header would also have to be present, and after reading that many bytes from the payload, there must be a double CRLF.
Second, there are softer parsing errors. Examples of this include header names that aren't valid UTF-8, missing any of the other required header fields, header lines (that aren't continuation lines) missing a colon, or LF instead of CRLF as line endings.
I'm not sure about these. Part of me wants to handle these exactly like hard errors. Accepting files that don't conform to the spec leads to exactly the same issue as we have with HTTP or IRC: over time, more and more content would exist that is not conformant, and so everyone has to adapt to also handle these non-conformant cases without any proper documentation of this stuff. I feel like this is even more problematic in the context of archival formats, which are meant to be preserved forever.
from warcio.
btw I have a not-quite-finished develop-warcio-test branch in the repo that is capable of complaining about soft parsing errors. There are tons of WARCs out there with problems.
from warcio.
Related Issues (20)
- `capture_http` fails in tests, but works otherwise HOT 5
- Record not followed by newline (conversion error) HOT 1
- Warcio does not support replay of sites hosted on NCSA 1.5 HOT 3
- Issues with encoding of http-answers HOT 2
- Documentation: Clarify that capture_http writer with filename has no get_stream methood HOT 3
- warcio.exceptions.ArchiveLoadFailed: Unknown archive format HOT 3
- Empty WARC files when deploying warcio on Airflow HOT 5
- Trying to write to closed file when using `requests.Session`
- Patching WARCs using warcio
- warcio cannot write wet files
- webrecorder fails to open IA warc file on MacOS X Ventura 13.2.1 HOT 2
- wget warc status code? HOT 3
- doc bugs linking to source code files
- "warcio check" incorrectly reporting payload digest failures for non-HTTP WARCs HOT 2
- warcio accepts a bare LF everywhere a CRLF is required by the spec HOT 1
- "warcio check" does not warn of illegal characters in field names or values, including LF HOT 8
- warcio recompress adds WARC-Block-Digest fields to records without one
- warcio recompress adds "WARC-Payload-Digest" to records without understanding them
- DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version HOT 7
- Add test to HTTPS proxies
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warcio.