Giter Site home page Giter Site logo

Comments (5)

JustAnotherArchivist avatar JustAnotherArchivist commented on September 22, 2024

Here's a list of things that I think warcio should validate before even emitting an ArcWarcRecord:

  • Does the stream begin with WARC/, a supported spec version, and CRLF?
  • Does every following line end with CRLF?
  • Once an empty line has been read: Is every mandatory header (WARC-Record-ID, Content-Length, WARC-Date, and WARC-Type) present with a valid value?

Then, upon reading from raw_stream (directly or via content_stream):

  • Can enough bytes be read from the stream, in accordance with the Content-Length header?
  • Is that data followed by CRLFCRLF?

If any of these points fail, an exception should be raised.

As far as I can tell, only the first point is partially (no CRLF check) implemented so far.

from warcio.

JustAnotherArchivist avatar JustAnotherArchivist commented on September 22, 2024

I took a quick look how this could be implemented. warcio uses the same code for WARC and HTTP header parsing, warcio.statusandheaders.StatusAndHeadersParser. Unfortunately, HTTP servers are horrible at complying with the specifications, but some clients (browsers) 'fix' those inconsistencies instead of throwing bricks at the developer, and so other clients are forced to also be flexible when it comes to parsing HTTP (e.g. wrong linebreaks). Such flexibility is undesirable for WARC parsing in my opinion.

I'll take a shot at adding a strict kwarg to the parser which requires CRLF line endings, UTF-8 encoding without fallback to ISO-8859-1, colons except in continuation lines, etc. Then this can be enabled for the WARC parsing but not the HTTP one. I'm not sure about the name of that kwarg since at least the encoding part does not apply for HTTP even if strictly following the spec, but we can sort that out later in the PR. :-)

from warcio.

wumpus avatar wumpus commented on September 22, 2024

One interface thing to keep in mind is that looping over an iterator cannot be continued if the iterator raises.

That's why warcio's digest verification has a complicated interface, with 4 options: don't check (default), record problems but carry on, record problems and print something to stderr but carry on, and finally raise on problems.

from warcio.

JustAnotherArchivist avatar JustAnotherArchivist commented on September 22, 2024

Right. But should the iterator be resumable if the underlying stream is not a valid WARC file like in the examples above? For digest verification, it makes sense to log digest mismatches and check the rest of the file. Similarly, content type mismatches or invalid payload data (e.g. corrupted images) can and should be handled downstream. But that isn't really possible or sensible if the file isn't a parsable WARC in the first place.

Generic recovery from such a situation isn't possible either. I've had a case in the past where a record in the middle of a WARC was truncated for unknown reasons. Fortunately, the file used per-record compression, so some nasty processing allowed to find the next record and then produce a new file without the offending record. But that's not possible in the general case because the file might be compressed as a whole rather than per-record or, even worse, gzip member boundaries might be offset entirely compared to the WARC record boundaries. You can't simply decompress everything and then search for '\r\n\r\nWARC/1.0\r\n' either because that string could appear within a record, too.

I suppose it makes sense to split the points mentioned above into two categories.

First, there are hard parsing errors. These are errors that are absolutely impossible to recover from. For example, if a file doesn't start with WARC/1.1 + CRLF (or other valid known version) or there isn't an empty line marking the end of the headers, it simply cannot be parsed. A Content-Length header would also have to be present, and after reading that many bytes from the payload, there must be a double CRLF.

Second, there are softer parsing errors. Examples of this include header names that aren't valid UTF-8, missing any of the other required header fields, header lines (that aren't continuation lines) missing a colon, or LF instead of CRLF as line endings.
I'm not sure about these. Part of me wants to handle these exactly like hard errors. Accepting files that don't conform to the spec leads to exactly the same issue as we have with HTTP or IRC: over time, more and more content would exist that is not conformant, and so everyone has to adapt to also handle these non-conformant cases without any proper documentation of this stuff. I feel like this is even more problematic in the context of archival formats, which are meant to be preserved forever.

from warcio.

wumpus avatar wumpus commented on September 22, 2024

btw I have a not-quite-finished develop-warcio-test branch in the repo that is capable of complaining about soft parsing errors. There are tons of WARCs out there with problems.

from warcio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.