Giter Site home page Giter Site logo

Comments (3)

ikreymer avatar ikreymer commented on May 26, 2024

The field is useful to have to allow revisit records, so could have a revisit of a resource or metadata record, for example.

The revisit record also has the payload digest, which matches that of the original.

The example resource record actually includes the WARC-Payload-Digest:
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1-1_latestdraft.pdf

Don't think any changes are needed here, and not really sure what the 'well-defined payload' is supposed to mean (when is a payload not well defined?)

from warcio.

wumpus avatar wumpus commented on May 26, 2024

My testing code has a list of record types that have a well-defined payload, in the WARC sense. I don't think it means an http payload.

  • optional: warcinfo, response, resource, request, revisit, conversion
  • prohibited for: metadata, continuation

One of the points of this testing is to flush out disagreements about what the standard says.

from warcio.

JustAnotherArchivist avatar JustAnotherArchivist commented on May 26, 2024

Good point, revisit records for resource and metadata duplicates would indeed be useful.

I'm not sure either what "record with a well-defined payload" is supposed to mean exactly. I interpret it as "a record that has a Content-Type for which a definition of 'payload' has been specified". If that interpretation is correct, only HTTP records should have a payload digest since that's the only definition given in the spec. If it instead means "a record that contains data for which there is a common understanding of what its payload is", then I agree that it would also cover a number of other content types. Perhaps a discussion on https://github.com/iipc/warc-specifications is in order here for the details.

That said, I believe there is an issue here. I should probably have been more specific in the original report. In qwarc, I write all dependencies of the crawl to a metadata WARC using resource records. These dependencies include a Python script and may also include arbitrary files used by the script. The script is written using an application/x-python content type, which is not officially defined anywhere but common enough that it's clear what its contents – and therefore its payload – is. Since it is impossible to reliably guess the content type, qwarc doesn't attempt to do so for the files; instead, they are written using application/octet-stream. Now, I suppose there are (at least) two ways one could look at this type. The first is that it's a generic type that on its own doesn't have any meaning since it could be any type of data. This is what I see it as, and by this interpretation, I don't think it can be considered having a "well-defined" payload. An alternative view is that application/octet-stream is a container; while it may contain any type of data, on its own, it's just a stream of bytes, and as such, it does have a well-defined payload.
(Minor addition: per RFC 2046, application/octet-stream may also have padding, which I think should not be considered part of the payload even if it is seen as a container type. But since this is only about bit padding to full bytes, I don't think that's of concern.)

Another scenario: what if someone stores a WARC record within a WARC record? An example where this could happen is if a crawler writes a resource record in addition to request/response records, which contains the decoded HTTP body (i.e. transfer encoding removed etc.). There is at least one tool (crocoite) which does something like this with HTML pages, storing the rendered DOM in a resource record, so this is not unreasonable. What would the payload be in that case?

from warcio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.