Comments (3)
The field is useful to have to allow revisit records, so could have a revisit of a resource or metadata record, for example.
The revisit record also has the payload digest, which matches that of the original.
The example resource
record actually includes the WARC-Payload-Digest:
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1-1_latestdraft.pdf
Don't think any changes are needed here, and not really sure what the 'well-defined payload' is supposed to mean (when is a payload not well defined?)
from warcio.
My testing code has a list of record types that have a well-defined payload, in the WARC sense. I don't think it means an http payload.
- optional: warcinfo, response, resource, request, revisit, conversion
- prohibited for: metadata, continuation
One of the points of this testing is to flush out disagreements about what the standard says.
from warcio.
Good point, revisit records for resource and metadata duplicates would indeed be useful.
I'm not sure either what "record with a well-defined payload" is supposed to mean exactly. I interpret it as "a record that has a Content-Type for which a definition of 'payload' has been specified". If that interpretation is correct, only HTTP records should have a payload digest since that's the only definition given in the spec. If it instead means "a record that contains data for which there is a common understanding of what its payload is", then I agree that it would also cover a number of other content types. Perhaps a discussion on https://github.com/iipc/warc-specifications is in order here for the details.
That said, I believe there is an issue here. I should probably have been more specific in the original report. In qwarc, I write all dependencies of the crawl to a metadata WARC using resource records. These dependencies include a Python script and may also include arbitrary files used by the script. The script is written using an application/x-python
content type, which is not officially defined anywhere but common enough that it's clear what its contents – and therefore its payload – is. Since it is impossible to reliably guess the content type, qwarc doesn't attempt to do so for the files; instead, they are written using application/octet-stream
. Now, I suppose there are (at least) two ways one could look at this type. The first is that it's a generic type that on its own doesn't have any meaning since it could be any type of data. This is what I see it as, and by this interpretation, I don't think it can be considered having a "well-defined" payload. An alternative view is that application/octet-stream
is a container; while it may contain any type of data, on its own, it's just a stream of bytes, and as such, it does have a well-defined payload.
(Minor addition: per RFC 2046, application/octet-stream
may also have padding, which I think should not be considered part of the payload even if it is seen as a container type. But since this is only about bit padding to full bytes, I don't think that's of concern.)
Another scenario: what if someone stores a WARC record within a WARC record? An example where this could happen is if a crawler writes a resource record in addition to request/response records, which contains the decoded HTTP body (i.e. transfer encoding removed etc.). There is at least one tool (crocoite) which does something like this with HTML pages, storing the rendered DOM in a resource record, so this is not unreasonable. What would the payload be in that case?
from warcio.
Related Issues (20)
- get_test_file missing from the PyPI release HOT 4
- Offline tests HOT 2
- extract entire warc file? HOT 4
- warcio check does not raise error when GZip records are truncated HOT 5
- `capture_http` fails in tests, but works otherwise HOT 5
- Record not followed by newline (conversion error) HOT 1
- Warcio does not support replay of sites hosted on NCSA 1.5 HOT 3
- Issues with encoding of http-answers HOT 2
- Documentation: Clarify that capture_http writer with filename has no get_stream methood HOT 3
- warcio.exceptions.ArchiveLoadFailed: Unknown archive format HOT 3
- Empty WARC files when deploying warcio on Airflow HOT 5
- Trying to write to closed file when using `requests.Session`
- Patching WARCs using warcio
- warcio cannot write wet files
- webrecorder fails to open IA warc file on MacOS X Ventura 13.2.1 HOT 2
- wget warc status code? HOT 3
- doc bugs linking to source code files
- "warcio check" incorrectly reporting payload digest failures for non-HTTP WARCs HOT 2
- warcio accepts a bare LF everywhere a CRLF is required by the spec HOT 1
- "warcio check" does not warn of illegal characters in field names or values, including LF HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warcio.