Comments (3)
This is sort of an edge case, and the whitespace was at one point used to indicate multi-line headers (which have now been deprecated, but warcio still supports). I'm not sure that the whitespace is significant anymore from a parsing perspective.
Similar to #128, perhaps there could be a 'raw' mode flag that preserves the whitespace here if desired for when capturing HTTP traffic.
from warcio.
FWIW, I've never seen an HTTP server that returns a header like this, so (i hope) its not very common :)
from warcio.
The whitespace on the line with the field-name
has never been significant semantically as far as I know. Neither the whitespace after the colon nor the one at the end of the line is part of the actual field value content. And even with continuation lines: the optional whitespace at the end of a line, CRLF, and leading space/tab on the continuation line are overall equivalent to a single space.
But yeah, same as #128, this is about correctly preserving the data sent by the server, not the semantic meaning. I've suggested a possible solution there because they are indeed very similar and have essentially the same root cause.
Yeah, it is fortunately not very common, but I have seen it before, sadly enough. There are a lot of weird HTTP servers out there that operate at the edges of or beyond the specifications...
from warcio.
Related Issues (20)
- `capture_http` fails in tests, but works otherwise HOT 5
- Record not followed by newline (conversion error) HOT 1
- Warcio does not support replay of sites hosted on NCSA 1.5 HOT 3
- Issues with encoding of http-answers HOT 2
- Documentation: Clarify that capture_http writer with filename has no get_stream methood HOT 3
- warcio.exceptions.ArchiveLoadFailed: Unknown archive format HOT 3
- Empty WARC files when deploying warcio on Airflow HOT 5
- Trying to write to closed file when using `requests.Session`
- Patching WARCs using warcio
- warcio cannot write wet files
- webrecorder fails to open IA warc file on MacOS X Ventura 13.2.1 HOT 2
- wget warc status code? HOT 3
- doc bugs linking to source code files
- "warcio check" incorrectly reporting payload digest failures for non-HTTP WARCs HOT 2
- warcio accepts a bare LF everywhere a CRLF is required by the spec HOT 1
- "warcio check" does not warn of illegal characters in field names or values, including LF HOT 8
- warcio recompress adds WARC-Block-Digest fields to records without one
- warcio recompress adds "WARC-Payload-Digest" to records without understanding them
- DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version HOT 7
- Add test to HTTPS proxies
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warcio.