I discovered today that warcio mangles the HTTP header data when it isn't pure ASCII.

I'd argue that <a class="issue-link js-issue-link" data-error-text="Failed to load tit

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yeah, I think as <a class="user-mention notranslate" data-hovercard-type="user" data-h

warcio mangles non-ASCII HTTP headers,about webrecorder/warcio

JustAnotherArchivist commented on July 2, 2024 1

I'd argue that #45 does not actually fix #38 at all. Reading a WARC record and then writing it again should produce identical record body data (though it may change the WARC headers to an equivalent representation), i.e. the HTTP headers should not be reencoded in that scenario either. Essentially, reading the response record from a WARC is equivalent to reading from a network connection with the actual server.

I think the only sane approach is to avoid the round trip to the mapping and back to bytes entirely. I haven't thought this through completely, but my idea right now would be to have StatusAndHeadersParser record the raw data read from .readline() and then set the StatusAndHeaders object's headers_buff attribute directly to the concatenation of that. That would solve #35 (properly), this, and #129 in one go, and it should also prevent any further bugs like this on future header parsing changes.

from warcio.

ikreymer commented on July 2, 2024

Yeah, this is tricky...

First, the API isn't quite correct, it should really be:

headers = warcio.StatusAndHeaders("HTTP/1.0 200 OK",
          {b"Date": b"Thu, 27 May 2021 22:03:54 GMT",
           b"Content-Length": b"0",
           b"X-non-utf8-value": b"\xff"
          })
record = writer.create_warc_record('http://example.org/', 'response', http_headers=headers, payload=payload)

but it seems like the result is the same.

The %-encoding was implemented to deal with ambiguity, as even though ISO-8859-1 is supported to be the encoding for headers, servers can and do use UTF-8, and probably other encodings. In python its standard to represent headers as Unicode strings, so with Python 3, this is an issue. The %-encoding was designed to remove that ambiguity when serializing headers.

The default %-encoding via urllib.parse.quote uses utf-8 though, so that converts it to what you're seeing.

urllib.parse.quote('\xff')
'%C3%BF'

It seems like the implementation was geared towards content-disposition and similar uses of UTF-8.
Following https://datatracker.ietf.org/doc/html/rfc5987#section-3.2.2 and https://datatracker.ietf.org/doc/html/rfc8187#section-3.2.3

One option is to add the charset prefix, but not sure this is correct either (this is for multi-value headers, I think)

X-non-utf8-value: *=UTF-8''%C3%BF

or

X-non-utf8-value: *=ISO-8859-1''%FF

Probably it should just be:

X-non-utf8-value: %FF

Maybe an encoding param is also needed here...

from warcio.

JustAnotherArchivist commented on July 2, 2024

First, the API isn't quite correct, it should really be: [using http_headers]

Hmm, yeah, but there is even a test for a single payload with the headers included in the test suite, so I assumed that usage was also fine:

warcio/test/test_writer.py

Lines 356 to 368 in aa702cb

    
           # ============================================================================ 
        
           @sample_record('response_1-buff', RESPONSE_RECORD) 
        
           def sample_response_from_buff(builder): 
        
               payload = '\ 
        
           HTTP/1.0 200 OK\r\n\ 
        
           Content-Type: text/plain; charset="UTF-8"\r\n\ 
        
           Custom-Header: somevalue\r\n\ 
        
           \r\n\ 
        
           some\ntext'.encode('utf-8') 
        
               return builder.create_warc_record('http://example.com/', 'response', 
        
                                                payload=BytesIO(payload), 
        
                                                length=len(payload))

Unfortunately, the documentation for warcio's APIs is lacking, so it's often hard to tell what is and isn't supported usage.

[ambiguity with headers, encoding the value somehow]

I disagree. Yes, there is ambiguity, but that only matters when trying to consume HTTP. The data written to WARC response records is 'the full HTTP response received over the network' (or '... request sent ...' for request records). As such, no manipulation should ever happen to that data. Your proposed percent encoding would be equivalent per the HTTP spec, but it still wouldn't be what was sent/received over the network. In other words, representing the headers like that in the StatusAndHeaders mapping for further consumption by the client is fine, but having that in the WARC itself is a problem.

from warcio.

ikreymer commented on July 2, 2024

First, the API isn't quite correct, it should really be: [using http_headers]

Hmm, yeah, but there is even a test for a single payload with the headers included in the test suite, so I assumed that usage was also fine:

You're right, sorry, I forgot about that support, its been a while :)
...

[ambiguity with headers, encoding the value somehow]

I disagree. Yes, there is ambiguity, but that only matters when trying to consume HTTP. The data written to WARC response records is 'the full HTTP response received over the network' (or '... request sent ...' for request records). As such, no manipulation should ever happen to that data. Your proposed percent encoding would be equivalent per the HTTP spec, but it still wouldn't be what was sent/received over the network. In other words, representing the headers like that in the StatusAndHeaders mapping for further consumption by the client is fine, but having that in the WARC itself is a problem.

Yeah, it's tricky because warcio serves a dual role here, both for parsing data from existing WARCs, where the HTTP semantics are important, as well as writing raw data. An option could be to add a flag, say raw mode, that uses to_bytes instead of to_ascii_bytes at:
https://github.com/webrecorder/warcio/blob/master/warcio/statusandheaders.py#L118
I'm not sure if it should be the default, don't want to break the fix for #38

I don't have a lot of time to focus on this now unfortunately, but if you have other suggestions, or want to open a PR, can review it

from warcio.

manueldeprada commented on July 2, 2024

@JustAnotherArchivist is right. Warcio fails to decode and reencode again this file for example:
test.warc.gz
In particular, a record from the file has this headers:

HTTP/1.1 200 OK
Date: Mon, 26 Mar 2012 07:43:46 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.7a mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 PHP/5.2.13
X-Powered-By: PHP/5.2.13
Set-Cookie: ³ÒÚÍ×=%96%A4n%9Bu%B0%A9f%A5%7Cl%7D%98; expires=Sun, 06-May-2012 23:43:46 GMT; path=/
Content-Length: 7881
Connection: close
Content-Type: text/html

Notice the Set-Cookie header, where the name before the "=" has non-ascii characters. I agree this is not compliant nor common, but warcio should be able to deal with all sorts of WARC files archived in different manner. Instead, this error happens:

'ascii' codec can't encode character '\ufeff' in position 146: ordinal not in range(128)

in

warcio/warcio/statusandheaders.py

Line 179 in aa702cb

string = string.encode('ascii')

from warcio.

wumpus commented on July 2, 2024

I agree that this is a real bug. I don't think I have this tested in the "warcio check" branch, either! If you look at the WARC standard it's a bit complicated to figure out what you're really supposed to do in this case, so there should be no surprise that there are buggy implementations out there. But it's very obvious what is meant.

from warcio.

ikreymer commented on July 2, 2024

Yeah, I think as @JustAnotherArchivist points out in #128 (comment), probably the only real solution is to avoid changing the encoding at all and always treat the headers as bytes, and let the browser handle it for replay. The initial idea was to be in line with how Python 3 treats headers (as strings), and %-encode any non-ascii characters, but that clearly will not work if we want to allow invalid headers as they are, which probably does make sense.

from warcio.

manueldeprada commented on July 2, 2024

I created a pull request with a temporary fix so at least WARCIO doesn't terminate when this happens.

from warcio.

ssairanen commented on July 2, 2024

Yes, this is serious problem. I've encountered warcio breaking when trying to convert old arcs to warcs. The issue #88 is one example how warcio breaks. It happens in

warcio/warcio/statusandheaders.py

Lines 203 to 206 in aa702cb

    
           # %-encode value in ; name="value" 
        
               new_value = self.ENCODE_HEADER_RX.sub(do_encode, curr_value) 
        
               if new_value == curr_value: 
        
                   new_value = quote(curr_value)

.sub happens, but afterwards quoting doesn't (because curr_value isn't new_value) and afterwards this header breaks with UnicodeError on the line

warcio/warcio/statusandheaders.py

Line 179 in aa702cb

string = string.encode('ascii')

I also suggest that headers should just stay as they are, because it's not warcios job to fix the multitude of problems some web servers might cause in the headers. Warcio can ofc be used to fix the headers if there are functional problems in playback, but in main usage it shouldn't try to fix everything.

(Even though RFCs defining HTTP headers have stuff like "Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.")

from warcio.

warcio mangles non-ASCII HTTP headers about warcio HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# ============================================================================
	@sample_record('response_1-buff', RESPONSE_RECORD)
	def sample_response_from_buff(builder):
	payload = '\
	HTTP/1.0 200 OK\r\n\
	Content-Type: text/plain; charset="UTF-8"\r\n\
	Custom-Header: somevalue\r\n\
	\r\n\
	some\ntext'.encode('utf-8')

	return builder.create_warc_record('http://example.com/', 'response',
	payload=BytesIO(payload),
	length=len(payload))

	# %-encode value in ; name="value"
	new_value = self.ENCODE_HEADER_RX.sub(do_encode, curr_value)
	if new_value == curr_value:
	new_value = quote(curr_value)