Comments (6)
Hm, yeah, I think you're right, this is a mistake! This should be UTF-8 as its warcinfo contents and not warc or http headers.
from warcio.
OK.
As for warc headers, they can have utf-8... which looks like that's done correctly already? But not tested.
And for http headers, https://tools.ietf.org/html/rfc5987 says there's a funky encoding to use if you want non-latin-1. I have no idea if useragents typically decode them or not. If they do, then someone's got to encode them before printing into the warc? The warc standard is silent about this issue?
from warcio.
Yeah, this is all very confusing unfortunately.. Currently, both WARC and HTTP headers are encoded as ISO-8859-1/Latin-1
https://github.com/webrecorder/warcio/blob/master/warcio/statusandheaders.py#L148
I think the HTTP headers should be whatever the original encoding was, which is probably ISO-8859-1.
PEP 3333 also requires ISO-8859-1 for HTTP headers by default w/o the special encoding, so I think this is correct.
For WARC headers, it seems that it should be using UTF-8 as the default, based on this (slightly ambiguous wording):
Although UTF-8 characters are allowed, the 'encoded-word' mechanism of [RFC2047] may also be used when writing WARC fields and shall also be understood by WARC reading software.
Heritrix seems to use UTF-8, while warcprox uses ISO-8859-1/Latin-1.
In conclusions, I think what we should have is:
- warcinfo contents as UTF-8
- WARC headers as UTF-8
- http headers as ISO-8859-1/Latin-1
from warcio.
Yes, I agree with all that.
I'd like to eventually produce a small website and a corresponding .warc.gz which can be used as a 'torture test' for warc writers, readers, and playback. But that's an issue for another day.
from warcio.
Maybe my problem is connected with this issue? I'm getting UnicodeEncodeError: 'latin-1' codec can't encode characters with warcio recompress of some arc/warc files (most files of the same crawl are ok but some are not)
There is this line in http header that seems to be guilty:
Content-Disposition: attachment; filename=κ±°λΆκ±°λΆ.vcf
Traceback (most recent call last):
File "/usr/local/bin/warcio", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.4/dist-packages/warcio/cli.py", line 47, in main
cmd.func(cmd)
File "/usr/local/lib/python3.4/dist-packages/warcio/cli.py", line 94, in call
self.load_and_write(stream, cmd.output)
File "/usr/local/lib/python3.4/dist-packages/warcio/cli.py", line 111, in load_and_write
writer.write_record(record)
File "/usr/local/lib/python3.4/dist-packages/warcio/warcwriter.py", line 347, in write_record
self._write_warc_record(self.out, record)
File "/usr/local/lib/python3.4/dist-packages/warcio/warcwriter.py", line 232, in _write_warc_record
self._set_header_buff(record)
File "/usr/local/lib/python3.4/dist-packages/warcio/warcwriter.py", line 224, in _set_header_buff
headers_buff = record.http_headers.to_bytes(self.header_filter, 'iso-8859-1')
File "/usr/local/lib/python3.4/dist-packages/warcio/statusandheaders.py", line 155, in to_bytes
return self.to_str(filter_func).encode(encoding) + b'\r\n'
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 336-339: ordinal not in range(256)
Here are some files I have problems with: http://veebiarhiiv.digar.ee/20180115warcencodeissue/
from warcio.
That's certainly utf-8 in an http header. https://tools.ietf.org/html/rfc5987 says that's not allowed, but we shouldn't be surprised that it happens.
from warcio.
Related Issues (20)
- warcio does not preserve HTTP header whitespace HOT 3
- quoted-string WARC header values are not parsed correctly
- Not compatible with WARC-files/records writtin by ArchiveSpark HOT 1
- get_test_file missing from the PyPI release HOT 4
- Offline tests HOT 2
- extract entire warc file? HOT 4
- warcio check does not raise error when GZip records are truncated HOT 5
- `capture_http` fails in tests, but works otherwise HOT 5
- Record not followed by newline (conversion error) HOT 1
- Warcio does not support replay of sites hosted on NCSA 1.5 HOT 3
- Issues with encoding of http-answers HOT 2
- Documentation: Clarify that capture_http writer with filename has no get_stream methood HOT 3
- warcio.exceptions.ArchiveLoadFailed: Unknown archive format HOT 3
- Empty WARC files when deploying warcio on Airflow HOT 5
- Trying to write to closed file when using `requests.Session`
- Patching WARCs using warcio
- warcio cannot write wet files
- webrecorder fails to open IA warc file on MacOS X Ventura 13.2.1 HOT 2
- wget warc status code? HOT 3
- doc bugs linking to source code files
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warcio.