Giter Site home page Giter Site logo

Comments (6)

ikreymer avatar ikreymer commented on May 26, 2024

Hm, yeah, I think you're right, this is a mistake! This should be UTF-8 as its warcinfo contents and not warc or http headers.

from warcio.

wumpus avatar wumpus commented on May 26, 2024

OK.

As for warc headers, they can have utf-8... which looks like that's done correctly already? But not tested.

And for http headers, https://tools.ietf.org/html/rfc5987 says there's a funky encoding to use if you want non-latin-1. I have no idea if useragents typically decode them or not. If they do, then someone's got to encode them before printing into the warc? The warc standard is silent about this issue?

from warcio.

ikreymer avatar ikreymer commented on May 26, 2024

Yeah, this is all very confusing unfortunately.. Currently, both WARC and HTTP headers are encoded as ISO-8859-1/Latin-1
https://github.com/webrecorder/warcio/blob/master/warcio/statusandheaders.py#L148

I think the HTTP headers should be whatever the original encoding was, which is probably ISO-8859-1.

PEP 3333 also requires ISO-8859-1 for HTTP headers by default w/o the special encoding, so I think this is correct.

For WARC headers, it seems that it should be using UTF-8 as the default, based on this (slightly ambiguous wording):

Although UTF-8 characters are allowed, the 'encoded-word' mechanism of [RFC2047] may also be used when writing WARC fields and shall also be understood by WARC reading software.

Heritrix seems to use UTF-8, while warcprox uses ISO-8859-1/Latin-1.

In conclusions, I think what we should have is:

  • warcinfo contents as UTF-8
  • WARC headers as UTF-8
  • http headers as ISO-8859-1/Latin-1

from warcio.

wumpus avatar wumpus commented on May 26, 2024

Yes, I agree with all that.

I'd like to eventually produce a small website and a corresponding .warc.gz which can be used as a 'torture test' for warc writers, readers, and playback. But that's an issue for another day.

from warcio.

zuups avatar zuups commented on May 26, 2024

Maybe my problem is connected with this issue? I'm getting UnicodeEncodeError: 'latin-1' codec can't encode characters with warcio recompress of some arc/warc files (most files of the same crawl are ok but some are not)

There is this line in http header that seems to be guilty:
Content-Disposition: attachment; filename=거뢁거뢁.vcf

Traceback (most recent call last):
File "/usr/local/bin/warcio", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.4/dist-packages/warcio/cli.py", line 47, in main
cmd.func(cmd)
File "/usr/local/lib/python3.4/dist-packages/warcio/cli.py", line 94, in call
self.load_and_write(stream, cmd.output)
File "/usr/local/lib/python3.4/dist-packages/warcio/cli.py", line 111, in load_and_write
writer.write_record(record)
File "/usr/local/lib/python3.4/dist-packages/warcio/warcwriter.py", line 347, in write_record
self._write_warc_record(self.out, record)
File "/usr/local/lib/python3.4/dist-packages/warcio/warcwriter.py", line 232, in _write_warc_record
self._set_header_buff(record)
File "/usr/local/lib/python3.4/dist-packages/warcio/warcwriter.py", line 224, in _set_header_buff
headers_buff = record.http_headers.to_bytes(self.header_filter, 'iso-8859-1')
File "/usr/local/lib/python3.4/dist-packages/warcio/statusandheaders.py", line 155, in to_bytes
return self.to_str(filter_func).encode(encoding) + b'\r\n'
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 336-339: ordinal not in range(256)

Here are some files I have problems with: http://veebiarhiiv.digar.ee/20180115warcencodeissue/

from warcio.

wumpus avatar wumpus commented on May 26, 2024

That's certainly utf-8 in an http header. https://tools.ietf.org/html/rfc5987 says that's not allowed, but we shouldn't be surprised that it happens.

from warcio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.