Comments (3)
Looks like this is related to to the Content-Disposition re-encoding that warcio is doing...
But what is it that you're trying to do.. remove the non-ascii headers altogether?
I believe to_ascii_bytes()
already catches encoding errors, so that should not throw.
The issue you have can actually just be reproed with just:
with open(warcfilebasename + ".warc", 'rb') as f_in:
with open(warcfilebasename + ".warc.gz", 'wb') as f_out:
writer = WARCWriter(f_out, gzip=True)
for record in ArchiveIterator(f_in):
writer.write_record(record)
The issue happens because:
Content-Disposition: attachment; filename="BIZKAIKO-AIZKORA-TXAPELKETA-2ยช-eus.pdf";
is canonicalized to:
Content-Disposition: attachment; filename*=UTF-8''BIZKAIKO-AIZKORA-TXAPELKETA-2%C2%AA-eus.pdf;
automatically, but for some reason the size is not adjusted.
This happens during the write_record, the iterator does not do the adjustment on read.
It should adjust the size though.
from warcio.
Now fixed on develop, thanks!
from warcio.
and deployed in 1.7.2
from warcio.
Related Issues (20)
- warcio does not preserve HTTP header whitespace HOT 3
- quoted-string WARC header values are not parsed correctly
- Not compatible with WARC-files/records writtin by ArchiveSpark HOT 1
- get_test_file missing from the PyPI release HOT 4
- Offline tests HOT 2
- extract entire warc file? HOT 4
- warcio check does not raise error when GZip records are truncated HOT 5
- `capture_http` fails in tests, but works otherwise HOT 5
- Record not followed by newline (conversion error) HOT 1
- Warcio does not support replay of sites hosted on NCSA 1.5 HOT 3
- Issues with encoding of http-answers HOT 2
- Documentation: Clarify that capture_http writer with filename has no get_stream methood HOT 3
- warcio.exceptions.ArchiveLoadFailed: Unknown archive format HOT 3
- Empty WARC files when deploying warcio on Airflow HOT 5
- Trying to write to closed file when using `requests.Session`
- Patching WARCs using warcio
- warcio cannot write wet files
- webrecorder fails to open IA warc file on MacOS X Ventura 13.2.1 HOT 2
- wget warc status code? HOT 3
- doc bugs linking to source code files
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warcio.