Comments (5)
Wow, I'd totally forgotten about this!
Seems like there's a hook in the underlying Python library to spot this case:: https://docs.python.org/3/library/zlib.html#zlib.Decompress.eof
Decompress.eof
A boolean indicating whether the end of the compressed data stream has been reached.
This makes it possible to distinguish between a properly formed compressed stream, and an incomplete or truncated one.
New in version 3.3.
But it's not clear to me how to weave that in here...
warcio/warcio/archiveiterator.py
Lines 108 to 140 in aa702cb
from warcio.
This came up recently in IIPC Slack when trying to diagnose why warcheology was reporting a corrupted WARC file, and warcio was not. It appeared that the WARC file was truncated as a result of a browsertrix-crawler container exiting abnormally, and not closing the GZIP file properly...
In case it's helpful to have a test script (which doesn't emit a warning that I can see):
from warcio.archiveiterator import ArchiveIterator
with open('test.warc.gz', 'rb') as stream:
for i, record in enumerate(ArchiveIterator(stream)):
print(i, record.rec_headers.get_header('WARC-Target-URI'))
if record.rec_type == 'response':
content = record.content_stream().read()
And here's a test file: test.warc.gz
gunzip
on the other hand does notice:
$ gunzip --test test.warc.gz
gunzip: truncated input
gunzip: test.warc.gz: uncompress failed
from warcio.
@edsu what record in test.warc.gz is the truncated one? And where can I find warcheology? Thanks.
from warcio.
I believe it's the last record. If you try to gunzip the file, you should see the error error right at the end?
I'm not really familiar with it but here is the warchaeology repo: https://github.com/nlnwa/warchaeology
from warcio.
@edsu thanks for adding a simple test and @anjackson for looking up the .eof
property!
With that, I think detecting this case can be done as follows:
diff --git a/warcio/archiveiterator.py b/warcio/archiveiterator.py
index 484b7f0..451f182 100644
--- a/warcio/archiveiterator.py
+++ b/warcio/archiveiterator.py
@@ -113,7 +113,13 @@ class ArchiveIterator(six.Iterator):
yield self.record
- except EOFError:
+ except EOFError as e:
+ if self.reader.decompressor:
+ if not self.reader.decompressor.eof:
+ sys.stderr.write("warning: final record appears to be truncated")
+
empty_record = True
self.read_to_end()
But, what is the desired behavior be more generally?
- for
warcio check
, seems like it should return an error - seems like the
gunzip
behavior is definitely not desirable, as that fails to unzip any record even if only last one is invalid. - for indexing, it seems like the indexing should still succeed, and maybe print the warning? there are other recoverable errors that are also logged, such as Content-Length mismatches. Should it still return a 1
It sort of depends on how the WARC is being used:
- If the goal is to detect if WARC is valid after transfer, this is definitely an error and should be detected.
- If the goal is to index a WARC that already exists, this is more of a warning since not much be done at that point, and we definitely don't want to invalid the whole WARC just because of the last record.
from warcio.
Related Issues (20)
- `capture_http` fails in tests, but works otherwise HOT 5
- Record not followed by newline (conversion error) HOT 1
- Warcio does not support replay of sites hosted on NCSA 1.5 HOT 3
- Issues with encoding of http-answers HOT 2
- Documentation: Clarify that capture_http writer with filename has no get_stream methood HOT 3
- warcio.exceptions.ArchiveLoadFailed: Unknown archive format HOT 3
- Empty WARC files when deploying warcio on Airflow HOT 5
- Trying to write to closed file when using `requests.Session`
- Patching WARCs using warcio
- warcio cannot write wet files
- webrecorder fails to open IA warc file on MacOS X Ventura 13.2.1 HOT 2
- wget warc status code? HOT 3
- doc bugs linking to source code files
- "warcio check" incorrectly reporting payload digest failures for non-HTTP WARCs HOT 2
- warcio accepts a bare LF everywhere a CRLF is required by the spec HOT 1
- "warcio check" does not warn of illegal characters in field names or values, including LF HOT 8
- warcio recompress adds WARC-Block-Digest fields to records without one
- warcio recompress adds "WARC-Payload-Digest" to records without understanding them
- DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version HOT 7
- Add test to HTTPS proxies
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warcio.