Giter Site home page Giter Site logo

Comments (5)

anjackson avatar anjackson commented on September 21, 2024 1

Wow, I'd totally forgotten about this!

Seems like there's a hook in the underlying Python library to spot this case:: https://docs.python.org/3/library/zlib.html#zlib.Decompress.eof

Decompress.eof
A boolean indicating whether the end of the compressed data stream has been reached.
This makes it possible to distinguish between a properly formed compressed stream, and an incomplete or truncated one.
New in version 3.3.

But it's not clear to me how to weave that in here...

while True:
try:
self.record = self._next_record(self.next_line)
if raise_invalid_gzip:
self._raise_invalid_gzip_err()
yield self.record
except EOFError:
empty_record = True
self.read_to_end()
if self.reader.decompressor:
# if another gzip member, continue
if self.reader.read_next_member():
continue
# if empty record, then we're done
elif empty_record:
break
# otherwise, probably a gzip
# containing multiple non-chunked records
# raise this as an error
else:
raise_invalid_gzip = True
# non-gzip, so we're done
elif empty_record:
break
self.close()

from warcio.

edsu avatar edsu commented on September 21, 2024

This came up recently in IIPC Slack when trying to diagnose why warcheology was reporting a corrupted WARC file, and warcio was not. It appeared that the WARC file was truncated as a result of a browsertrix-crawler container exiting abnormally, and not closing the GZIP file properly...

In case it's helpful to have a test script (which doesn't emit a warning that I can see):

from warcio.archiveiterator import ArchiveIterator

with open('test.warc.gz', 'rb') as stream:
    for i, record in enumerate(ArchiveIterator(stream)):
        print(i, record.rec_headers.get_header('WARC-Target-URI'))
        if record.rec_type == 'response':
            content = record.content_stream().read()

And here's a test file: test.warc.gz

gunzip on the other hand does notice:

$ gunzip --test test.warc.gz
gunzip: truncated input
gunzip: test.warc.gz: uncompress failed

from warcio.

wumpus avatar wumpus commented on September 21, 2024

@edsu what record in test.warc.gz is the truncated one? And where can I find warcheology? Thanks.

from warcio.

edsu avatar edsu commented on September 21, 2024

I believe it's the last record. If you try to gunzip the file, you should see the error error right at the end?

I'm not really familiar with it but here is the warchaeology repo: https://github.com/nlnwa/warchaeology

from warcio.

ikreymer avatar ikreymer commented on September 21, 2024

@edsu thanks for adding a simple test and @anjackson for looking up the .eof property!

With that, I think detecting this case can be done as follows:

diff --git a/warcio/archiveiterator.py b/warcio/archiveiterator.py
index 484b7f0..451f182 100644
--- a/warcio/archiveiterator.py
+++ b/warcio/archiveiterator.py
@@ -113,7 +113,13 @@ class ArchiveIterator(six.Iterator):
 
                 yield self.record
 
-            except EOFError:
+            except EOFError as e:
+                if self.reader.decompressor:
+                    if not self.reader.decompressor.eof:
+                        sys.stderr.write("warning: final record appears to be truncated")
+
                 empty_record = True
 
             self.read_to_end()

But, what is the desired behavior be more generally?

  • for warcio check, seems like it should return an error
  • seems like the gunzip behavior is definitely not desirable, as that fails to unzip any record even if only last one is invalid.
  • for indexing, it seems like the indexing should still succeed, and maybe print the warning? there are other recoverable errors that are also logged, such as Content-Length mismatches. Should it still return a 1

It sort of depends on how the WARC is being used:

  • If the goal is to detect if WARC is valid after transfer, this is definitely an error and should be detected.
  • If the goal is to index a WARC that already exists, this is more of a warning since not much be done at that point, and we definitely don't want to invalid the whole WARC just because of the last record.

from warcio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.