webrecorder / warcio Goto Github PK

View Code? Open in Web Editor NEW

349.0 22.0 55.0 278 KB

Streaming WARC/ARC library for fast web archive IO

Home Page: https://pypi.python.org/pypi/warcio

License: Apache License 2.0

Python 98.09% Arc 1.91%

web-archives web-archiving warc pywb python

warcio's Issues

Support ZStd Compression for WARCs

ArchiveTeam has been using WARCs with ZStd compression (https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.20.3-at.20200401.01), so it would be good for warcio to also support Zstd.

Support for ZStd could involve the following:

Support for reading Zstd WARCs
Support for writing Zstd WARCs with a passed in dictionary (or default dictionary)
Training/creating a ZStd dictionary based on one or more WARCs

is definitely needed to support interoperability and be able to read other WARCs.
and 3) are a bit more experimental and will help warcio keep up with evolving compression options

Different encoding reading / writing headers?

Reading a WARC record (using ArchiveIterator) with a unicode character outside the range of iso-8859-1 in the HTTP headers is fine, but writing it again (using WARCWriter) gives the error

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 336: ordinal not in range(256).

This is the header line causing the problem in this case:

Content-disposition: attachment; filename="Lancement du Système d’Échange Local (SEL).pdf"

with the ’ in d’Échange causing the problem.

This code probably parses the headers using utf-8:

    def decode_header(line):
        try:
            # attempt to decode as utf-8 first
            return to_native_str(line, 'utf-8')
        except:
            # if fails, default to ISO-8859-1
            return to_native_str(line, 'iso-8859-1')

These are the lines of code that write headers hardcoded in latin-1:

def _set_header_buff(self, record):
    headers_buff = record.http_headers.to_bytes(self.header_filter, 'iso-8859-1')
    record.http_headers.headers_buff = headers_buff

If headers are by default read in utf-8, wouldn't it make sense to write them as utf-8 as well?

Threadpool executor creates zero byte warc files

Using Threadpool executor to create warc files, creates files with zero bytes.
I have provided the test code below.

#!/usr/bin/env python3

from warcio.capture_http import capture_http
import requests
import concurrent.futures


def save_warc(url, ofile):
    with capture_http(ofile):
        requests.get(url)


with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
        executor.submit(save_warc, "https//example.com", "com.warc.gz")
        executor.submit(save_warc, "https//example.org", "org.warc.gz")

Ensure http headers added automatically only if explicitly requested.

Currently, on load the http_headers block is always set automatically to default http headers if parsing records. This is incorrect for non-http WARC records.

Instead, by default only add http headers for response, request, revisit if length > 0, otherwise set http_headers=None

Sometimes it is useful to auto-generated the http headers for other record types, for example, for replay. This can now be enabled with a new ensure_http_headers=True flag, which will auto-create http headers suitable for replay with status 200, and content type and content length set.

Multiple Cookies are prolematic when parsing warc files

When reading a warc file that contains 'Set-Cookie' header and there are multiple cookies present on subsequent lines, the parsing logic breaks the line on the first colon, which appears to be fine for headers, but when the line is actually a continuation of cookies from the previous line, they're incorrectly being added to the http_headers property.

Having played with the warcio code here (https://github.com/webrecorder/warcio/blob/master/warcio/statusandheaders.py#L262) and adding the following:

while line:
    if line.startswith('Set-Cookie:'):
        print('Testing Set-Cookie line -> ', line)

I can see that the cookies on subsequent lines are not picked up, which is expected of course, but I always like to test my hypothesis before just assuming.

Here are the http headers that I have in a warc files:

HTTP/1.1 200 OK 
Cache-Control: private
Content-Length: 25858
Content-Type: text/html; charset=utf-8
Vary: Accept-Encoding
Server: Microsoft-IIS/10.0
Set-Cookie: ASP.NET_SessionId=xxx; path=/; secure; HttpOnly
COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23:42:12 GMT; path=/; secure; HttpOnly
COOKIE_B=xxx;Path=/;HttpOnly;Domain=xxx
X-Frame-Options: SAMEORIGIN
Date: Sat, 06 Oct 2018 23:42:11 GMT

This is the code I used to access the headers:

>>> for header in a.items[0].record.http_headers.headers:
...     print(header)
...
('Cache-Control', 'private')
('Content-Length', '25858')
('Content-Type', 'text/html; charset=utf-8')
('Vary', 'Accept-Encoding')
('Server', 'Microsoft-IIS/10.0')
('Set-Cookie', 'ASP.NET_SessionId=xxx; path=/; secure; HttpOnly')
('COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23', '42:12 GMT; path=/; secure; HttpOnly')
('X-Frame-Options', 'SAMEORIGIN')
('Date', 'Sat, 06 Oct 2018 23:42:11 GMT')

I'm looking into how pywb 0.33 did this before this was extracted to see if there's a difference in behavior.

record.content_stream().read() alters the record and causes a write out to fail

(Using code from #57)
Calling record.content_stream().read() before writing the record causes the record to be changed in such a way that the file it writes out is incorrect and mangled.

import pytest

from io import BytesIO
from tempfile import NamedTemporaryFile

from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders

def test_identity_correct ():
    """ read(write(record)) should yield record """
    with NamedTemporaryFile () as fd:
        payload = b'foobar'
        writer = WARCWriter (fd, gzip=False)
        httpHeaders = StatusAndHeaders('GET / HTTP/1.1', {}, is_http_request=True)
        warcHeaders = {'Foo': 'Bar'}
        record = writer.create_warc_record ('http://example.com/', 'request',
                payload=BytesIO(payload),
                warc_headers_dict=warcHeaders, http_headers=httpHeaders)
        writer.write_record (record)

        fd.seek (0)
        rut = next (ArchiveIterator (fd))
        golden = record
        assert rut.rec_type == golden.rec_type
        assert rut.rec_headers == golden.rec_headers
        assert rut.content_type == golden.content_type
        assert rut.length == golden.length
        assert rut.http_headers == golden.http_headers
        assert rut.raw_stream.read() == payload

def test_identity_fail ():
    """ read(write(record)) should yield record """
    with NamedTemporaryFile () as fd:
        payload = b'foobar'
        writer = WARCWriter (fd, gzip=False)
        httpHeaders = StatusAndHeaders('GET / HTTP/1.1', {}, is_http_request=True)
        warcHeaders = {'Foo': 'Bar'}
        record = writer.create_warc_record ('http://example.com/', 'request',
                payload=BytesIO(payload),
                warc_headers_dict=warcHeaders, http_headers=httpHeaders)
        record.content_stream().read()
        writer.write_record (record)

        fd.seek (0)
        rut = next (ArchiveIterator (fd))
        golden = record
        assert rut.rec_type == golden.rec_type
        assert rut.rec_headers == golden.rec_headers
        assert rut.content_type == golden.content_type
        assert rut.length == golden.length
        assert rut.http_headers == golden.http_headers
        assert rut.raw_stream.read() == payload

test_identity_correct()
print("Write Worked")
test_identity_fail()
print("Write 2 Worked")

Output:

Write Worked
Traceback (most recent call last):
  File "./test2.py", line 57, in <module>
    test_identity_fail()
  File "./test2.py", line 53, in test_identity_fail
    assert rut.raw_stream.read() == payload
AssertionError

configuring warc capture

I'm excited to adopt warcio in a project but I'm stuck.

Following the warcio warc write examples, the warc files I create do not contain styles, images, fonts, videos. However, the user experience on warcrecorder.io does contain those elements.

I'm not sure if I should be passing args, kwargs, params or a filter_function to achieve my desired result but examining the tests and source has left me without a clue!

I'd be very grateful if you could give me a hint or point me towards some samples.

Incorrect WARC-Payload-Digest values when transfer encoding is present

Per WARC/1.0 spec section 5.9:

The payload of an application/http block is its ‘entity-body’ (per [RFC2616]).

The entity-body is the HTTP body without transfer encoding per section 4.3 in RFC 2616. (In the newer RFC 723# family, it's called "payload body" instead and defined in section 3.3 of RFC 7230.)

Just to be clear to avoid confusion: this is the definition of the payload; the WARC record should still contain the exact response sent by the server with transfer encoding intact. But when calculating the WARC-Payload-Digest, the transfer encoding must be stripped.

warcio (like many other tools) passes the response data directly into the payload digester without removing transfer encoding. This means that it produces an invalid WARC-Payload-Digest when the HTTP body is transfer-encoded.

non-streaming interface would be useful

Right now the only interface for getting at the record content is record.content_stream().read(), which is streaming. I can't do that twice. So if I'm passing a record around in a program and want to access the record content in multiple places, I've ended up wrapping warcio's record with a class that has a .content() method.

That seems odd. Other packages like Requests offer both streaming and non-streaming interfaces.

Obviously we'd want to preserve streaming behavior -- pure streaming code should continue to not buffer all of the content in memory. One way to do that would be to save all of the content in memory only if .content() is called before .content_stream().read(), and make calling .content() after calling content_stream().read() raise an exception.

How to retrieve the record based on target-uri?

How to retrieve the record based on target-uri once indexes are created on the warc file?

Plans for adding type annotations?

Hi all,

Are type annotations on the roadmap at all? Would you take PRs for it? If so, would you prefer the comment-based, Python2 compatible annotation scheme, or the syntax introduced by PEP 526(which means dropping support for anything older than Py3.6)?

PS: It's always a great feeling when you know there's a (well maintained) Python library out there tailored exactly to your current needs. Thank you!

Facing issue while custom writing without http_headers

>>> type(content)
<class 'bytes'>

>>> record = writer.create_warc_record("https://www.xxxxxx.html",record_type="response", payload=BytesIO(content))

>>> raise StatusAndHeadersParserException(msg, full_statusline)
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['HTTP/1.0', 'HTTP/1.1'] - Found:   <!DOCTYPE html>

Add an iterator of HTTP exchanges

I can see many use cases where it would be useful to be able to iterate over the WARC records and yield related HTTP Request and Response records together as a tuple. I understand that WARC does not guarantee presence of the pair in the same file or in any specific order, but in a typical archival collection we might find them close enough. This iterator could be based on the best-effort attempts.

half_exchanges = {}

for record in ArchiveIterator(stream):
    # Filter any non-HTTP/HTTPS records out
    if record.rec_headers.get_header('uri').startswith(('http:', 'https:')):
        if record.rec_type == 'request':
            id = record.rec_headers.get_header('WARC-Concurrent-To')
        elif record.rec_type == 'response':
            id = record.rec_headers.get_header('WARC-Record-ID')

        if id:
            if id not in half_exchanges:
                half_exchanges[id] = record
            else:
                if record.rec_type == 'request':
                    req = record
                    res = half_exchanges[id]
                else:
                    req = half_exchanges[id]
                    res = record
                # Remove temporary record that is paired and yield the pair
                del half_exchanges[id]
                yield (req, res)

The above code is one possible way to implement it in which we keep track of unpaired records in a dictionary keyed by the identifier that glues corresponding request and response records together. Once a pair is matched, we can delete that bookkeeping entry and return the pair. This rudimentary approach, however, should be taken with a pinch of salt as there is potential of memory leak for records that won't be paired indefinitely. For WARCs with chaotic ordering the memory requirement may go high due to the growth of of bookkeeping, but related WARC records are generally placed in close proximity, in which case the bookkeeping should be small.

Also, this code does not talk about dealing with revisit records for now, but something should be done for those too.

AttributeError: 'brotli.Decompressor' object has no attribute 'unused_data'

Hello, I'm using warcio to read a WARC archive containing brotli encoded HTTP responses like so.

with open(sys.argv[1],'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type != 'response':
            continue
        print(record.content_stream().read())

This gives me the following error, because there is not attribute unused_data in brotli.

Traceback (most recent call last):
  File "load_warc.py", line 22, in <module>
    print(record.content_stream().read())
  File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/recordloader.py", line 34, in content_stream
    return ChunkedDataReader(self.raw_stream, decomp_type=encoding)
  File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 284, in __init__
    super(ChunkedDataReader, self).__init__(stream, **kwargs)
  File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 72, in __init__
    self._init_decomp(decomp_type)
  File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 89, in _init_decomp
    self.decompressor = self.DECOMPRESSORS[decomp_type.lower()]()
  File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 31, in brotli_decompressor
    decomp.unused_data = None
AttributeError: 'brotli.Decompressor' object has no attribute 'unused_data'

What am I missing?

Error reading WAT files

When I try to use warcio to read WAT files generated from archive-metadata-extractor tool, it gives me this error message

    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: -97518
    Remainder: b'WARC/1.0\r\n'
Traceback (most recent call last):
  File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 220, in _detect_type_load_headers
    rec_headers = self.warc_parser.parse(stream, statusline)
  File "/home/.local/lib/python3.7/site-packages/warcio/statusandheaders.py", line 264, in parse
    raise StatusAndHeadersParserException(msg, full_statusline)
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WA
RC/0.17', 'WARC/0.18'] - Found: WARC-Type: metadata

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/ptvsd_launcher.py", line 43, in <module>
    main(ptvsdArgs)
  File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 432, in main
    run()
  File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/usr/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/warcio_reader.py", line 4, in <module>
    for record in ArchiveIterator(stream):
  File "/home/.local/lib/python3.7/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/home/.local/lib/python3.7/site-packages/warcio/archiveiterator.py", line 262, in _next_record
    self.check_digests)
  File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 88, in parse_record_stream
    known_format))
  File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 225, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: metadata

This is the code snippet I used to read WAT files:

from warcio.archiveiterator import ArchiveIterator

with open('file.wat.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'metadata':
            print(record.rec_headers.get_header('WARC-Target-URI'))

Use scrapy together with warcio

I am interested in downloading a list of files through scrapy to save them in a warc file, is that possible? The documentation only shows how to do it with the requests module and a simple request so if there is any way to do this maybe it could also be added as an example of use in the documentation.

UTF-8 characters in Link header parameters raises exception

Link headers can be supplied with extra parameters that currently are not correctly handled by the percent encoding in the StatusAndHeaders class:

>>> from warcio.statusandheaders import StatusAndHeaders
>>> bad_header = '''Link: <https://www.albawaba.com/ar/node/1299230>; rel="shortlink", <https://www.addustour.com/articles/1089185-رونالدو-الأغلى-في-التاريخ-الصورة-بـ875-ألف-يورو?s=6226fa042a39b111646918198b4656d6>; rel="canonical"'''
>>> StatusAndHeaders(None, [(bad_header[:4], bad_header[6:])])
UnicodeEncodeError

I believe this is a perfectly reasonable header and I have seen examples of this bug on several occasions.

UnicodeEncodeError when using 'warcio recompress'

I was using warcio recompress command line tool to fix some incorrect (not individually compressed) WARC files and I have stumbled onto a UnicodeEncodeError exception. I assume the reason for this bug is that the WARCs that I used contain Cyrillic and Greek characters. However, I don't suppose that is the expected behavior.

The WARCs that I used can be found here. Specifically, kremlin.warc.gz and primeminister.warc.xz are the WARC files in question.

This is the exact error that I've gotten:

Exception Details:
Traceback (most recent call last):
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/statusandheaders.py", line 168, in to_ascii_bytes
    string = string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 130-136: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/cli.py", line 105, in __call__
    count = self.load_and_write(stream, cmd.output)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/cli.py", line 145, in load_and_write
    writer.write_record(record)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 368, in write_record
    self._write_warc_record(self.out, record)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 248, in _write_warc_record
    self._set_header_buff(record)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 240, in _set_header_buff
    headers_buff = record.http_headers.to_ascii_bytes(self.header_filter)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/statusandheaders.py", line 172, in to_ascii_bytes
    string = string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4517-4520: ordinal not in range(128)

Jinja2 started using f-strings & stopped being installable in Python 2.7

I tried to set up travis to test my develop-warcio-test branch, see https://travis-ci.org/wumpus/warcio

Jinja2 3.0.0a1 uses f-strings. This repo's travis config tests Python 3.4/3.5 which do not support f-strings. You can pin an earlier version.

Python 2.7 installing Jinja2 in setuptools, that's a similar issue.

odd and surprising things discovered while writing DNS records

I was adding dns records to my crawler and ran across a few odd things:

from io import BytesIO
from warcio.warcwriter import WARCWriter

payload = '''\                                                                                                                                                 
20170509000739                                                                                                                                                 
google.com. 10 IN A 172.217.6.78                                                                                                                                   
google.com. 10 IN A 172.217.6.78                                                                                                                                   
google.com. 10 IN A 172.217.6.78                                                                                                                                   
'''

payload = payload.encode('utf-8')

with open('test_dns.warc', 'wb') as f:
    writer = WARCWriter(f, gzip=False)

    # oddness #1 -- programming error leads to negative Content-Length                                                                                         
    # (error is that dns things are 'resource' not 'response' according to WARC 1.0 standard)                                                                           
    # recommend: raising an exception for negative Content-Length                                                                                                      
    record = writer.create_warc_record('dns:www.google.com', 'response', payload=BytesIO(payload),
                                       warc_content_type='text/dns')
    writer.write_record(record)

    # surprise #2 -- if I don't specify length=, I get a length of 0.                                                                                          
    # recommend: this should just work                                                                                                                         
    record = writer.create_warc_record('dns:www.google.com', 'resource', payload=BytesIO(payload),
                                       warc_content_type='text/dns')
    writer.write_record(record)

    # specify length, this one looks OK                                                                                                                        
    record = writer.create_warc_record('dns:www.google.com', 'resource', payload=BytesIO(payload),
                                       warc_content_type='text/dns', length=len(payload))
    writer.write_record(record)

Do you think either is worth patching? Or am I using it wrong? Should it also have thrown an exception for oddness #1 because the library expected to find http headers in the payload and didn't?

Forcing the requests.get() to be in stream mode

I've been trying to create warc file using warcio by passing a response object, however, when requests.get() is not in stream mode, warcio is not writing the response content to the payload of a warc record. Is there a way to solve this, except for changing the request mode to stream mode?

Add an option to split warc files?

When working with large warc files, sometimes it is necessary to split warc file into chunks. Can we add CLI option to split warc file into n chunks or chunks with n records in each chunk?

Include a title attribute to applicable warc records

Should we extract titles of applicable records (such as HTML pages) and make them available as an attribute? I can see some usefulness to this, but I understand that it will add some additional processing time. While the same can be done in applications using warcio package, but it its usefulness is widespread, we might as well move the functionality to warcio itself.

Option to read the optional headers (languages-cld2, fetchTimeMs, charset-detected)

There seems to be no way of reading the optional headers that appears between the metadata and request headers.

fetchTimeMs: 1313
charset-detected: UTF-8
languages-cld2: {"reliable":true,"text-bytes":29834,"languages":[{"code":"de","code-iso-639-3":"deu","text-covered":0.99,"score":990.0,"name":"GERMAN"}]}

From CC-MAIN-20200216182139-20200216212139-00000.warc.gz

warcio check/test should know about transfer encoding vs. digest computation issue

Brought up in issue #74, it appears that most tools and the WARC standard disagree about how to compute digests when there is a transfer-encoding (i.e. chunked). "warcio check" should be extended to compute both digests and make useful comments about the situation.

Confusing documentation around request filter

Example case of filtering responses that don't have 200 status fails with:

Traceback (most recent call last):
  File "/usr/lib/python3.7/http/client.py", line 457, in read
    n = self.readinto(b)
  File "/usr/lib/python3.7/http/client.py", line 509, in readinto
    self._close_conn()
  File "/usr/lib/python3.7/http/client.py", line 411, in _close_conn
    fp.close()
  File "/home/baali/projects/pyenv/lib/python3.7/site-packages/warcio/capture_http.py", line 65, in close
    self.recorder.done()
  File "/home/baali/projects/pyenv/lib/python3.7/site-packages/warcio/capture_http.py", line 185, in done
    request, response = self.filter_func(request, response, self)
  File "create_warc.py", line 43, in filter_records
    if response.http_headers.get_statuscode() != '200':
AttributeError: 'RequestRecorder' object has no attribute 'http_headers'

And the object RequestRecorder doesn't have http_headers

WARC-Payload-Digest should only be written for HTTP records

The WARC/1.1 specification states that:

The WARC-Payload-Digest field may be used on WARC records with a well-defined payload and shall not be used on records without a well-defined payload. (Section 5.9)

While a payload can certainly be defined for other data as well, the spec only does so for HTTP (cf. #74). However, warcio writes a payload digest indiscriminately for any record that isn't a warcinfo or revisit. I'm writing resource records with a number of content types which don't have a payload in the HTTP sense, including application/x-python, application/octet-stream, and text/plain. Of course, in principle, one could also write request and response records for something else than HTTP (e.g. DNS queries) which may or may not have a "well-defined payload".

I think that warcio should only write the payload digest for records with an HTTP Content-Type header.

Add close_decompressor() to BufferedReader

The change introduces to fix #34 adds a close ArchiveIterator to, and disabled the close on underlying stream in BufferedReader. This was incorrect, instead BufferedReader.close() should continue to close the underlying stream, ~~but adding an option to not close the stream, close_stream=False to BufferedReader constructor.~~ ArchiveIterator will not close the underlying stream, as before.

Simpler fix: Just adding separate close_decompressor() method to BufferedReader, ArchiveIterator calls close_decompressor(), not close(). close() continues to behave as before, closing the stream.

warcio doesn't verify digests on read

I was experimenting with injecting digests from my crawler, so that digests aren't computed twice, and noticed that records with a bad WARC-Payload-Digest don't raise an exception on read. No code for it, so I suppose this is a feature request.

The check should be disable-able, and "warcio index" and recompress ought to have a command line flag to ignore digest errors.

Lacking this feature, I don't think that warcio currently has any test to ensure that it's correctly computing digests.

is 'latin-1' charset for warcinfo payload correct?

warcwriter:create_warcinfo_record does this to the payload lines:

        warcinfo.write(line.encode('latin-1'))

Is that correct? I looked through the 1.0 draft standard and it appears to say that utf8 can appear anywhere, and no mention of latin-1.

If latin-1 is correct, it needs errors='something' in case python3 users send in stuff that won't encode latin-1.

In warc 1.0, uri was specified to always have "<" and ">"

regarding #42 ... on reading the WARC 1.1 spec my eye was drawn to this line:

NOTE: in WARC 1.0 standard (ISO 28500:2009), uri was defined as “<” <’URI’ per RFC 3986> “>”. This rule has been changed to meet requests from implementers.

And indeed the WARC 1.0 standard does specify that all uris have "<" and ">" around them. But in the examples, WARC-Target-URI does not have "<" and ">". (The examples are informative only.)

It appears that the only header affected by this specification bug is WARC-Target-URI. In WARC 1.1 both this field and the new-in-1.1 WARC-Refers-To-Target-URI are explicitly said to not have "<" ">" around the uri, while other uris explicitly do have the "<" ">" (e.g. WARC-Record-ID, WARC-Refers-To, ...)

No action needed, but, this does reinforce that #42 adding a workaround for wget is a good idea. Other tools might have chosen to do this after reading the 1.0 standard.

headers as bytes

I'm working through trying to emit headers that are as unprocessed as possible. aiohttp's response object has a raw_headers variable that's a list of 2ples (good) and the values are bytes, not str. That ought to be good, right? But the warcio code appears to assume that the headers are str (python3).

While I could certainly decode the headers before calling warcio, and the charset iso-8859-1 is safe for round-tripping like that, I'm just wondering if taking str is a bad interface. You aren't doing https://tools.ietf.org/html/rfc5987 processing, for example, if non-ISO-8859-1 codepoints are in the str.

order of request/response pairs

Is there a reason why WARCWriter will output the response before the request when using write_request_response_pair? Wouldn't it be more natural to put the request first?

Very minor nitpick.

Using print to report a warning or error seems brittle

I've been attempting to use warcio in Hadoop streaming jobs. This went rather wrong because Hadoop streaming mode uses stdin/stdout and warcio prints to stdout under certain error conditions:

warcio/warcio/bufferedreaders.py

Lines 142 to 144 in ed7ebfd

    
           else: 
        
               print(str(e)) 
        
               return b''

This seems brittle/clumsy. Surely this should raise an exception if it's serious, and/or using the standard logging framework (as appropriate)?

(Sadly, I still have not got to the bottom of why a WARC file that works fine when processed via warcio index managed to throw this error when processed via Hadoop, but that's a separate issue)

Support for WARCs based on version 1.1 of the spec?

I am creating some test cases for https://github.com/oduwsdl/ipwb and want to use the feature of the WARC/1.1 specification that allows for WARC-Date precision on the sub-second scale.

The sample WARCs I have generated process fine with warcio unless I use the WARC/1.1 first line of a WARC record. Are there plans to allow records using this version of the spec to be processed by warcio?

Do not allow writing records which content_stream() has been modified as it results in partial or empty content

Related to #64:

When writing a record into an archive one can read out its content via the streaming API and write the record (now with partial or empty content) into the archive resulting the loss of content. We clearly do not or at least should not want that. An Exception or a Warning should be raised instead of loosing content silently.

from warcio.warcwriter import WARCWriter

filename = 'test.warc.gz'

# Write WARC INFO record with custom headers
output_file = open(filename, 'wb')
writer = WARCWriter(output_file, gzip=True, warc_version='WARC/1.1')

# Custom information and stuff
info_headers = {'custom': 'stuff'}
info_record = writer.create_warcinfo_record(filename, info_headers)

custom_headers_raw = info_record.content_stream().read(6)  # TODO: After this writing should not be allowed

writer.write_record(info_record)

output_file.close()

Result (notice the partial payload):

WARC/1.1
WARC-Type: warcinfo
WARC-Record-ID: <urn:uuid:67a981ea-fece-49b9-834a-e3b660042cf5>
WARC-Filename: test.warc.gz
WARC-Date: 2019-08-15T10:44:53.778034Z
Content-Type: application/warc-fields
Content-Length: 15

: stuff

read(write(record)) != record

Records read back from a file just written should be equal to the Python object written. This is something I discovered while writing tests for an application using warcio. Test case:

import pytest

from io import BytesIO
from tempfile import NamedTemporaryFile

from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders

def test_identity ():
    """ read(write(record)) should yield record """
    with NamedTemporaryFile () as fd:
        payload = b'foobar'
        writer = WARCWriter (fd, gzip=True)
        httpHeaders = StatusAndHeaders('GET / HTTP/1.1', {}, is_http_request=True)
        warcHeaders = {'Foo': 'Bar'}
        record = writer.create_warc_record ('http://example.com/', 'request',
                payload=BytesIO(payload),
                warc_headers_dict=warcHeaders, http_headers=httpHeaders)
        writer.write_record (record)

        fd.seek (0)
        rut = next (ArchiveIterator (fd))
        golden = record
        assert rut.rec_type == golden.rec_type
        assert rut.rec_headers == golden.rec_headers
        assert rut.content_type == golden.content_type
        assert rut.length == golden.length
        assert rut.http_headers == golden.http_headers
        assert rut.raw_stream.read() == payload

results in the following assertion failure:

E           AssertionError: assert StatusAndHead...ngth', '24')]) == StatusAndHeade...ngth', '24')])
E             Full diff:
E             - StatusAndHeaders(protocol = 'WARC/1.0', statusline = '', headers = [('Foo', 'Bar'), ('WARC-Type', 'request'), ('WARC-Record-ID', '<urn:uuid:2eb39603-c759-4865-9e5b-2a3cd9c81c92>'), ('WARC-Target-URI', 'http://example.com/'), ('WARC-Date', '2018-12-04T15:27:45Z'), ('WARC-Payload-Digest', 'sha1:RBB5P6JECYQR32PLXFR76THCQESZGKDY'), ('WARC-Block-Digest', 'sha1:HVUJ5SESVATOLVXZZTFORJY44V5BW7YB'), ('Content-Type', 'application/http; msgtype=request'), ('Content-Length', '24')])
E             ?                                         ^^^^^^^^^^   --------------
E             + StatusAndHeaders(protocol = '', statusline = 'WARC/1.0', headers = [('Foo', 'Bar'), ('WARC-Type', 'request'), ('WARC-Record-ID', '<urn:uuid:2eb39603-c759-4865-9e5b-2a3cd9c81c92>'), ('WARC-Target-URI', 'http://example.com/'), ('WARC-Date', '2018-12-04T15:27:45Z'), ('WARC-Payload-Digest', 'sha1:RBB5P6JECYQR32PLXFR76THCQESZGKDY'), ('WARC-Block-Digest', 'sha1:HVUJ5SESVATOLVXZZTFORJY44V5BW7YB'), ('Content-Type', 'application/http; msgtype=request'), ('Content-Length', '24')])
E             ?                            +++++++++++++++++             ^^^^^^^

capture_http doesn't work requests http/https proxies (was: capture_http and requests import order)

How can I patch the capture_http method for the order of imports to be irrelevant? Seems that this ordering is quite problematic when importing a class (such as a web scraping class/tool) which handles all the WARC writing for you.
There must be an easier way than including the ordered import statements in every single script which imports this scraping class/tool.

Using warcio with scrapy - what does the payload need to look like?

Hello,

I'd like to use the warcio library with scrapy and saw the other thread about it as well as the code.
The difference for me is, I've got some logic inside my spider, where I'd like to create different WARC files and until they reach a certain size, need to stay in memory before being written to the disc.

I believe, I've almost got it working, but I can't figure out, how to build up the payload.
That's what I currently have:

def write_response_to_memory(self, response):
    '''Writes a `response` object from Scrapy as a Warc record. '''
    response_url = w3lib.url.safe_download_url(response.url)

    # Create the payload string
    payload_temp = io.StringIO()

    for h_name in response.headers:
        payload_temp.write('%s: %s\n' % (h_name, response.headers[h_name]))
    
    payload_temp.write('\r\n')
    payload_temp.write(response.text)

    headers = []
    headers.append(tuple(('WARC-Type', 'response')))
    headers.append(tuple(('WARC-Date', self.now_iso_format())))
    headers.append(tuple(('Content-Length', str(payload_temp.tell()))))
    headers.append(tuple(('Content-Type', str(response.headers.get('Content-Type', '')))))

    http_headers_temp = StatusAndHeaders('200 OK', headers, protocol='HTTP/1.0')
    
    record = self.warcdict['test'].create_warc_record(response_url, 'response',
                                payload=payload_temp.getvalue(),
                                http_headers=http_headers_temp)

Which results into:

File "/home/hpn/TestScripts/warc/scrapywarc/scrapywarc/spiders/scrapywarc.py", line 123, in write_response_to_memory
http_headers=http_headers_temp)
File "/home/hpn/live/ccd/lib/python3.7/site-packages/warcio/recordbuilder.py", line 117, in create_warc_record
self.ensure_digest(record, block=False, payload=True)
File "/home/hpn/live/ccd/lib/python3.7/site-packages/warcio/recordbuilder.py", line 191, in ensure_digest
for buf in self._iter_stream(record.raw_stream):
File "/home/hpn/live/ccd/lib/python3.7/site-packages/warcio/recordbuilder.py", line 218, in _iter_stream
buf = stream.read(BUFF_SIZE)
AttributeError: 'str' object has no attribute 'read'

Is there any documentation of what the payload needs to look like?
Or is there an alternative way to add my content and header to create_warc_record?

Provide API for parsed warcinfo payload in conjunction with the raw form

Related to #64:
When one create a warcinfo record it is required to provide custom headers in a dictionary-like format:

info_headers = {'custom': 'stuff'}
info_record = writer.create_warcinfo_record(filename, info_headers)

But when reading a warcinfo record one must parse the dictionary oneself:

info_rec = next(archive_it)
assert info_rec.rec_type == 'warcinfo'
custom_headers_raw = info_rec.content_stream().read()
info_rec_payload = dict(r.split(': ', maxsplit=1) for r in custom_headers_raw.decode('UTF-8')
                        .strip().split('\r\n') if len(r) > 0)

There should be an API like the following:

info_headers = {'custom': 'stuff'}
info_record = writer.create_warcinfo_record(filename, info_headers)

"""...write out and open the archive for reading..."""

info_rec = next(archive_it)
assert info_rec.rec_type == 'warcinfo'
custom_headers = info_rec.info_content_dict()

assert info_headers == custom_headers

ARCHeadersParser splits on space, cause errors with spaces in uri's

We were using the cdxj-indexer to re-index our w/arcs and ran across this error. In our older arcs, there are uri's that have spaces. The cdj-indexer was failing on these arcs.

warcio.recordloader.ArchiveLoadFailed: Unknown archive format, first line: ['http://www.melforsenate.org/index.cfm?FuseAction=Home.Email&EmailTitle=Martinez', 'Targets', 'Castor', 'Backer&EmailURL=http%3A%2F%2Fwww%2Emelforsenate%2Eorg%2Findex%2Ecfm%3FFuseaction%3DArticles%2EView%26Article%5Fid%3D187&IsPopUp=True', '65.36.164.67', '20040925001013', 'text/html', '149']

Probable memory leak in ArchiveIterator

In ArchiveIterator, chunks of data(16384) are being decompressed using decompressor=zlib.decompressobj() however once decompressing is done the decompressor.flush() is not done, which is leaking memory when reading large files or large number of files in python 2.7. Please look into it @ikreymer

No block digest written for warcinfo records

I noticed today that warcio doesn't generate a block digest for warcinfo records:

warcio/warcio/recordbuilder.py

Line 30 in 8e3ceb7

NO_BLOCK_DIGEST_TYPES = ('warcinfo')

This seems to have been introduced in a791617, but I was unable to figure out why. The spec permits block digests on any record (whereas a payload digest would make no sense on a warcinfo record due to the content type used normally), and it seems good practice to me to always store a digest to allow for integrity checks.

Warc tester

I built a thing that tests a warc for standards conformance. The cli is similar to "warcio check". It's 440 lines of code so far, likely to be around 1,000 when done.

It will need an extended testing and tweaking period while it's tested against everything in the ecosystem that generates warcs. Discussion might be ... vigorous. I'm currently labeling things as "not standard conforming", "following/not following recommendations", and "comments". Hopefully not too many hairs will be split.

Does this belong in warcio? My hope is that it will be commonly used; with luck that means that the entire web archiving ecosystem will keep warcio installed and part of their testing processes.

Incorrect WARC-Profile for revisit records when using WARC/1.1

warcio writes a WARC-Profile header value of http://netpreserve.org/warc/1.0/revisit/identical-payload-digest on revisit records regardless of the WARC version used. For 1.1, it should be http://netpreserve.org/warc/1.1/revisit/identical-payload-digest instead (section 6.7.2 in the WARC/1.1 specification).

Unfortunately, it isn't even possible to override this through warc_headers_dict because you then end up with two headers. This works instead after creating the record:

record.rec_headers.replace_header('WARC-Profile', 'http://netpreserve.org/warc/1.1/revisit/identical-payload-digest')

Undocumented and non-standardised default Content-Type application/warc-record

warcio uses a default Content-Type value for WARC records of application/warc-record. This MIME type is not documented or specified anywhere; the WARC spec only mentions application/warc as the MIME type for WARC files and application/warc-fields for warcinfo and metadata records (though it is ambiguous on whether that is required or recommended).

Is it possible to add custom WARC headers when using capture_http method?

Prior to the update, I used to manually write the records with custom headers, now that the capture_http method avoids the manual writing I want to use it, but does it support adding custom headers?

ArchiveIterator is adding bytes to payload HTTP header without updating Content-length

Using ArchiveIterator to filter some WARC records, I noticed that it adds some bytes to the HTTP header without updating the Content-length (record.length) in memory. Given this code:

with open(warcfilebasename + ".warc", 'rb') as f_in:
        with open(warcfilebasename + ".warc.gz", 'wb') as f_out:
            writer = WARCWriter(f_out, gzip=True)
            try:
                for record in ArchiveIterator(f_in):
                    if record.http_headers:
                        if record.http_headers.get_header('Transfer-Encoding') == "chunked":
                            continue
                        try:
                            record.http_headers.to_ascii_bytes()
                        except UnicodeEncodeError:
                            # if header is non ascii, create a new header, with status code only
                            # content length and content type will be filled before writing
                            record.http_headers = StatusAndHeaders(record.http_headers.get_statuscode(), [])
                    writer.write_record(record)
            except:
                pass

and this input WARC generated by wget:
awt.zip

It generates this WARC with wrong Content-length in the record 'urn:uuid:47ef9267-a4cc-47fa-a1b2-ddc6e746216d':
awt.warc.gz

This WARC crashes using warcio index awt.warc.gz, the previous one doesn't.

If I add record.length = None before writer.write_record(record) to my code, warcio recalculates the increased WARC content length before writing the output and then it works with warcio index.

The issue is, why does ArchiveIterator add content to the HTTP header when reading? And if that is necessary, why doesn't it update the content length?

Is this related to #57 ?

Encode WARC headers as UTF-8

Per discussion in #6, encode WARC headers as UTF-8.
Attempt to decode as UTF-8 as well, and fallback to ISO-8859-1

error checking around record creation?

Given this whitespace-related header bug that crept into the August 2018 Common Crawl crawl , it would be nice if it was somewhat difficult to create broken WARC files using warcio.

I see a couple of possible issues:

The programmer could pass in http_headers that have trailing CR or LF
The programmer could pass in warc_headers or warc_headers_dict that have trailing CR or LF
These bad things could happen in warcwriter.py in create_warc_record()
These bad things could happen in recordloader.py in the ArcWarcRecord constructor

webrecorder / warcio Goto Github PK

warcio's Issues

Recommend Projects

Recommend Topics

Recommend Org