Giter Site home page Giter Site logo

webrecorder / warcio Goto Github PK

View Code? Open in Web Editor NEW
346.0 22.0 54.0 292 KB

Streaming WARC/ARC library for fast web archive IO

Home Page: https://pypi.python.org/pypi/warcio

License: Apache License 2.0

Python 98.09% Arc 1.91%
web-archives web-archiving warc pywb python

warcio's Introduction

WARCIO: WARC (and ARC) Streaming Library

image

image

Background

This library provides a fast, standalone way to read and write WARC Format commonly used in web archives. Supports Python 2.7+ and Python 3.4+ (using six, the only external dependency)

warcio supports reading and writing of WARC files compliant with both the WARC 1.0 and WARC 1.1 ISO standards.

Install with: pip install warcio

This library is a spin-off of the WARC reading and writing component of the pywb high-fidelity replay library, a key component of Webrecorder

The library is designed for fast, low-level access to web archival content, oriented around a stream of WARC records rather than files.

Reading WARC Records

A key feature of the library is to be able to iterate over a stream of WARC records using the ArchiveIterator.

It includes the following features:

  • Reading a WARC 1.0, WARC 1.1 or ARC stream
  • On the fly ARC to WARC record conversion
  • Decompressing and de-chunking HTTP payload content stored in WARC/ARC files.

For example, the following prints the the url for each WARC response record:

from warcio.archiveiterator import ArchiveIterator

with open('path/to/file', 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'response':
            print(record.rec_headers.get_header('WARC-Target-URI'))

The stream object could be a file on disk or a remote network stream. The ArchiveIterator reads the WARC content in a single pass. The record is represented by an ArcWarcRecord object which contains the format (ARC or WARC), record type, the record headers, http headers (if any), and raw stream for reading the payload.

class ArcWarcRecord(object):
    def __init__(self, *args):
        (self.format, self.rec_type, self.rec_headers, self.raw_stream,
         self.http_headers, self.content_type, self.length) = args

Reading WARC Content

The raw_stream can be used to read the rest of the payload directly. A special ArcWarcRecord.content_stream() function provides a stream that automatically decompresses and de-chunks the HTTP payload, if it is compressed and/or transfer-encoding chunked.

ARC Files

The library provides support for reading (but not writing ARC) files. The ARC format is legacy but is important to support in a consistent matter. The ArchiveIterator can equally iterate over ARC and WARC files to emit ArcWarcRecord objects. The special arc2warc option converts ARC records to WARCs on the fly, allowing for them to be accessed using the same API.

(Special WARCIterator and ARCIterator subclasses of ArchiveIterator are also available to read only WARC or only ARC files).

WARC and ARC Streaming

For example, here is a snippet for reading an ARC and a WARC using the same API.

The example streams a WARC and ARC file over HTTP using requests, printing the warcinfo record (or ARC header) and any response records (or all ARC records) that contain HTML:

import requests
from warcio.archiveiterator import ArchiveIterator

def print_records(url):
    resp = requests.get(url, stream=True)

    for record in ArchiveIterator(resp.raw, arc2warc=True):
        if record.rec_type == 'warcinfo':
            print(record.raw_stream.read())

        elif record.rec_type == 'response':
            if record.http_headers.get_header('Content-Type') == 'text/html':
                print(record.rec_headers.get_header('WARC-Target-URI'))
                print(record.content_stream().read())
                print('')

# WARC
print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz')


# ARC with arc2warc
print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.arc.gz')

Writing WARC Records

Starting with 1.6, warcio introduces a way to capture HTTP/S traffic directly to a WARC file, by monkey-patching Python's http.client library.

This approach works well with the popular requests library often used to fetch HTTP/S content. Note that requests must be imported after the capture_http module.

Quick Start to Writing a WARC

Fetching the url https://example.com/ while capturing the response and request into a gzip compressed WARC file named example.warc.gz can be done with the following four lines:

from warcio.capture_http import capture_http
import requests  # requests must be imported after capture_http

with capture_http('example.warc.gz'):
    requests.get('https://example.com/')

The WARC example.warc.gz will contain two records (the response is written first, then the request).

To write to a default in-memory buffer (BufferWARCWriter), don't specify a filename, using with capture_http() as writer:.

Additional requests in the capture_http context and will be appended to the WARC as expected.

The WARC-IP-Address header will also be added for each record if the IP address is available.

The following example (similar to a unit test from the test suite) demonstrates the resulting records created with capture_http:

with capture_http() as writer:
    requests.get('http://example.com/')
    requests.get('https://google.com/')

expected = [('http://example.com/', 'response', True),
            ('http://example.com/', 'request', True),
            ('https://google.com/', 'response', True),
            ('https://google.com/', 'request', True),
            ('https://www.google.com/', 'response', True),
            ('https://www.google.com/', 'request', True)
           ]

 actual = [
            (record.rec_headers['WARC-Target-URI'],
             record.rec_type,
             'WARC-IP-Address' in record.rec_headers)

            for record in ArchiveIterator(writer.get_stream())
          ]

 assert actual == expected

Customizing WARC Writing

The library provides a simple and extensible interface for writing standards-compliant WARC files.

The library comes with a basic WARCWriter class for writing to a single WARC file and BufferWARCWriter for writing to an in-memory buffer. The BaseWARCWriter can be extended to support more complex operations.

(There is no support for writing legacy ARC files)

For more flexibility, such as to use a custom WARCWriter class, the above example can be written as:

from warcio.capture_http import capture_http
from warcio import WARCWriter
import requests  # requests *must* be imported after capture_http

with open('example.warc.gz', 'wb') as fh:
    warc_writer = WARCWriter(fh)
    with capture_http(warc_writer):
        requests.get('https://example.com/')

WARC/1.1 Support

By default, warcio creates WARC 1.0 records for maximum compatibility with existing tools. To create WARC/1.1 records, simply specify the warc version as follows:

with capture_http('example.warc.gz', warc_version='1.1'):
    ...
WARCWriter(fh, warc_version='1.1)
...

When using WARC 1.1, the main difference is that the WARC-Date timestamp header will be written with microsecond precision, while WARC 1.0 only supports second precision.

WARC 1.0:

WARC/1.0
...
WARC-Date: 2018-12-26T10:11:12Z

WARC 1.1:

WARC/1.1
...
WARC-Date: 2018-12-26T10:11:12.456789Z

Filtering HTTP Capture

When capturing via HTTP, it is possible to provide a custom filter function, which can be used to determine if a particular request and response records should be written to the WARC file or skipped.

The filter function is called with the request and response record before they are written, and can be used to substitute a different record (for example, a revisit instead of a response), or to skip writing altogether by returning nothing, as shown below:

def filter_records(request, response, request_recorder):
    # return None, None to indicate records should be skipped
    if response.http_headers.get_statuscode() != '200':
        return None, None

    # the response record can be replaced with a revisit record
    elif check_for_dedup():
        response = create_revisit_record(...)

    return request, response

with capture_http('example.warc.gz', filter_records):
     requests.get('https://example.com/')

Please refer to test/test_capture_http.py for additional examples of capturing requests traffic to WARC.

Manual/Advanced WARC Writing

Before 1.6, this was the primary method for fetching a url and then writing to a WARC. This process is a bit more verbose, but provides for full control of WARC creation and avoid monkey-patching.

The following example loads http://example.com/, creates a WARC response record, and writes it, gzip compressed, to example.warc.gz The block and payload digests are computed automatically.

from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders

import requests

with open('example.warc.gz', 'wb') as output:
    writer = WARCWriter(output, gzip=True)

    resp = requests.get('http://example.com/',
                        headers={'Accept-Encoding': 'identity'},
                        stream=True)

    # get raw headers from urllib3
    headers_list = resp.raw.headers.items()

    http_headers = StatusAndHeaders('200 OK', headers_list, protocol='HTTP/1.0')

    record = writer.create_warc_record('http://example.com/', 'response',
                                        payload=resp.raw,
                                        http_headers=http_headers)

    writer.write_record(record)
The library also includes additional semantics for:
  • Creating warcinfo and revisit records
  • Writing response and request records together
  • Writing custom WARC records
  • Reading a full WARC record from a stream

Please refer to warcwriter.py and test/test_writer.py for additional examples.

WARCIO CLI: Indexing and Recompression

The library currently ships with a few simple command line tools.

Index

The warcio index cmd will print a simple index of the records in the warc file as newline delimited JSON lines (NDJSON).

WARC header fields to include in the index can be specified via the -f flag, and are included in the JSON block (in order, for convenience).

warcio index ./test/data/example-iana.org-chunked.warc -f warc-type,warc-target-uri,content-length
{"warc-type": "warcinfo", "content-length": "137"}
{"warc-type": "response", "warc-target-uri": "http://www.iana.org/", "content-length": "7566"}
{"warc-type": "request", "warc-target-uri": "http://www.iana.org/", "content-length": "76"}

HTTP header fields can be included by prefixing them with the prefix http:. The special field offset refers to the record offset within the warc file.

warcio index ./test/data/example-iana.org-chunked.warc -f offset,content-type,http:content-type,warc-target-uri
{"offset": "0", "content-type": "application/warc-fields"}
{"offset": "405", "content-type": "application/http;msgtype=response", "http:content-type": "text/html; charset=UTF-8", "warc-target-uri": "http://www.iana.org/"}
{"offset": "8379", "content-type": "application/http;msgtype=request", "warc-target-uri": "http://www.iana.org/"}

(Note: this library does not produce CDX or CDXJ format indexes often associated with web archives. To create these indexes, please see the cdxj-indexer tool which extends warcio indexing to provide this functionality)

Check

The warcio check command will check the payload and block digests of WARC records, if possible. An exit value of 1 indicates a failure. warcio check -v will print verbose output for each record in the WARC file.

Recompress

The recompress command allows for re-compressing or normalizing WARC (or ARC) files to a record-compressed, gzipped WARC file.

Each WARC record is compressed individually and concatenated. This is the 'canonical' WARC storage format used by Webrecorder and other web archiving institutions, and usually stored with a .warc.gz extension.

It can be used to: - Compress an uncompressed WARC - Convert any ARC file to a compressed WARC - Fix an improperly compressed WARC file (eg. a WARC compressed entirely instead of by record)

warcio recompress ./input.arc.gz ./output.warc.gz

Extract

The extract command provides a way to extract either the WARC and HTTP headers and/or payload of a WARC record to stdout. Given a WARC filename and an offset, extract will print the (decompressed) record at that offset in the file to stdout

Specifying --payload or --headers will output only the payload or only the WARC + HTTP headers (if any), respectively.

warcio extract [--payload | --headers] filename offset

License

warcio is licensed under the Apache 2.0 License and is part of the Webrecorder project.

See NOTICE and LICENSE for details.

warcio's People

Contributors

baali avatar cclauss avatar edsu avatar ikreymer avatar isra17 avatar justanotherarchivist avatar machawk1 avatar n0tan3rd avatar nlevitt avatar notslang avatar pmlandwehr avatar rebeccacremona avatar thomaspreece avatar wumpus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

warcio's Issues

AttributeError: 'brotli.Decompressor' object has no attribute 'unused_data'

Hello, I'm using warcio to read a WARC archive containing brotli encoded HTTP responses like so.

with open(sys.argv[1],'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type != 'response':
            continue
        print(record.content_stream().read())

This gives me the following error, because there is not attribute unused_data in brotli.

Traceback (most recent call last):
  File "load_warc.py", line 22, in <module>
    print(record.content_stream().read())
  File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/recordloader.py", line 34, in content_stream
    return ChunkedDataReader(self.raw_stream, decomp_type=encoding)
  File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 284, in __init__
    super(ChunkedDataReader, self).__init__(stream, **kwargs)
  File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 72, in __init__
    self._init_decomp(decomp_type)
  File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 89, in _init_decomp
    self.decompressor = self.DECOMPRESSORS[decomp_type.lower()]()
  File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 31, in brotli_decompressor
    decomp.unused_data = None
AttributeError: 'brotli.Decompressor' object has no attribute 'unused_data'

What am I missing?

Using print to report a warning or error seems brittle

I've been attempting to use warcio in Hadoop streaming jobs. This went rather wrong because Hadoop streaming mode uses stdin/stdout and warcio prints to stdout under certain error conditions:

else:
print(str(e))
return b''

This seems brittle/clumsy. Surely this should raise an exception if it's serious, and/or using the standard logging framework (as appropriate)?

(Sadly, I still have not got to the bottom of why a WARC file that works fine when processed via warcio index managed to throw this error when processed via Hadoop, but that's a separate issue)

Multiple Cookies are prolematic when parsing warc files

When reading a warc file that contains 'Set-Cookie' header and there are multiple cookies present on subsequent lines, the parsing logic breaks the line on the first colon, which appears to be fine for headers, but when the line is actually a continuation of cookies from the previous line, they're incorrectly being added to the http_headers property.

Having played with the warcio code here (https://github.com/webrecorder/warcio/blob/master/warcio/statusandheaders.py#L262) and adding the following:

while line:
    if line.startswith('Set-Cookie:'):
        print('Testing Set-Cookie line -> ', line)

I can see that the cookies on subsequent lines are not picked up, which is expected of course, but I always like to test my hypothesis before just assuming.

Here are the http headers that I have in a warc files:

HTTP/1.1 200 OK 
Cache-Control: private
Content-Length: 25858
Content-Type: text/html; charset=utf-8
Vary: Accept-Encoding
Server: Microsoft-IIS/10.0
Set-Cookie: ASP.NET_SessionId=xxx; path=/; secure; HttpOnly
COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23:42:12 GMT; path=/; secure; HttpOnly
COOKIE_B=xxx;Path=/;HttpOnly;Domain=xxx
X-Frame-Options: SAMEORIGIN
Date: Sat, 06 Oct 2018 23:42:11 GMT

This is the code I used to access the headers:

>>> for header in a.items[0].record.http_headers.headers:
...     print(header)
...
('Cache-Control', 'private')
('Content-Length', '25858')
('Content-Type', 'text/html; charset=utf-8')
('Vary', 'Accept-Encoding')
('Server', 'Microsoft-IIS/10.0')
('Set-Cookie', 'ASP.NET_SessionId=xxx; path=/; secure; HttpOnly')
('COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23', '42:12 GMT; path=/; secure; HttpOnly')
('X-Frame-Options', 'SAMEORIGIN')
('Date', 'Sat, 06 Oct 2018 23:42:11 GMT')

I'm looking into how pywb 0.33 did this before this was extracted to see if there's a difference in behavior.

Add an iterator of HTTP exchanges

I can see many use cases where it would be useful to be able to iterate over the WARC records and yield related HTTP Request and Response records together as a tuple. I understand that WARC does not guarantee presence of the pair in the same file or in any specific order, but in a typical archival collection we might find them close enough. This iterator could be based on the best-effort attempts.

half_exchanges = {}

for record in ArchiveIterator(stream):
    # Filter any non-HTTP/HTTPS records out
    if record.rec_headers.get_header('uri').startswith(('http:', 'https:')):
        if record.rec_type == 'request':
            id = record.rec_headers.get_header('WARC-Concurrent-To')
        elif record.rec_type == 'response':
            id = record.rec_headers.get_header('WARC-Record-ID')

        if id:
            if id not in half_exchanges:
                half_exchanges[id] = record
            else:
                if record.rec_type == 'request':
                    req = record
                    res = half_exchanges[id]
                else:
                    req = half_exchanges[id]
                    res = record
                # Remove temporary record that is paired and yield the pair
                del half_exchanges[id]
                yield (req, res)

The above code is one possible way to implement it in which we keep track of unpaired records in a dictionary keyed by the identifier that glues corresponding request and response records together. Once a pair is matched, we can delete that bookkeeping entry and return the pair. This rudimentary approach, however, should be taken with a pinch of salt as there is potential of memory leak for records that won't be paired indefinitely. For WARCs with chaotic ordering the memory requirement may go high due to the growth of of bookkeeping, but related WARC records are generally placed in close proximity, in which case the bookkeeping should be small.

Also, this code does not talk about dealing with revisit records for now, but something should be done for those too.

In warc 1.0, uri was specified to always have "<" and ">"

regarding #42 ... on reading the WARC 1.1 spec my eye was drawn to this line:

NOTE: in WARC 1.0 standard (ISO 28500:2009), uri was defined as “<” <’URI’ per RFC 3986> “>”. This rule has been changed to meet requests from implementers.

And indeed the WARC 1.0 standard does specify that all uris have "<" and ">" around them. But in the examples, WARC-Target-URI does not have "<" and ">". (The examples are informative only.)

It appears that the only header affected by this specification bug is WARC-Target-URI. In WARC 1.1 both this field and the new-in-1.1 WARC-Refers-To-Target-URI are explicitly said to not have "<" ">" around the uri, while other uris explicitly do have the "<" ">" (e.g. WARC-Record-ID, WARC-Refers-To, ...)

No action needed, but, this does reinforce that #42 adding a workaround for wget is a good idea. Other tools might have chosen to do this after reading the 1.0 standard.

Ensure http headers added automatically only if explicitly requested.

Currently, on load the http_headers block is always set automatically to default http headers if parsing records. This is incorrect for non-http WARC records.

Instead, by default only add http headers for response, request, revisit if length > 0, otherwise set http_headers=None

Sometimes it is useful to auto-generated the http headers for other record types, for example, for replay. This can now be enabled with a new ensure_http_headers=True flag, which will auto-create http headers suitable for replay with status 200, and content type and content length set.

Use scrapy together with warcio

I am interested in downloading a list of files through scrapy to save them in a warc file, is that possible? The documentation only shows how to do it with the requests module and a simple request so if there is any way to do this maybe it could also be added as an example of use in the documentation.

Using warcio with scrapy - what does the payload need to look like?

Hello,

I'd like to use the warcio library with scrapy and saw the other thread about it as well as the code.
The difference for me is, I've got some logic inside my spider, where I'd like to create different WARC files and until they reach a certain size, need to stay in memory before being written to the disc.

I believe, I've almost got it working, but I can't figure out, how to build up the payload.
That's what I currently have:

def write_response_to_memory(self, response):
    '''Writes a `response` object from Scrapy as a Warc record. '''
    response_url = w3lib.url.safe_download_url(response.url)

    # Create the payload string
    payload_temp = io.StringIO()

    for h_name in response.headers:
        payload_temp.write('%s: %s\n' % (h_name, response.headers[h_name]))
    
    payload_temp.write('\r\n')
    payload_temp.write(response.text)

    headers = []
    headers.append(tuple(('WARC-Type', 'response')))
    headers.append(tuple(('WARC-Date', self.now_iso_format())))
    headers.append(tuple(('Content-Length', str(payload_temp.tell()))))
    headers.append(tuple(('Content-Type', str(response.headers.get('Content-Type', '')))))

    http_headers_temp = StatusAndHeaders('200 OK', headers, protocol='HTTP/1.0')
    
    record = self.warcdict['test'].create_warc_record(response_url, 'response',
                                payload=payload_temp.getvalue(),
                                http_headers=http_headers_temp)

Which results into:

File "/home/hpn/TestScripts/warc/scrapywarc/scrapywarc/spiders/scrapywarc.py", line 123, in write_response_to_memory
http_headers=http_headers_temp)
File "/home/hpn/live/ccd/lib/python3.7/site-packages/warcio/recordbuilder.py", line 117, in create_warc_record
self.ensure_digest(record, block=False, payload=True)
File "/home/hpn/live/ccd/lib/python3.7/site-packages/warcio/recordbuilder.py", line 191, in ensure_digest
for buf in self._iter_stream(record.raw_stream):
File "/home/hpn/live/ccd/lib/python3.7/site-packages/warcio/recordbuilder.py", line 218, in _iter_stream
buf = stream.read(BUFF_SIZE)
AttributeError: 'str' object has no attribute 'read'

Is there any documentation of what the payload needs to look like?
Or is there an alternative way to add my content and header to create_warc_record?

headers as bytes

I'm working through trying to emit headers that are as unprocessed as possible. aiohttp's response object has a raw_headers variable that's a list of 2ples (good) and the values are bytes, not str. That ought to be good, right? But the warcio code appears to assume that the headers are str (python3).

While I could certainly decode the headers before calling warcio, and the charset iso-8859-1 is safe for round-tripping like that, I'm just wondering if taking str is a bad interface. You aren't doing https://tools.ietf.org/html/rfc5987 processing, for example, if non-ISO-8859-1 codepoints are in the str.

ARCHeadersParser splits on space, cause errors with spaces in uri's

We were using the cdxj-indexer to re-index our w/arcs and ran across this error. In our older arcs, there are uri's that have spaces. The cdj-indexer was failing on these arcs.

warcio.recordloader.ArchiveLoadFailed: Unknown archive format, first line: ['http://www.melforsenate.org/index.cfm?FuseAction=Home.Email&EmailTitle=Martinez', 'Targets', 'Castor', 'Backer&EmailURL=http%3A%2F%2Fwww%2Emelforsenate%2Eorg%2Findex%2Ecfm%3FFuseaction%3DArticles%2EView%26Article%5Fid%3D187&IsPopUp=True', '65.36.164.67', '20040925001013', 'text/html', '149']

warcio doesn't verify digests on read

I was experimenting with injecting digests from my crawler, so that digests aren't computed twice, and noticed that records with a bad WARC-Payload-Digest don't raise an exception on read. No code for it, so I suppose this is a feature request.

The check should be disable-able, and "warcio index" and recompress ought to have a command line flag to ignore digest errors.

Lacking this feature, I don't think that warcio currently has any test to ensure that it's correctly computing digests.

record.content_stream().read() alters the record and causes a write out to fail

(Using code from #57)
Calling record.content_stream().read() before writing the record causes the record to be changed in such a way that the file it writes out is incorrect and mangled.

import pytest

from io import BytesIO
from tempfile import NamedTemporaryFile

from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders

def test_identity_correct ():
    """ read(write(record)) should yield record """
    with NamedTemporaryFile () as fd:
        payload = b'foobar'
        writer = WARCWriter (fd, gzip=False)
        httpHeaders = StatusAndHeaders('GET / HTTP/1.1', {}, is_http_request=True)
        warcHeaders = {'Foo': 'Bar'}
        record = writer.create_warc_record ('http://example.com/', 'request',
                payload=BytesIO(payload),
                warc_headers_dict=warcHeaders, http_headers=httpHeaders)
        writer.write_record (record)

        fd.seek (0)
        rut = next (ArchiveIterator (fd))
        golden = record
        assert rut.rec_type == golden.rec_type
        assert rut.rec_headers == golden.rec_headers
        assert rut.content_type == golden.content_type
        assert rut.length == golden.length
        assert rut.http_headers == golden.http_headers
        assert rut.raw_stream.read() == payload

def test_identity_fail ():
    """ read(write(record)) should yield record """
    with NamedTemporaryFile () as fd:
        payload = b'foobar'
        writer = WARCWriter (fd, gzip=False)
        httpHeaders = StatusAndHeaders('GET / HTTP/1.1', {}, is_http_request=True)
        warcHeaders = {'Foo': 'Bar'}
        record = writer.create_warc_record ('http://example.com/', 'request',
                payload=BytesIO(payload),
                warc_headers_dict=warcHeaders, http_headers=httpHeaders)
        record.content_stream().read()
        writer.write_record (record)

        fd.seek (0)
        rut = next (ArchiveIterator (fd))
        golden = record
        assert rut.rec_type == golden.rec_type
        assert rut.rec_headers == golden.rec_headers
        assert rut.content_type == golden.content_type
        assert rut.length == golden.length
        assert rut.http_headers == golden.http_headers
        assert rut.raw_stream.read() == payload

test_identity_correct()
print("Write Worked")
test_identity_fail()
print("Write 2 Worked")

Output:

Write Worked
Traceback (most recent call last):
  File "./test2.py", line 57, in <module>
    test_identity_fail()
  File "./test2.py", line 53, in test_identity_fail
    assert rut.raw_stream.read() == payload
AssertionError

Add an option to split warc files?

When working with large warc files, sometimes it is necessary to split warc file into chunks. Can we add CLI option to split warc file into n chunks or chunks with n records in each chunk?

non-streaming interface would be useful

Right now the only interface for getting at the record content is record.content_stream().read(), which is streaming. I can't do that twice. So if I'm passing a record around in a program and want to access the record content in multiple places, I've ended up wrapping warcio's record with a class that has a .content() method.

That seems odd. Other packages like Requests offer both streaming and non-streaming interfaces.

Obviously we'd want to preserve streaming behavior -- pure streaming code should continue to not buffer all of the content in memory. One way to do that would be to save all of the content in memory only if .content() is called before .content_stream().read(), and make calling .content() after calling content_stream().read() raise an exception.

Do not allow writing records which content_stream() has been modified as it results in partial or empty content

Related to #64:

When writing a record into an archive one can read out its content via the streaming API and write the record (now with partial or empty content) into the archive resulting the loss of content. We clearly do not or at least should not want that. An Exception or a Warning should be raised instead of loosing content silently.

from warcio.warcwriter import WARCWriter

filename = 'test.warc.gz'

# Write WARC INFO record with custom headers
output_file = open(filename, 'wb')
writer = WARCWriter(output_file, gzip=True, warc_version='WARC/1.1')

# Custom information and stuff
info_headers = {'custom': 'stuff'}
info_record = writer.create_warcinfo_record(filename, info_headers)

custom_headers_raw = info_record.content_stream().read(6)  # TODO: After this writing should not be allowed

writer.write_record(info_record)

output_file.close()

Result (notice the partial payload):

WARC/1.1
WARC-Type: warcinfo
WARC-Record-ID: <urn:uuid:67a981ea-fece-49b9-834a-e3b660042cf5>
WARC-Filename: test.warc.gz
WARC-Date: 2019-08-15T10:44:53.778034Z
Content-Type: application/warc-fields
Content-Length: 15

: stuff



configuring warc capture

I'm excited to adopt warcio in a project but I'm stuck.

Following the warcio warc write examples, the warc files I create do not contain styles, images, fonts, videos. However, the user experience on warcrecorder.io does contain those elements.

I'm not sure if I should be passing args, kwargs, params or a filter_function to achieve my desired result but examining the tests and source has left me without a clue!

I'd be very grateful if you could give me a hint or point me towards some samples.

Warc tester

I built a thing that tests a warc for standards conformance. The cli is similar to "warcio check". It's 440 lines of code so far, likely to be around 1,000 when done.

It will need an extended testing and tweaking period while it's tested against everything in the ecosystem that generates warcs. Discussion might be ... vigorous. I'm currently labeling things as "not standard conforming", "following/not following recommendations", and "comments". Hopefully not too many hairs will be split.

Does this belong in warcio? My hope is that it will be commonly used; with luck that means that the entire web archiving ecosystem will keep warcio installed and part of their testing processes.

ArchiveIterator is adding bytes to payload HTTP header without updating Content-length

Using ArchiveIterator to filter some WARC records, I noticed that it adds some bytes to the HTTP header without updating the Content-length (record.length) in memory. Given this code:

with open(warcfilebasename + ".warc", 'rb') as f_in:
        with open(warcfilebasename + ".warc.gz", 'wb') as f_out:
            writer = WARCWriter(f_out, gzip=True)
            try:
                for record in ArchiveIterator(f_in):
                    if record.http_headers:
                        if record.http_headers.get_header('Transfer-Encoding') == "chunked":
                            continue
                        try:
                            record.http_headers.to_ascii_bytes()
                        except UnicodeEncodeError:
                            # if header is non ascii, create a new header, with status code only
                            # content length and content type will be filled before writing
                            record.http_headers = StatusAndHeaders(record.http_headers.get_statuscode(), [])
                    writer.write_record(record)
            except:
                pass

and this input WARC generated by wget:
awt.zip

It generates this WARC with wrong Content-length in the record 'urn:uuid:47ef9267-a4cc-47fa-a1b2-ddc6e746216d':
awt.warc.gz

This WARC crashes using warcio index awt.warc.gz, the previous one doesn't.

If I add record.length = None before writer.write_record(record) to my code, warcio recalculates the increased WARC content length before writing the output and then it works with warcio index.

The issue is, why does ArchiveIterator add content to the HTTP header when reading? And if that is necessary, why doesn't it update the content length?

Is this related to #57 ?

No block digest written for warcinfo records

I noticed today that warcio doesn't generate a block digest for warcinfo records:

NO_BLOCK_DIGEST_TYPES = ('warcinfo')

This seems to have been introduced in a791617, but I was unable to figure out why. The spec permits block digests on any record (whereas a payload digest would make no sense on a warcinfo record due to the content type used normally), and it seems good practice to me to always store a digest to allow for integrity checks.

Facing issue while custom writing without http_headers

>>> type(content)
<class 'bytes'>

>>> record = writer.create_warc_record("https://www.xxxxxx.html",record_type="response", payload=BytesIO(content))

>>> raise StatusAndHeadersParserException(msg, full_statusline)
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['HTTP/1.0', 'HTTP/1.1'] - Found:   <!DOCTYPE html>

error checking around record creation?

Given this whitespace-related header bug that crept into the August 2018 Common Crawl crawl , it would be nice if it was somewhat difficult to create broken WARC files using warcio.

I see a couple of possible issues:

  • The programmer could pass in http_headers that have trailing CR or LF
  • The programmer could pass in warc_headers or warc_headers_dict that have trailing CR or LF
  • These bad things could happen in warcwriter.py in create_warc_record()
  • These bad things could happen in recordloader.py in the ArcWarcRecord constructor

Support ZStd Compression for WARCs

ArchiveTeam has been using WARCs with ZStd compression (https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.20.3-at.20200401.01), so it would be good for warcio to also support Zstd.

Support for ZStd could involve the following:

  • Support for reading Zstd WARCs
  • Support for writing Zstd WARCs with a passed in dictionary (or default dictionary)
  • Training/creating a ZStd dictionary based on one or more WARCs
  1. is definitely needed to support interoperability and be able to read other WARCs.
  2. and 3) are a bit more experimental and will help warcio keep up with evolving compression options

Confusing documentation around request filter

Example case of filtering responses that don't have 200 status fails with:

Traceback (most recent call last):
  File "/usr/lib/python3.7/http/client.py", line 457, in read
    n = self.readinto(b)
  File "/usr/lib/python3.7/http/client.py", line 509, in readinto
    self._close_conn()
  File "/usr/lib/python3.7/http/client.py", line 411, in _close_conn
    fp.close()
  File "/home/baali/projects/pyenv/lib/python3.7/site-packages/warcio/capture_http.py", line 65, in close
    self.recorder.done()
  File "/home/baali/projects/pyenv/lib/python3.7/site-packages/warcio/capture_http.py", line 185, in done
    request, response = self.filter_func(request, response, self)
  File "create_warc.py", line 43, in filter_records
    if response.http_headers.get_statuscode() != '200':
AttributeError: 'RequestRecorder' object has no attribute 'http_headers'

And the object RequestRecorder doesn't have http_headers

WARC-Payload-Digest should only be written for HTTP records

The WARC/1.1 specification states that:

The WARC-Payload-Digest field may be used on WARC records with a well-defined payload and shall not be used on records without a well-defined payload. (Section 5.9)

While a payload can certainly be defined for other data as well, the spec only does so for HTTP (cf. #74). However, warcio writes a payload digest indiscriminately for any record that isn't a warcinfo or revisit. I'm writing resource records with a number of content types which don't have a payload in the HTTP sense, including application/x-python, application/octet-stream, and text/plain. Of course, in principle, one could also write request and response records for something else than HTTP (e.g. DNS queries) which may or may not have a "well-defined payload".

I think that warcio should only write the payload digest for records with an HTTP Content-Type header.

Error reading WAT files

When I try to use warcio to read WAT files generated from archive-metadata-extractor tool, it gives me this error message

    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: -97518
    Remainder: b'WARC/1.0\r\n'
Traceback (most recent call last):
  File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 220, in _detect_type_load_headers
    rec_headers = self.warc_parser.parse(stream, statusline)
  File "/home/.local/lib/python3.7/site-packages/warcio/statusandheaders.py", line 264, in parse
    raise StatusAndHeadersParserException(msg, full_statusline)
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WA
RC/0.17', 'WARC/0.18'] - Found: WARC-Type: metadata

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/ptvsd_launcher.py", line 43, in <module>
    main(ptvsdArgs)
  File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 432, in main
    run()
  File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/usr/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/warcio_reader.py", line 4, in <module>
    for record in ArchiveIterator(stream):
  File "/home/.local/lib/python3.7/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/home/.local/lib/python3.7/site-packages/warcio/archiveiterator.py", line 262, in _next_record
    self.check_digests)
  File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 88, in parse_record_stream
    known_format))
  File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 225, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: metadata

This is the code snippet I used to read WAT files:

from warcio.archiveiterator import ArchiveIterator

with open('file.wat.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'metadata':
            print(record.rec_headers.get_header('WARC-Target-URI'))

Forcing the requests.get() to be in stream mode

I've been trying to create warc file using warcio by passing a response object, however, when requests.get() is not in stream mode, warcio is not writing the response content to the payload of a warc record. Is there a way to solve this, except for changing the request mode to stream mode?

Plans for adding type annotations?

Hi all,

Are type annotations on the roadmap at all? Would you take PRs for it? If so, would you prefer the comment-based, Python2 compatible annotation scheme, or the syntax introduced by PEP 526(which means dropping support for anything older than Py3.6)?

PS: It's always a great feeling when you know there's a (well maintained) Python library out there tailored exactly to your current needs. Thank you!

Probable memory leak in ArchiveIterator

In ArchiveIterator, chunks of data(16384) are being decompressed using decompressor=zlib.decompressobj() however once decompressing is done the decompressor.flush() is not done, which is leaking memory when reading large files or large number of files in python 2.7. Please look into it @ikreymer

is 'latin-1' charset for warcinfo payload correct?

warcwriter:create_warcinfo_record does this to the payload lines:

        warcinfo.write(line.encode('latin-1'))

Is that correct? I looked through the 1.0 draft standard and it appears to say that utf8 can appear anywhere, and no mention of latin-1.

If latin-1 is correct, it needs errors='something' in case python3 users send in stuff that won't encode latin-1.

Include a title attribute to applicable warc records

Should we extract titles of applicable records (such as HTML pages) and make them available as an attribute? I can see some usefulness to this, but I understand that it will add some additional processing time. While the same can be done in applications using warcio package, but it its usefulness is widespread, we might as well move the functionality to warcio itself.

Add close_decompressor() to BufferedReader

The change introduces to fix #34 adds a close ArchiveIterator to, and disabled the close on underlying stream in BufferedReader. This was incorrect, instead BufferedReader.close() should continue to close the underlying stream, but adding an option to not close the stream, close_stream=False to BufferedReader constructor. ArchiveIterator will not close the underlying stream, as before.

Simpler fix: Just adding separate close_decompressor() method to BufferedReader, ArchiveIterator calls close_decompressor(), not close(). close() continues to behave as before, closing the stream.

Option to read the optional headers (languages-cld2, fetchTimeMs, charset-detected)

There seems to be no way of reading the optional headers that appears between the metadata and request headers.

fetchTimeMs: 1313
charset-detected: UTF-8
languages-cld2: {"reliable":true,"text-bytes":29834,"languages":[{"code":"de","code-iso-639-3":"deu","text-covered":0.99,"score":990.0,"name":"GERMAN"}]}

From CC-MAIN-20200216182139-20200216212139-00000.warc.gz

Incorrect WARC-Profile for revisit records when using WARC/1.1

warcio writes a WARC-Profile header value of http://netpreserve.org/warc/1.0/revisit/identical-payload-digest on revisit records regardless of the WARC version used. For 1.1, it should be http://netpreserve.org/warc/1.1/revisit/identical-payload-digest instead (section 6.7.2 in the WARC/1.1 specification).

Unfortunately, it isn't even possible to override this through warc_headers_dict because you then end up with two headers. This works instead after creating the record:

record.rec_headers.replace_header('WARC-Profile', 'http://netpreserve.org/warc/1.1/revisit/identical-payload-digest')

Provide API for parsed warcinfo payload in conjunction with the raw form

Related to #64:
When one create a warcinfo record it is required to provide custom headers in a dictionary-like format:

info_headers = {'custom': 'stuff'}
info_record = writer.create_warcinfo_record(filename, info_headers)

But when reading a warcinfo record one must parse the dictionary oneself:

info_rec = next(archive_it)
assert info_rec.rec_type == 'warcinfo'
custom_headers_raw = info_rec.content_stream().read()
info_rec_payload = dict(r.split(': ', maxsplit=1) for r in custom_headers_raw.decode('UTF-8')
                        .strip().split('\r\n') if len(r) > 0)

There should be an API like the following:

info_headers = {'custom': 'stuff'}
info_record = writer.create_warcinfo_record(filename, info_headers)

"""...write out and open the archive for reading..."""

info_rec = next(archive_it)
assert info_rec.rec_type == 'warcinfo'
custom_headers = info_rec.info_content_dict()

assert info_headers == custom_headers

UnicodeEncodeError when using 'warcio recompress'

I was using warcio recompress command line tool to fix some incorrect (not individually compressed) WARC files and I have stumbled onto a UnicodeEncodeError exception. I assume the reason for this bug is that the WARCs that I used contain Cyrillic and Greek characters. However, I don't suppose that is the expected behavior.

The WARCs that I used can be found here. Specifically, kremlin.warc.gz and primeminister.warc.xz are the WARC files in question.

This is the exact error that I've gotten:

Exception Details:
Traceback (most recent call last):
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/statusandheaders.py", line 168, in to_ascii_bytes
    string = string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 130-136: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/cli.py", line 105, in __call__
    count = self.load_and_write(stream, cmd.output)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/cli.py", line 145, in load_and_write
    writer.write_record(record)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 368, in write_record
    self._write_warc_record(self.out, record)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 248, in _write_warc_record
    self._set_header_buff(record)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 240, in _set_header_buff
    headers_buff = record.http_headers.to_ascii_bytes(self.header_filter)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/statusandheaders.py", line 172, in to_ascii_bytes
    string = string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4517-4520: ordinal not in range(128)

Threadpool executor creates zero byte warc files

Using Threadpool executor to create warc files, creates files with zero bytes.
I have provided the test code below.

#!/usr/bin/env python3

from warcio.capture_http import capture_http
import requests
import concurrent.futures


def save_warc(url, ofile):
    with capture_http(ofile):
        requests.get(url)


with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
        executor.submit(save_warc, "https//example.com", "com.warc.gz")
        executor.submit(save_warc, "https//example.org", "org.warc.gz")

UTF-8 characters in Link header parameters raises exception

Link headers can be supplied with extra parameters that currently are not correctly handled by the percent encoding in the StatusAndHeaders class:

>>> from warcio.statusandheaders import StatusAndHeaders
>>> bad_header = '''Link: <https://www.albawaba.com/ar/node/1299230>; rel="shortlink", <https://www.addustour.com/articles/1089185-رونالدو-الأغلى-في-التاريخ-الصورة-بـ875-ألف-يورو?s=6226fa042a39b111646918198b4656d6>; rel="canonical"'''
>>> StatusAndHeaders(None, [(bad_header[:4], bad_header[6:])])
UnicodeEncodeError

I believe this is a perfectly reasonable header and I have seen examples of this bug on several occasions.

Different encoding reading / writing headers?

Reading a WARC record (using ArchiveIterator) with a unicode character outside the range of iso-8859-1 in the HTTP headers is fine, but writing it again (using WARCWriter) gives the error

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 336: ordinal not in range(256).

This is the header line causing the problem in this case:

Content-disposition: attachment; filename="Lancement du Système d’Échange Local (SEL).pdf"

with the in d’Échange causing the problem.

This code probably parses the headers using utf-8:

    def decode_header(line):
        try:
            # attempt to decode as utf-8 first
            return to_native_str(line, 'utf-8')
        except:
            # if fails, default to ISO-8859-1
            return to_native_str(line, 'iso-8859-1')

These are the lines of code that write headers hardcoded in latin-1:

def _set_header_buff(self, record):
    headers_buff = record.http_headers.to_bytes(self.header_filter, 'iso-8859-1')
    record.http_headers.headers_buff = headers_buff

If headers are by default read in utf-8, wouldn't it make sense to write them as utf-8 as well?

order of request/response pairs

Is there a reason why WARCWriter will output the response before the request when using write_request_response_pair? Wouldn't it be more natural to put the request first?

Very minor nitpick.

odd and surprising things discovered while writing DNS records

I was adding dns records to my crawler and ran across a few odd things:

from io import BytesIO
from warcio.warcwriter import WARCWriter

payload = '''\                                                                                                                                                 
20170509000739                                                                                                                                                 
google.com. 10 IN A 172.217.6.78                                                                                                                                   
google.com. 10 IN A 172.217.6.78                                                                                                                                   
google.com. 10 IN A 172.217.6.78                                                                                                                                   
'''

payload = payload.encode('utf-8')

with open('test_dns.warc', 'wb') as f:
    writer = WARCWriter(f, gzip=False)

    # oddness #1 -- programming error leads to negative Content-Length                                                                                         
    # (error is that dns things are 'resource' not 'response' according to WARC 1.0 standard)                                                                           
    # recommend: raising an exception for negative Content-Length                                                                                                      
    record = writer.create_warc_record('dns:www.google.com', 'response', payload=BytesIO(payload),
                                       warc_content_type='text/dns')
    writer.write_record(record)

    # surprise #2 -- if I don't specify length=, I get a length of 0.                                                                                          
    # recommend: this should just work                                                                                                                         
    record = writer.create_warc_record('dns:www.google.com', 'resource', payload=BytesIO(payload),
                                       warc_content_type='text/dns')
    writer.write_record(record)

    # specify length, this one looks OK                                                                                                                        
    record = writer.create_warc_record('dns:www.google.com', 'resource', payload=BytesIO(payload),
                                       warc_content_type='text/dns', length=len(payload))
    writer.write_record(record)

Do you think either is worth patching? Or am I using it wrong? Should it also have thrown an exception for oddness #1 because the library expected to find http headers in the payload and didn't?

Incorrect WARC-Payload-Digest values when transfer encoding is present

Per WARC/1.0 spec section 5.9:

The payload of an application/http block is its ‘entity-body’ (per [RFC2616]).

The entity-body is the HTTP body without transfer encoding per section 4.3 in RFC 2616. (In the newer RFC 723# family, it's called "payload body" instead and defined in section 3.3 of RFC 7230.)

Just to be clear to avoid confusion: this is the definition of the payload; the WARC record should still contain the exact response sent by the server with transfer encoding intact. But when calculating the WARC-Payload-Digest, the transfer encoding must be stripped.

warcio (like many other tools) passes the response data directly into the payload digester without removing transfer encoding. This means that it produces an invalid WARC-Payload-Digest when the HTTP body is transfer-encoded.

read(write(record)) != record

Records read back from a file just written should be equal to the Python object written. This is something I discovered while writing tests for an application using warcio. Test case:

import pytest

from io import BytesIO
from tempfile import NamedTemporaryFile

from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders

def test_identity ():
    """ read(write(record)) should yield record """
    with NamedTemporaryFile () as fd:
        payload = b'foobar'
        writer = WARCWriter (fd, gzip=True)
        httpHeaders = StatusAndHeaders('GET / HTTP/1.1', {}, is_http_request=True)
        warcHeaders = {'Foo': 'Bar'}
        record = writer.create_warc_record ('http://example.com/', 'request',
                payload=BytesIO(payload),
                warc_headers_dict=warcHeaders, http_headers=httpHeaders)
        writer.write_record (record)

        fd.seek (0)
        rut = next (ArchiveIterator (fd))
        golden = record
        assert rut.rec_type == golden.rec_type
        assert rut.rec_headers == golden.rec_headers
        assert rut.content_type == golden.content_type
        assert rut.length == golden.length
        assert rut.http_headers == golden.http_headers
        assert rut.raw_stream.read() == payload

results in the following assertion failure:

E           AssertionError: assert StatusAndHead...ngth', '24')]) == StatusAndHeade...ngth', '24')])
E             Full diff:
E             - StatusAndHeaders(protocol = 'WARC/1.0', statusline = '', headers = [('Foo', 'Bar'), ('WARC-Type', 'request'), ('WARC-Record-ID', '<urn:uuid:2eb39603-c759-4865-9e5b-2a3cd9c81c92>'), ('WARC-Target-URI', 'http://example.com/'), ('WARC-Date', '2018-12-04T15:27:45Z'), ('WARC-Payload-Digest', 'sha1:RBB5P6JECYQR32PLXFR76THCQESZGKDY'), ('WARC-Block-Digest', 'sha1:HVUJ5SESVATOLVXZZTFORJY44V5BW7YB'), ('Content-Type', 'application/http; msgtype=request'), ('Content-Length', '24')])
E             ?                                         ^^^^^^^^^^   --------------
E             + StatusAndHeaders(protocol = '', statusline = 'WARC/1.0', headers = [('Foo', 'Bar'), ('WARC-Type', 'request'), ('WARC-Record-ID', '<urn:uuid:2eb39603-c759-4865-9e5b-2a3cd9c81c92>'), ('WARC-Target-URI', 'http://example.com/'), ('WARC-Date', '2018-12-04T15:27:45Z'), ('WARC-Payload-Digest', 'sha1:RBB5P6JECYQR32PLXFR76THCQESZGKDY'), ('WARC-Block-Digest', 'sha1:HVUJ5SESVATOLVXZZTFORJY44V5BW7YB'), ('Content-Type', 'application/http; msgtype=request'), ('Content-Length', '24')])
E             ?                            +++++++++++++++++             ^^^^^^^

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.