webrecorder / warcio Goto Github PK
View Code? Open in Web Editor NEWStreaming WARC/ARC library for fast web archive IO
Home Page: https://pypi.python.org/pypi/warcio
License: Apache License 2.0
Streaming WARC/ARC library for fast web archive IO
Home Page: https://pypi.python.org/pypi/warcio
License: Apache License 2.0
ArchiveTeam has been using WARCs with ZStd compression (https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.20.3-at.20200401.01), so it would be good for warcio to also support Zstd.
Support for ZStd could involve the following:
Reading a WARC record (using ArchiveIterator
) with a unicode character outside the range of iso-8859-1
in the HTTP headers is fine, but writing it again (using WARCWriter) gives the error
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 336: ordinal not in range(256).
This is the header line causing the problem in this case:
Content-disposition: attachment; filename="Lancement du Système d’Échange Local (SEL).pdf"
with the ’
in d’Échange
causing the problem.
This code probably parses the headers using utf-8
:
def decode_header(line):
try:
# attempt to decode as utf-8 first
return to_native_str(line, 'utf-8')
except:
# if fails, default to ISO-8859-1
return to_native_str(line, 'iso-8859-1')
These are the lines of code that write headers hardcoded in latin-1:
def _set_header_buff(self, record):
headers_buff = record.http_headers.to_bytes(self.header_filter, 'iso-8859-1')
record.http_headers.headers_buff = headers_buff
If headers are by default read in utf-8
, wouldn't it make sense to write them as utf-8
as well?
Using Threadpool executor to create warc files, creates files with zero bytes.
I have provided the test code below.
#!/usr/bin/env python3
from warcio.capture_http import capture_http
import requests
import concurrent.futures
def save_warc(url, ofile):
with capture_http(ofile):
requests.get(url)
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
executor.submit(save_warc, "https//example.com", "com.warc.gz")
executor.submit(save_warc, "https//example.org", "org.warc.gz")
Currently, on load the http_headers
block is always set automatically to default http headers if parsing records. This is incorrect for non-http WARC records.
Instead, by default only add http headers for response
, request
, revisit
if length > 0, otherwise set http_headers=None
Sometimes it is useful to auto-generated the http headers for other record types, for example, for replay. This can now be enabled with a new ensure_http_headers=True
flag, which will auto-create http headers suitable for replay with status 200, and content type and content length set.
When reading a warc file that contains 'Set-Cookie' header and there are multiple cookies present on subsequent lines, the parsing logic breaks the line on the first colon, which appears to be fine for headers, but when the line is actually a continuation of cookies from the previous line, they're incorrectly being added to the http_headers property.
Having played with the warcio code here (https://github.com/webrecorder/warcio/blob/master/warcio/statusandheaders.py#L262) and adding the following:
while line:
if line.startswith('Set-Cookie:'):
print('Testing Set-Cookie line -> ', line)
I can see that the cookies on subsequent lines are not picked up, which is expected of course, but I always like to test my hypothesis before just assuming.
Here are the http headers that I have in a warc files:
HTTP/1.1 200 OK
Cache-Control: private
Content-Length: 25858
Content-Type: text/html; charset=utf-8
Vary: Accept-Encoding
Server: Microsoft-IIS/10.0
Set-Cookie: ASP.NET_SessionId=xxx; path=/; secure; HttpOnly
COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23:42:12 GMT; path=/; secure; HttpOnly
COOKIE_B=xxx;Path=/;HttpOnly;Domain=xxx
X-Frame-Options: SAMEORIGIN
Date: Sat, 06 Oct 2018 23:42:11 GMT
This is the code I used to access the headers:
>>> for header in a.items[0].record.http_headers.headers:
... print(header)
...
('Cache-Control', 'private')
('Content-Length', '25858')
('Content-Type', 'text/html; charset=utf-8')
('Vary', 'Accept-Encoding')
('Server', 'Microsoft-IIS/10.0')
('Set-Cookie', 'ASP.NET_SessionId=xxx; path=/; secure; HttpOnly')
('COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23', '42:12 GMT; path=/; secure; HttpOnly')
('X-Frame-Options', 'SAMEORIGIN')
('Date', 'Sat, 06 Oct 2018 23:42:11 GMT')
I'm looking into how pywb 0.33 did this before this was extracted to see if there's a difference in behavior.
(Using code from #57)
Calling record.content_stream().read() before writing the record causes the record to be changed in such a way that the file it writes out is incorrect and mangled.
import pytest
from io import BytesIO
from tempfile import NamedTemporaryFile
from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders
def test_identity_correct ():
""" read(write(record)) should yield record """
with NamedTemporaryFile () as fd:
payload = b'foobar'
writer = WARCWriter (fd, gzip=False)
httpHeaders = StatusAndHeaders('GET / HTTP/1.1', {}, is_http_request=True)
warcHeaders = {'Foo': 'Bar'}
record = writer.create_warc_record ('http://example.com/', 'request',
payload=BytesIO(payload),
warc_headers_dict=warcHeaders, http_headers=httpHeaders)
writer.write_record (record)
fd.seek (0)
rut = next (ArchiveIterator (fd))
golden = record
assert rut.rec_type == golden.rec_type
assert rut.rec_headers == golden.rec_headers
assert rut.content_type == golden.content_type
assert rut.length == golden.length
assert rut.http_headers == golden.http_headers
assert rut.raw_stream.read() == payload
def test_identity_fail ():
""" read(write(record)) should yield record """
with NamedTemporaryFile () as fd:
payload = b'foobar'
writer = WARCWriter (fd, gzip=False)
httpHeaders = StatusAndHeaders('GET / HTTP/1.1', {}, is_http_request=True)
warcHeaders = {'Foo': 'Bar'}
record = writer.create_warc_record ('http://example.com/', 'request',
payload=BytesIO(payload),
warc_headers_dict=warcHeaders, http_headers=httpHeaders)
record.content_stream().read()
writer.write_record (record)
fd.seek (0)
rut = next (ArchiveIterator (fd))
golden = record
assert rut.rec_type == golden.rec_type
assert rut.rec_headers == golden.rec_headers
assert rut.content_type == golden.content_type
assert rut.length == golden.length
assert rut.http_headers == golden.http_headers
assert rut.raw_stream.read() == payload
test_identity_correct()
print("Write Worked")
test_identity_fail()
print("Write 2 Worked")
Output:
Write Worked
Traceback (most recent call last):
File "./test2.py", line 57, in <module>
test_identity_fail()
File "./test2.py", line 53, in test_identity_fail
assert rut.raw_stream.read() == payload
AssertionError
I'm excited to adopt warcio in a project but I'm stuck.
Following the warcio warc write examples, the warc files I create do not contain styles, images, fonts, videos. However, the user experience on warcrecorder.io does contain those elements.
I'm not sure if I should be passing args, kwargs, params or a filter_function to achieve my desired result but examining the tests and source has left me without a clue!
I'd be very grateful if you could give me a hint or point me towards some samples.
Per WARC/1.0 spec section 5.9:
The payload of an application/http block is its ‘entity-body’ (per [RFC2616]).
The entity-body is the HTTP body without transfer encoding per section 4.3 in RFC 2616. (In the newer RFC 723# family, it's called "payload body" instead and defined in section 3.3 of RFC 7230.)
Just to be clear to avoid confusion: this is the definition of the payload; the WARC record should still contain the exact response sent by the server with transfer encoding intact. But when calculating the WARC-Payload-Digest, the transfer encoding must be stripped.
warcio (like many other tools) passes the response data directly into the payload digester without removing transfer encoding. This means that it produces an invalid WARC-Payload-Digest when the HTTP body is transfer-encoded.
Right now the only interface for getting at the record content is record.content_stream().read()
, which is streaming. I can't do that twice. So if I'm passing a record around in a program and want to access the record content in multiple places, I've ended up wrapping warcio's record with a class that has a .content()
method.
That seems odd. Other packages like Requests offer both streaming and non-streaming interfaces.
Obviously we'd want to preserve streaming behavior -- pure streaming code should continue to not buffer all of the content in memory. One way to do that would be to save all of the content in memory only if .content()
is called before .content_stream().read()
, and make calling .content()
after calling content_stream().read()
raise an exception.
How to retrieve the record based on target-uri once indexes are created on the warc file?
Hi all,
Are type annotations on the roadmap at all? Would you take PRs for it? If so, would you prefer the comment-based, Python2 compatible annotation scheme, or the syntax introduced by PEP 526(which means dropping support for anything older than Py3.6)?
PS: It's always a great feeling when you know there's a (well maintained) Python library out there tailored exactly to your current needs. Thank you!
>>> type(content)
<class 'bytes'>
>>> record = writer.create_warc_record("https://www.xxxxxx.html",record_type="response", payload=BytesIO(content))
>>> raise StatusAndHeadersParserException(msg, full_statusline)
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['HTTP/1.0', 'HTTP/1.1'] - Found: <!DOCTYPE html>
I can see many use cases where it would be useful to be able to iterate over the WARC records and yield related HTTP Request and Response records together as a tuple. I understand that WARC does not guarantee presence of the pair in the same file or in any specific order, but in a typical archival collection we might find them close enough. This iterator could be based on the best-effort attempts.
half_exchanges = {}
for record in ArchiveIterator(stream):
# Filter any non-HTTP/HTTPS records out
if record.rec_headers.get_header('uri').startswith(('http:', 'https:')):
if record.rec_type == 'request':
id = record.rec_headers.get_header('WARC-Concurrent-To')
elif record.rec_type == 'response':
id = record.rec_headers.get_header('WARC-Record-ID')
if id:
if id not in half_exchanges:
half_exchanges[id] = record
else:
if record.rec_type == 'request':
req = record
res = half_exchanges[id]
else:
req = half_exchanges[id]
res = record
# Remove temporary record that is paired and yield the pair
del half_exchanges[id]
yield (req, res)
The above code is one possible way to implement it in which we keep track of unpaired records in a dictionary keyed by the identifier that glues corresponding request and response records together. Once a pair is matched, we can delete that bookkeeping entry and return the pair. This rudimentary approach, however, should be taken with a pinch of salt as there is potential of memory leak for records that won't be paired indefinitely. For WARCs with chaotic ordering the memory requirement may go high due to the growth of of bookkeeping, but related WARC records are generally placed in close proximity, in which case the bookkeeping should be small.
Also, this code does not talk about dealing with revisit records for now, but something should be done for those too.
Hello, I'm using warcio to read a WARC archive containing brotli encoded HTTP responses like so.
with open(sys.argv[1],'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type != 'response':
continue
print(record.content_stream().read())
This gives me the following error, because there is not attribute unused_data
in brotli
.
Traceback (most recent call last):
File "load_warc.py", line 22, in <module>
print(record.content_stream().read())
File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/recordloader.py", line 34, in content_stream
return ChunkedDataReader(self.raw_stream, decomp_type=encoding)
File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 284, in __init__
super(ChunkedDataReader, self).__init__(stream, **kwargs)
File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 72, in __init__
self._init_decomp(decomp_type)
File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 89, in _init_decomp
self.decompressor = self.DECOMPRESSORS[decomp_type.lower()]()
File "/home/sebastian/.asdf/installs/python/3.7.2/lib/python3.7/site-packages/warcio/bufferedreaders.py", line 31, in brotli_decompressor
decomp.unused_data = None
AttributeError: 'brotli.Decompressor' object has no attribute 'unused_data'
What am I missing?
When I try to use warcio to read WAT files generated from archive-metadata-extractor tool, it gives me this error message
WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: -97518
Remainder: b'WARC/1.0\r\n'
Traceback (most recent call last):
File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 220, in _detect_type_load_headers
rec_headers = self.warc_parser.parse(stream, statusline)
File "/home/.local/lib/python3.7/site-packages/warcio/statusandheaders.py", line 264, in parse
raise StatusAndHeadersParserException(msg, full_statusline)
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WA
RC/0.17', 'WARC/0.18'] - Found: WARC-Type: metadata
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/ptvsd_launcher.py", line 43, in <module>
main(ptvsdArgs)
File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 432, in main
run()
File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 316, in run_file
runpy.run_path(target, run_name='__main__')
File "/usr/lib/python3.7/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/usr/lib/python3.7/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/warcio_reader.py", line 4, in <module>
for record in ArchiveIterator(stream):
File "/home/.local/lib/python3.7/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
self.record = self._next_record(self.next_line)
File "/home/.local/lib/python3.7/site-packages/warcio/archiveiterator.py", line 262, in _next_record
self.check_digests)
File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 88, in parse_record_stream
known_format))
File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 225, in _detect_type_load_headers
raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: metadata
This is the code snippet I used to read WAT files:
from warcio.archiveiterator import ArchiveIterator
with open('file.wat.gz', 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'metadata':
print(record.rec_headers.get_header('WARC-Target-URI'))
I am interested in downloading a list of files through scrapy to save them in a warc file, is that possible? The documentation only shows how to do it with the requests
module and a simple request so if there is any way to do this maybe it could also be added as an example of use in the documentation.
Link headers can be supplied with extra parameters that currently are not correctly handled by the percent encoding in the StatusAndHeaders class:
>>> from warcio.statusandheaders import StatusAndHeaders
>>> bad_header = '''Link: <https://www.albawaba.com/ar/node/1299230>; rel="shortlink", <https://www.addustour.com/articles/1089185-رونالدو-الأغلى-في-التاريخ-الصورة-بـ875-ألف-يورو?s=6226fa042a39b111646918198b4656d6>; rel="canonical"'''
>>> StatusAndHeaders(None, [(bad_header[:4], bad_header[6:])])
UnicodeEncodeError
I believe this is a perfectly reasonable header and I have seen examples of this bug on several occasions.
I was using warcio recompress
command line tool to fix some incorrect (not individually compressed) WARC files and I have stumbled onto a UnicodeEncodeError
exception. I assume the reason for this bug is that the WARCs that I used contain Cyrillic and Greek characters. However, I don't suppose that is the expected behavior.
The WARCs that I used can be found here. Specifically, kremlin.warc.gz
and primeminister.warc.xz
are the WARC files in question.
This is the exact error that I've gotten:
Exception Details:
Traceback (most recent call last):
File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/statusandheaders.py", line 168, in to_ascii_bytes
string = string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 130-136: ordinal not in range(128)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/cli.py", line 105, in __call__
count = self.load_and_write(stream, cmd.output)
File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/cli.py", line 145, in load_and_write
writer.write_record(record)
File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 368, in write_record
self._write_warc_record(self.out, record)
File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 248, in _write_warc_record
self._set_header_buff(record)
File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 240, in _set_header_buff
headers_buff = record.http_headers.to_ascii_bytes(self.header_filter)
File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/statusandheaders.py", line 172, in to_ascii_bytes
string = string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4517-4520: ordinal not in range(128)
I tried to set up travis to test my develop-warcio-test branch, see https://travis-ci.org/wumpus/warcio
Jinja2 3.0.0a1 uses f-strings. This repo's travis config tests Python 3.4/3.5 which do not support f-strings. You can pin an earlier version.
Python 2.7 installing Jinja2 in setuptools, that's a similar issue.
I was adding dns records to my crawler and ran across a few odd things:
from io import BytesIO
from warcio.warcwriter import WARCWriter
payload = '''\
20170509000739
google.com. 10 IN A 172.217.6.78
google.com. 10 IN A 172.217.6.78
google.com. 10 IN A 172.217.6.78
'''
payload = payload.encode('utf-8')
with open('test_dns.warc', 'wb') as f:
writer = WARCWriter(f, gzip=False)
# oddness #1 -- programming error leads to negative Content-Length
# (error is that dns things are 'resource' not 'response' according to WARC 1.0 standard)
# recommend: raising an exception for negative Content-Length
record = writer.create_warc_record('dns:www.google.com', 'response', payload=BytesIO(payload),
warc_content_type='text/dns')
writer.write_record(record)
# surprise #2 -- if I don't specify length=, I get a length of 0.
# recommend: this should just work
record = writer.create_warc_record('dns:www.google.com', 'resource', payload=BytesIO(payload),
warc_content_type='text/dns')
writer.write_record(record)
# specify length, this one looks OK
record = writer.create_warc_record('dns:www.google.com', 'resource', payload=BytesIO(payload),
warc_content_type='text/dns', length=len(payload))
writer.write_record(record)
Do you think either is worth patching? Or am I using it wrong? Should it also have thrown an exception for oddness #1 because the library expected to find http headers in the payload and didn't?
I've been trying to create warc file using warcio by passing a response object, however, when requests.get() is not in stream mode, warcio is not writing the response content to the payload of a warc record. Is there a way to solve this, except for changing the request mode to stream mode?
When working with large warc files, sometimes it is necessary to split warc file into chunks. Can we add CLI option to split warc file into n chunks or chunks with n records in each chunk?
Should we extract titles of applicable records (such as HTML pages) and make them available as an attribute? I can see some usefulness to this, but I understand that it will add some additional processing time. While the same can be done in applications using warcio
package, but it its usefulness is widespread, we might as well move the functionality to warcio
itself.
There seems to be no way of reading the optional headers that appears between the metadata and request headers.
fetchTimeMs: 1313
charset-detected: UTF-8
languages-cld2: {"reliable":true,"text-bytes":29834,"languages":[{"code":"de","code-iso-639-3":"deu","text-covered":0.99,"score":990.0,"name":"GERMAN"}]}
From CC-MAIN-20200216182139-20200216212139-00000.warc.gz
Brought up in issue #74, it appears that most tools and the WARC standard disagree about how to compute digests when there is a transfer-encoding (i.e. chunked). "warcio check" should be extended to compute both digests and make useful comments about the situation.
Example case of filtering responses that don't have 200
status fails with:
Traceback (most recent call last):
File "/usr/lib/python3.7/http/client.py", line 457, in read
n = self.readinto(b)
File "/usr/lib/python3.7/http/client.py", line 509, in readinto
self._close_conn()
File "/usr/lib/python3.7/http/client.py", line 411, in _close_conn
fp.close()
File "/home/baali/projects/pyenv/lib/python3.7/site-packages/warcio/capture_http.py", line 65, in close
self.recorder.done()
File "/home/baali/projects/pyenv/lib/python3.7/site-packages/warcio/capture_http.py", line 185, in done
request, response = self.filter_func(request, response, self)
File "create_warc.py", line 43, in filter_records
if response.http_headers.get_statuscode() != '200':
AttributeError: 'RequestRecorder' object has no attribute 'http_headers'
And the object RequestRecorder
doesn't have http_headers
The WARC/1.1 specification states that:
The WARC-Payload-Digest field may be used on WARC records with a well-defined payload and shall not be used on records without a well-defined payload. (Section 5.9)
While a payload can certainly be defined for other data as well, the spec only does so for HTTP (cf. #74). However, warcio writes a payload digest indiscriminately for any record that isn't a warcinfo
or revisit
. I'm writing resource
records with a number of content types which don't have a payload in the HTTP sense, including application/x-python
, application/octet-stream
, and text/plain
. Of course, in principle, one could also write request
and response
records for something else than HTTP (e.g. DNS queries) which may or may not have a "well-defined payload".
I think that warcio should only write the payload digest for records with an HTTP Content-Type
header.
The change introduces to fix #34 adds a close ArchiveIterator to, and disabled the close on underlying stream in BufferedReader. This was incorrect, instead BufferedReader.close() should continue to close the underlying stream, but adding an option to not close the stream, ArchiveIterator will not close the underlying stream, as before.close_stream=False
to BufferedReader constructor.
Simpler fix: Just adding separate close_decompressor()
method to BufferedReader, ArchiveIterator calls close_decompressor()
, not close(). close() continues to behave as before, closing the stream.
I was experimenting with injecting digests from my crawler, so that digests aren't computed twice, and noticed that records with a bad WARC-Payload-Digest don't raise an exception on read. No code for it, so I suppose this is a feature request.
The check should be disable-able, and "warcio index" and recompress ought to have a command line flag to ignore digest errors.
Lacking this feature, I don't think that warcio currently has any test to ensure that it's correctly computing digests.
warcwriter:create_warcinfo_record does this to the payload lines:
warcinfo.write(line.encode('latin-1'))
Is that correct? I looked through the 1.0 draft standard and it appears to say that utf8 can appear anywhere, and no mention of latin-1.
If latin-1 is correct, it needs errors='something' in case python3 users send in stuff that won't encode latin-1.
regarding #42 ... on reading the WARC 1.1 spec my eye was drawn to this line:
NOTE: in WARC 1.0 standard (ISO 28500:2009), uri was defined as “<” <’URI’ per RFC 3986> “>”. This rule has been changed to meet requests from implementers.
And indeed the WARC 1.0 standard does specify that all uris have "<" and ">" around them. But in the examples, WARC-Target-URI does not have "<" and ">". (The examples are informative only.)
It appears that the only header affected by this specification bug is WARC-Target-URI. In WARC 1.1 both this field and the new-in-1.1 WARC-Refers-To-Target-URI are explicitly said to not have "<" ">" around the uri, while other uris explicitly do have the "<" ">" (e.g. WARC-Record-ID, WARC-Refers-To, ...)
No action needed, but, this does reinforce that #42 adding a workaround for wget is a good idea. Other tools might have chosen to do this after reading the 1.0 standard.
I'm working through trying to emit headers that are as unprocessed as possible. aiohttp's response object has a raw_headers variable that's a list of 2ples (good) and the values are bytes, not str. That ought to be good, right? But the warcio code appears to assume that the headers are str (python3).
While I could certainly decode the headers before calling warcio, and the charset iso-8859-1 is safe for round-tripping like that, I'm just wondering if taking str is a bad interface. You aren't doing https://tools.ietf.org/html/rfc5987 processing, for example, if non-ISO-8859-1 codepoints are in the str.
Is there a reason why WARCWriter will output the response before the request when using write_request_response_pair? Wouldn't it be more natural to put the request first?
Very minor nitpick.
I've been attempting to use warcio in Hadoop streaming jobs. This went rather wrong because Hadoop streaming mode uses stdin/stdout and warcio prints to stdout under certain error conditions:
warcio/warcio/bufferedreaders.py
Lines 142 to 144 in ed7ebfd
This seems brittle/clumsy. Surely this should raise an exception if it's serious, and/or using the standard logging framework (as appropriate)?
(Sadly, I still have not got to the bottom of why a WARC file that works fine when processed via warcio index
managed to throw this error when processed via Hadoop, but that's a separate issue)
I am creating some test cases for https://github.com/oduwsdl/ipwb and want to use the feature of the WARC/1.1 specification that allows for WARC-Date precision on the sub-second scale.
The sample WARCs I have generated process fine with warcio unless I use the WARC/1.1
first line of a WARC record. Are there plans to allow records using this version of the spec to be processed by warcio?
Related to #64:
When writing a record into an archive one can read out its content via the streaming API and write the record (now with partial or empty content) into the archive resulting the loss of content. We clearly do not or at least should not want that. An Exception or a Warning should be raised instead of loosing content silently.
from warcio.warcwriter import WARCWriter
filename = 'test.warc.gz'
# Write WARC INFO record with custom headers
output_file = open(filename, 'wb')
writer = WARCWriter(output_file, gzip=True, warc_version='WARC/1.1')
# Custom information and stuff
info_headers = {'custom': 'stuff'}
info_record = writer.create_warcinfo_record(filename, info_headers)
custom_headers_raw = info_record.content_stream().read(6) # TODO: After this writing should not be allowed
writer.write_record(info_record)
output_file.close()
Result (notice the partial payload):
WARC/1.1
WARC-Type: warcinfo
WARC-Record-ID: <urn:uuid:67a981ea-fece-49b9-834a-e3b660042cf5>
WARC-Filename: test.warc.gz
WARC-Date: 2019-08-15T10:44:53.778034Z
Content-Type: application/warc-fields
Content-Length: 15
: stuff
Records read back from a file just written should be equal to the Python object written. This is something I discovered while writing tests for an application using warcio. Test case:
import pytest
from io import BytesIO
from tempfile import NamedTemporaryFile
from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders
def test_identity ():
""" read(write(record)) should yield record """
with NamedTemporaryFile () as fd:
payload = b'foobar'
writer = WARCWriter (fd, gzip=True)
httpHeaders = StatusAndHeaders('GET / HTTP/1.1', {}, is_http_request=True)
warcHeaders = {'Foo': 'Bar'}
record = writer.create_warc_record ('http://example.com/', 'request',
payload=BytesIO(payload),
warc_headers_dict=warcHeaders, http_headers=httpHeaders)
writer.write_record (record)
fd.seek (0)
rut = next (ArchiveIterator (fd))
golden = record
assert rut.rec_type == golden.rec_type
assert rut.rec_headers == golden.rec_headers
assert rut.content_type == golden.content_type
assert rut.length == golden.length
assert rut.http_headers == golden.http_headers
assert rut.raw_stream.read() == payload
results in the following assertion failure:
E AssertionError: assert StatusAndHead...ngth', '24')]) == StatusAndHeade...ngth', '24')])
E Full diff:
E - StatusAndHeaders(protocol = 'WARC/1.0', statusline = '', headers = [('Foo', 'Bar'), ('WARC-Type', 'request'), ('WARC-Record-ID', '<urn:uuid:2eb39603-c759-4865-9e5b-2a3cd9c81c92>'), ('WARC-Target-URI', 'http://example.com/'), ('WARC-Date', '2018-12-04T15:27:45Z'), ('WARC-Payload-Digest', 'sha1:RBB5P6JECYQR32PLXFR76THCQESZGKDY'), ('WARC-Block-Digest', 'sha1:HVUJ5SESVATOLVXZZTFORJY44V5BW7YB'), ('Content-Type', 'application/http; msgtype=request'), ('Content-Length', '24')])
E ? ^^^^^^^^^^ --------------
E + StatusAndHeaders(protocol = '', statusline = 'WARC/1.0', headers = [('Foo', 'Bar'), ('WARC-Type', 'request'), ('WARC-Record-ID', '<urn:uuid:2eb39603-c759-4865-9e5b-2a3cd9c81c92>'), ('WARC-Target-URI', 'http://example.com/'), ('WARC-Date', '2018-12-04T15:27:45Z'), ('WARC-Payload-Digest', 'sha1:RBB5P6JECYQR32PLXFR76THCQESZGKDY'), ('WARC-Block-Digest', 'sha1:HVUJ5SESVATOLVXZZTFORJY44V5BW7YB'), ('Content-Type', 'application/http; msgtype=request'), ('Content-Length', '24')])
E ? +++++++++++++++++ ^^^^^^^
How can I patch the capture_http method for the order of imports to be irrelevant? Seems that this ordering is quite problematic when importing a class (such as a web scraping class/tool) which handles all the WARC writing for you.
There must be an easier way than including the ordered import statements in every single script which imports this scraping class/tool.
Hello,
I'd like to use the warcio library with scrapy and saw the other thread about it as well as the code.
The difference for me is, I've got some logic inside my spider, where I'd like to create different WARC files and until they reach a certain size, need to stay in memory before being written to the disc.
I believe, I've almost got it working, but I can't figure out, how to build up the payload.
That's what I currently have:
def write_response_to_memory(self, response):
'''Writes a `response` object from Scrapy as a Warc record. '''
response_url = w3lib.url.safe_download_url(response.url)
# Create the payload string
payload_temp = io.StringIO()
for h_name in response.headers:
payload_temp.write('%s: %s\n' % (h_name, response.headers[h_name]))
payload_temp.write('\r\n')
payload_temp.write(response.text)
headers = []
headers.append(tuple(('WARC-Type', 'response')))
headers.append(tuple(('WARC-Date', self.now_iso_format())))
headers.append(tuple(('Content-Length', str(payload_temp.tell()))))
headers.append(tuple(('Content-Type', str(response.headers.get('Content-Type', '')))))
http_headers_temp = StatusAndHeaders('200 OK', headers, protocol='HTTP/1.0')
record = self.warcdict['test'].create_warc_record(response_url, 'response',
payload=payload_temp.getvalue(),
http_headers=http_headers_temp)
Which results into:
File "/home/hpn/TestScripts/warc/scrapywarc/scrapywarc/spiders/scrapywarc.py", line 123, in write_response_to_memory
http_headers=http_headers_temp)
File "/home/hpn/live/ccd/lib/python3.7/site-packages/warcio/recordbuilder.py", line 117, in create_warc_record
self.ensure_digest(record, block=False, payload=True)
File "/home/hpn/live/ccd/lib/python3.7/site-packages/warcio/recordbuilder.py", line 191, in ensure_digest
for buf in self._iter_stream(record.raw_stream):
File "/home/hpn/live/ccd/lib/python3.7/site-packages/warcio/recordbuilder.py", line 218, in _iter_stream
buf = stream.read(BUFF_SIZE)
AttributeError: 'str' object has no attribute 'read'
Is there any documentation of what the payload needs to look like?
Or is there an alternative way to add my content and header to create_warc_record?
Related to #64:
When one create a warcinfo record it is required to provide custom headers in a dictionary-like format:
info_headers = {'custom': 'stuff'}
info_record = writer.create_warcinfo_record(filename, info_headers)
But when reading a warcinfo record one must parse the dictionary oneself:
info_rec = next(archive_it)
assert info_rec.rec_type == 'warcinfo'
custom_headers_raw = info_rec.content_stream().read()
info_rec_payload = dict(r.split(': ', maxsplit=1) for r in custom_headers_raw.decode('UTF-8')
.strip().split('\r\n') if len(r) > 0)
There should be an API like the following:
info_headers = {'custom': 'stuff'}
info_record = writer.create_warcinfo_record(filename, info_headers)
"""...write out and open the archive for reading..."""
info_rec = next(archive_it)
assert info_rec.rec_type == 'warcinfo'
custom_headers = info_rec.info_content_dict()
assert info_headers == custom_headers
We were using the cdxj-indexer to re-index our w/arcs and ran across this error. In our older arcs, there are uri's that have spaces. The cdj-indexer was failing on these arcs.
warcio.recordloader.ArchiveLoadFailed: Unknown archive format, first line: ['http://www.melforsenate.org/index.cfm?FuseAction=Home.Email&EmailTitle=Martinez', 'Targets', 'Castor', 'Backer&EmailURL=http%3A%2F%2Fwww%2Emelforsenate%2Eorg%2Findex%2Ecfm%3FFuseaction%3DArticles%2EView%26Article%5Fid%3D187&IsPopUp=True', '65.36.164.67', '20040925001013', 'text/html', '149']
In ArchiveIterator, chunks of data(16384) are being decompressed using decompressor=zlib.decompressobj()
however once decompressing is done the decompressor.flush()
is not done, which is leaking memory when reading large files or large number of files in python 2.7. Please look into it @ikreymer
I noticed today that warcio doesn't generate a block digest for warcinfo records:
warcio/warcio/recordbuilder.py
Line 30 in 8e3ceb7
This seems to have been introduced in a791617, but I was unable to figure out why. The spec permits block digests on any record (whereas a payload digest would make no sense on a warcinfo record due to the content type used normally), and it seems good practice to me to always store a digest to allow for integrity checks.
I built a thing that tests a warc for standards conformance. The cli is similar to "warcio check". It's 440 lines of code so far, likely to be around 1,000 when done.
It will need an extended testing and tweaking period while it's tested against everything in the ecosystem that generates warcs. Discussion might be ... vigorous. I'm currently labeling things as "not standard conforming", "following/not following recommendations", and "comments". Hopefully not too many hairs will be split.
Does this belong in warcio? My hope is that it will be commonly used; with luck that means that the entire web archiving ecosystem will keep warcio installed and part of their testing processes.
warcio writes a WARC-Profile
header value of http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
on revisit records regardless of the WARC version used. For 1.1, it should be http://netpreserve.org/warc/1.1/revisit/identical-payload-digest
instead (section 6.7.2 in the WARC/1.1 specification).
Unfortunately, it isn't even possible to override this through warc_headers_dict
because you then end up with two headers. This works instead after creating the record:
record.rec_headers.replace_header('WARC-Profile', 'http://netpreserve.org/warc/1.1/revisit/identical-payload-digest')
warcio uses a default Content-Type
value for WARC records of application/warc-record
. This MIME type is not documented or specified anywhere; the WARC spec only mentions application/warc
as the MIME type for WARC files and application/warc-fields
for warcinfo and metadata records (though it is ambiguous on whether that is required or recommended).
Prior to the update, I used to manually write the records with custom headers, now that the capture_http
method avoids the manual writing I want to use it, but does it support adding custom headers?
Using ArchiveIterator to filter some WARC records, I noticed that it adds some bytes to the HTTP header without updating the Content-length (record.length
) in memory. Given this code:
with open(warcfilebasename + ".warc", 'rb') as f_in:
with open(warcfilebasename + ".warc.gz", 'wb') as f_out:
writer = WARCWriter(f_out, gzip=True)
try:
for record in ArchiveIterator(f_in):
if record.http_headers:
if record.http_headers.get_header('Transfer-Encoding') == "chunked":
continue
try:
record.http_headers.to_ascii_bytes()
except UnicodeEncodeError:
# if header is non ascii, create a new header, with status code only
# content length and content type will be filled before writing
record.http_headers = StatusAndHeaders(record.http_headers.get_statuscode(), [])
writer.write_record(record)
except:
pass
and this input WARC generated by wget
:
awt.zip
It generates this WARC with wrong Content-length in the record 'urn:uuid:47ef9267-a4cc-47fa-a1b2-ddc6e746216d':
awt.warc.gz
This WARC crashes using warcio index awt.warc.gz
, the previous one doesn't.
If I add record.length = None
before writer.write_record(record)
to my code, warcio recalculates the increased WARC content length before writing the output and then it works with warcio index
.
The issue is, why does ArchiveIterator add content to the HTTP header when reading? And if that is necessary, why doesn't it update the content length?
Is this related to #57 ?
Per discussion in #6, encode WARC headers as UTF-8.
Attempt to decode as UTF-8 as well, and fallback to ISO-8859-1
Given this whitespace-related header bug that crept into the August 2018 Common Crawl crawl , it would be nice if it was somewhat difficult to create broken WARC files using warcio.
I see a couple of possible issues:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.