Comments (5)
Support for reading and writing WARC 1.1 added in warcio 1.6.0
from warcio.
@sebastian-nagel yeah, i think you're right, while we've been cautious to start writing 1.1 WARCs, we should definitely support reading WARC/1.1
. We can look into this soon.
from warcio.
In selectively importing parts of the warcio API, I can persuade the below to process examples with the above scenario:
from warcio.archiveiterator import ArchiveIterator
from warcio.recordloader import ArcWarcRecordLoader
ArcWarcRecordLoader.WARC_TYPES.append('WARC/1.1')
warc11 = '(pathtomywarc)'
with open(warc11, 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'response':
print(record.rec_headers.get_header('WARC-Date'))
...but this is a dirty hack and does not account for the other features of the 1.1 spec. With that also in mind, WARC-Date
s that are invalid per the WARC/1.0 spec but legal per WARC/1.1 (e.g., 2014-02-10T00:00:01.000000002Z
) do not throw any sort of validation error when processed with warcio 95d5dcd.
My above question (plans?) still remains. I am hoping to finally get around to integrating warcio into ipwb for oduwsdl/ipwb#380 and oduwsdl/ipwb#374.
from warcio.
Similar issue: warcio index
fails on a WARC file of version 1.1:
warcio.recordloader.ArchiveLoadFailed: Unknown archive format, first line: ['WARC/1.1']
The mentioned work-around (add WARC/1.1
to WARC_TYPES) is not applicable to command-line tools.
The only 1.1 feature I plan to use is the WARC-Refers-To-Date header in revisit records. Warcio does not seem to have issues with unknown headers. If there is already partial support for WARC/1.1 (simply because the differences to 1.0 are small), why not claim to support it?
from warcio.
Any news on Warc 1.1 support?
from warcio.
Related Issues (20)
- Add version tags to the repository HOT 2
- Invalid WARCs are silently accepted instead of raising an error HOT 5
- warcio mangles non-ASCII HTTP headers HOT 9
- warcio does not preserve HTTP header whitespace HOT 3
- quoted-string WARC header values are not parsed correctly
- Not compatible with WARC-files/records writtin by ArchiveSpark HOT 1
- get_test_file missing from the PyPI release HOT 4
- Offline tests HOT 2
- extract entire warc file? HOT 4
- warcio check does not raise error when GZip records are truncated HOT 5
- `capture_http` fails in tests, but works otherwise HOT 5
- Record not followed by newline (conversion error) HOT 1
- Warcio does not support replay of sites hosted on NCSA 1.5 HOT 3
- Issues with encoding of http-answers HOT 2
- Documentation: Clarify that capture_http writer with filename has no get_stream methood HOT 3
- warcio.exceptions.ArchiveLoadFailed: Unknown archive format HOT 3
- Empty WARC files when deploying warcio on Airflow HOT 5
- Trying to write to closed file when using `requests.Session`
- Patching WARCs using warcio
- warcio cannot write wet files
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warcio.