Comments (6)
Fixed together with some other issues. New binaries should be on PyPi in a few minutes.
from chatnoir-resiliparse.
The question is: is this expected behaviour or not? It's invalid input and you would want some sort of error thrown.
EDIT: ah, I see. The content reader hangs. That shouldn't happen.
from chatnoir-resiliparse.
The goal as I understood it is not just resilience of large-scale data processing jobs with respect to, e.g., extreme or invalid HTML files, but also resilience against errors occurring on other parts of the processing pipeline. It would be wasteful if a million-WARC file processing job fails because of a single corrupt WARC file.
At any rate, can recoverable errors be logged (on demand)?
EDIT: This comment relates to the previous one on whether this was expected behavior.
from chatnoir-resiliparse.
Of course. But resilience also means that you should be able to react on errors. With the fix, the processing pipeline just continues without errors, even if the GZip stream is truncated, which is fine I believe (it shouldn't hang in any case, which is one of the major issues I've had with previous pipelines and the whole reason Resiliparse has TimeGuard and MemoryGuard). In fact, I wonder if this error should be logged at all or if it should be up to the user to detect this kind of issue. As a user you could compare the stream content length with the Content-Length header or verify the record digests if you worry about truncated records. So yes, not throwing an unexpected exception wouldn't be desirable here, I would say.
from chatnoir-resiliparse.
Regarding logging, I guess we should not have expectations about whatever goes on in different parts of operations that involve processing WARC files at scale. Rather, if we have knowledge of an error, then it makes sense to tell the user about it---albeit, maybe only on demand.
So, what's the most common wish users have from their tools?
Silent by default, and noisy on demand? Or the other way around?
If more extensive logging is introduced, it creates lots of extra plumbing (e.g., where does the tool store the logs and can this be adjusted, logging server connections in case of distributed usage, etc.). But in the long run, such facilities might be asked for, anyway, given the professional context of resilient large-scale processing that is the target audience of this tool.
from chatnoir-resiliparse.
For performance reasons, I would refrain from adding intensive logging at the moment.
from chatnoir-resiliparse.
Related Issues (20)
- Fastwarc: CLI may index gzipped WARC records with erroneous length 0 HOT 3
- yum install HOT 3
- Installing fastwarc via `pip install` fails if compilation is required or requested HOT 3
- pipx run resiliparse faild: ModuleNotFoundError: No module named 'joblib' HOT 5
- pipx run fastwarc check faild: binascii.Error: Non-base32 digit found HOT 9
- Interesting Benchmarks running resilparse 'HTML2text' sequentially vs parallel HOT 28
- resiliparse crashes in colab HOT 8
- Trouble building in Python 3.11 HOT 6
- fatal error: html.h: No such file or directory HOT 3
- Resiliparse does not Compile under Ubuntu 18 HOT 3
- Random or Chunked Reading HOT 3
- Type annotations HOT 1
- can not install on python 3.11 ubuntu docker HOT 4
- python3.7 can use this package? HOT 1
- setuptools.config.pyprojecttoml has no attribute _BetaConfiguration HOT 3
- svg caused lexbor to crash HOT 2
- steady memory grouth while working on web pages HOT 5
- DOM Tree Manipulation and DOMNode HOT 6
- Nested Span HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatnoir-resiliparse.