Comments (5)
FYI Haven't forgotten you, but I will probably not have time for this before next week.
from chatnoir-resiliparse.
I believe this comes down to two things:
- Slight inefficiencies in the buffer (re-)allocation in
BufferedReader.read()
for larger documents - Measurement biases due to unequal document sizes
I've pushed some changes to address the first issue (see 070c5be, should be up as version 0.14.7 soon).
The second issue seems to stem from the peculiarity that later documents in the CC WARC seem to be larger for some reason. Here's the updated test case (with proper encoding detection):
import os
import psutil
import time
from fastwarc.warc import ArchiveIterator, is_http, WarcRecordType
from resiliparse.parse.html import HTMLTree
from resiliparse.parse.encoding import detect_encoding
file = open('CC-MAIN-20231005012006-20231005042006-00899.warc.gz', 'rb')
archive_iterator = ArchiveIterator(
file,
record_types=WarcRecordType.response,
parse_http=True,
func_filter=is_http,
)
s = time.monotonic()
doc_size = 0
process = psutil.Process(os.getpid())
for idx, record in enumerate(archive_iterator):
raw = record.reader.read()
doc_size += len(raw)
str(HTMLTree.parse_from_bytes(raw, detect_encoding(raw)))
if (idx + 1) % 5000 == 0:
t = time.monotonic() - s
print(f'- Mem: {process.memory_info().rss / (1024 ** 2):.3f}MB, Time: {t:.3f}s')
print(f' Avg doc size: {doc_size / 5000:,.0f} bytes ({doc_size / t:,.0f} bytes/s)')
doc_size = 0
s = time.monotonic()
and this is the output I get:
- Mem: 53.504MB, Time: 12.782s
Avg doc size: 76,857 bytes (30,064,370 bytes/s)
- Mem: 62.117MB, Time: 17.384s
Avg doc size: 132,829 bytes (38,204,150 bytes/s)
- Mem: 62.492MB, Time: 16.595s
Avg doc size: 137,767 bytes (41,508,993 bytes/s)
- Mem: 62.867MB, Time: 16.600s
Avg doc size: 136,393 bytes (41,083,262 bytes/s)
- Mem: 67.742MB, Time: 18.536s
Avg doc size: 145,028 bytes (39,121,232 bytes/s)
- Mem: 71.680MB, Time: 21.965s
Avg doc size: 179,860 bytes (40,942,038 bytes/s)
- Mem: 71.680MB, Time: 21.583s
Avg doc size: 179,028 bytes (41,473,469 bytes/s)
The rate per actual processed byte seems quite constant to me.
from chatnoir-resiliparse.
Is it the WARC reader or the HTML parser?
from chatnoir-resiliparse.
It's caused by the HTML parser.
from chatnoir-resiliparse.
I'm trying to get to the bottom of this and I can reproduce the behaviour, but I'm suspecting this is not in Resiliparse, at least not in the HTML parser.
Memory usage does grow and the loop times go up as well, but not in a strictly monotonic fashion and it does go back to normal after the loop. You can see this with a modified version of your test code:
import os
import psutil
import time
from fastwarc.warc import ArchiveIterator as FastWarcArchiveIterator
from fastwarc.warc import WarcRecordType
from fastwarc.warc import is_http
from resiliparse.parse.html import HTMLTree
for r in range(3):
print(f'-------\nRound {r + 1}\n-------')
file = open('CC-MAIN-20231005012006-20231005042006-00899.warc.gz', 'rb')
archive_iterator = FastWarcArchiveIterator(
file,
record_types=WarcRecordType.response,
parse_http=True,
func_filter=is_http,
)
s = time.monotonic()
process = psutil.Process(os.getpid())
for idx, record in enumerate(archive_iterator):
raw = record.reader.read()
try:
str(HTMLTree.parse(raw.decode('utf-8')))
except:
pass
if (idx + 1) % 5000 == 0:
print(f'In loop {round(process.memory_info().rss / (1024 ** 2), 3)}MB, {round(time.monotonic() - s, 3)}s')
s = time.monotonic()
print(f'After loop {round(process.memory_info().rss / (1024 ** 2), 3)}MB')
This prints stats only ever 5k iterations and after the loop. The whole thing is run three times. These are the results:
-------
Round 1
-------
In loop 38.895MB, 3.704s
In loop 51.941MB, 6.022s
In loop 48.191MB, 5.294s
In loop 51.715MB, 5.336s
In loop 51.039MB, 5.36s
In loop 52.727MB, 6.726s
In loop 59.426MB, 6.596s
After loop 59.426MB
-------
Round 2
-------
In loop 59.426MB, 3.462s
In loop 56.117MB, 6.315s
In loop 56.305MB, 5.681s
In loop 55.559MB, 5.651s
In loop 59.508MB, 5.465s
In loop 57.93MB, 6.182s
In loop 66.922MB, 6.151s
After loop 66.922MB
-------
Round 3
-------
In loop 66.922MB, 3.314s
In loop 66.922MB, 5.8s
In loop 57.469MB, 5.459s
In loop 57.453MB, 5.344s
In loop 61.523MB, 5.485s
In loop 58.355MB, 6.403s
In loop 65.293MB, 6.268s
After loop 65.293MB
When I comment out the HTML parsing part, I get similar results, just with smaller numbers:
-------
Round 1
-------
In loop 29.07MB, 1.395s
In loop 28.703MB, 2.032s
In loop 28.996MB, 2.069s
In loop 26.387MB, 2.067s
In loop 28.926MB, 2.194s
In loop 28.824MB, 2.687s
In loop 28.512MB, 2.722s
After loop 28.215MB
-------
Round 2
-------
In loop 27.094MB, 1.339s
In loop 27.316MB, 2.067s
In loop 28.434MB, 2.096s
In loop 26.07MB, 2.1s
In loop 28.52MB, 2.174s
In loop 28.613MB, 2.717s
In loop 28.441MB, 2.727s
After loop 28.734MB
-------
Round 3
-------
In loop 26.672MB, 1.383s
In loop 28.172MB, 2.092s
In loop 28.465MB, 2.11s
In loop 26.027MB, 2.087s
In loop 28.547MB, 2.259s
In loop 28.727MB, 2.786s
In loop 28.379MB, 2.74s
After loop 28.105MB
The problem could still be in the WARC iterator, but it doesn't matter whether I do record.reader.read()
or not.
from chatnoir-resiliparse.
Related Issues (20)
- pipx run resiliparse faild: ModuleNotFoundError: No module named 'joblib' HOT 5
- pipx run fastwarc check faild: binascii.Error: Non-base32 digit found HOT 9
- Interesting Benchmarks running resilparse 'HTML2text' sequentially vs parallel HOT 28
- resiliparse crashes in colab HOT 8
- Trouble building in Python 3.11 HOT 6
- fatal error: html.h: No such file or directory HOT 3
- Resiliparse does not Compile under Ubuntu 18 HOT 3
- Random or Chunked Reading HOT 3
- Type annotations HOT 3
- can not install on python 3.11 ubuntu docker HOT 4
- python3.7 can use this package? HOT 1
- setuptools.config.pyprojecttoml has no attribute _BetaConfiguration HOT 3
- svg caused lexbor to crash HOT 2
- DOM Tree Manipulation and DOMNode HOT 6
- Old HOT 3
- Segmentation fault (resolved)
- install failed HOT 1
- operation system problem HOT 1
- Build fails in Python 3.11 HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatnoir-resiliparse.