I'm trying to use Resiliparse to handle CC, but I find that single-process memory usag

I believe this comes down to two things: Slight inefficiencies

steady memory grouth while working on web pages about chatnoir-resiliparse HOT 5 CLOSED

chatnoir-eu commented on September 22, 2024

steady memory grouth while working on web pages

from chatnoir-resiliparse.

Comments (5)

phoerious commented on September 22, 2024 1

FYI Haven't forgotten you, but I will probably not have time for this before next week.

from chatnoir-resiliparse.

phoerious commented on September 22, 2024 1

I believe this comes down to two things:

Slight inefficiencies in the buffer (re-)allocation in BufferedReader.read() for larger documents
Measurement biases due to unequal document sizes

I've pushed some changes to address the first issue (see 070c5be, should be up as version 0.14.7 soon).

The second issue seems to stem from the peculiarity that later documents in the CC WARC seem to be larger for some reason. Here's the updated test case (with proper encoding detection):

import os
import psutil
import time

from fastwarc.warc import ArchiveIterator, is_http, WarcRecordType
from resiliparse.parse.html import HTMLTree
from resiliparse.parse.encoding import detect_encoding

file = open('CC-MAIN-20231005012006-20231005042006-00899.warc.gz', 'rb')

archive_iterator = ArchiveIterator(
    file,
    record_types=WarcRecordType.response,
    parse_http=True,
    func_filter=is_http,
)
s = time.monotonic()
doc_size = 0
process = psutil.Process(os.getpid())
for idx, record in enumerate(archive_iterator):
    raw = record.reader.read()
    doc_size += len(raw)
    str(HTMLTree.parse_from_bytes(raw, detect_encoding(raw)))

    if (idx + 1) % 5000 == 0:
        t = time.monotonic() - s
        print(f'- Mem: {process.memory_info().rss / (1024 ** 2):.3f}MB, Time: {t:.3f}s')
        print(f'  Avg doc size: {doc_size / 5000:,.0f} bytes ({doc_size / t:,.0f} bytes/s)')
        doc_size = 0
        s = time.monotonic()

and this is the output I get:

- Mem: 53.504MB, Time: 12.782s
  Avg doc size: 76,857 bytes (30,064,370 bytes/s)
- Mem: 62.117MB, Time: 17.384s
  Avg doc size: 132,829 bytes (38,204,150 bytes/s)
- Mem: 62.492MB, Time: 16.595s
  Avg doc size: 137,767 bytes (41,508,993 bytes/s)
- Mem: 62.867MB, Time: 16.600s
  Avg doc size: 136,393 bytes (41,083,262 bytes/s)
- Mem: 67.742MB, Time: 18.536s
  Avg doc size: 145,028 bytes (39,121,232 bytes/s)
- Mem: 71.680MB, Time: 21.965s
  Avg doc size: 179,860 bytes (40,942,038 bytes/s)
- Mem: 71.680MB, Time: 21.583s
  Avg doc size: 179,028 bytes (41,473,469 bytes/s)

The rate per actual processed byte seems quite constant to me.

from chatnoir-resiliparse.

phoerious commented on September 22, 2024

Is it the WARC reader or the HTML parser?

from chatnoir-resiliparse.

prnake commented on September 22, 2024

It's caused by the HTML parser.

from chatnoir-resiliparse.

phoerious commented on September 22, 2024

I'm trying to get to the bottom of this and I can reproduce the behaviour, but I'm suspecting this is not in Resiliparse, at least not in the HTML parser.

Memory usage does grow and the loop times go up as well, but not in a strictly monotonic fashion and it does go back to normal after the loop. You can see this with a modified version of your test code:

import os
import psutil
import time

from fastwarc.warc import ArchiveIterator as FastWarcArchiveIterator
from fastwarc.warc import WarcRecordType
from fastwarc.warc import is_http
from resiliparse.parse.html import HTMLTree

for r in range(3):
    print(f'-------\nRound {r + 1}\n-------')
    file = open('CC-MAIN-20231005012006-20231005042006-00899.warc.gz', 'rb')

    archive_iterator = FastWarcArchiveIterator(
        file,
        record_types=WarcRecordType.response,
        parse_http=True,
        func_filter=is_http,
    )
    s = time.monotonic()
    process = psutil.Process(os.getpid())
    for idx, record in enumerate(archive_iterator):
        raw = record.reader.read()
        try:
            str(HTMLTree.parse(raw.decode('utf-8')))
        except:
            pass

        if (idx + 1) % 5000 == 0:
            print(f'In loop {round(process.memory_info().rss / (1024 ** 2), 3)}MB, {round(time.monotonic() - s, 3)}s')
            s = time.monotonic()

    print(f'After loop {round(process.memory_info().rss / (1024 ** 2), 3)}MB')

This prints stats only ever 5k iterations and after the loop. The whole thing is run three times. These are the results:

-------
Round 1
-------
In loop 38.895MB, 3.704s
In loop 51.941MB, 6.022s
In loop 48.191MB, 5.294s
In loop 51.715MB, 5.336s
In loop 51.039MB, 5.36s
In loop 52.727MB, 6.726s
In loop 59.426MB, 6.596s
After loop 59.426MB
-------
Round 2
-------
In loop 59.426MB, 3.462s
In loop 56.117MB, 6.315s
In loop 56.305MB, 5.681s
In loop 55.559MB, 5.651s
In loop 59.508MB, 5.465s
In loop 57.93MB, 6.182s
In loop 66.922MB, 6.151s
After loop 66.922MB
-------
Round 3
-------
In loop 66.922MB, 3.314s
In loop 66.922MB, 5.8s
In loop 57.469MB, 5.459s
In loop 57.453MB, 5.344s
In loop 61.523MB, 5.485s
In loop 58.355MB, 6.403s
In loop 65.293MB, 6.268s
After loop 65.293MB

When I comment out the HTML parsing part, I get similar results, just with smaller numbers:

-------
Round 1
-------
In loop 29.07MB, 1.395s
In loop 28.703MB, 2.032s
In loop 28.996MB, 2.069s
In loop 26.387MB, 2.067s
In loop 28.926MB, 2.194s
In loop 28.824MB, 2.687s
In loop 28.512MB, 2.722s
After loop 28.215MB
-------
Round 2
-------
In loop 27.094MB, 1.339s
In loop 27.316MB, 2.067s
In loop 28.434MB, 2.096s
In loop 26.07MB, 2.1s
In loop 28.52MB, 2.174s
In loop 28.613MB, 2.717s
In loop 28.441MB, 2.727s
After loop 28.734MB
-------
Round 3
-------
In loop 26.672MB, 1.383s
In loop 28.172MB, 2.092s
In loop 28.465MB, 2.11s
In loop 26.027MB, 2.087s
In loop 28.547MB, 2.259s
In loop 28.727MB, 2.786s
In loop 28.379MB, 2.74s
After loop 28.105MB

The problem could still be in the WARC iterator, but it doesn't matter whether I do record.reader.read() or not.

from chatnoir-resiliparse.

steady memory grouth while working on web pages about chatnoir-resiliparse HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent