Giter Site home page Giter Site logo

Comments (5)

phoerious avatar phoerious commented on September 22, 2024 1

FYI Haven't forgotten you, but I will probably not have time for this before next week.

from chatnoir-resiliparse.

phoerious avatar phoerious commented on September 22, 2024 1

I believe this comes down to two things:

  1. Slight inefficiencies in the buffer (re-)allocation in BufferedReader.read() for larger documents
  2. Measurement biases due to unequal document sizes

I've pushed some changes to address the first issue (see 070c5be, should be up as version 0.14.7 soon).

The second issue seems to stem from the peculiarity that later documents in the CC WARC seem to be larger for some reason. Here's the updated test case (with proper encoding detection):

import os
import psutil
import time

from fastwarc.warc import ArchiveIterator, is_http, WarcRecordType
from resiliparse.parse.html import HTMLTree
from resiliparse.parse.encoding import detect_encoding

file = open('CC-MAIN-20231005012006-20231005042006-00899.warc.gz', 'rb')

archive_iterator = ArchiveIterator(
    file,
    record_types=WarcRecordType.response,
    parse_http=True,
    func_filter=is_http,
)
s = time.monotonic()
doc_size = 0
process = psutil.Process(os.getpid())
for idx, record in enumerate(archive_iterator):
    raw = record.reader.read()
    doc_size += len(raw)
    str(HTMLTree.parse_from_bytes(raw, detect_encoding(raw)))

    if (idx + 1) % 5000 == 0:
        t = time.monotonic() - s
        print(f'- Mem: {process.memory_info().rss / (1024 ** 2):.3f}MB, Time: {t:.3f}s')
        print(f'  Avg doc size: {doc_size / 5000:,.0f} bytes ({doc_size / t:,.0f} bytes/s)')
        doc_size = 0
        s = time.monotonic()

and this is the output I get:

- Mem: 53.504MB, Time: 12.782s
  Avg doc size: 76,857 bytes (30,064,370 bytes/s)
- Mem: 62.117MB, Time: 17.384s
  Avg doc size: 132,829 bytes (38,204,150 bytes/s)
- Mem: 62.492MB, Time: 16.595s
  Avg doc size: 137,767 bytes (41,508,993 bytes/s)
- Mem: 62.867MB, Time: 16.600s
  Avg doc size: 136,393 bytes (41,083,262 bytes/s)
- Mem: 67.742MB, Time: 18.536s
  Avg doc size: 145,028 bytes (39,121,232 bytes/s)
- Mem: 71.680MB, Time: 21.965s
  Avg doc size: 179,860 bytes (40,942,038 bytes/s)
- Mem: 71.680MB, Time: 21.583s
  Avg doc size: 179,028 bytes (41,473,469 bytes/s)

The rate per actual processed byte seems quite constant to me.

from chatnoir-resiliparse.

phoerious avatar phoerious commented on September 22, 2024

Is it the WARC reader or the HTML parser?

from chatnoir-resiliparse.

prnake avatar prnake commented on September 22, 2024

It's caused by the HTML parser.

from chatnoir-resiliparse.

phoerious avatar phoerious commented on September 22, 2024

I'm trying to get to the bottom of this and I can reproduce the behaviour, but I'm suspecting this is not in Resiliparse, at least not in the HTML parser.

Memory usage does grow and the loop times go up as well, but not in a strictly monotonic fashion and it does go back to normal after the loop. You can see this with a modified version of your test code:

import os
import psutil
import time

from fastwarc.warc import ArchiveIterator as FastWarcArchiveIterator
from fastwarc.warc import WarcRecordType
from fastwarc.warc import is_http
from resiliparse.parse.html import HTMLTree

for r in range(3):
    print(f'-------\nRound {r + 1}\n-------')
    file = open('CC-MAIN-20231005012006-20231005042006-00899.warc.gz', 'rb')

    archive_iterator = FastWarcArchiveIterator(
        file,
        record_types=WarcRecordType.response,
        parse_http=True,
        func_filter=is_http,
    )
    s = time.monotonic()
    process = psutil.Process(os.getpid())
    for idx, record in enumerate(archive_iterator):
        raw = record.reader.read()
        try:
            str(HTMLTree.parse(raw.decode('utf-8')))
        except:
            pass

        if (idx + 1) % 5000 == 0:
            print(f'In loop {round(process.memory_info().rss / (1024 ** 2), 3)}MB, {round(time.monotonic() - s, 3)}s')
            s = time.monotonic()

    print(f'After loop {round(process.memory_info().rss / (1024 ** 2), 3)}MB')

This prints stats only ever 5k iterations and after the loop. The whole thing is run three times. These are the results:

-------
Round 1
-------
In loop 38.895MB, 3.704s
In loop 51.941MB, 6.022s
In loop 48.191MB, 5.294s
In loop 51.715MB, 5.336s
In loop 51.039MB, 5.36s
In loop 52.727MB, 6.726s
In loop 59.426MB, 6.596s
After loop 59.426MB
-------
Round 2
-------
In loop 59.426MB, 3.462s
In loop 56.117MB, 6.315s
In loop 56.305MB, 5.681s
In loop 55.559MB, 5.651s
In loop 59.508MB, 5.465s
In loop 57.93MB, 6.182s
In loop 66.922MB, 6.151s
After loop 66.922MB
-------
Round 3
-------
In loop 66.922MB, 3.314s
In loop 66.922MB, 5.8s
In loop 57.469MB, 5.459s
In loop 57.453MB, 5.344s
In loop 61.523MB, 5.485s
In loop 58.355MB, 6.403s
In loop 65.293MB, 6.268s
After loop 65.293MB

When I comment out the HTML parsing part, I get similar results, just with smaller numbers:

-------
Round 1
-------
In loop 29.07MB, 1.395s
In loop 28.703MB, 2.032s
In loop 28.996MB, 2.069s
In loop 26.387MB, 2.067s
In loop 28.926MB, 2.194s
In loop 28.824MB, 2.687s
In loop 28.512MB, 2.722s
After loop 28.215MB
-------
Round 2
-------
In loop 27.094MB, 1.339s
In loop 27.316MB, 2.067s
In loop 28.434MB, 2.096s
In loop 26.07MB, 2.1s
In loop 28.52MB, 2.174s
In loop 28.613MB, 2.717s
In loop 28.441MB, 2.727s
After loop 28.734MB
-------
Round 3
-------
In loop 26.672MB, 1.383s
In loop 28.172MB, 2.092s
In loop 28.465MB, 2.11s
In loop 26.027MB, 2.087s
In loop 28.547MB, 2.259s
In loop 28.727MB, 2.786s
In loop 28.379MB, 2.74s
After loop 28.105MB

The problem could still be in the WARC iterator, but it doesn't matter whether I do record.reader.read() or not.

from chatnoir-resiliparse.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.