Comments (28)
@lexborisov @phoerious I'm wondering if we are using a version of lexbor with threading turned on in resiliparse? If so, then running in multiprocesses might not be helpful. I found a mention of threads in lexbor/lexbor@51930b4 ?
The lexbor
project does not use multithreading for html parsing or anything at all. At the same time, the project is completely thread-safe.
from chatnoir-resiliparse.
Your report motivated me to further optimise the extractor by minimising the time the GIL is held, so you can speed up execution even with Python threads. Now the GIL is required only for the initial call and a few constant-time setup operations. The extraction itself is done entirely without locking.
Before:
Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0].
Benchmarking 10,000 extractions...
----------------------------------
sequential: 2.824s (3,541.2 extractions/s)
threading, 24 workers: 3.853s (2,595.7 extractions/s)
threadpoolexecutor, 24 workers: 3.610s (2,770.1 extractions/s)
After:
Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0].
Benchmarking 10,000 extractions...
----------------------------------
sequential: 2.752s (3,633.8 extractions/s)
threading, 24 workers: 2.264s (4,417.9 extractions/s)
threadpoolexecutor, 24 workers: 1.936s (5,165.3 extractions/s)
A simple ThreadPoolExecutor is now about as fast as a ProcessPoolExecutor and about 1.4x faster than a sequential run. A proper multiprocessing Pool is of course still around 6.4x faster.
from chatnoir-resiliparse.
Member
Dask bags run on a multiprocessing scheduler. And there is about a 200us overhead per task. Now this was on a fairly small sample 181 extractions. The dask multiprocess scheduler also serialises functions and I wonder if the cython code it adds more overhead. Be interesting to see how the results differ if I get closer to more realworld samples for me, of around 20k, and a different DAG, eg 20 tasks running 1000 html files, vs 200 tasks running 1 html file.
I'm incredibly impressed with the speed and accuracy of extract_plain_text(tree, main_content=True)) for article extraction. For me it's definitely worth digging into how its done and working on adding title, authors, and a published date.
from chatnoir-resiliparse.
I added a print(self.elapsed_time)
right after the second-to-last line in your snippet above. When I time sequential runs, I get < 0.001 seconds. You can also add the print at the end of __call__
, which makes hardly a difference.
I timed the parallel execution with cProfile and it spends 2256ms in <method 'acquire' of '_thread.lock' objects>
, which hints at really inefficient parallelization. Second place goes to tokenize
with 623ms.
from chatnoir-resiliparse.
I just replaced the call to bagged.compute()
with a call to Pool.map
:
from multiprocessing import Pool
with Pool() as p:
start = time.time()
bagged = p.map(self.parallel_extract, sequence)
self.elapsed_time = time.time() - start
which brings the extraction time down from 2.9 seconds to 0.2 seconds.
Output:
┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ resiliparse │ 0.6525 │ 0.8529 │ 0.9852 │ 0.904 │ 0.8819 │ 948.5001 │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘
Still only a speedup of around 2x, but I guess that's I/O-bound by loading the HTML sources into Dask bags.
from chatnoir-resiliparse.
One more thing I just noticed: You should be using time.monotonic()
or time.perf_counter()
/ time.process_time()
(for higher resolution) instead of time.time()
when measuring execution timings.
from chatnoir-resiliparse.
Is part of it written in C++/Cython? That might explain it, I'm using Dask and dask bags, which i think add some significant overhead or possibly reduce prevent parallelism in cython.
from chatnoir-resiliparse.
Depends on how you run it in parallel. If you start separate Python processes, they should all be equally fast. If you run it in threads, you may experience slowdown due to GIL locking. The latter might be avoidable perhaps, but I'd need a proper test case.
from chatnoir-resiliparse.
BTW interesting benchmarking suite. I used the ScrapingHub suite for developing and evaluating Resiliparse's accuracy, but this one is new to me.
So far, it's a pretty basic rule-based and precision-oriented extractor, but we have work done for future improvement. But great to see that it's still the fastest of the bunch.
from chatnoir-resiliparse.
Have you figured this out yet? I am also seeing a slowdown when running in parrallel.
from chatnoir-resiliparse.
Have you figured this out yet? I am also seeing a slowdown when running in parrallel.
Not yet, taking a look over the next day or two, but I suspect, in my case, with dask, it's the overhead of tasks, and how non python code is serialised with bags. First thing I'll do is try running on a larger dataset, and doing more than 1 extraction in each task. If that doesn't work, I'll have to reassess, but using cython parallel processing is another option. Or this could be a case where parallelisation isn't necessary.
from chatnoir-resiliparse.
@lexborisov @phoerious I'm wondering if we are using a version of lexbor with threading turned on in resiliparse? If so, then running in multiprocesses might not be helpful. I found a mention of threads in lexbor/lexbor@51930b4 ?
from chatnoir-resiliparse.
or it could be some blocking stuff done here in fastwarc/resiliparse itself? https://github.com/chatnoir-eu/chatnoir-resiliparse/tree/1f4b92d2a5f49f903e5f1b3ac087b5ac8db69aac/resiliparse_inc
from chatnoir-resiliparse.
I am having trouble reproducing this. I was able to reproduce the threading slowdown once after starting Resiliparse directly from my IDE, but that vanished after a clean rebuild of everything with debugging and line tracing turned off. When started from the command line, I could not reproduce it at all.
Here are my benchmarking results for a simple HTML file using different parallel joblib backends:
Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0].
Benchmarking 10,000 extractions...
----------------------------------
sequential: 3.070s (3,257.8 extractions/s)
multiprocessing, 24 workers: 0.274s (36,437.4 extractions/s)
loky, 24 workers: 0.498s (20,066.1 extractions/s)
threading, 24 workers: 3.227s (3,098.9 extractions/s)
The threading backend is about as fast as the sequential run, which is to be expected (it's a little slower due to the added overhead). The multiprocessing backend is 10-12x faster.
I get the same result when I pull a build directly from PyPi instead if using a local build.
from chatnoir-resiliparse.
Strange. I can reproduce the slowdown with your benchmarking tool:
┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ resiliparse │ 0.6525 │ 0.8529 │ 0.9852 │ 0.904 │ 0.8819 │ 62.3391 │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘
The results are (surprisingly) slightly better if I change the dask compute scheduler to threads
:
┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ resiliparse │ 0.6525 │ 0.8529 │ 0.9852 │ 0.904 │ 0.8819 │ 220.0468 │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘
The default seems to be processes
, which should be faster in theory. synchronous
gives slightly worse results than threading
:
┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ resiliparse │ 0.6525 │ 0.8529 │ 0.9852 │ 0.904 │ 0.8819 │ 180.6137 │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘```
Compare this to running it without any scheduler at all:
┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ resiliparse │ 0.6525 │ 0.8529 │ 0.9852 │ 0.904 │ 0.8819 │ 493.7656 │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘
Dask must be introducing significant overhead somewhere.
I extended my own benchmarker by adding a custom multiprocessing Pool scheduler and a ThreadPoolExecutor and ProcessPoolExecutor in addition to what joblib has to offer. The multiprocessing Pool is comparable to joblib's multiprocessing scheduler and the ThreadPoolExecutor is about as fast as joblib's threading backend. ProcessPoolExecutor is significantly worse than a multiprocessing Pool, probably because it creates an entirely new process in each iteration:
Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0].
Benchmarking 10,000 extractions...
----------------------------------
sequential: 2.829s (3,534.2 extractions/s)
multiprocessing, 24 workers: 0.368s (27,205.6 extractions/s)
loky, 24 workers: 0.886s (11,288.3 extractions/s)
threading, 24 workers: 3.082s (3,244.8 extractions/s)
multiprocessingpool, 24 workers: 0.311s (32,114.7 extractions/s)
processpoolexecutor, 24 workers: 1.752s (5,706.5 extractions/s)
threadpoolexecutor, 24 workers: 2.790s (3,584.1 extractions/s)
from chatnoir-resiliparse.
I think the problem is not in Resiliparse. When I print self.elapsed_time
in your extractor and replace the Resiliparse extractor implementation with return ''
, I get the following timings:
Sequential: 0.0071163177490234375
Parallel: 2.8729026317596436
Compare that to the timings when I add the real implementation back in (with main content extraction, which is the most expensive extraction):
Sequential: 0.3654904365539551
Parallel: 2.890957832336426
So Resiliparse adds 0.36 and 0.02 seconds, respectively. The rest is overhead of your benchmarking implementation. I suspect that reading and distributing the HTML inputs is really expensive, therefore bottlenecking the entire execution. Perhaps a simpler multiprocessing backend would be better.
Here are the timings of my benchmarks also with main content extraction turned on:
Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0].
Benchmarking 10,000 extractions...
----------------------------------
sequential: 4.542s (2,201.9 extractions/s)
multiprocessing, 24 workers: 0.430s (23,244.4 extractions/s)
loky, 24 workers: 0.953s (10,493.8 extractions/s)
threading, 24 workers: 3.603s (2,775.4 extractions/s)
multiprocessingpool, 24 workers: 0.424s (23,611.7 extractions/s)
processpoolexecutor, 24 workers: 1.878s (5,325.0 extractions/s)
threadpoolexecutor, 24 workers: 3.376s (2,962.1 extractions/s)
from chatnoir-resiliparse.
FWIW, here's my benchmark implementation. The sample HTML was taken from issue #24:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from multiprocessing import Pool
import os
import sys
from joblib import delayed, parallel_backend, Parallel
from time import monotonic
from tqdm import tqdm
from resiliparse.parse import detect_encoding
from resiliparse.parse.html import HTMLTree
from resiliparse.extract.html2text import extract_plain_text
html_byte = b'\n\n\n\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\r\n<meta http-equiv="X-UA-Compatible" content="IE=9">\r\n<link rel="stylesheet" type="text/css" href="https://firgraf.oh.gov.hu/include/style.css" media="screen" />\r\n<title>Int\xc3\xa9zm\xc3\xa9nyi adatok</title>\r\n<!-- Global site tag (gtag.js) - Google Analytics -->\r\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-198540847-1"></script>\r\n<script>\r\n window.dataLayer = window.dataLayer || [];\r\n function gtag(){dataLayer.push(arguments);}\r\n gtag(\'js\', new Date());\r\n gtag(\'config\', \'UA-198540847-1\');\r\n</script>\r\n</head>\r\n<body>\r\n<table width="80%" cellpadding="0" cellspacing="0" align="center" style="border:3px solid;\r\nborder-radius:8px; border: 3px solid #0994dc; background-color:#FFFFFF">\r\n <tr>\r\n <td valign="top" rowspan="2" bgcolor=\'#FFFFFF\'></td>\r\n <td align=\'center\' height=\'70\' bgcolor=\'#FFFFFF\' style=\'font: bold small-caps 28px monospace;\'><img src=\'https://firgraf.oh.gov.hu/images/firgraf_logo.png\' width=\'1200\'></td>\r\n </tr>\r\n <tr>\r\n <td valign="top" align=\'center\' bgcolor="#FFFFFF">\r\n \r\n <table>\r\n\t<tr>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/index.php">Kezd\xc5\x91lap</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/kkk.php">K\xc3\xa9pz\xc3\xa9si \xc3\xa9s kimeneti k\xc3\xb6vetelm\xc3\xa9nyek</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/int.php">Int\xc3\xa9zm\xc3\xa9nyi adatok</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/torzs.php">T\xc3\xb6rzsadatok</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/gyorslista.php">Gyorslist\xc3\xa1k</a></td>\r\n\t <td class="menu"><a class="menu" href="http://www.felvi.hu/hivataliugyek/">Vissza a felvi.hu-ra</a></td>\r\n\t</tr>\r\n </table>\r\n </td>\r\n </tr>\r\n <tr>\r\n <td bgcolor=\'#ffffff\'>\r\n \r\n </td>\r\n <td colspan="2" style="padding: 0.5em">\r\n <div align="center"><font size="4" color="#000000">Int\xc3\xa9zm\xc3\xa9nyi adatok</font></div><hr>\r\n <div align=\'left\' valign=\'top\'><form name=\'hataly\' method=\'get\' action=\'/prg/int.php?nyilvantartottszakid=36318\'><a href=\'/prg/int.php?hatalyvalt=hat\xc3\xa1lyoss\xc3\xa1g+bekapcsol\xc3\xa1sa&nyilvantartottszakid=36318\'>[A hat\xc3\xa1lyoss\xc3\xa1gi sz\xc5\xb1r\xc5\x91k bekapcsol\xc3\xa1sa.]</a></form>\n</div><form name=form1 method=post action=\'/prg/int.php?nyilvantartottszakid=36318\'><div align=\'left\' valign=\'top\'>\xe2\x96\xa0 <a href=\'kkk.php?graf=MSZKSMU\'>KKK teljes gr\xc3\xa1f</a> \xe2\x96\xa0 <a href=\'int.php?adatmod=nyilvszak&szervezetid=36\'>SZTE nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9sei</a><br>A gr\xc3\xa1fban a csom\xc3\xb3pontokra kattintva b\xc5\x91vebb inform\xc3\xa1ci\xc3\xb3 olvashat\xc3\xb3 az adott csom\xc3\xb3pontr\xc3\xb3l.<br>Gr\xc3\xa1fn\xc3\xa9zet: <select name=grafnezet>\n<option value="resz">csak a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n<option value="mind">a teljes gr\xc3\xa1fban a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n</select> mutatja.<br>A gr\xc3\xa1fban a ny\xc3\xadl kezdete \xc3\xa9s v\xc3\xa9ge k\xc3\xb6z\xc3\xb6tti minim\xc3\xa1lis t\xc3\xa1vols\xc3\xa1g: <select name=grafminlen>\n<option value="0">legkisebb</option>\n<option selected value="1">1 egys\xc3\xa9g</option>\n<option value="2">2 egys\xc3\xa9g</option>\n<option value="3">3 egys\xc3\xa9g</option>\n<option value="4">4 egys\xc3\xa9g</option>\n<option value="5">5 egys\xc3\xa9g</option>\n</select> (A nagyobb \xc3\xa9rt\xc3\xa9k szell\xc5\x91sebb\xc3\xa9 teszi az \xc3\xa1br\xc3\xa1t.)<br> <button type=\'submit\' style="background-color:#E5E5E5; color:#000000; font-size: 12px;" name=\'muv\' value=\'n\xc3\xa9zetet friss\xc3\xadt\'>n\xc3\xa9zetet friss\xc3\xadt</button> </div><br><table width=\'100%\' align=\'center\' border=\'0\'><tr><td width=\'50%\' align=\'left\' valign=\'top\'><a href=\'/prg/int.php?nyilvantartottszakid=36317\'>\xc2\xab el\xc5\x91z\xc5\x91: szoci\xc3\xa1lis munka (36317)</a></td><td width=\'50%\' align=\'right\'><a href=\'/prg/int.php?nyilvantartottszakid=6150\'>k\xc3\xb6vetkez\xc5\x91: szoci\xc3\xa1lpedag\xc3\xb3gia (6150) \xc2\xbb</a></td></tr></table>\n<br><div align=\'left\' valign=\'top\'><b><a href=\'torzsadat.php?tabla=szervezet&sid=70\'>(SZTE) Szegedi Tudom\xc3\xa1nyegyetem</a> - <a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>(MSZKSMU) szoci\xc3\xa1lis munka [36318]</a></b></div><br><div align=\'left\' valign=\'top\'><?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"\n "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">\n<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n -->\n<!-- Title: MSZKSMU Pages: 1 -->\n<svg width="340pt" height="116pt"\n viewBox="0.00 0.00 340.00 116.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">\n<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 112)">\n<title>MSZKSMU</title>\n<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-112 336,-112 336,4 -4,4"/>\n<g id="clust1" class="cluster">\n<title>cluster_vegzettseg</title>\n<polygon fill="none" stroke="#ffff00" points="231,-8 231,-62 324,-62 324,-8 231,-8"/>\n</g>\n<!-- START -->\n<g id="node1" class="node">\n<title>START</title>\n<ellipse fill="#d3d3d3" stroke="#d3d3d3" cx="27" cy="-63" rx="27" ry="18"/>\n<text text-anchor="middle" x="27" y="-60.8" font-family="Times,serif" font-size="9.00" fill="#000000">START</text>\n</g>\n<!-- MSZKSMU -->\n<g id="node2" class="node">\n<title>MSZKSMU</title>\n<g id="a_node2"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414" xlink:title="MSZKSMU\\nszoci\xc3\xa1lis munka">\n<polygon fill="#e0ffff" stroke="#e0ffff" points="164,-81 91,-81 91,-45 164,-45 164,-81"/>\n<text text-anchor="middle" x="127.5" y="-65.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSZKSMU</text>\n<text text-anchor="middle" x="127.5" y="-55.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- START->MSZKSMU -->\n<g id="edge1" class="edge">\n<title>START->MSZKSMU</title>\n<path fill="none" stroke="#0000ff" stroke-width="2" d="M54.1967,-63C62.3906,-63 71.6286,-63 80.7147,-63"/>\n<polygon fill="#0000ff" stroke="#0000ff" stroke-width="2" points="80.8451,-66.5001 90.8451,-63 80.845,-59.5001 80.8451,-66.5001"/>\n</g>\n<!-- MSPCKSM -->\n<g id="node3" class="node">\n<title>MSPCKSM</title>\n<g id="a_node3"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710" xlink:title="MSPCKSM\\nklinikai szoci\xc3\xa1lis munka">\n<polygon fill="#ffe4e1" stroke="#ffe4e1" points="328,-108 227,-108 227,-72 328,-72 328,-108"/>\n<text text-anchor="middle" x="277.5" y="-92.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSPCKSM</text>\n<text text-anchor="middle" x="277.5" y="-82.8" font-family="Times,serif" font-size="9.00" fill="#000000">klinikai szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU->MSPCKSM -->\n<g id="edge3" class="edge">\n<title>MSZKSMU->MSPCKSM</title>\n<path fill="none" stroke="#000000" d="M164.1941,-69.6049C179.9274,-72.4369 198.7348,-75.8223 216.4633,-79.0134"/>\n<polygon fill="#000000" stroke="#000000" points="216.2835,-82.5372 226.7454,-80.8642 217.5237,-75.6479 216.2835,-82.5372"/>\n</g>\n<!-- 1287 -->\n<g id="node4" class="node">\n<title>1287</title>\n<g id="a_node4"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=vegzettseg&idmezo=vegzettsegid&id=1287" xlink:title="MMSAZMO\\nokleveles\\nszoci\xc3\xa1lis munk\xc3\xa1s">\n<polygon fill="#ffff00" stroke="#ffff00" points="316,-54 239,-54 239,-16 316,-16 316,-54"/>\n<text text-anchor="middle" x="277.5" y="-42.8" font-family="Times,serif" font-size="9.00" fill="#000000">MMSAZMO</text>\n<text text-anchor="middle" x="277.5" y="-32.8" font-family="Times,serif" font-size="9.00" fill="#000000">okleveles</text>\n<text text-anchor="middle" x="277.5" y="-22.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munk\xc3\xa1s</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU->1287 -->\n<g id="edge2" class="edge">\n<title>MSZKSMU->1287</title>\n<path fill="none" stroke="#ff0000" d="M164.1941,-56.1504C183.6481,-52.519 207.8022,-48.0103 228.805,-44.0897"/>\n<polygon fill="#ff0000" stroke="#ff0000" points="229.6399,-47.4944 238.8279,-42.2188 228.3554,-40.6133 229.6399,-47.4944"/>\n<text text-anchor="middle" x="195.5" y="-54.6" font-family="Times,serif" font-size="8.00" fill="#ff0000">START</text>\n</g>\n</g>\n</svg>\n</div><br><br><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott szak:</div></b><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'><tr><td align=\'left\' valign=\'top\'><b>nyilv. szak ID</b></td><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>telephely</b></td><td align=\'left\' valign=\'top\'><b>nyelv</b></td><td align=\'left\' valign=\'top\'><b>munkarend</b></td></tr>\n<tr><td align=\'left\' valign=\'top\'><a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>36318</a></td><td align=\'left\' valign=\'top\'>MSZKSMU</td><td align=\'left\' valign=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>Szeged</td><td align=\'left\' valign=\'top\'>magyar</td><td align=\'left\' valign=\'top\'>levelez\xc5\x91</td></tr>\n</table><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9si elemek:</b></div><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'>\n<tr><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>t\xc3\xadpus</b></td><td align=\'left\' valign=\'top\'><b>minimum kredit</b></td><td align=\'left\' valign=\'top\'><b>maximum kredit</b></td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710\'>MSPCKSM</a></td><td align=\'left\' valig=\'top\'>klinikai szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>specializ\xc3\xa1ci\xc3\xb3</td><td align=\'left\' valig=\'top\'>35</td><td align=\'left\' valig=\'top\'>40</td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414\'>MSZKSMU</a></td><td align=\'left\' valig=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>szak</td><td align=\'left\' valig=\'top\'>120</td><td align=\'left\' valig=\'top\'>120</td></tr></table></form>\r\n </td>\r\n </tr>\r\n <tr>\r\n <td colspan="2" bgcolor=\'#0994dc\' width="100%">\r\n <table width="100%">\r\n\t<tr>\r\n\t <td align=\'left\'>\r\n\t <font size=\'1\' color=\'#ffffff\'>Az adatb\xc3\xa1zis 2022-09-24 hajnalban friss\xc3\xbclt.</font>\r\n\t </td>\r\n\t <td align="right">\r\n\t <font size=\'1\' color=\'#ffffff\'>K\xc3\xa9sz\xc3\xbclt az EKOP-1.A.1-08/C-2009-0009 "Az Oktat\xc3\xa1si Hivatal k\xc3\xb6zigazgat\xc3\xa1si szolg\xc3\xa1ltat\xc3\xa1sainak elektroniz\xc3\xa1l\xc3\xa1sa" projekt keret\xc3\xa9ben. © 2012.</font>\r\n\t </td>\r\n\t</tr>\r\n </td>\r\n </tr>\r\n</table>\r\n</body>\r\n</html>\r\n\n'
encoding = detect_encoding(html_byte)
ITERATIONS = 10000
def extract(*_):
tree = HTMLTree.parse_from_bytes(html_byte, encoding)
extract_plain_text(tree, main_content=True)
def run_bench(backend='sequential', workers=None):
workers = workers or os.cpu_count()
start = monotonic()
if backend == 'sequential':
for _ in tqdm(range(ITERATIONS), mininterval=0.3, leave=False):
extract()
elif backend == 'multiprocessingpool':
with Pool(workers) as p:
p.map(extract, tqdm(range(ITERATIONS), mininterval=0.3, leave=False))
elif backend == 'processpoolexecutor':
with ProcessPoolExecutor(workers) as p:
p.map(extract, tqdm(range(ITERATIONS), mininterval=0.3, leave=False))
elif backend == 'threadpoolexecutor':
with ThreadPoolExecutor(workers) as p:
p.map(extract, tqdm(range(ITERATIONS), mininterval=0.3, leave=False))
else:
with parallel_backend(backend, n_jobs=workers):
Parallel()(delayed(extract)() for _ in tqdm(range(ITERATIONS), mininterval=0.3, leave=False))
time_taken = monotonic() - start
desc = backend
if backend != 'sequential':
desc = f'{backend}, {workers} workers'
print(f'{desc}: {time_taken:.3f}s ({ITERATIONS / time_taken:,.1f} extractions/s)')
print(f'Using Python {sys.version}.')
print(f'Benchmarking {ITERATIONS:,} extractions...')
print(f'----------------------------------')
run_bench()
run_bench('multiprocessing')
run_bench('loky')
run_bench('threading')
run_bench('multiprocessingpool')
run_bench('processpoolexecutor')
run_bench('threadpoolexecutor')
from chatnoir-resiliparse.
I think the problem is not in Resiliparse. When I print
self.elapsed_time
in your extractor and replace the Resiliparse extractor implementation withreturn ''
, I get the following timings:Sequential: 0.0071163177490234375 Parallel: 2.8729026317596436
Compare that to the timings when I add the real implementation back in (with main content extraction, which is the most expensive extraction):
Sequential: 0.3654904365539551 Parallel: 2.890957832336426
So Resiliparse adds 0.36 and 0.02 seconds, respectively. The rest is overhead of your benchmarking implementation. I suspect that reading and distributing the HTML inputs is really expensive, bottlenecking the entire execution. Perhaps a simpler multiprocessing backend would be better.
Here are the timings of my benchmarks also with main content extraction turned on:
Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]. Benchmarking 10,000 extractions... ---------------------------------- sequential: 4.542s (2,201.9 extractions/s) multiprocessing, 24 workers: 0.430s (23,244.4 extractions/s) loky, 24 workers: 0.953s (10,493.8 extractions/s) threading, 24 workers: 3.603s (2,775.4 extractions/s) multiprocessingpool, 24 workers: 0.424s (23,611.7 extractions/s) processpoolexecutor, 24 workers: 1.878s (5,325.0 extractions/s) threadpoolexecutor, 24 workers: 3.376s (2,962.1 extractions/s)
So Resiliparse adds 0.36 and 0.02 seconds, respectively. The rest is overhead of your benchmarking implementation. I suspect that reading and distributing the HTML inputs is really expensive, bottlenecking the entire execution. Perhaps a simpler multiprocessing backend would be better.
Unlikely, since the slowdown accelerates when I pass more extractions, only the actual extraction is timed. Also the slowdown is not seen from other libraries like goose, which speed up. Resiliparse is the only one that shows a slow down.
I did some further benchmarking with some minimal cython examples, those do get a speed up, run in the same way. So i can rule out cython being an issue with dask.
from chatnoir-resiliparse.
Yes, it adds up, but that could already be passing the html
parameter to the extraction function or spawning the executor itself. As I said, replacing the whole implementation with return ''
should see a significant speedup if the problem were Resiliparse, but there is barely a difference.
from chatnoir-resiliparse.
Just to be clear, this is my extractor implementation, which takes about 2.9 seconds on its own:
from wee_cli.extractors import BaseExtractor
class ResiliparseExtract(BaseExtractor):
name = 'resiliparse'
@staticmethod
def extract(html):
return ''
Adding the Resiliparse extractor back in makes almost no difference in execution time. I am pretty sure the other libraries are slowed down as well, but their sequential extraction times are long enough that you don't notice it that much. Hence you still get an overall speedup (although a lower one than might be expected). Resiliparse is probably the only library of the bunch whose sequential timing is faster than your overhead.
from chatnoir-resiliparse.
How are you timing that? If I add the extractor as above, it times it at 0.00015997886657714844s
I suspect your timing includes the overhead from loading in the html files into an iterable. If you look at the timings for my benchmark tool only the extraction times are included:
def extract_in_parallel(self):
"""
runs the extracts in parallel using dask.
This isn't the most effient way with dask, but the sequence is loaded first to get the timings as close as possible
:return:
"""
sequence = []
for path in Path('datasets/scrappinghub_aeb/html').glob('*.html.gz'):
with gzip.open(path, 'rt', encoding='utf8') as f:
html = f.read()
item_id = path.stem.split('.')[0]
sequence.append({
'item_id': item_id,
'html': html
})
start = time.time()
bagged = db.from_sequence(sequence)\
.map(self.parallel_extract)
bagged = bagged.compute()
self.elapsed_time = time.time() - start
self.extracts = {item['item_id']:{'articleBody': item['articleBody']} for item in bagged}
https://github.com/Nootka-io/wee-benchmarking-tool/blob/main/wee_cli/extractors/__init__.py
from chatnoir-resiliparse.
However I added dask bag to the benchmarking tool, and it runs quicker than sequentially, the only difference is the function called from map is calling a global variable, which might be giving skewed results across all the parallel computing, since the object doesn't need to be serialised.
Benchmarking 10,000 extractions...
----------------------------------
sequential: 5.175s (1,932.2 extractions/s)
multiprocessing, 16 workers: 0.642s (15,568.2 extractions/s)
loky, 16 workers: 1.145s (8,736.9 extractions/s)
threading, 16 workers: 2.914s (3,432.2 extractions/s)
multiprocessingpool, 16 workers: 0.551s (18,133.1 extractions/s)
processpoolexecutor, 16 workers: 1.797s (5,565.0 extractions/s)
threadpoolexecutor, 16 workers: 2.567s (3,895.2 extractions/s)
dask_bag, 16 workers: 1.718s (5,819.8 extractions/s)
from chatnoir-resiliparse.
NVM, i puked up on my own tool and forgot to run in parallel. Yes, that's likely overhead of serialising the html object.
from chatnoir-resiliparse.
The profiler output is very sparse, since a lot of stuff is happening in native functions, but if you ask me, I would assume that the pickler is problably the blottleneck. You should definitely try to distribute the tasks to workers before you start running and timing them, similar to what Pool
does. Spawning an individual process for each task and feeding it an individually pickled object is going to be really inefficient.
from chatnoir-resiliparse.
Thanks, working though it has been a big help, I'm going to add a few other parallel processing options to my benchmarking tool, and likely change my real world example away from dask.
from chatnoir-resiliparse.
Try to avoid serialisation as much as possible and try to pre-load strings into memory to be less dependent on the disk and page cache.
I suppose this issue has resolved itself, so I'm closing it, but feel free to add any new discoveries.
from chatnoir-resiliparse.
Thank you for all this great discussion. The addition of less GIL for threading is awesome. When will it be pushed to the official pypi? Thank you for all your hardwork.
from chatnoir-resiliparse.
It's already pushed. You can get 0.13.7 from PyPi.
from chatnoir-resiliparse.
Related Issues (11)
- Problem with LZ4F_cctx and LZ4G_dctx HOT 1
- Fastwarc: CLI may index gzipped WARC records with erroneous length 0 HOT 3
- yum install HOT 3
- Installing fastwarc via `pip install` fails if compilation is required or requested HOT 3
- pipx run resiliparse faild: ModuleNotFoundError: No module named 'joblib' HOT 5
- pipx run fastwarc check faild: binascii.Error: Non-base32 digit found HOT 9
- resiliparse crashes in colab HOT 8
- Trouble building in Python 3.11 HOT 6
- Fix HTTP status code parsing (reason phrase may contain spaces) HOT 2
- FastWARC: BufferedReader may hang up on truncated gzipped WARC file HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatnoir-resiliparse.