chatnoir-eu / chatnoir-resiliparse Goto Github PK
View Code? Open in Web Editor NEWA robust web archive analytics toolkit
Home Page: https://resiliparse.chatnoir.eu
License: Apache License 2.0
A robust web archive analytics toolkit
Home Page: https://resiliparse.chatnoir.eu
License: Apache License 2.0
The ArchiveIterator, resp. the underlying stream_io.BufferedReader when reading a truncated gzipped WARC file (eg. an incomplete download). The issue can be reproduced when reading clipped.warc.gz, see iipc/jwarc#17. The stack during the hangup (instead of ftell
I've also observed stream_io.FileStream.read()
on top of _refill_working_buf()
:
#3 0x00007f98a34f8705 in __GI__IO_ftell (fp=0x19b3790) at ioftell.c:38
#4 0x00007f98a2764766 in __pyx_f_8fastwarc_9stream_io_10GZipStream__refill_working_buf (__pyx_v_self=0x7f98a19fad60, __pyx_v_size=16384)
at fastwarc/stream_io.cpp:4944
#5 0x00007f98a276d500 in __pyx_f_8fastwarc_9stream_io_10GZipStream_read (__pyx_v_self=0x7f98a19fad60, __pyx_v_out="", __pyx_v_size=16384)
at fastwarc/stream_io.cpp:5191
#6 0x00007f98a27645bc in __pyx_f_8fastwarc_9stream_io_14BufferedReader__fill_buf (__pyx_v_self=0x7f98a19fb9a0) at fastwarc/stream_io.cpp:9201
#7 0x00007f98a276ce6b in __pyx_f_8fastwarc_9stream_io_14BufferedReader_read (__pyx_v_self=0x7f98a19fb9a0, __pyx_skip_dispatch=<optimized out>,
__pyx_optional_args=<optimized out>) at fastwarc/stream_io.cpp:9684
#8 0x00007f98a2765d75 in __pyx_pf_8fastwarc_9stream_io_14BufferedReader_4read (__pyx_v_size=16384, __pyx_v_self=0x7f98a19fb9a0)
at fastwarc/stream_io.cpp:9840
AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?
fails to build in docker on apple silicon.
builds fine on linux and also outside of docker in native osx
pip install fastwarc==0.14.5
Collecting fastwarc==0.14.5
Using cached FastWARC-0.14.5.tar.gz (42 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 311, in run_setup
exec(code, locals())
File "<string>", line 34, in <module>
AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
$ pipx run --verbose fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
pipx >(setup:729): pipx version is 1.0.0
pipx >(setup:730): Default python interpreter is '/home/user/.local/pipx/venvs/pipx/bin/python'
pipx >(needs_upgrade:69): Time since last upgrade of shared libs, in seconds: 1561898. Upgrade will be run by pipx if greater than 2592000.
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(run:103): Reusing cached venv /home/user/.local/pipx/.cache/7a73b1e86637c39
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(exec_app:387): exec_app: /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
0 records were verified successfully.
1 records were skipped without digest.
Error in sys.excepthook:
Traceback (most recent call last):
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
sys.exit(main())
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
for v in pbar:
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "fastwarc/tools.pyx", line 178, in verify_digests
File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
File "/usr/lib/python3.9/base64.py", line 231, in b32decode
raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found
Original exception was:
Traceback (most recent call last):
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
sys.exit(main())
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
for v in pbar:
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "fastwarc/tools.pyx", line 178, in verify_digests
File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
File "/usr/lib/python3.9/base64.py", line 231, in b32decode
raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found
$
I'm having this error both when trying to install from pip and from this repo:
fastwarc/warc.cpp:1189:3: error: ‘LZ4F_cctx’ does not name a type; did you mean ‘LZ4F_cctx_s’?
LZ4F_cctx *cctx;
^~~~~~~~~
LZ4F_cctx_s
fastwarc/warc.cpp:1190:3: error: ‘LZ4F_dctx’ does not name a type; did you mean ‘LZ4F_dctx_s’?
LZ4F_dctx *dctx;
^~~~~~~~~
LZ4F_dctx_s
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Does this package support python3.7? Because I am using a distributed cluster, which only supports python3.7 versions.
Hello, Thank you for the wonderful project!
I have a question about DOM Manipulation and DOM Node. In the document, there are warnings against use of instance of DOMNode after DOM Tree Manipulation.
Warning
A DOMNode object is valid only for as > long as its parent tree has not been modified or deallocated. Thus, DO NOT use existing instances after any sort of DOM tree manipulation! Doing so may result in Python crashes or (worse) security vulnerabilities due to dangling pointers (use after free). This is a known Lexbor limitation for which there is no workaround at the moment.
I am currently working on creating HTML extractor, and there are many DOM manipulations and DOMNode accesses, for example, like this:
sibling = next_sibling.next
p.append_child(next_sibling)
next_sibling = sibling
I think if I need to re-find DOMNode again for every DOM manipulation operations it will make it hard to do some kind of works. Is there are a concrete example of safe or okay manipulations/accesses or a specific cases where accessing after manipulation will cause error or segfault? Thank you!
pip3 install --no-binary fastwarc fastwarc
pip3 install fastwarc
(no binaries provided for ARM CPUs)The error message indicates that fastwarc is now too interconnected with resiliparse
ERROR: Command errored out with exit status 1:
...
from resiliparse_common.string_util cimport str_to_lower, strip_str, strip_c_str
^
------------------------------------------------------------
fastwarc/warc.pyx:32:0: 'resiliparse_common/string_util.pxd' not found
Building from a checkout of chatnoir-resiliparse via pip3 wheel -e fastwarc
succeeds also on ARM-based systems.
Hello,
I'm trying to build Resiliparse 0.13.7 from source, and I'm getting this error. Can you tell me which library Resiliparse is expecting to get html.h from? I suspect I'm missing a dependency.
resiliparse/extract/html2text.cpp:869:10: fatal error: html.h: No such file or directory
#include "html.h"
^~~~~~~~
Thanks,
Dave
Thanks for developing fastwarc! It's been a great tool while I've been exploring text extracting of common crawl with Python.
I'm interested in using it as part of much larger pipeline but want to enable chunked processing and I'm curious if this is possible. My understanding at the moment is that the ArchiverIterator
gives me a nice handle to process the all WarcRecords
sequentially. I think I'd like to be able to do is something like:
archv = ArchiveChunked(open(...), ...)
recs = archv[N:N+10] # select 10 records starting at N
Doing this should allow me to leverage batch processing functionality and distributed the processing across multiple cores
$ pip install --no-binary resiliparse resiliparse
DEPRECATION: --no-binary currently disables reading from the cache of locally built wheels. In the future --no-binary will not influence the wheel cache. pip 23.1 will enforce this behaviour change. A possible replacement is to use the --no-cache-dir option. You can use the flag --use-feature=no-binary-enable-wheel-cache to test the upcoming behaviour. Discussion can be found at https://github.com/pypa/pip/issues/11453
Collecting resiliparse
Using cached Resiliparse-0.13.7.tar.gz (601 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
Preparing metadata (pyproject.toml) ... done
Collecting fastwarc==0.13.7
Using cached FastWARC-0.13.7-cp311-cp311-linux_x86_64.whl
Collecting brotli
Using cached Brotli-1.0.9-cp311-cp311-linux_x86_64.whl
Requirement already satisfied: click in ./venv311/lib/python3.11/site-packages (from fastwarc==0.13.7->resiliparse) (8.0.4)
Requirement already satisfied: tqdm in ./venv311/lib/python3.11/site-packages (from fastwarc==0.13.7->resiliparse) (4.64.1)
Building wheels for collected packages: resiliparse
Building wheel for resiliparse (pyproject.toml) ... error
error: subprocess-exited-with-error
× Building wheel for resiliparse (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [50 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-cpython-311
creating build/lib.linux-x86_64-cpython-311/resiliparse
copying resiliparse/cli.py -> build/lib.linux-x86_64-cpython-311/resiliparse
copying resiliparse/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse
creating build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/coders.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/textio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/warcio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/elasticsearch.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/fileio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
creating build/lib.linux-x86_64-cpython-311/resiliparse/extract
copying resiliparse/extract/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
creating build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
copying resiliparse/itertools.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
copying resiliparse/process_guard.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
copying resiliparse/extract/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
copying resiliparse/extract/html2text.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
copying resiliparse/parse/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/html.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/http.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/lang.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/encoding.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/lang_profiles.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/encoding.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/html.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
running build_ext
building 'resiliparse.itertools' extension
creating build/temp.linux-x86_64-cpython-311
creating build/temp.linux-x86_64-cpython-311/resiliparse
x86_64-linux-gnu-gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/mnt/Data/Projects/nootka_io/sentry_dragon/venv311/include -I/usr/include/python3.11 -c resiliparse/itertools.cpp -o build/temp.linux-x86_64-cpython-311/resiliparse/itertools.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 build/temp.linux-x86_64-cpython-311/resiliparse/itertools.o -L/usr/lib/x86_64-linux-gnu -o build/lib.linux-x86_64-cpython-311/resiliparse/itertools.cpython-311-x86_64-linux-gnu.so -std=c++17
building 'resiliparse.extract.html2text' extension
creating build/temp.linux-x86_64-cpython-311/resiliparse/extract
x86_64-linux-gnu-gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I./resiliparse/parse -I/mnt/Data/Projects/nootka_io/sentry_dragon/venv311/include -I/usr/include/python3.11 -c resiliparse/extract/html2text.cpp -o build/temp.linux-x86_64-cpython-311/resiliparse/extract/html2text.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
In file included from /usr/include/lexbor/css/css.h:14,
from resiliparse/extract/html2text.cpp:864:
/usr/include/lexbor/css/stylesheet.h: In function ‘lxb_css_stylesheet_t* lxb_css_stylesheet_create(lexbor_mraw_t*)’:
/usr/include/lexbor/css/stylesheet.h:33:30: error: invalid conversion from ‘void*’ to ‘lxb_css_stylesheet_t*’ {aka ‘lxb_css_stylesheet*’} [-fpermissive]
33 | return lexbor_mraw_calloc(mraw, sizeof(lxb_css_stylesheet_t));
| ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| void*
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for resiliparse
Failed to build resiliparse
ERROR: Could not build wheels for resiliparse, which is required to install pyproject.toml-based projects
The field WarcRecord.http_headers
could include the HTTP status code or it could be provided as an extra attribute to WarcRecord.
When reading a record it is not easily visible what status code a response had. For example, if I would like to only filter 301
redirection content, I'm not able to do this, as far as I can see. (Or just filter 200
responses for further processing.) The other HTTP headers are parsed but not the HTTP status line which has a simple format, e. g. HTTP/1.X XXX Description
, that could be integrated to the existing HTTP header parsing. I also found no simple way like .reader
to access the HTTP communication.
Example:
>>> record.headers
{'WARC-Type': 'response', 'WARC-Target-URI': 'http://vgperson.com/robots.txt', 'WARC-Date': '2021-08-09T13:25:55Z', 'WARC-Payload-Digest': 'sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP', 'WARC-IP-Address': '85.214.122.46', 'WARC-Record-ID': '<urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>', 'Content-Type': 'application/http; msgtype=response', 'Content-Length': '454'}
>>> record.http_headers
{'Date': 'Mon, 09 Aug 2021 13:25:53 GMT', 'Server': 'Apache', 'Location': 'https://vgperson.com/robots.txt', 'Content-Length': '239', 'Connection': 'close', 'Content-Type': 'text/html; charset=iso-8859-1'}
>>> content = record.reader.read()
>>> assert len(content) == record.content_length # content only includes the real content, no access to HTTP stuff
>>> print(content)
b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>301 Moved Permanently</title>\n</head><body>\n<h1>Moved Permanently</h1>\n<p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>\n</body></html>\n'
HTTP communication:
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://vgperson.com/robots.txt
WARC-Date: 2021-08-09T13:25:55Z
WARC-Payload-Digest: sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP
WARC-IP-Address: 85.214.122.46
WARC-Record-ID: <urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>
Content-Type: application/http; msgtype=response
Content-Length: 454
HTTP/1.1 301 Moved Permanently
Date: Mon, 09 Aug 2021 13:25:53 GMT
Server: Apache
Location: https://vgperson.com/robots.txt
Content-Length: 239
Connection: close
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>
</body></html>
Trying this piece of html... Is there something I can do to upgrade the underlying parser? I recall reading this...
from resiliparse.parse import detect_encoding
from resiliparse.parse.html import HTMLTree
from resiliparse.extract.html2text import extract_plain_text
html_byte = b'\n\n\n\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\r\n<meta http-equiv="X-UA-Compatible" content="IE=9">\r\n<link rel="stylesheet" type="text/css" href="https://firgraf.oh.gov.hu/include/style.css" media="screen" />\r\n<title>Int\xc3\xa9zm\xc3\xa9nyi adatok</title>\r\n<!-- Global site tag (gtag.js) - Google Analytics -->\r\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-198540847-1"></script>\r\n<script>\r\n window.dataLayer = window.dataLayer || [];\r\n function gtag(){dataLayer.push(arguments);}\r\n gtag(\'js\', new Date());\r\n gtag(\'config\', \'UA-198540847-1\');\r\n</script>\r\n</head>\r\n<body>\r\n<table width="80%" cellpadding="0" cellspacing="0" align="center" style="border:3px solid;\r\nborder-radius:8px; border: 3px solid #0994dc; background-color:#FFFFFF">\r\n <tr>\r\n <td valign="top" rowspan="2" bgcolor=\'#FFFFFF\'></td>\r\n <td align=\'center\' height=\'70\' bgcolor=\'#FFFFFF\' style=\'font: bold small-caps 28px monospace;\'><img src=\'https://firgraf.oh.gov.hu/images/firgraf_logo.png\' width=\'1200\'></td>\r\n </tr>\r\n <tr>\r\n <td valign="top" align=\'center\' bgcolor="#FFFFFF">\r\n \r\n <table>\r\n\t<tr>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/index.php">Kezd\xc5\x91lap</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/kkk.php">K\xc3\xa9pz\xc3\xa9si \xc3\xa9s kimeneti k\xc3\xb6vetelm\xc3\xa9nyek</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/int.php">Int\xc3\xa9zm\xc3\xa9nyi adatok</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/torzs.php">T\xc3\xb6rzsadatok</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/gyorslista.php">Gyorslist\xc3\xa1k</a></td>\r\n\t <td class="menu"><a class="menu" href="http://www.felvi.hu/hivataliugyek/">Vissza a felvi.hu-ra</a></td>\r\n\t</tr>\r\n </table>\r\n </td>\r\n </tr>\r\n <tr>\r\n <td bgcolor=\'#ffffff\'>\r\n \r\n </td>\r\n <td colspan="2" style="padding: 0.5em">\r\n <div align="center"><font size="4" color="#000000">Int\xc3\xa9zm\xc3\xa9nyi adatok</font></div><hr>\r\n <div align=\'left\' valign=\'top\'><form name=\'hataly\' method=\'get\' action=\'/prg/int.php?nyilvantartottszakid=36318\'><a href=\'/prg/int.php?hatalyvalt=hat\xc3\xa1lyoss\xc3\xa1g+bekapcsol\xc3\xa1sa&nyilvantartottszakid=36318\'>[A hat\xc3\xa1lyoss\xc3\xa1gi sz\xc5\xb1r\xc5\x91k bekapcsol\xc3\xa1sa.]</a></form>\n</div><form name=form1 method=post action=\'/prg/int.php?nyilvantartottszakid=36318\'><div align=\'left\' valign=\'top\'>\xe2\x96\xa0 <a href=\'kkk.php?graf=MSZKSMU\'>KKK teljes gr\xc3\xa1f</a> \xe2\x96\xa0 <a href=\'int.php?adatmod=nyilvszak&szervezetid=36\'>SZTE nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9sei</a><br>A gr\xc3\xa1fban a csom\xc3\xb3pontokra kattintva b\xc5\x91vebb inform\xc3\xa1ci\xc3\xb3 olvashat\xc3\xb3 az adott csom\xc3\xb3pontr\xc3\xb3l.<br>Gr\xc3\xa1fn\xc3\xa9zet: <select name=grafnezet>\n<option value="resz">csak a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n<option value="mind">a teljes gr\xc3\xa1fban a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n</select> mutatja.<br>A gr\xc3\xa1fban a ny\xc3\xadl kezdete \xc3\xa9s v\xc3\xa9ge k\xc3\xb6z\xc3\xb6tti minim\xc3\xa1lis t\xc3\xa1vols\xc3\xa1g: <select name=grafminlen>\n<option value="0">legkisebb</option>\n<option selected value="1">1 egys\xc3\xa9g</option>\n<option value="2">2 egys\xc3\xa9g</option>\n<option value="3">3 egys\xc3\xa9g</option>\n<option value="4">4 egys\xc3\xa9g</option>\n<option value="5">5 egys\xc3\xa9g</option>\n</select> (A nagyobb \xc3\xa9rt\xc3\xa9k szell\xc5\x91sebb\xc3\xa9 teszi az \xc3\xa1br\xc3\xa1t.)<br> <button type=\'submit\' style="background-color:#E5E5E5; color:#000000; font-size: 12px;" name=\'muv\' value=\'n\xc3\xa9zetet friss\xc3\xadt\'>n\xc3\xa9zetet friss\xc3\xadt</button> </div><br><table width=\'100%\' align=\'center\' border=\'0\'><tr><td width=\'50%\' align=\'left\' valign=\'top\'><a href=\'/prg/int.php?nyilvantartottszakid=36317\'>\xc2\xab el\xc5\x91z\xc5\x91: szoci\xc3\xa1lis munka (36317)</a></td><td width=\'50%\' align=\'right\'><a href=\'/prg/int.php?nyilvantartottszakid=6150\'>k\xc3\xb6vetkez\xc5\x91: szoci\xc3\xa1lpedag\xc3\xb3gia (6150) \xc2\xbb</a></td></tr></table>\n<br><div align=\'left\' valign=\'top\'><b><a href=\'torzsadat.php?tabla=szervezet&sid=70\'>(SZTE) Szegedi Tudom\xc3\xa1nyegyetem</a> - <a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>(MSZKSMU) szoci\xc3\xa1lis munka [36318]</a></b></div><br><div align=\'left\' valign=\'top\'><?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"\n "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">\n<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n -->\n<!-- Title: MSZKSMU Pages: 1 -->\n<svg width="340pt" height="116pt"\n viewBox="0.00 0.00 340.00 116.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">\n<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 112)">\n<title>MSZKSMU</title>\n<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-112 336,-112 336,4 -4,4"/>\n<g id="clust1" class="cluster">\n<title>cluster_vegzettseg</title>\n<polygon fill="none" stroke="#ffff00" points="231,-8 231,-62 324,-62 324,-8 231,-8"/>\n</g>\n<!-- START -->\n<g id="node1" class="node">\n<title>START</title>\n<ellipse fill="#d3d3d3" stroke="#d3d3d3" cx="27" cy="-63" rx="27" ry="18"/>\n<text text-anchor="middle" x="27" y="-60.8" font-family="Times,serif" font-size="9.00" fill="#000000">START</text>\n</g>\n<!-- MSZKSMU -->\n<g id="node2" class="node">\n<title>MSZKSMU</title>\n<g id="a_node2"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414" xlink:title="MSZKSMU\\nszoci\xc3\xa1lis munka">\n<polygon fill="#e0ffff" stroke="#e0ffff" points="164,-81 91,-81 91,-45 164,-45 164,-81"/>\n<text text-anchor="middle" x="127.5" y="-65.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSZKSMU</text>\n<text text-anchor="middle" x="127.5" y="-55.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- START->MSZKSMU -->\n<g id="edge1" class="edge">\n<title>START->MSZKSMU</title>\n<path fill="none" stroke="#0000ff" stroke-width="2" d="M54.1967,-63C62.3906,-63 71.6286,-63 80.7147,-63"/>\n<polygon fill="#0000ff" stroke="#0000ff" stroke-width="2" points="80.8451,-66.5001 90.8451,-63 80.845,-59.5001 80.8451,-66.5001"/>\n</g>\n<!-- MSPCKSM -->\n<g id="node3" class="node">\n<title>MSPCKSM</title>\n<g id="a_node3"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710" xlink:title="MSPCKSM\\nklinikai szoci\xc3\xa1lis munka">\n<polygon fill="#ffe4e1" stroke="#ffe4e1" points="328,-108 227,-108 227,-72 328,-72 328,-108"/>\n<text text-anchor="middle" x="277.5" y="-92.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSPCKSM</text>\n<text text-anchor="middle" x="277.5" y="-82.8" font-family="Times,serif" font-size="9.00" fill="#000000">klinikai szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU->MSPCKSM -->\n<g id="edge3" class="edge">\n<title>MSZKSMU->MSPCKSM</title>\n<path fill="none" stroke="#000000" d="M164.1941,-69.6049C179.9274,-72.4369 198.7348,-75.8223 216.4633,-79.0134"/>\n<polygon fill="#000000" stroke="#000000" points="216.2835,-82.5372 226.7454,-80.8642 217.5237,-75.6479 216.2835,-82.5372"/>\n</g>\n<!-- 1287 -->\n<g id="node4" class="node">\n<title>1287</title>\n<g id="a_node4"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=vegzettseg&idmezo=vegzettsegid&id=1287" xlink:title="MMSAZMO\\nokleveles\\nszoci\xc3\xa1lis munk\xc3\xa1s">\n<polygon fill="#ffff00" stroke="#ffff00" points="316,-54 239,-54 239,-16 316,-16 316,-54"/>\n<text text-anchor="middle" x="277.5" y="-42.8" font-family="Times,serif" font-size="9.00" fill="#000000">MMSAZMO</text>\n<text text-anchor="middle" x="277.5" y="-32.8" font-family="Times,serif" font-size="9.00" fill="#000000">okleveles</text>\n<text text-anchor="middle" x="277.5" y="-22.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munk\xc3\xa1s</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU->1287 -->\n<g id="edge2" class="edge">\n<title>MSZKSMU->1287</title>\n<path fill="none" stroke="#ff0000" d="M164.1941,-56.1504C183.6481,-52.519 207.8022,-48.0103 228.805,-44.0897"/>\n<polygon fill="#ff0000" stroke="#ff0000" points="229.6399,-47.4944 238.8279,-42.2188 228.3554,-40.6133 229.6399,-47.4944"/>\n<text text-anchor="middle" x="195.5" y="-54.6" font-family="Times,serif" font-size="8.00" fill="#ff0000">START</text>\n</g>\n</g>\n</svg>\n</div><br><br><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott szak:</div></b><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'><tr><td align=\'left\' valign=\'top\'><b>nyilv. szak ID</b></td><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>telephely</b></td><td align=\'left\' valign=\'top\'><b>nyelv</b></td><td align=\'left\' valign=\'top\'><b>munkarend</b></td></tr>\n<tr><td align=\'left\' valign=\'top\'><a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>36318</a></td><td align=\'left\' valign=\'top\'>MSZKSMU</td><td align=\'left\' valign=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>Szeged</td><td align=\'left\' valign=\'top\'>magyar</td><td align=\'left\' valign=\'top\'>levelez\xc5\x91</td></tr>\n</table><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9si elemek:</b></div><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'>\n<tr><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>t\xc3\xadpus</b></td><td align=\'left\' valign=\'top\'><b>minimum kredit</b></td><td align=\'left\' valign=\'top\'><b>maximum kredit</b></td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710\'>MSPCKSM</a></td><td align=\'left\' valig=\'top\'>klinikai szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>specializ\xc3\xa1ci\xc3\xb3</td><td align=\'left\' valig=\'top\'>35</td><td align=\'left\' valig=\'top\'>40</td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414\'>MSZKSMU</a></td><td align=\'left\' valig=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>szak</td><td align=\'left\' valig=\'top\'>120</td><td align=\'left\' valig=\'top\'>120</td></tr></table></form>\r\n </td>\r\n </tr>\r\n <tr>\r\n <td colspan="2" bgcolor=\'#0994dc\' width="100%">\r\n <table width="100%">\r\n\t<tr>\r\n\t <td align=\'left\'>\r\n\t <font size=\'1\' color=\'#ffffff\'>Az adatb\xc3\xa1zis 2022-09-24 hajnalban friss\xc3\xbclt.</font>\r\n\t </td>\r\n\t <td align="right">\r\n\t <font size=\'1\' color=\'#ffffff\'>K\xc3\xa9sz\xc3\xbclt az EKOP-1.A.1-08/C-2009-0009 "Az Oktat\xc3\xa1si Hivatal k\xc3\xb6zigazgat\xc3\xa1si szolg\xc3\xa1ltat\xc3\xa1sainak elektroniz\xc3\xa1l\xc3\xa1sa" projekt keret\xc3\xa9ben. © 2012.</font>\r\n\t </td>\r\n\t</tr>\r\n </td>\r\n </tr>\r\n</table>\r\n</body>\r\n</html>\r\n\n'
encoding = detect_encoding(html_byte)
tree = HTMLTree.parse_from_bytes(html_byte, encoding)
str(tree)
It would be incredibly useful for this library to include type annotations and to declare itself as a PEP 561 compliant stub package.
It seems like resiliparse does not compile under Ubuntu 18, it fails with this error message:
building 'fastwarc.warc' extension
creating build/temp.linux-x86_64-cpython-37
creating build/temp.linux-x86_64-cpython-37/fastwarc
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.7m -c fastwarc/warc.cpp -o build/temp.linux-x86_64-cpython-37/fastwarc/warc.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
fastwarc/warc.cpp:1348:3: error: ‘LZ4F_cctx’ does not name a type; did you mean ‘LZ4F_cctx_s’?
LZ4F_cctx *cctx;
^~~~~~~~~
LZ4F_cctx_s
fastwarc/warc.cpp:1349:3: error: ‘LZ4F_dctx’ does not name a type; did you mean ‘LZ4F_dctx_s’?
LZ4F_dctx *dctx;
^~~~~~~~~
LZ4F_dctx_s
error: command '/usr/bin/gcc' failed with exit code 1
----------------------------------------
ERROR: Failed building wheel for fastwarc
It seems like the lz4 version that comes from the package repository in ubuntu 18 is in the wrong version?
When I install lz4 from source, it works:
git clone https://github.com/lz4/lz4
cd lz4
make
make install
Since installing lz4 from source resolves the problem, this might not have the highest priority.
The fastwarc command-line tool "index" index some records of a gzipped WARC file with an erroneous zero record length:
$> wget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/09/CC-NEWS-20210930113548-00741.warc.gz
$> fastwarc index -fwarc-type,warc-target-uri,offset,length CC-NEWS-20210930113548-00741.warc.gz \
| grep -F '"length": "0"'
{"warc-type": "response", "warc-target-uri": "https://www.themarketsdaily.com/2021/09/30/ishares-sp-500-etf-nysearcaivv-sees-strong-trading-volume.html", "offset": "232757027", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.timeturk.com/yasam/baskan-buyukkilic-10-milyon-tl-yatirim-yapilan-yeralti-carsisi-nda-incelemede-bulundu-esnafi-ziyaret-etti/haber-1703634", "offset": "278528237", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.sondakika.com/haber/haber-yayinci-tevfik-rauf-baysal-vefat-etti-14429565/", "offset": "1044381471", "length": "0"}
See also the discussion in #11, however, fewer records are affected here. With uncompressed the WARC file the error is not reproducible.
After running some benchmarking on resiliparse "HTMl2text" extract_plain_text(tree, main_content=True))
it seems the extract_plain_text
method is significantly slower in parallel than sequentially.
sequentially : 508.147 items/sec
parallel : 62.7322 items/sec
I ran the benchmarking with a tool I wrote, https://github.com/Nootka-io/wee-benchmarking-tool. I'll work on pulling out a minimal example.
It seems strange to me, and not sure where to begin profiling/debugging. Other libraries see little improvement, but resiliparse is the only one showing a dramatic drop, although it's still the fastest.
Hi,
Thanks for the very nice package.
Do you know which dependencies should be installed with yum?
I am struggling to build fastWARC from source within a lambda container. Here is my Dockerfile.
FROM public.ecr.aws/lambda/python:3.8
RUN yum groupinstall "Development Tools" -y
RUN yum install python3-devel -y
RUN yum install -y zlib-devel lz4-devel liblexbor-devel uchardet-devel
RUN pip3 install --no-binary fastwarc fastwarc --target "${LAMBDA_TASK_ROOT}"
COPY app.py ${LAMBDA_TASK_ROOT}
CMD [ "app.handler" ]
This is the error message
ERROR: Command errored out with exit status 1:
command: /var/lang/bin/python3.8 /var/lang/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmparkimzwm
cwd: /tmp/pip-install-1hzfg9i1/fastwarc_fcfee32f14f34b609444e2992925ac95
Complete output (26 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/fastwarc
copying fastwarc/cli.py -> build/lib.linux-x86_64-3.8/fastwarc
copying fastwarc/__init__.py -> build/lib.linux-x86_64-3.8/fastwarc
copying fastwarc/stream_io.pxd -> build/lib.linux-x86_64-3.8/fastwarc
copying fastwarc/warc.pxd -> build/lib.linux-x86_64-3.8/fastwarc
copying fastwarc/__init__.pxd -> build/lib.linux-x86_64-3.8/fastwarc
running build_ext
building 'fastwarc.warc' extension
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/fastwarc
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/var/lang/include/python3.8 -c fastwarc/warc.cpp -o build/temp.linux-x86_64-3.8/fastwarc/warc.o -std=c++17 -O3 -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function -fpermissive -Wno-c++11-narrowing
g++ -pthread -shared -Wl,-rpath=/var/lang/lib build/temp.linux-x86_64-3.8/fastwarc/warc.o -L/var/lang/lib -o build/lib.linux-x86_64-3.8/fastwarc/warc.cpython-38-x86_64-linux-gnu.so -std=c++17
building 'fastwarc.stream_io' extension
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/var/lang/include/python3.8 -c fastwarc/stream_io.cpp -o build/temp.linux-x86_64-3.8/fastwarc/stream_io.o -std=c++17 -O3 -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function -fpermissive -Wno-c++11-narrowing
fastwarc/stream_io.cpp: In function ‘int __pyx_pf_8fastwarc_9stream_io_9LZ4Stream_2__cinit__(__pyx_obj_8fastwarc_9stream_io_LZ4Stream*, PyObject*, PyObject*, PyObject*)’:
fastwarc/stream_io.cpp:7441:23: error: ‘struct LZ4F_preferences_t’ has no member named ‘favorDecSpeed’
__pyx_v_self->prefs.favorDecSpeed = __pyx_t_4;
^~~~~~~~~~~~~
At global scope:
cc1plus: warning: unrecognized command line option ‘-Wno-c++11-narrowing’
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for fastwarc
Many thanks!
MRE:
from resiliparse.parse.html import HTMLTree
str(HTMLTree.parse("<svg><template>\n"))
It causes segmentation fault
, and trace show that is was caused by lxb_html_serialize_node_cb
.
CC @lexborisov
I am getting this error inside ubuntu:23.04 docker
pip install fastwarc
Collecting fastwarc
Downloading FastWARC-0.14.5.tar.gz (42 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.6/42.6 kB 2.0 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 325, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 341, in run_setup
exec(code, locals())
File "<string>", line 34, in <module>
AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
I'm trying to use Resiliparse to handle CC, but I find that single-process memory usage grows slowly, and web pages get slower, and I can't find the source of the problem with the memory analysis tool, it feels like some sort of memory leak.
from fastwarc.warc import ArchiveIterator as FastWarcArchiveIterator
from fastwarc.warc import WarcRecordType, WarcRecord
from fastwarc.stream_io import FastWARCError
from fastwarc.warc import is_http
from resiliparse.parse.html import HTMLTree
import os
import psutil
import time
file = open("../CC-MAIN-20231129041834-20231129071834-00864.warc.gz", "rb")
archive_iterator = FastWarcArchiveIterator(
file,
record_types=WarcRecordType.response,
parse_http=True,
func_filter=is_http,
)
idx = 0
s = time.time()
process = psutil.Process(os.getpid())
for record in archive_iterator:
raw = record.reader.read()
try:
str(HTMLTree.parse(raw.decode("utf-8")))
except:
pass
idx += 1
if idx % 1000 == 0:
print(process.memory_info().rss / 1024**2, time.time() - s)
s = time.time()
And here is the output:
91.0625 0.5237786769866943
108.921875 0.5350120067596436
132.015625 0.6991021633148193
132.828125 0.6604471206665039
144.25 0.6309971809387207
149.703125 0.9664597511291504
163.6875 1.0105879306793213
168.65625 0.9966611862182617
175.578125 1.1011531352996826
219.890625 0.9911088943481445
233.671875 1.1710970401763916
238.421875 1.0304219722747803
238.453125 0.9967031478881836
242.5625 1.016160011291504
246.40625 1.0688650608062744
246.40625 1.0952439308166504
246.40625 1.100167989730835
248.84375 1.077517032623291
248.84375 1.0870559215545654
248.84375 1.0127251148223877
248.84375 1.1437797546386719
253.125 1.0540661811828613
255.75 1.1424810886383057
255.75 1.0953960418701172
261.203125 1.1094980239868164
262.96875 1.293619155883789
265.6875 1.2223970890045166
265.6875 1.2639102935791016
265.6875 1.2537767887115479
265.6875 1.1750257015228271
266.203125 1.2749638557434082
266.203125 1.2159233093261719
272.484375 1.2846689224243164
275.515625 1.2536969184875488
275.515625 1.1752068996429443
275.515625 1.156343936920166
275.515625 1.249042272567749
276.765625 1.1668882369995117
Hello,
I was wondering if you had any input on what would be the best way to traverse and remove the contents of a nested span?
Nested span to traverse:
<span normalizedcite="<span class="citation no-link">98 T.C. 141</span>">98 T.C. 141</span>
Span to remove or replace:
"<span class="citation no-link">98 T.C. 141</span>"
Using query selector all does not seem to get the .normalizedcite
class.
Thank you
user@box:~$ pipx run resiliparse
Traceback (most recent call last):
File "/home/user/.local/pipx/.cache/42f25da10f76b98/bin/resiliparse", line 5, in <module>
from resiliparse.cli import main
File "/home/user/.local/pipx/.cache/42f25da10f76b98/lib/python3.9/site-packages/resiliparse/cli.py", line 18, in <module>
from joblib import Parallel, delayed
ModuleNotFoundError: No module named 'joblib'
user@box:~$ pipx install resiliparse
installed package resiliparse 0.11.1, installed using Python 3.9.2
These apps are now globally available
- resiliparse
done! ✨ 🌟 ✨
user@box:~$ resiliparse
Traceback (most recent call last):
File "/home/user/.local/bin/resiliparse", line 5, in <module>
from resiliparse.cli import main
File "/home/user/.local/pipx/venvs/resiliparse/lib/python3.9/site-packages/resiliparse/cli.py", line 18, in <module>
from joblib import Parallel, delayed
ModuleNotFoundError: No module named 'joblib'
user@box:~$
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.