chatnoir-eu / chatnoir-resiliparse Goto Github PK
View Code? Open in Web Editor NEWA robust web archive analytics toolkit
Home Page: https://resiliparse.chatnoir.eu
License: Apache License 2.0
A robust web archive analytics toolkit
Home Page: https://resiliparse.chatnoir.eu
License: Apache License 2.0
user@box:~$ pipx run resiliparse
Traceback (most recent call last):
File "/home/user/.local/pipx/.cache/42f25da10f76b98/bin/resiliparse", line 5, in <module>
from resiliparse.cli import main
File "/home/user/.local/pipx/.cache/42f25da10f76b98/lib/python3.9/site-packages/resiliparse/cli.py", line 18, in <module>
from joblib import Parallel, delayed
ModuleNotFoundError: No module named 'joblib'
user@box:~$ pipx install resiliparse
installed package resiliparse 0.11.1, installed using Python 3.9.2
These apps are now globally available
- resiliparse
done! ✨ 🌟 ✨
user@box:~$ resiliparse
Traceback (most recent call last):
File "/home/user/.local/bin/resiliparse", line 5, in <module>
from resiliparse.cli import main
File "/home/user/.local/pipx/venvs/resiliparse/lib/python3.9/site-packages/resiliparse/cli.py", line 18, in <module>
from joblib import Parallel, delayed
ModuleNotFoundError: No module named 'joblib'
user@box:~$
pip3 install --no-binary fastwarc fastwarc
pip3 install fastwarc
(no binaries provided for ARM CPUs)The error message indicates that fastwarc is now too interconnected with resiliparse
ERROR: Command errored out with exit status 1:
...
from resiliparse_common.string_util cimport str_to_lower, strip_str, strip_c_str
^
------------------------------------------------------------
fastwarc/warc.pyx:32:0: 'resiliparse_common/string_util.pxd' not found
Building from a checkout of chatnoir-resiliparse via pip3 wheel -e fastwarc
succeeds also on ARM-based systems.
Hello,
I'm trying to build Resiliparse 0.13.7 from source, and I'm getting this error. Can you tell me which library Resiliparse is expecting to get html.h from? I suspect I'm missing a dependency.
resiliparse/extract/html2text.cpp:869:10: fatal error: html.h: No such file or directory
#include "html.h"
^~~~~~~~
Thanks,
Dave
It seems like resiliparse does not compile under Ubuntu 18, it fails with this error message:
building 'fastwarc.warc' extension
creating build/temp.linux-x86_64-cpython-37
creating build/temp.linux-x86_64-cpython-37/fastwarc
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.7m -c fastwarc/warc.cpp -o build/temp.linux-x86_64-cpython-37/fastwarc/warc.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
fastwarc/warc.cpp:1348:3: error: ‘LZ4F_cctx’ does not name a type; did you mean ‘LZ4F_cctx_s’?
LZ4F_cctx *cctx;
^~~~~~~~~
LZ4F_cctx_s
fastwarc/warc.cpp:1349:3: error: ‘LZ4F_dctx’ does not name a type; did you mean ‘LZ4F_dctx_s’?
LZ4F_dctx *dctx;
^~~~~~~~~
LZ4F_dctx_s
error: command '/usr/bin/gcc' failed with exit code 1
----------------------------------------
ERROR: Failed building wheel for fastwarc
It seems like the lz4 version that comes from the package repository in ubuntu 18 is in the wrong version?
When I install lz4 from source, it works:
git clone https://github.com/lz4/lz4
cd lz4
make
make install
Since installing lz4 from source resolves the problem, this might not have the highest priority.
Hi,
Thanks for the very nice package.
Do you know which dependencies should be installed with yum?
I am struggling to build fastWARC from source within a lambda container. Here is my Dockerfile.
FROM public.ecr.aws/lambda/python:3.8
RUN yum groupinstall "Development Tools" -y
RUN yum install python3-devel -y
RUN yum install -y zlib-devel lz4-devel liblexbor-devel uchardet-devel
RUN pip3 install --no-binary fastwarc fastwarc --target "${LAMBDA_TASK_ROOT}"
COPY app.py ${LAMBDA_TASK_ROOT}
CMD [ "app.handler" ]
This is the error message
ERROR: Command errored out with exit status 1:
command: /var/lang/bin/python3.8 /var/lang/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmparkimzwm
cwd: /tmp/pip-install-1hzfg9i1/fastwarc_fcfee32f14f34b609444e2992925ac95
Complete output (26 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/fastwarc
copying fastwarc/cli.py -> build/lib.linux-x86_64-3.8/fastwarc
copying fastwarc/__init__.py -> build/lib.linux-x86_64-3.8/fastwarc
copying fastwarc/stream_io.pxd -> build/lib.linux-x86_64-3.8/fastwarc
copying fastwarc/warc.pxd -> build/lib.linux-x86_64-3.8/fastwarc
copying fastwarc/__init__.pxd -> build/lib.linux-x86_64-3.8/fastwarc
running build_ext
building 'fastwarc.warc' extension
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/fastwarc
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/var/lang/include/python3.8 -c fastwarc/warc.cpp -o build/temp.linux-x86_64-3.8/fastwarc/warc.o -std=c++17 -O3 -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function -fpermissive -Wno-c++11-narrowing
g++ -pthread -shared -Wl,-rpath=/var/lang/lib build/temp.linux-x86_64-3.8/fastwarc/warc.o -L/var/lang/lib -o build/lib.linux-x86_64-3.8/fastwarc/warc.cpython-38-x86_64-linux-gnu.so -std=c++17
building 'fastwarc.stream_io' extension
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/var/lang/include/python3.8 -c fastwarc/stream_io.cpp -o build/temp.linux-x86_64-3.8/fastwarc/stream_io.o -std=c++17 -O3 -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function -fpermissive -Wno-c++11-narrowing
fastwarc/stream_io.cpp: In function ‘int __pyx_pf_8fastwarc_9stream_io_9LZ4Stream_2__cinit__(__pyx_obj_8fastwarc_9stream_io_LZ4Stream*, PyObject*, PyObject*, PyObject*)’:
fastwarc/stream_io.cpp:7441:23: error: ‘struct LZ4F_preferences_t’ has no member named ‘favorDecSpeed’
__pyx_v_self->prefs.favorDecSpeed = __pyx_t_4;
^~~~~~~~~~~~~
At global scope:
cc1plus: warning: unrecognized command line option ‘-Wno-c++11-narrowing’
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for fastwarc
Many thanks!
Thanks for developing fastwarc! It's been a great tool while I've been exploring text extracting of common crawl with Python.
I'm interested in using it as part of much larger pipeline but want to enable chunked processing and I'm curious if this is possible. My understanding at the moment is that the ArchiverIterator
gives me a nice handle to process the all WarcRecords
sequentially. I think I'd like to be able to do is something like:
archv = ArchiveChunked(open(...), ...)
recs = archv[N:N+10] # select 10 records starting at N
Doing this should allow me to leverage batch processing functionality and distributed the processing across multiple cores
Does this package support python3.7? Because I am using a distributed cluster, which only supports python3.7 versions.
AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?
fails to build in docker on apple silicon.
builds fine on linux and also outside of docker in native osx
pip install fastwarc==0.14.5
Collecting fastwarc==0.14.5
Using cached FastWARC-0.14.5.tar.gz (42 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 311, in run_setup
exec(code, locals())
File "<string>", line 34, in <module>
AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
$ pipx run --verbose fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
pipx >(setup:729): pipx version is 1.0.0
pipx >(setup:730): Default python interpreter is '/home/user/.local/pipx/venvs/pipx/bin/python'
pipx >(needs_upgrade:69): Time since last upgrade of shared libs, in seconds: 1561898. Upgrade will be run by pipx if greater than 2592000.
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(run:103): Reusing cached venv /home/user/.local/pipx/.cache/7a73b1e86637c39
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(exec_app:387): exec_app: /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
0 records were verified successfully.
1 records were skipped without digest.
Error in sys.excepthook:
Traceback (most recent call last):
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
sys.exit(main())
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
for v in pbar:
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "fastwarc/tools.pyx", line 178, in verify_digests
File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
File "/usr/lib/python3.9/base64.py", line 231, in b32decode
raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found
Original exception was:
Traceback (most recent call last):
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
sys.exit(main())
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
for v in pbar:
File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "fastwarc/tools.pyx", line 178, in verify_digests
File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
File "/usr/lib/python3.9/base64.py", line 231, in b32decode
raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found
$
The field WarcRecord.http_headers
could include the HTTP status code or it could be provided as an extra attribute to WarcRecord.
When reading a record it is not easily visible what status code a response had. For example, if I would like to only filter 301
redirection content, I'm not able to do this, as far as I can see. (Or just filter 200
responses for further processing.) The other HTTP headers are parsed but not the HTTP status line which has a simple format, e. g. HTTP/1.X XXX Description
, that could be integrated to the existing HTTP header parsing. I also found no simple way like .reader
to access the HTTP communication.
Example:
>>> record.headers
{'WARC-Type': 'response', 'WARC-Target-URI': 'http://vgperson.com/robots.txt', 'WARC-Date': '2021-08-09T13:25:55Z', 'WARC-Payload-Digest': 'sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP', 'WARC-IP-Address': '85.214.122.46', 'WARC-Record-ID': '<urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>', 'Content-Type': 'application/http; msgtype=response', 'Content-Length': '454'}
>>> record.http_headers
{'Date': 'Mon, 09 Aug 2021 13:25:53 GMT', 'Server': 'Apache', 'Location': 'https://vgperson.com/robots.txt', 'Content-Length': '239', 'Connection': 'close', 'Content-Type': 'text/html; charset=iso-8859-1'}
>>> content = record.reader.read()
>>> assert len(content) == record.content_length # content only includes the real content, no access to HTTP stuff
>>> print(content)
b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>301 Moved Permanently</title>\n</head><body>\n<h1>Moved Permanently</h1>\n<p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>\n</body></html>\n'
HTTP communication:
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://vgperson.com/robots.txt
WARC-Date: 2021-08-09T13:25:55Z
WARC-Payload-Digest: sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP
WARC-IP-Address: 85.214.122.46
WARC-Record-ID: <urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>
Content-Type: application/http; msgtype=response
Content-Length: 454
HTTP/1.1 301 Moved Permanently
Date: Mon, 09 Aug 2021 13:25:53 GMT
Server: Apache
Location: https://vgperson.com/robots.txt
Content-Length: 239
Connection: close
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>
</body></html>
I'm having this error both when trying to install from pip and from this repo:
fastwarc/warc.cpp:1189:3: error: ‘LZ4F_cctx’ does not name a type; did you mean ‘LZ4F_cctx_s’?
LZ4F_cctx *cctx;
^~~~~~~~~
LZ4F_cctx_s
fastwarc/warc.cpp:1190:3: error: ‘LZ4F_dctx’ does not name a type; did you mean ‘LZ4F_dctx_s’?
LZ4F_dctx *dctx;
^~~~~~~~~
LZ4F_dctx_s
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
I am getting this error inside ubuntu:23.04 docker
pip install fastwarc
Collecting fastwarc
Downloading FastWARC-0.14.5.tar.gz (42 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.6/42.6 kB 2.0 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 325, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 341, in run_setup
exec(code, locals())
File "<string>", line 34, in <module>
AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
$ pip install --no-binary resiliparse resiliparse
DEPRECATION: --no-binary currently disables reading from the cache of locally built wheels. In the future --no-binary will not influence the wheel cache. pip 23.1 will enforce this behaviour change. A possible replacement is to use the --no-cache-dir option. You can use the flag --use-feature=no-binary-enable-wheel-cache to test the upcoming behaviour. Discussion can be found at https://github.com/pypa/pip/issues/11453
Collecting resiliparse
Using cached Resiliparse-0.13.7.tar.gz (601 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
Preparing metadata (pyproject.toml) ... done
Collecting fastwarc==0.13.7
Using cached FastWARC-0.13.7-cp311-cp311-linux_x86_64.whl
Collecting brotli
Using cached Brotli-1.0.9-cp311-cp311-linux_x86_64.whl
Requirement already satisfied: click in ./venv311/lib/python3.11/site-packages (from fastwarc==0.13.7->resiliparse) (8.0.4)
Requirement already satisfied: tqdm in ./venv311/lib/python3.11/site-packages (from fastwarc==0.13.7->resiliparse) (4.64.1)
Building wheels for collected packages: resiliparse
Building wheel for resiliparse (pyproject.toml) ... error
error: subprocess-exited-with-error
× Building wheel for resiliparse (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [50 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-cpython-311
creating build/lib.linux-x86_64-cpython-311/resiliparse
copying resiliparse/cli.py -> build/lib.linux-x86_64-cpython-311/resiliparse
copying resiliparse/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse
creating build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/coders.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/textio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/warcio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/elasticsearch.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
copying resiliparse/beam/fileio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
creating build/lib.linux-x86_64-cpython-311/resiliparse/extract
copying resiliparse/extract/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
creating build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
copying resiliparse/itertools.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
copying resiliparse/process_guard.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
copying resiliparse/extract/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
copying resiliparse/extract/html2text.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
copying resiliparse/parse/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/html.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/http.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/lang.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/encoding.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/lang_profiles.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/encoding.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
copying resiliparse/parse/html.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
running build_ext
building 'resiliparse.itertools' extension
creating build/temp.linux-x86_64-cpython-311
creating build/temp.linux-x86_64-cpython-311/resiliparse
x86_64-linux-gnu-gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/mnt/Data/Projects/nootka_io/sentry_dragon/venv311/include -I/usr/include/python3.11 -c resiliparse/itertools.cpp -o build/temp.linux-x86_64-cpython-311/resiliparse/itertools.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 build/temp.linux-x86_64-cpython-311/resiliparse/itertools.o -L/usr/lib/x86_64-linux-gnu -o build/lib.linux-x86_64-cpython-311/resiliparse/itertools.cpython-311-x86_64-linux-gnu.so -std=c++17
building 'resiliparse.extract.html2text' extension
creating build/temp.linux-x86_64-cpython-311/resiliparse/extract
x86_64-linux-gnu-gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I./resiliparse/parse -I/mnt/Data/Projects/nootka_io/sentry_dragon/venv311/include -I/usr/include/python3.11 -c resiliparse/extract/html2text.cpp -o build/temp.linux-x86_64-cpython-311/resiliparse/extract/html2text.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
In file included from /usr/include/lexbor/css/css.h:14,
from resiliparse/extract/html2text.cpp:864:
/usr/include/lexbor/css/stylesheet.h: In function ‘lxb_css_stylesheet_t* lxb_css_stylesheet_create(lexbor_mraw_t*)’:
/usr/include/lexbor/css/stylesheet.h:33:30: error: invalid conversion from ‘void*’ to ‘lxb_css_stylesheet_t*’ {aka ‘lxb_css_stylesheet*’} [-fpermissive]
33 | return lexbor_mraw_calloc(mraw, sizeof(lxb_css_stylesheet_t));
| ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| void*
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for resiliparse
Failed to build resiliparse
ERROR: Could not build wheels for resiliparse, which is required to install pyproject.toml-based projects
It would be incredibly useful for this library to include type annotations and to declare itself as a PEP 561 compliant stub package.
Trying this piece of html... Is there something I can do to upgrade the underlying parser? I recall reading this...
from resiliparse.parse import detect_encoding
from resiliparse.parse.html import HTMLTree
from resiliparse.extract.html2text import extract_plain_text
html_byte = b'\n\n\n\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\r\n<meta http-equiv="X-UA-Compatible" content="IE=9">\r\n<link rel="stylesheet" type="text/css" href="https://firgraf.oh.gov.hu/include/style.css" media="screen" />\r\n<title>Int\xc3\xa9zm\xc3\xa9nyi adatok</title>\r\n<!-- Global site tag (gtag.js) - Google Analytics -->\r\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-198540847-1"></script>\r\n<script>\r\n window.dataLayer = window.dataLayer || [];\r\n function gtag(){dataLayer.push(arguments);}\r\n gtag(\'js\', new Date());\r\n gtag(\'config\', \'UA-198540847-1\');\r\n</script>\r\n</head>\r\n<body>\r\n<table width="80%" cellpadding="0" cellspacing="0" align="center" style="border:3px solid;\r\nborder-radius:8px; border: 3px solid #0994dc; background-color:#FFFFFF">\r\n <tr>\r\n <td valign="top" rowspan="2" bgcolor=\'#FFFFFF\'></td>\r\n <td align=\'center\' height=\'70\' bgcolor=\'#FFFFFF\' style=\'font: bold small-caps 28px monospace;\'><img src=\'https://firgraf.oh.gov.hu/images/firgraf_logo.png\' width=\'1200\'></td>\r\n </tr>\r\n <tr>\r\n <td valign="top" align=\'center\' bgcolor="#FFFFFF">\r\n \r\n <table>\r\n\t<tr>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/index.php">Kezd\xc5\x91lap</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/kkk.php">K\xc3\xa9pz\xc3\xa9si \xc3\xa9s kimeneti k\xc3\xb6vetelm\xc3\xa9nyek</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/int.php">Int\xc3\xa9zm\xc3\xa9nyi adatok</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/torzs.php">T\xc3\xb6rzsadatok</a></td>\r\n\t <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/gyorslista.php">Gyorslist\xc3\xa1k</a></td>\r\n\t <td class="menu"><a class="menu" href="http://www.felvi.hu/hivataliugyek/">Vissza a felvi.hu-ra</a></td>\r\n\t</tr>\r\n </table>\r\n </td>\r\n </tr>\r\n <tr>\r\n <td bgcolor=\'#ffffff\'>\r\n \r\n </td>\r\n <td colspan="2" style="padding: 0.5em">\r\n <div align="center"><font size="4" color="#000000">Int\xc3\xa9zm\xc3\xa9nyi adatok</font></div><hr>\r\n <div align=\'left\' valign=\'top\'><form name=\'hataly\' method=\'get\' action=\'/prg/int.php?nyilvantartottszakid=36318\'><a href=\'/prg/int.php?hatalyvalt=hat\xc3\xa1lyoss\xc3\xa1g+bekapcsol\xc3\xa1sa&nyilvantartottszakid=36318\'>[A hat\xc3\xa1lyoss\xc3\xa1gi sz\xc5\xb1r\xc5\x91k bekapcsol\xc3\xa1sa.]</a></form>\n</div><form name=form1 method=post action=\'/prg/int.php?nyilvantartottszakid=36318\'><div align=\'left\' valign=\'top\'>\xe2\x96\xa0 <a href=\'kkk.php?graf=MSZKSMU\'>KKK teljes gr\xc3\xa1f</a> \xe2\x96\xa0 <a href=\'int.php?adatmod=nyilvszak&szervezetid=36\'>SZTE nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9sei</a><br>A gr\xc3\xa1fban a csom\xc3\xb3pontokra kattintva b\xc5\x91vebb inform\xc3\xa1ci\xc3\xb3 olvashat\xc3\xb3 az adott csom\xc3\xb3pontr\xc3\xb3l.<br>Gr\xc3\xa1fn\xc3\xa9zet: <select name=grafnezet>\n<option value="resz">csak a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n<option value="mind">a teljes gr\xc3\xa1fban a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n</select> mutatja.<br>A gr\xc3\xa1fban a ny\xc3\xadl kezdete \xc3\xa9s v\xc3\xa9ge k\xc3\xb6z\xc3\xb6tti minim\xc3\xa1lis t\xc3\xa1vols\xc3\xa1g: <select name=grafminlen>\n<option value="0">legkisebb</option>\n<option selected value="1">1 egys\xc3\xa9g</option>\n<option value="2">2 egys\xc3\xa9g</option>\n<option value="3">3 egys\xc3\xa9g</option>\n<option value="4">4 egys\xc3\xa9g</option>\n<option value="5">5 egys\xc3\xa9g</option>\n</select> (A nagyobb \xc3\xa9rt\xc3\xa9k szell\xc5\x91sebb\xc3\xa9 teszi az \xc3\xa1br\xc3\xa1t.)<br> <button type=\'submit\' style="background-color:#E5E5E5; color:#000000; font-size: 12px;" name=\'muv\' value=\'n\xc3\xa9zetet friss\xc3\xadt\'>n\xc3\xa9zetet friss\xc3\xadt</button> </div><br><table width=\'100%\' align=\'center\' border=\'0\'><tr><td width=\'50%\' align=\'left\' valign=\'top\'><a href=\'/prg/int.php?nyilvantartottszakid=36317\'>\xc2\xab el\xc5\x91z\xc5\x91: szoci\xc3\xa1lis munka (36317)</a></td><td width=\'50%\' align=\'right\'><a href=\'/prg/int.php?nyilvantartottszakid=6150\'>k\xc3\xb6vetkez\xc5\x91: szoci\xc3\xa1lpedag\xc3\xb3gia (6150) \xc2\xbb</a></td></tr></table>\n<br><div align=\'left\' valign=\'top\'><b><a href=\'torzsadat.php?tabla=szervezet&sid=70\'>(SZTE) Szegedi Tudom\xc3\xa1nyegyetem</a> - <a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>(MSZKSMU) szoci\xc3\xa1lis munka [36318]</a></b></div><br><div align=\'left\' valign=\'top\'><?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"\n "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">\n<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n -->\n<!-- Title: MSZKSMU Pages: 1 -->\n<svg width="340pt" height="116pt"\n viewBox="0.00 0.00 340.00 116.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">\n<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 112)">\n<title>MSZKSMU</title>\n<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-112 336,-112 336,4 -4,4"/>\n<g id="clust1" class="cluster">\n<title>cluster_vegzettseg</title>\n<polygon fill="none" stroke="#ffff00" points="231,-8 231,-62 324,-62 324,-8 231,-8"/>\n</g>\n<!-- START -->\n<g id="node1" class="node">\n<title>START</title>\n<ellipse fill="#d3d3d3" stroke="#d3d3d3" cx="27" cy="-63" rx="27" ry="18"/>\n<text text-anchor="middle" x="27" y="-60.8" font-family="Times,serif" font-size="9.00" fill="#000000">START</text>\n</g>\n<!-- MSZKSMU -->\n<g id="node2" class="node">\n<title>MSZKSMU</title>\n<g id="a_node2"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414" xlink:title="MSZKSMU\\nszoci\xc3\xa1lis munka">\n<polygon fill="#e0ffff" stroke="#e0ffff" points="164,-81 91,-81 91,-45 164,-45 164,-81"/>\n<text text-anchor="middle" x="127.5" y="-65.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSZKSMU</text>\n<text text-anchor="middle" x="127.5" y="-55.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- START->MSZKSMU -->\n<g id="edge1" class="edge">\n<title>START->MSZKSMU</title>\n<path fill="none" stroke="#0000ff" stroke-width="2" d="M54.1967,-63C62.3906,-63 71.6286,-63 80.7147,-63"/>\n<polygon fill="#0000ff" stroke="#0000ff" stroke-width="2" points="80.8451,-66.5001 90.8451,-63 80.845,-59.5001 80.8451,-66.5001"/>\n</g>\n<!-- MSPCKSM -->\n<g id="node3" class="node">\n<title>MSPCKSM</title>\n<g id="a_node3"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710" xlink:title="MSPCKSM\\nklinikai szoci\xc3\xa1lis munka">\n<polygon fill="#ffe4e1" stroke="#ffe4e1" points="328,-108 227,-108 227,-72 328,-72 328,-108"/>\n<text text-anchor="middle" x="277.5" y="-92.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSPCKSM</text>\n<text text-anchor="middle" x="277.5" y="-82.8" font-family="Times,serif" font-size="9.00" fill="#000000">klinikai szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU->MSPCKSM -->\n<g id="edge3" class="edge">\n<title>MSZKSMU->MSPCKSM</title>\n<path fill="none" stroke="#000000" d="M164.1941,-69.6049C179.9274,-72.4369 198.7348,-75.8223 216.4633,-79.0134"/>\n<polygon fill="#000000" stroke="#000000" points="216.2835,-82.5372 226.7454,-80.8642 217.5237,-75.6479 216.2835,-82.5372"/>\n</g>\n<!-- 1287 -->\n<g id="node4" class="node">\n<title>1287</title>\n<g id="a_node4"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=vegzettseg&idmezo=vegzettsegid&id=1287" xlink:title="MMSAZMO\\nokleveles\\nszoci\xc3\xa1lis munk\xc3\xa1s">\n<polygon fill="#ffff00" stroke="#ffff00" points="316,-54 239,-54 239,-16 316,-16 316,-54"/>\n<text text-anchor="middle" x="277.5" y="-42.8" font-family="Times,serif" font-size="9.00" fill="#000000">MMSAZMO</text>\n<text text-anchor="middle" x="277.5" y="-32.8" font-family="Times,serif" font-size="9.00" fill="#000000">okleveles</text>\n<text text-anchor="middle" x="277.5" y="-22.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munk\xc3\xa1s</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU->1287 -->\n<g id="edge2" class="edge">\n<title>MSZKSMU->1287</title>\n<path fill="none" stroke="#ff0000" d="M164.1941,-56.1504C183.6481,-52.519 207.8022,-48.0103 228.805,-44.0897"/>\n<polygon fill="#ff0000" stroke="#ff0000" points="229.6399,-47.4944 238.8279,-42.2188 228.3554,-40.6133 229.6399,-47.4944"/>\n<text text-anchor="middle" x="195.5" y="-54.6" font-family="Times,serif" font-size="8.00" fill="#ff0000">START</text>\n</g>\n</g>\n</svg>\n</div><br><br><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott szak:</div></b><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'><tr><td align=\'left\' valign=\'top\'><b>nyilv. szak ID</b></td><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>telephely</b></td><td align=\'left\' valign=\'top\'><b>nyelv</b></td><td align=\'left\' valign=\'top\'><b>munkarend</b></td></tr>\n<tr><td align=\'left\' valign=\'top\'><a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>36318</a></td><td align=\'left\' valign=\'top\'>MSZKSMU</td><td align=\'left\' valign=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>Szeged</td><td align=\'left\' valign=\'top\'>magyar</td><td align=\'left\' valign=\'top\'>levelez\xc5\x91</td></tr>\n</table><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9si elemek:</b></div><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'>\n<tr><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>t\xc3\xadpus</b></td><td align=\'left\' valign=\'top\'><b>minimum kredit</b></td><td align=\'left\' valign=\'top\'><b>maximum kredit</b></td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710\'>MSPCKSM</a></td><td align=\'left\' valig=\'top\'>klinikai szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>specializ\xc3\xa1ci\xc3\xb3</td><td align=\'left\' valig=\'top\'>35</td><td align=\'left\' valig=\'top\'>40</td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414\'>MSZKSMU</a></td><td align=\'left\' valig=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>szak</td><td align=\'left\' valig=\'top\'>120</td><td align=\'left\' valig=\'top\'>120</td></tr></table></form>\r\n </td>\r\n </tr>\r\n <tr>\r\n <td colspan="2" bgcolor=\'#0994dc\' width="100%">\r\n <table width="100%">\r\n\t<tr>\r\n\t <td align=\'left\'>\r\n\t <font size=\'1\' color=\'#ffffff\'>Az adatb\xc3\xa1zis 2022-09-24 hajnalban friss\xc3\xbclt.</font>\r\n\t </td>\r\n\t <td align="right">\r\n\t <font size=\'1\' color=\'#ffffff\'>K\xc3\xa9sz\xc3\xbclt az EKOP-1.A.1-08/C-2009-0009 "Az Oktat\xc3\xa1si Hivatal k\xc3\xb6zigazgat\xc3\xa1si szolg\xc3\xa1ltat\xc3\xa1sainak elektroniz\xc3\xa1l\xc3\xa1sa" projekt keret\xc3\xa9ben. © 2012.</font>\r\n\t </td>\r\n\t</tr>\r\n </td>\r\n </tr>\r\n</table>\r\n</body>\r\n</html>\r\n\n'
encoding = detect_encoding(html_byte)
tree = HTMLTree.parse_from_bytes(html_byte, encoding)
str(tree)
The fastwarc command-line tool "index" index some records of a gzipped WARC file with an erroneous zero record length:
$> wget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/09/CC-NEWS-20210930113548-00741.warc.gz
$> fastwarc index -fwarc-type,warc-target-uri,offset,length CC-NEWS-20210930113548-00741.warc.gz \
| grep -F '"length": "0"'
{"warc-type": "response", "warc-target-uri": "https://www.themarketsdaily.com/2021/09/30/ishares-sp-500-etf-nysearcaivv-sees-strong-trading-volume.html", "offset": "232757027", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.timeturk.com/yasam/baskan-buyukkilic-10-milyon-tl-yatirim-yapilan-yeralti-carsisi-nda-incelemede-bulundu-esnafi-ziyaret-etti/haber-1703634", "offset": "278528237", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.sondakika.com/haber/haber-yayinci-tevfik-rauf-baysal-vefat-etti-14429565/", "offset": "1044381471", "length": "0"}
See also the discussion in #11, however, fewer records are affected here. With uncompressed the WARC file the error is not reproducible.
The ArchiveIterator, resp. the underlying stream_io.BufferedReader when reading a truncated gzipped WARC file (eg. an incomplete download). The issue can be reproduced when reading clipped.warc.gz, see iipc/jwarc#17. The stack during the hangup (instead of ftell
I've also observed stream_io.FileStream.read()
on top of _refill_working_buf()
:
#3 0x00007f98a34f8705 in __GI__IO_ftell (fp=0x19b3790) at ioftell.c:38
#4 0x00007f98a2764766 in __pyx_f_8fastwarc_9stream_io_10GZipStream__refill_working_buf (__pyx_v_self=0x7f98a19fad60, __pyx_v_size=16384)
at fastwarc/stream_io.cpp:4944
#5 0x00007f98a276d500 in __pyx_f_8fastwarc_9stream_io_10GZipStream_read (__pyx_v_self=0x7f98a19fad60, __pyx_v_out="", __pyx_v_size=16384)
at fastwarc/stream_io.cpp:5191
#6 0x00007f98a27645bc in __pyx_f_8fastwarc_9stream_io_14BufferedReader__fill_buf (__pyx_v_self=0x7f98a19fb9a0) at fastwarc/stream_io.cpp:9201
#7 0x00007f98a276ce6b in __pyx_f_8fastwarc_9stream_io_14BufferedReader_read (__pyx_v_self=0x7f98a19fb9a0, __pyx_skip_dispatch=<optimized out>,
__pyx_optional_args=<optimized out>) at fastwarc/stream_io.cpp:9684
#8 0x00007f98a2765d75 in __pyx_pf_8fastwarc_9stream_io_14BufferedReader_4read (__pyx_v_size=16384, __pyx_v_self=0x7f98a19fb9a0)
at fastwarc/stream_io.cpp:9840
After running some benchmarking on resiliparse "HTMl2text" extract_plain_text(tree, main_content=True))
it seems the extract_plain_text
method is significantly slower in parallel than sequentially.
sequentially : 508.147 items/sec
parallel : 62.7322 items/sec
I ran the benchmarking with a tool I wrote, https://github.com/Nootka-io/wee-benchmarking-tool. I'll work on pulling out a minimal example.
It seems strange to me, and not sure where to begin profiling/debugging. Other libraries see little improvement, but resiliparse is the only one showing a dramatic drop, although it's still the fastest.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.