Giter Site home page Giter Site logo

chatnoir-eu / chatnoir-resiliparse Goto Github PK

View Code? Open in Web Editor NEW
42.0 9.0 8.0 1.86 MB

A robust web archive analytics toolkit

Home Page: https://resiliparse.chatnoir.eu

License: Apache License 2.0

Python 30.80% Cython 54.44% CMake 0.30% C 13.37% C++ 0.80% Dockerfile 0.29%
python web warc bigdata cython cpp extraction webarchive htmlparser

chatnoir-resiliparse's People

Contributors

cclauss avatar jmfrees avatar mam10eks avatar niklasdeckers avatar phoerious avatar querela avatar sebastian-nagel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chatnoir-resiliparse's Issues

pipx run resiliparse faild: ModuleNotFoundError: No module named 'joblib'

user@box:~$ pipx run resiliparse
Traceback (most recent call last):
  File "/home/user/.local/pipx/.cache/42f25da10f76b98/bin/resiliparse", line 5, in <module>
    from resiliparse.cli import main
  File "/home/user/.local/pipx/.cache/42f25da10f76b98/lib/python3.9/site-packages/resiliparse/cli.py", line 18, in <module>
    from joblib import Parallel, delayed
ModuleNotFoundError: No module named 'joblib'
user@box:~$ pipx install resiliparse
  installed package resiliparse 0.11.1, installed using Python 3.9.2
  These apps are now globally available
    - resiliparse
done! ✨ 🌟 ✨
user@box:~$ resiliparse
Traceback (most recent call last):
  File "/home/user/.local/bin/resiliparse", line 5, in <module>
    from resiliparse.cli import main
  File "/home/user/.local/pipx/venvs/resiliparse/lib/python3.9/site-packages/resiliparse/cli.py", line 18, in <module>
    from joblib import Parallel, delayed
ModuleNotFoundError: No module named 'joblib'
user@box:~$ 

Installing fastwarc via `pip install` fails if compilation is required or requested

  • applies to fastwarc 0.6.6 and 0.7.0 (0.6.5 successfully installed)
  • seen on Ubuntu 20.04 and 21.04
  • on amd64 with pip3 install --no-binary fastwarc fastwarc
  • or on aarch64 with pip3 install fastwarc (no binaries provided for ARM CPUs)

The error message indicates that fastwarc is now too interconnected with resiliparse

  ERROR: Command errored out with exit status 1:
...  
  from resiliparse_common.string_util cimport str_to_lower, strip_str, strip_c_str
  ^
  ------------------------------------------------------------
  
  fastwarc/warc.pyx:32:0: 'resiliparse_common/string_util.pxd' not found

Building from a checkout of chatnoir-resiliparse via pip3 wheel -e fastwarc succeeds also on ARM-based systems.

fatal error: html.h: No such file or directory

Hello,

I'm trying to build Resiliparse 0.13.7 from source, and I'm getting this error. Can you tell me which library Resiliparse is expecting to get html.h from? I suspect I'm missing a dependency.

resiliparse/extract/html2text.cpp:869:10: fatal error: html.h: No such file or directory
#include "html.h"
^~~~~~~~

Thanks,
Dave

Resiliparse does not Compile under Ubuntu 18

It seems like resiliparse does not compile under Ubuntu 18, it fails with this error message:

  building 'fastwarc.warc' extension
  creating build/temp.linux-x86_64-cpython-37
  creating build/temp.linux-x86_64-cpython-37/fastwarc
  gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.7m -c fastwarc/warc.cpp -o build/temp.linux-x86_64-cpython-37/fastwarc/warc.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  fastwarc/warc.cpp:1348:3: error: ‘LZ4F_cctx’ does not name a type; did you mean ‘LZ4F_cctx_s’?
     LZ4F_cctx *cctx;
     ^~~~~~~~~
     LZ4F_cctx_s
  fastwarc/warc.cpp:1349:3: error: ‘LZ4F_dctx’ does not name a type; did you mean ‘LZ4F_dctx_s’?
     LZ4F_dctx *dctx;
     ^~~~~~~~~
     LZ4F_dctx_s
  error: command '/usr/bin/gcc' failed with exit code 1
  ----------------------------------------
  ERROR: Failed building wheel for fastwarc

It seems like the lz4 version that comes from the package repository in ubuntu 18 is in the wrong version?

When I install lz4 from source, it works:

git clone https://github.com/lz4/lz4
cd lz4
make
make install

Since installing lz4 from source resolves the problem, this might not have the highest priority.

yum install

Hi,

Thanks for the very nice package.

Do you know which dependencies should be installed with yum?
I am struggling to build fastWARC from source within a lambda container. Here is my Dockerfile.

FROM public.ecr.aws/lambda/python:3.8

RUN yum groupinstall "Development Tools" -y
RUN yum install python3-devel -y
RUN yum install -y zlib-devel lz4-devel liblexbor-devel uchardet-devel 
RUN pip3 install --no-binary fastwarc fastwarc --target "${LAMBDA_TASK_ROOT}"

COPY app.py ${LAMBDA_TASK_ROOT}
CMD [ "app.handler" ]

This is the error message

  ERROR: Command errored out with exit status 1:
   command: /var/lang/bin/python3.8 /var/lang/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmparkimzwm
       cwd: /tmp/pip-install-1hzfg9i1/fastwarc_fcfee32f14f34b609444e2992925ac95
  Complete output (26 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.8
  creating build/lib.linux-x86_64-3.8/fastwarc
  copying fastwarc/cli.py -> build/lib.linux-x86_64-3.8/fastwarc
  copying fastwarc/__init__.py -> build/lib.linux-x86_64-3.8/fastwarc
  copying fastwarc/stream_io.pxd -> build/lib.linux-x86_64-3.8/fastwarc
  copying fastwarc/warc.pxd -> build/lib.linux-x86_64-3.8/fastwarc
  copying fastwarc/__init__.pxd -> build/lib.linux-x86_64-3.8/fastwarc
  running build_ext
  building 'fastwarc.warc' extension
  creating build/temp.linux-x86_64-3.8
  creating build/temp.linux-x86_64-3.8/fastwarc
  gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/var/lang/include/python3.8 -c fastwarc/warc.cpp -o build/temp.linux-x86_64-3.8/fastwarc/warc.o -std=c++17 -O3 -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function -fpermissive -Wno-c++11-narrowing
  g++ -pthread -shared -Wl,-rpath=/var/lang/lib build/temp.linux-x86_64-3.8/fastwarc/warc.o -L/var/lang/lib -o build/lib.linux-x86_64-3.8/fastwarc/warc.cpython-38-x86_64-linux-gnu.so -std=c++17
  building 'fastwarc.stream_io' extension
  gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/var/lang/include/python3.8 -c fastwarc/stream_io.cpp -o build/temp.linux-x86_64-3.8/fastwarc/stream_io.o -std=c++17 -O3 -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function -fpermissive -Wno-c++11-narrowing
  fastwarc/stream_io.cpp: In function ‘int __pyx_pf_8fastwarc_9stream_io_9LZ4Stream_2__cinit__(__pyx_obj_8fastwarc_9stream_io_LZ4Stream*, PyObject*, PyObject*, PyObject*)’:
  fastwarc/stream_io.cpp:7441:23: error: ‘struct LZ4F_preferences_t’ has no member named ‘favorDecSpeed’
     __pyx_v_self->prefs.favorDecSpeed = __pyx_t_4;
                         ^~~~~~~~~~~~~
  At global scope:
  cc1plus: warning: unrecognized command line option ‘-Wno-c++11-narrowing’
  error: command 'gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for fastwarc

Many thanks!

Random or Chunked Reading

Thanks for developing fastwarc! It's been a great tool while I've been exploring text extracting of common crawl with Python.

I'm interested in using it as part of much larger pipeline but want to enable chunked processing and I'm curious if this is possible. My understanding at the moment is that the ArchiverIterator gives me a nice handle to process the all WarcRecords sequentially. I think I'd like to be able to do is something like:

archv = ArchiveChunked(open(...), ...)
recs = archv[N:N+10] # select 10 records starting at N

Doing this should allow me to leverage batch processing functionality and distributed the processing across multiple cores

setuptools.config.pyprojecttoml has no attribute _BetaConfiguration

AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?

fails to build in docker on apple silicon.

builds fine on linux and also outside of docker in native osx

pip install fastwarc==0.14.5
Collecting fastwarc==0.14.5
  Using cached FastWARC-0.14.5.tar.gz (42 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      Traceback (most recent call last):
        File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 34, in <module>
      AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

pipx run fastwarc check faild: binascii.Error: Non-base32 digit found

$ pipx run --verbose fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
pipx >(setup:729): pipx version is 1.0.0
pipx >(setup:730): Default python interpreter is '/home/user/.local/pipx/venvs/pipx/bin/python'
pipx >(needs_upgrade:69): Time since last upgrade of shared libs, in seconds: 1561898. Upgrade will be run by pipx if greater than 2592000.
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(run:103): Reusing cached venv /home/user/.local/pipx/.cache/7a73b1e86637c39
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(exec_app:387): exec_app: /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
0 records were verified successfully.                           
1 records were skipped without digest.
Error in sys.excepthook:
Traceback (most recent call last):
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
    for v in pbar:
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "fastwarc/tools.pyx", line 178, in verify_digests
  File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
  File "/usr/lib/python3.9/base64.py", line 231, in b32decode
    raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found

Original exception was:
Traceback (most recent call last):
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
    for v in pbar:
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "fastwarc/tools.pyx", line 178, in verify_digests
  File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
  File "/usr/lib/python3.9/base64.py", line 231, in b32decode
    raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found
$

Fix HTTP status code parsing (reason phrase may contain spaces)

The field WarcRecord.http_headers could include the HTTP status code or it could be provided as an extra attribute to WarcRecord.

When reading a record it is not easily visible what status code a response had. For example, if I would like to only filter 301 redirection content, I'm not able to do this, as far as I can see. (Or just filter 200 responses for further processing.) The other HTTP headers are parsed but not the HTTP status line which has a simple format, e. g. HTTP/1.X XXX Description, that could be integrated to the existing HTTP header parsing. I also found no simple way like .reader to access the HTTP communication.

Example:

>>> record.headers
{'WARC-Type': 'response', 'WARC-Target-URI': 'http://vgperson.com/robots.txt', 'WARC-Date': '2021-08-09T13:25:55Z', 'WARC-Payload-Digest': 'sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP', 'WARC-IP-Address': '85.214.122.46', 'WARC-Record-ID': '<urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>', 'Content-Type': 'application/http; msgtype=response', 'Content-Length': '454'}
>>> record.http_headers
{'Date': 'Mon, 09 Aug 2021 13:25:53 GMT', 'Server': 'Apache', 'Location': 'https://vgperson.com/robots.txt', 'Content-Length': '239', 'Connection': 'close', 'Content-Type': 'text/html; charset=iso-8859-1'}
>>> content = record.reader.read()
>>> assert len(content) == record.content_length  # content only includes the real content, no access to HTTP stuff
>>> print(content)
b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>301 Moved Permanently</title>\n</head><body>\n<h1>Moved Permanently</h1>\n<p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>\n</body></html>\n'

HTTP communication:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://vgperson.com/robots.txt
WARC-Date: 2021-08-09T13:25:55Z
WARC-Payload-Digest: sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP
WARC-IP-Address: 85.214.122.46
WARC-Record-ID: <urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>
Content-Type: application/http; msgtype=response
Content-Length: 454

HTTP/1.1 301 Moved Permanently
Date: Mon, 09 Aug 2021 13:25:53 GMT
Server: Apache
Location: https://vgperson.com/robots.txt
Content-Length: 239
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>
</body></html>

Problem with LZ4F_cctx and LZ4G_dctx

I'm having this error both when trying to install from pip and from this repo:

fastwarc/warc.cpp:1189:3: error: ‘LZ4F_cctx’ does not name a type; did you mean ‘LZ4F_cctx_s’?
LZ4F_cctx *cctx;
^~~~~~~~~
LZ4F_cctx_s
fastwarc/warc.cpp:1190:3: error: ‘LZ4F_dctx’ does not name a type; did you mean ‘LZ4F_dctx_s’?
LZ4F_dctx *dctx;
^~~~~~~~~
LZ4F_dctx_s
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

can not install on python 3.11 ubuntu docker

I am getting this error inside ubuntu:23.04 docker

pip install fastwarc
Collecting fastwarc
 Downloading FastWARC-0.14.5.tar.gz (42 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.6/42.6 kB 2.0 MB/s eta 0:00:00
 Installing build dependencies ... done
 Getting requirements to build wheel ... error
 error: subprocess-exited-with-error

 × Getting requirements to build wheel did not run successfully.
 │ exit code: 1
 ╰─> [18 lines of output]
     Traceback (most recent call last):
       File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
         main()
       File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
         json_out['return_val'] = hook(**hook_input['kwargs'])
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
         return hook(config_settings)
                ^^^^^^^^^^^^^^^^^^^^^
       File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel
         return self._get_build_requires(config_settings, requirements=['wheel'])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 325, in _get_build_requires
         self.run_setup()
       File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 341, in run_setup
         exec(code, locals())
       File "<string>", line 34, in <module>
     AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?
     [end of output]

 note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Trouble building in Python 3.11

$ pip install --no-binary resiliparse resiliparse

DEPRECATION: --no-binary currently disables reading from the cache of locally built wheels. In the future --no-binary will not influence the wheel cache. pip 23.1 will enforce this behaviour change. A possible replacement is to use the --no-cache-dir option. You can use the flag --use-feature=no-binary-enable-wheel-cache to test the upcoming behaviour. Discussion can be found at https://github.com/pypa/pip/issues/11453
Collecting resiliparse
  Using cached Resiliparse-0.13.7.tar.gz (601 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting fastwarc==0.13.7
  Using cached FastWARC-0.13.7-cp311-cp311-linux_x86_64.whl
Collecting brotli
  Using cached Brotli-1.0.9-cp311-cp311-linux_x86_64.whl
Requirement already satisfied: click in ./venv311/lib/python3.11/site-packages (from fastwarc==0.13.7->resiliparse) (8.0.4)
Requirement already satisfied: tqdm in ./venv311/lib/python3.11/site-packages (from fastwarc==0.13.7->resiliparse) (4.64.1)
Building wheels for collected packages: resiliparse
  Building wheel for resiliparse (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for resiliparse (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [50 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-311
      creating build/lib.linux-x86_64-cpython-311/resiliparse
      copying resiliparse/cli.py -> build/lib.linux-x86_64-cpython-311/resiliparse
      copying resiliparse/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse
      creating build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/coders.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/textio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/warcio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/elasticsearch.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/fileio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      creating build/lib.linux-x86_64-cpython-311/resiliparse/extract
      copying resiliparse/extract/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
      creating build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
      copying resiliparse/itertools.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
      copying resiliparse/process_guard.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
      copying resiliparse/extract/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
      copying resiliparse/extract/html2text.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
      copying resiliparse/parse/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/html.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/http.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/lang.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/encoding.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/lang_profiles.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/encoding.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/html.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      running build_ext
      building 'resiliparse.itertools' extension
      creating build/temp.linux-x86_64-cpython-311
      creating build/temp.linux-x86_64-cpython-311/resiliparse
      x86_64-linux-gnu-gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/mnt/Data/Projects/nootka_io/sentry_dragon/venv311/include -I/usr/include/python3.11 -c resiliparse/itertools.cpp -o build/temp.linux-x86_64-cpython-311/resiliparse/itertools.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
      x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 build/temp.linux-x86_64-cpython-311/resiliparse/itertools.o -L/usr/lib/x86_64-linux-gnu -o build/lib.linux-x86_64-cpython-311/resiliparse/itertools.cpython-311-x86_64-linux-gnu.so -std=c++17
      building 'resiliparse.extract.html2text' extension
      creating build/temp.linux-x86_64-cpython-311/resiliparse/extract
      x86_64-linux-gnu-gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I./resiliparse/parse -I/mnt/Data/Projects/nootka_io/sentry_dragon/venv311/include -I/usr/include/python3.11 -c resiliparse/extract/html2text.cpp -o build/temp.linux-x86_64-cpython-311/resiliparse/extract/html2text.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
      In file included from /usr/include/lexbor/css/css.h:14,
                       from resiliparse/extract/html2text.cpp:864:
      /usr/include/lexbor/css/stylesheet.h: In function ‘lxb_css_stylesheet_t* lxb_css_stylesheet_create(lexbor_mraw_t*)’:
      /usr/include/lexbor/css/stylesheet.h:33:30: error: invalid conversion from ‘void*’ to ‘lxb_css_stylesheet_t*’ {aka ‘lxb_css_stylesheet*’} [-fpermissive]
         33 |     return lexbor_mraw_calloc(mraw, sizeof(lxb_css_stylesheet_t));
            |            ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            |                              |
            |                              void*
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for resiliparse
Failed to build resiliparse
ERROR: Could not build wheels for resiliparse, which is required to install pyproject.toml-based projects

resiliparse crashes in colab

Trying this piece of html... Is there something I can do to upgrade the underlying parser? I recall reading this...

from resiliparse.parse import detect_encoding
from resiliparse.parse.html import HTMLTree
from resiliparse.extract.html2text import extract_plain_text
html_byte = b'\n\n\n\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\r\n<meta http-equiv="X-UA-Compatible" content="IE=9">\r\n<link rel="stylesheet" type="text/css" href="https://firgraf.oh.gov.hu/include/style.css" media="screen" />\r\n<title>Int\xc3\xa9zm\xc3\xa9nyi adatok</title>\r\n<!-- Global site tag (gtag.js) - Google Analytics -->\r\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-198540847-1"></script>\r\n<script>\r\n  window.dataLayer = window.dataLayer || [];\r\n  function gtag(){dataLayer.push(arguments);}\r\n  gtag(\'js\', new Date());\r\n  gtag(\'config\', \'UA-198540847-1\');\r\n</script>\r\n</head>\r\n<body>\r\n<table width="80%" cellpadding="0" cellspacing="0" align="center" style="border:3px solid;\r\nborder-radius:8px; border: 3px solid #0994dc; background-color:#FFFFFF">\r\n  <tr>\r\n    <td valign="top" rowspan="2" bgcolor=\'#FFFFFF\'></td>\r\n    <td align=\'center\' height=\'70\' bgcolor=\'#FFFFFF\' style=\'font: bold small-caps 28px monospace;\'><img src=\'https://firgraf.oh.gov.hu/images/firgraf_logo.png\' width=\'1200\'></td>\r\n  </tr>\r\n  <tr>\r\n    <td valign="top" align=\'center\' bgcolor="#FFFFFF">\r\n      \r\n      <table>\r\n\t<tr>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/index.php">Kezd\xc5\x91lap</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/kkk.php">K\xc3\xa9pz\xc3\xa9si \xc3\xa9s kimeneti k\xc3\xb6vetelm\xc3\xa9nyek</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/int.php">Int\xc3\xa9zm\xc3\xa9nyi adatok</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/torzs.php">T\xc3\xb6rzsadatok</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/gyorslista.php">Gyorslist\xc3\xa1k</a></td>\r\n\t  <td class="menu"><a class="menu" href="http://www.felvi.hu/hivataliugyek/">Vissza a felvi.hu-ra</a></td>\r\n\t</tr>\r\n      </table>\r\n    </td>\r\n  </tr>\r\n  <tr>\r\n    <td bgcolor=\'#ffffff\'>\r\n      &nbsp;\r\n    </td>\r\n    <td colspan="2" style="padding: 0.5em">\r\n      <div align="center"><font size="4" color="#000000">Int\xc3\xa9zm\xc3\xa9nyi adatok</font></div><hr>\r\n      <div align=\'left\' valign=\'top\'><form name=\'hataly\' method=\'get\' action=\'/prg/int.php?nyilvantartottszakid=36318\'><a href=\'/prg/int.php?hatalyvalt=hat\xc3\xa1lyoss\xc3\xa1g+bekapcsol\xc3\xa1sa&nyilvantartottszakid=36318\'>[A hat\xc3\xa1lyoss\xc3\xa1gi sz\xc5\xb1r\xc5\x91k bekapcsol\xc3\xa1sa.]</a></form>\n</div><form name=form1 method=post action=\'/prg/int.php?nyilvantartottszakid=36318\'><div align=\'left\' valign=\'top\'>\xe2\x96\xa0 <a href=\'kkk.php?graf=MSZKSMU\'>KKK teljes gr\xc3\xa1f</a> \xe2\x96\xa0 <a href=\'int.php?adatmod=nyilvszak&szervezetid=36\'>SZTE nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9sei</a><br>A gr\xc3\xa1fban a csom\xc3\xb3pontokra kattintva b\xc5\x91vebb inform\xc3\xa1ci\xc3\xb3 olvashat\xc3\xb3 az adott csom\xc3\xb3pontr\xc3\xb3l.<br>Gr\xc3\xa1fn\xc3\xa9zet:   <select name=grafnezet>\n<option value="resz">csak a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n<option value="mind">a teljes gr\xc3\xa1fban a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n</select> mutatja.<br>A gr\xc3\xa1fban a ny\xc3\xadl kezdete \xc3\xa9s v\xc3\xa9ge k\xc3\xb6z\xc3\xb6tti minim\xc3\xa1lis t\xc3\xa1vols\xc3\xa1g:   <select name=grafminlen>\n<option value="0">legkisebb</option>\n<option selected value="1">1 egys\xc3\xa9g</option>\n<option value="2">2 egys\xc3\xa9g</option>\n<option value="3">3 egys\xc3\xa9g</option>\n<option value="4">4 egys\xc3\xa9g</option>\n<option value="5">5 egys\xc3\xa9g</option>\n</select> (A nagyobb \xc3\xa9rt\xc3\xa9k szell\xc5\x91sebb\xc3\xa9 teszi az \xc3\xa1br\xc3\xa1t.)<br> <button type=\'submit\'  style="background-color:#E5E5E5; color:#000000; font-size: 12px;" name=\'muv\' value=\'n\xc3\xa9zetet friss\xc3\xadt\'>n\xc3\xa9zetet friss\xc3\xadt</button> </div><br><table width=\'100%\' align=\'center\' border=\'0\'><tr><td width=\'50%\' align=\'left\' valign=\'top\'><a href=\'/prg/int.php?nyilvantartottszakid=36317\'>\xc2\xab el\xc5\x91z\xc5\x91: szoci\xc3\xa1lis munka (36317)</a></td><td width=\'50%\' align=\'right\'><a href=\'/prg/int.php?nyilvantartottszakid=6150\'>k\xc3\xb6vetkez\xc5\x91: szoci\xc3\xa1lpedag\xc3\xb3gia (6150) \xc2\xbb</a></td></tr></table>\n<br><div align=\'left\' valign=\'top\'><b><a href=\'torzsadat.php?tabla=szervezet&sid=70\'>(SZTE) Szegedi Tudom\xc3\xa1nyegyetem</a> - <a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>(MSZKSMU) szoci\xc3\xa1lis munka [36318]</a></b></div><br><div align=\'left\' valign=\'top\'><?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"\n "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">\n<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n -->\n<!-- Title: MSZKSMU Pages: 1 -->\n<svg width="340pt" height="116pt"\n viewBox="0.00 0.00 340.00 116.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">\n<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 112)">\n<title>MSZKSMU</title>\n<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-112 336,-112 336,4 -4,4"/>\n<g id="clust1" class="cluster">\n<title>cluster_vegzettseg</title>\n<polygon fill="none" stroke="#ffff00" points="231,-8 231,-62 324,-62 324,-8 231,-8"/>\n</g>\n<!-- START -->\n<g id="node1" class="node">\n<title>START</title>\n<ellipse fill="#d3d3d3" stroke="#d3d3d3" cx="27" cy="-63" rx="27" ry="18"/>\n<text text-anchor="middle" x="27" y="-60.8" font-family="Times,serif" font-size="9.00" fill="#000000">START</text>\n</g>\n<!-- MSZKSMU -->\n<g id="node2" class="node">\n<title>MSZKSMU</title>\n<g id="a_node2"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414" xlink:title="MSZKSMU\\nszoci\xc3\xa1lis munka">\n<polygon fill="#e0ffff" stroke="#e0ffff" points="164,-81 91,-81 91,-45 164,-45 164,-81"/>\n<text text-anchor="middle" x="127.5" y="-65.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSZKSMU</text>\n<text text-anchor="middle" x="127.5" y="-55.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- START&#45;&gt;MSZKSMU -->\n<g id="edge1" class="edge">\n<title>START&#45;&gt;MSZKSMU</title>\n<path fill="none" stroke="#0000ff" stroke-width="2" d="M54.1967,-63C62.3906,-63 71.6286,-63 80.7147,-63"/>\n<polygon fill="#0000ff" stroke="#0000ff" stroke-width="2" points="80.8451,-66.5001 90.8451,-63 80.845,-59.5001 80.8451,-66.5001"/>\n</g>\n<!-- MSPCKSM -->\n<g id="node3" class="node">\n<title>MSPCKSM</title>\n<g id="a_node3"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710" xlink:title="MSPCKSM\\nklinikai szoci\xc3\xa1lis munka">\n<polygon fill="#ffe4e1" stroke="#ffe4e1" points="328,-108 227,-108 227,-72 328,-72 328,-108"/>\n<text text-anchor="middle" x="277.5" y="-92.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSPCKSM</text>\n<text text-anchor="middle" x="277.5" y="-82.8" font-family="Times,serif" font-size="9.00" fill="#000000">klinikai szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU&#45;&gt;MSPCKSM -->\n<g id="edge3" class="edge">\n<title>MSZKSMU&#45;&gt;MSPCKSM</title>\n<path fill="none" stroke="#000000" d="M164.1941,-69.6049C179.9274,-72.4369 198.7348,-75.8223 216.4633,-79.0134"/>\n<polygon fill="#000000" stroke="#000000" points="216.2835,-82.5372 226.7454,-80.8642 217.5237,-75.6479 216.2835,-82.5372"/>\n</g>\n<!-- 1287 -->\n<g id="node4" class="node">\n<title>1287</title>\n<g id="a_node4"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=vegzettseg&idmezo=vegzettsegid&id=1287" xlink:title="MMSAZMO\\nokleveles\\nszoci\xc3\xa1lis munk\xc3\xa1s">\n<polygon fill="#ffff00" stroke="#ffff00" points="316,-54 239,-54 239,-16 316,-16 316,-54"/>\n<text text-anchor="middle" x="277.5" y="-42.8" font-family="Times,serif" font-size="9.00" fill="#000000">MMSAZMO</text>\n<text text-anchor="middle" x="277.5" y="-32.8" font-family="Times,serif" font-size="9.00" fill="#000000">okleveles</text>\n<text text-anchor="middle" x="277.5" y="-22.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munk\xc3\xa1s</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU&#45;&gt;1287 -->\n<g id="edge2" class="edge">\n<title>MSZKSMU&#45;&gt;1287</title>\n<path fill="none" stroke="#ff0000" d="M164.1941,-56.1504C183.6481,-52.519 207.8022,-48.0103 228.805,-44.0897"/>\n<polygon fill="#ff0000" stroke="#ff0000" points="229.6399,-47.4944 238.8279,-42.2188 228.3554,-40.6133 229.6399,-47.4944"/>\n<text text-anchor="middle" x="195.5" y="-54.6" font-family="Times,serif" font-size="8.00" fill="#ff0000">START</text>\n</g>\n</g>\n</svg>\n</div><br><br><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott szak:</div></b><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'><tr><td align=\'left\' valign=\'top\'><b>nyilv. szak ID</b></td><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>telephely</b></td><td align=\'left\' valign=\'top\'><b>nyelv</b></td><td align=\'left\' valign=\'top\'><b>munkarend</b></td></tr>\n<tr><td align=\'left\' valign=\'top\'><a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>36318</a></td><td align=\'left\' valign=\'top\'>MSZKSMU</td><td align=\'left\' valign=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>Szeged</td><td align=\'left\' valign=\'top\'>magyar</td><td align=\'left\' valign=\'top\'>levelez\xc5\x91</td></tr>\n</table><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9si elemek:</b></div><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'>\n<tr><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>t\xc3\xadpus</b></td><td align=\'left\' valign=\'top\'><b>minimum kredit</b></td><td align=\'left\' valign=\'top\'><b>maximum kredit</b></td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710\'>MSPCKSM</a></td><td align=\'left\' valig=\'top\'>klinikai szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>specializ\xc3\xa1ci\xc3\xb3</td><td align=\'left\' valig=\'top\'>35</td><td align=\'left\' valig=\'top\'>40</td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414\'>MSZKSMU</a></td><td align=\'left\' valig=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>szak</td><td align=\'left\' valig=\'top\'>120</td><td align=\'left\' valig=\'top\'>120</td></tr></table></form>\r\n    </td>\r\n  </tr>\r\n  <tr>\r\n    <td colspan="2" bgcolor=\'#0994dc\' width="100%">\r\n      <table width="100%">\r\n\t<tr>\r\n\t  <td align=\'left\'>\r\n\t      <font size=\'1\' color=\'#ffffff\'>Az adatb\xc3\xa1zis 2022-09-24 hajnalban friss\xc3\xbclt.</font>\r\n\t  </td>\r\n\t  <td align="right">\r\n\t    <font size=\'1\' color=\'#ffffff\'>K\xc3\xa9sz\xc3\xbclt az EKOP-1.A.1-08/C-2009-0009  "Az Oktat\xc3\xa1si Hivatal k\xc3\xb6zigazgat\xc3\xa1si szolg\xc3\xa1ltat\xc3\xa1sainak elektroniz\xc3\xa1l\xc3\xa1sa" projekt keret\xc3\xa9ben. &copy; 2012.</font>\r\n\t  </td>\r\n\t</tr>\r\n    </td>\r\n  </tr>\r\n</table>\r\n</body>\r\n</html>\r\n\n'
encoding = detect_encoding(html_byte)
tree = HTMLTree.parse_from_bytes(html_byte, encoding)
str(tree)

Fastwarc: CLI may index gzipped WARC records with erroneous length 0

The fastwarc command-line tool "index" index some records of a gzipped WARC file with an erroneous zero record length:

$> wget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/09/CC-NEWS-20210930113548-00741.warc.gz

$> fastwarc index -fwarc-type,warc-target-uri,offset,length CC-NEWS-20210930113548-00741.warc.gz \
    | grep -F '"length": "0"'
{"warc-type": "response", "warc-target-uri": "https://www.themarketsdaily.com/2021/09/30/ishares-sp-500-etf-nysearcaivv-sees-strong-trading-volume.html", "offset": "232757027", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.timeturk.com/yasam/baskan-buyukkilic-10-milyon-tl-yatirim-yapilan-yeralti-carsisi-nda-incelemede-bulundu-esnafi-ziyaret-etti/haber-1703634", "offset": "278528237", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.sondakika.com/haber/haber-yayinci-tevfik-rauf-baysal-vefat-etti-14429565/", "offset": "1044381471", "length": "0"}

See also the discussion in #11, however, fewer records are affected here. With uncompressed the WARC file the error is not reproducible.

FastWARC: BufferedReader may hang up on truncated gzipped WARC file

The ArchiveIterator, resp. the underlying stream_io.BufferedReader when reading a truncated gzipped WARC file (eg. an incomplete download). The issue can be reproduced when reading clipped.warc.gz, see iipc/jwarc#17. The stack during the hangup (instead of ftell I've also observed stream_io.FileStream.read() on top of _refill_working_buf():

#3  0x00007f98a34f8705 in __GI__IO_ftell (fp=0x19b3790) at ioftell.c:38
#4  0x00007f98a2764766 in __pyx_f_8fastwarc_9stream_io_10GZipStream__refill_working_buf (__pyx_v_self=0x7f98a19fad60, __pyx_v_size=16384)
    at fastwarc/stream_io.cpp:4944
#5  0x00007f98a276d500 in __pyx_f_8fastwarc_9stream_io_10GZipStream_read (__pyx_v_self=0x7f98a19fad60, __pyx_v_out="", __pyx_v_size=16384)
    at fastwarc/stream_io.cpp:5191
#6  0x00007f98a27645bc in __pyx_f_8fastwarc_9stream_io_14BufferedReader__fill_buf (__pyx_v_self=0x7f98a19fb9a0) at fastwarc/stream_io.cpp:9201
#7  0x00007f98a276ce6b in __pyx_f_8fastwarc_9stream_io_14BufferedReader_read (__pyx_v_self=0x7f98a19fb9a0, __pyx_skip_dispatch=<optimized out>, 
    __pyx_optional_args=<optimized out>) at fastwarc/stream_io.cpp:9684
#8  0x00007f98a2765d75 in __pyx_pf_8fastwarc_9stream_io_14BufferedReader_4read (__pyx_v_size=16384, __pyx_v_self=0x7f98a19fb9a0)
    at fastwarc/stream_io.cpp:9840

Interesting Benchmarks running resilparse 'HTML2text' sequentially vs parallel

After running some benchmarking on resiliparse "HTMl2text" extract_plain_text(tree, main_content=True)) it seems the extract_plain_text method is significantly slower in parallel than sequentially.

sequentially : 508.147 items/sec
parallel : 62.7322 items/sec

I ran the benchmarking with a tool I wrote, https://github.com/Nootka-io/wee-benchmarking-tool. I'll work on pulling out a minimal example.

It seems strange to me, and not sure where to begin profiling/debugging. Other libraries see little improvement, but resiliparse is the only one showing a dramatic drop, although it's still the fastest.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.