Giter Site home page Giter Site logo

chatnoir-eu / chatnoir-resiliparse Goto Github PK

View Code? Open in Web Editor NEW
45.0 9.0 8.0 1.92 MB

A robust web archive analytics toolkit

Home Page: https://resiliparse.chatnoir.eu

License: Apache License 2.0

Python 30.73% Cython 54.48% CMake 0.30% C 13.33% C++ 0.80% Dockerfile 0.37%
python web warc bigdata cython cpp extraction webarchive htmlparser

chatnoir-resiliparse's Issues

FastWARC: BufferedReader may hang up on truncated gzipped WARC file

The ArchiveIterator, resp. the underlying stream_io.BufferedReader when reading a truncated gzipped WARC file (eg. an incomplete download). The issue can be reproduced when reading clipped.warc.gz, see iipc/jwarc#17. The stack during the hangup (instead of ftell I've also observed stream_io.FileStream.read() on top of _refill_working_buf():

#3  0x00007f98a34f8705 in __GI__IO_ftell (fp=0x19b3790) at ioftell.c:38
#4  0x00007f98a2764766 in __pyx_f_8fastwarc_9stream_io_10GZipStream__refill_working_buf (__pyx_v_self=0x7f98a19fad60, __pyx_v_size=16384)
    at fastwarc/stream_io.cpp:4944
#5  0x00007f98a276d500 in __pyx_f_8fastwarc_9stream_io_10GZipStream_read (__pyx_v_self=0x7f98a19fad60, __pyx_v_out="", __pyx_v_size=16384)
    at fastwarc/stream_io.cpp:5191
#6  0x00007f98a27645bc in __pyx_f_8fastwarc_9stream_io_14BufferedReader__fill_buf (__pyx_v_self=0x7f98a19fb9a0) at fastwarc/stream_io.cpp:9201
#7  0x00007f98a276ce6b in __pyx_f_8fastwarc_9stream_io_14BufferedReader_read (__pyx_v_self=0x7f98a19fb9a0, __pyx_skip_dispatch=<optimized out>, 
    __pyx_optional_args=<optimized out>) at fastwarc/stream_io.cpp:9684
#8  0x00007f98a2765d75 in __pyx_pf_8fastwarc_9stream_io_14BufferedReader_4read (__pyx_v_size=16384, __pyx_v_self=0x7f98a19fb9a0)
    at fastwarc/stream_io.cpp:9840

setuptools.config.pyprojecttoml has no attribute _BetaConfiguration

AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?

fails to build in docker on apple silicon.

builds fine on linux and also outside of docker in native osx

pip install fastwarc==0.14.5
Collecting fastwarc==0.14.5
  Using cached FastWARC-0.14.5.tar.gz (42 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      Traceback (most recent call last):
        File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-jytdl1uv/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 34, in <module>
      AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

pipx run fastwarc check faild: binascii.Error: Non-base32 digit found

$ pipx run --verbose fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
pipx >(setup:729): pipx version is 1.0.0
pipx >(setup:730): Default python interpreter is '/home/user/.local/pipx/venvs/pipx/bin/python'
pipx >(needs_upgrade:69): Time since last upgrade of shared libs, in seconds: 1561898. Upgrade will be run by pipx if greater than 2592000.
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(run:103): Reusing cached venv /home/user/.local/pipx/.cache/7a73b1e86637c39
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(exec_app:387): exec_app: /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
0 records were verified successfully.                           
1 records were skipped without digest.
Error in sys.excepthook:
Traceback (most recent call last):
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
    for v in pbar:
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "fastwarc/tools.pyx", line 178, in verify_digests
  File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
  File "/usr/lib/python3.9/base64.py", line 231, in b32decode
    raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found

Original exception was:
Traceback (most recent call last):
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
    for v in pbar:
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "fastwarc/tools.pyx", line 178, in verify_digests
  File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
  File "/usr/lib/python3.9/base64.py", line 231, in b32decode
    raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found
$

Problem with LZ4F_cctx and LZ4G_dctx

I'm having this error both when trying to install from pip and from this repo:

fastwarc/warc.cpp:1189:3: error: ‘LZ4F_cctx’ does not name a type; did you mean ‘LZ4F_cctx_s’?
LZ4F_cctx *cctx;
^~~~~~~~~
LZ4F_cctx_s
fastwarc/warc.cpp:1190:3: error: ‘LZ4F_dctx’ does not name a type; did you mean ‘LZ4F_dctx_s’?
LZ4F_dctx *dctx;
^~~~~~~~~
LZ4F_dctx_s
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

DOM Tree Manipulation and DOMNode

Hello, Thank you for the wonderful project!

I have a question about DOM Manipulation and DOM Node. In the document, there are warnings against use of instance of DOMNode after DOM Tree Manipulation.

Warning

A DOMNode object is valid only for as > long as its parent tree has not been modified or deallocated. Thus, DO NOT use existing instances after any sort of DOM tree manipulation! Doing so may result in Python crashes or (worse) security vulnerabilities due to dangling pointers (use after free). This is a known Lexbor limitation for which there is no workaround at the moment.

I am currently working on creating HTML extractor, and there are many DOM manipulations and DOMNode accesses, for example, like this:

sibling = next_sibling.next
p.append_child(next_sibling)
next_sibling = sibling

I think if I need to re-find DOMNode again for every DOM manipulation operations it will make it hard to do some kind of works. Is there are a concrete example of safe or okay manipulations/accesses or a specific cases where accessing after manipulation will cause error or segfault? Thank you!

Installing fastwarc via `pip install` fails if compilation is required or requested

  • applies to fastwarc 0.6.6 and 0.7.0 (0.6.5 successfully installed)
  • seen on Ubuntu 20.04 and 21.04
  • on amd64 with pip3 install --no-binary fastwarc fastwarc
  • or on aarch64 with pip3 install fastwarc (no binaries provided for ARM CPUs)

The error message indicates that fastwarc is now too interconnected with resiliparse

  ERROR: Command errored out with exit status 1:
...  
  from resiliparse_common.string_util cimport str_to_lower, strip_str, strip_c_str
  ^
  ------------------------------------------------------------
  
  fastwarc/warc.pyx:32:0: 'resiliparse_common/string_util.pxd' not found

Building from a checkout of chatnoir-resiliparse via pip3 wheel -e fastwarc succeeds also on ARM-based systems.

fatal error: html.h: No such file or directory

Hello,

I'm trying to build Resiliparse 0.13.7 from source, and I'm getting this error. Can you tell me which library Resiliparse is expecting to get html.h from? I suspect I'm missing a dependency.

resiliparse/extract/html2text.cpp:869:10: fatal error: html.h: No such file or directory
#include "html.h"
^~~~~~~~

Thanks,
Dave

Random or Chunked Reading

Thanks for developing fastwarc! It's been a great tool while I've been exploring text extracting of common crawl with Python.

I'm interested in using it as part of much larger pipeline but want to enable chunked processing and I'm curious if this is possible. My understanding at the moment is that the ArchiverIterator gives me a nice handle to process the all WarcRecords sequentially. I think I'd like to be able to do is something like:

archv = ArchiveChunked(open(...), ...)
recs = archv[N:N+10] # select 10 records starting at N

Doing this should allow me to leverage batch processing functionality and distributed the processing across multiple cores

Trouble building in Python 3.11

$ pip install --no-binary resiliparse resiliparse

DEPRECATION: --no-binary currently disables reading from the cache of locally built wheels. In the future --no-binary will not influence the wheel cache. pip 23.1 will enforce this behaviour change. A possible replacement is to use the --no-cache-dir option. You can use the flag --use-feature=no-binary-enable-wheel-cache to test the upcoming behaviour. Discussion can be found at https://github.com/pypa/pip/issues/11453
Collecting resiliparse
  Using cached Resiliparse-0.13.7.tar.gz (601 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting fastwarc==0.13.7
  Using cached FastWARC-0.13.7-cp311-cp311-linux_x86_64.whl
Collecting brotli
  Using cached Brotli-1.0.9-cp311-cp311-linux_x86_64.whl
Requirement already satisfied: click in ./venv311/lib/python3.11/site-packages (from fastwarc==0.13.7->resiliparse) (8.0.4)
Requirement already satisfied: tqdm in ./venv311/lib/python3.11/site-packages (from fastwarc==0.13.7->resiliparse) (4.64.1)
Building wheels for collected packages: resiliparse
  Building wheel for resiliparse (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for resiliparse (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [50 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-311
      creating build/lib.linux-x86_64-cpython-311/resiliparse
      copying resiliparse/cli.py -> build/lib.linux-x86_64-cpython-311/resiliparse
      copying resiliparse/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse
      creating build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/coders.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/textio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/warcio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/elasticsearch.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      copying resiliparse/beam/fileio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
      creating build/lib.linux-x86_64-cpython-311/resiliparse/extract
      copying resiliparse/extract/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
      creating build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
      copying resiliparse/itertools.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
      copying resiliparse/process_guard.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
      copying resiliparse/extract/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
      copying resiliparse/extract/html2text.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
      copying resiliparse/parse/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/html.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/http.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/lang.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/encoding.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/lang_profiles.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/encoding.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      copying resiliparse/parse/html.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
      running build_ext
      building 'resiliparse.itertools' extension
      creating build/temp.linux-x86_64-cpython-311
      creating build/temp.linux-x86_64-cpython-311/resiliparse
      x86_64-linux-gnu-gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/mnt/Data/Projects/nootka_io/sentry_dragon/venv311/include -I/usr/include/python3.11 -c resiliparse/itertools.cpp -o build/temp.linux-x86_64-cpython-311/resiliparse/itertools.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
      x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 build/temp.linux-x86_64-cpython-311/resiliparse/itertools.o -L/usr/lib/x86_64-linux-gnu -o build/lib.linux-x86_64-cpython-311/resiliparse/itertools.cpython-311-x86_64-linux-gnu.so -std=c++17
      building 'resiliparse.extract.html2text' extension
      creating build/temp.linux-x86_64-cpython-311/resiliparse/extract
      x86_64-linux-gnu-gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I./resiliparse/parse -I/mnt/Data/Projects/nootka_io/sentry_dragon/venv311/include -I/usr/include/python3.11 -c resiliparse/extract/html2text.cpp -o build/temp.linux-x86_64-cpython-311/resiliparse/extract/html2text.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
      In file included from /usr/include/lexbor/css/css.h:14,
                       from resiliparse/extract/html2text.cpp:864:
      /usr/include/lexbor/css/stylesheet.h: In function ‘lxb_css_stylesheet_t* lxb_css_stylesheet_create(lexbor_mraw_t*)’:
      /usr/include/lexbor/css/stylesheet.h:33:30: error: invalid conversion from ‘void*’ to ‘lxb_css_stylesheet_t*’ {aka ‘lxb_css_stylesheet*’} [-fpermissive]
         33 |     return lexbor_mraw_calloc(mraw, sizeof(lxb_css_stylesheet_t));
            |            ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            |                              |
            |                              void*
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for resiliparse
Failed to build resiliparse
ERROR: Could not build wheels for resiliparse, which is required to install pyproject.toml-based projects

Fix HTTP status code parsing (reason phrase may contain spaces)

The field WarcRecord.http_headers could include the HTTP status code or it could be provided as an extra attribute to WarcRecord.

When reading a record it is not easily visible what status code a response had. For example, if I would like to only filter 301 redirection content, I'm not able to do this, as far as I can see. (Or just filter 200 responses for further processing.) The other HTTP headers are parsed but not the HTTP status line which has a simple format, e. g. HTTP/1.X XXX Description, that could be integrated to the existing HTTP header parsing. I also found no simple way like .reader to access the HTTP communication.

Example:

>>> record.headers
{'WARC-Type': 'response', 'WARC-Target-URI': 'http://vgperson.com/robots.txt', 'WARC-Date': '2021-08-09T13:25:55Z', 'WARC-Payload-Digest': 'sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP', 'WARC-IP-Address': '85.214.122.46', 'WARC-Record-ID': '<urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>', 'Content-Type': 'application/http; msgtype=response', 'Content-Length': '454'}
>>> record.http_headers
{'Date': 'Mon, 09 Aug 2021 13:25:53 GMT', 'Server': 'Apache', 'Location': 'https://vgperson.com/robots.txt', 'Content-Length': '239', 'Connection': 'close', 'Content-Type': 'text/html; charset=iso-8859-1'}
>>> content = record.reader.read()
>>> assert len(content) == record.content_length  # content only includes the real content, no access to HTTP stuff
>>> print(content)
b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>301 Moved Permanently</title>\n</head><body>\n<h1>Moved Permanently</h1>\n<p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>\n</body></html>\n'

HTTP communication:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://vgperson.com/robots.txt
WARC-Date: 2021-08-09T13:25:55Z
WARC-Payload-Digest: sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP
WARC-IP-Address: 85.214.122.46
WARC-Record-ID: <urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>
Content-Type: application/http; msgtype=response
Content-Length: 454

HTTP/1.1 301 Moved Permanently
Date: Mon, 09 Aug 2021 13:25:53 GMT
Server: Apache
Location: https://vgperson.com/robots.txt
Content-Length: 239
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>
</body></html>

resiliparse crashes in colab

Trying this piece of html... Is there something I can do to upgrade the underlying parser? I recall reading this...

from resiliparse.parse import detect_encoding
from resiliparse.parse.html import HTMLTree
from resiliparse.extract.html2text import extract_plain_text
html_byte = b'\n\n\n\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\r\n<meta http-equiv="X-UA-Compatible" content="IE=9">\r\n<link rel="stylesheet" type="text/css" href="https://firgraf.oh.gov.hu/include/style.css" media="screen" />\r\n<title>Int\xc3\xa9zm\xc3\xa9nyi adatok</title>\r\n<!-- Global site tag (gtag.js) - Google Analytics -->\r\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-198540847-1"></script>\r\n<script>\r\n  window.dataLayer = window.dataLayer || [];\r\n  function gtag(){dataLayer.push(arguments);}\r\n  gtag(\'js\', new Date());\r\n  gtag(\'config\', \'UA-198540847-1\');\r\n</script>\r\n</head>\r\n<body>\r\n<table width="80%" cellpadding="0" cellspacing="0" align="center" style="border:3px solid;\r\nborder-radius:8px; border: 3px solid #0994dc; background-color:#FFFFFF">\r\n  <tr>\r\n    <td valign="top" rowspan="2" bgcolor=\'#FFFFFF\'></td>\r\n    <td align=\'center\' height=\'70\' bgcolor=\'#FFFFFF\' style=\'font: bold small-caps 28px monospace;\'><img src=\'https://firgraf.oh.gov.hu/images/firgraf_logo.png\' width=\'1200\'></td>\r\n  </tr>\r\n  <tr>\r\n    <td valign="top" align=\'center\' bgcolor="#FFFFFF">\r\n      \r\n      <table>\r\n\t<tr>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/index.php">Kezd\xc5\x91lap</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/kkk.php">K\xc3\xa9pz\xc3\xa9si \xc3\xa9s kimeneti k\xc3\xb6vetelm\xc3\xa9nyek</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/int.php">Int\xc3\xa9zm\xc3\xa9nyi adatok</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/torzs.php">T\xc3\xb6rzsadatok</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/gyorslista.php">Gyorslist\xc3\xa1k</a></td>\r\n\t  <td class="menu"><a class="menu" href="http://www.felvi.hu/hivataliugyek/">Vissza a felvi.hu-ra</a></td>\r\n\t</tr>\r\n      </table>\r\n    </td>\r\n  </tr>\r\n  <tr>\r\n    <td bgcolor=\'#ffffff\'>\r\n      &nbsp;\r\n    </td>\r\n    <td colspan="2" style="padding: 0.5em">\r\n      <div align="center"><font size="4" color="#000000">Int\xc3\xa9zm\xc3\xa9nyi adatok</font></div><hr>\r\n      <div align=\'left\' valign=\'top\'><form name=\'hataly\' method=\'get\' action=\'/prg/int.php?nyilvantartottszakid=36318\'><a href=\'/prg/int.php?hatalyvalt=hat\xc3\xa1lyoss\xc3\xa1g+bekapcsol\xc3\xa1sa&nyilvantartottszakid=36318\'>[A hat\xc3\xa1lyoss\xc3\xa1gi sz\xc5\xb1r\xc5\x91k bekapcsol\xc3\xa1sa.]</a></form>\n</div><form name=form1 method=post action=\'/prg/int.php?nyilvantartottszakid=36318\'><div align=\'left\' valign=\'top\'>\xe2\x96\xa0 <a href=\'kkk.php?graf=MSZKSMU\'>KKK teljes gr\xc3\xa1f</a> \xe2\x96\xa0 <a href=\'int.php?adatmod=nyilvszak&szervezetid=36\'>SZTE nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9sei</a><br>A gr\xc3\xa1fban a csom\xc3\xb3pontokra kattintva b\xc5\x91vebb inform\xc3\xa1ci\xc3\xb3 olvashat\xc3\xb3 az adott csom\xc3\xb3pontr\xc3\xb3l.<br>Gr\xc3\xa1fn\xc3\xa9zet:   <select name=grafnezet>\n<option value="resz">csak a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n<option value="mind">a teljes gr\xc3\xa1fban a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n</select> mutatja.<br>A gr\xc3\xa1fban a ny\xc3\xadl kezdete \xc3\xa9s v\xc3\xa9ge k\xc3\xb6z\xc3\xb6tti minim\xc3\xa1lis t\xc3\xa1vols\xc3\xa1g:   <select name=grafminlen>\n<option value="0">legkisebb</option>\n<option selected value="1">1 egys\xc3\xa9g</option>\n<option value="2">2 egys\xc3\xa9g</option>\n<option value="3">3 egys\xc3\xa9g</option>\n<option value="4">4 egys\xc3\xa9g</option>\n<option value="5">5 egys\xc3\xa9g</option>\n</select> (A nagyobb \xc3\xa9rt\xc3\xa9k szell\xc5\x91sebb\xc3\xa9 teszi az \xc3\xa1br\xc3\xa1t.)<br> <button type=\'submit\'  style="background-color:#E5E5E5; color:#000000; font-size: 12px;" name=\'muv\' value=\'n\xc3\xa9zetet friss\xc3\xadt\'>n\xc3\xa9zetet friss\xc3\xadt</button> </div><br><table width=\'100%\' align=\'center\' border=\'0\'><tr><td width=\'50%\' align=\'left\' valign=\'top\'><a href=\'/prg/int.php?nyilvantartottszakid=36317\'>\xc2\xab el\xc5\x91z\xc5\x91: szoci\xc3\xa1lis munka (36317)</a></td><td width=\'50%\' align=\'right\'><a href=\'/prg/int.php?nyilvantartottszakid=6150\'>k\xc3\xb6vetkez\xc5\x91: szoci\xc3\xa1lpedag\xc3\xb3gia (6150) \xc2\xbb</a></td></tr></table>\n<br><div align=\'left\' valign=\'top\'><b><a href=\'torzsadat.php?tabla=szervezet&sid=70\'>(SZTE) Szegedi Tudom\xc3\xa1nyegyetem</a> - <a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>(MSZKSMU) szoci\xc3\xa1lis munka [36318]</a></b></div><br><div align=\'left\' valign=\'top\'><?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"\n "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">\n<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n -->\n<!-- Title: MSZKSMU Pages: 1 -->\n<svg width="340pt" height="116pt"\n viewBox="0.00 0.00 340.00 116.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">\n<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 112)">\n<title>MSZKSMU</title>\n<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-112 336,-112 336,4 -4,4"/>\n<g id="clust1" class="cluster">\n<title>cluster_vegzettseg</title>\n<polygon fill="none" stroke="#ffff00" points="231,-8 231,-62 324,-62 324,-8 231,-8"/>\n</g>\n<!-- START -->\n<g id="node1" class="node">\n<title>START</title>\n<ellipse fill="#d3d3d3" stroke="#d3d3d3" cx="27" cy="-63" rx="27" ry="18"/>\n<text text-anchor="middle" x="27" y="-60.8" font-family="Times,serif" font-size="9.00" fill="#000000">START</text>\n</g>\n<!-- MSZKSMU -->\n<g id="node2" class="node">\n<title>MSZKSMU</title>\n<g id="a_node2"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414" xlink:title="MSZKSMU\\nszoci\xc3\xa1lis munka">\n<polygon fill="#e0ffff" stroke="#e0ffff" points="164,-81 91,-81 91,-45 164,-45 164,-81"/>\n<text text-anchor="middle" x="127.5" y="-65.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSZKSMU</text>\n<text text-anchor="middle" x="127.5" y="-55.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- START&#45;&gt;MSZKSMU -->\n<g id="edge1" class="edge">\n<title>START&#45;&gt;MSZKSMU</title>\n<path fill="none" stroke="#0000ff" stroke-width="2" d="M54.1967,-63C62.3906,-63 71.6286,-63 80.7147,-63"/>\n<polygon fill="#0000ff" stroke="#0000ff" stroke-width="2" points="80.8451,-66.5001 90.8451,-63 80.845,-59.5001 80.8451,-66.5001"/>\n</g>\n<!-- MSPCKSM -->\n<g id="node3" class="node">\n<title>MSPCKSM</title>\n<g id="a_node3"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710" xlink:title="MSPCKSM\\nklinikai szoci\xc3\xa1lis munka">\n<polygon fill="#ffe4e1" stroke="#ffe4e1" points="328,-108 227,-108 227,-72 328,-72 328,-108"/>\n<text text-anchor="middle" x="277.5" y="-92.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSPCKSM</text>\n<text text-anchor="middle" x="277.5" y="-82.8" font-family="Times,serif" font-size="9.00" fill="#000000">klinikai szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU&#45;&gt;MSPCKSM -->\n<g id="edge3" class="edge">\n<title>MSZKSMU&#45;&gt;MSPCKSM</title>\n<path fill="none" stroke="#000000" d="M164.1941,-69.6049C179.9274,-72.4369 198.7348,-75.8223 216.4633,-79.0134"/>\n<polygon fill="#000000" stroke="#000000" points="216.2835,-82.5372 226.7454,-80.8642 217.5237,-75.6479 216.2835,-82.5372"/>\n</g>\n<!-- 1287 -->\n<g id="node4" class="node">\n<title>1287</title>\n<g id="a_node4"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=vegzettseg&idmezo=vegzettsegid&id=1287" xlink:title="MMSAZMO\\nokleveles\\nszoci\xc3\xa1lis munk\xc3\xa1s">\n<polygon fill="#ffff00" stroke="#ffff00" points="316,-54 239,-54 239,-16 316,-16 316,-54"/>\n<text text-anchor="middle" x="277.5" y="-42.8" font-family="Times,serif" font-size="9.00" fill="#000000">MMSAZMO</text>\n<text text-anchor="middle" x="277.5" y="-32.8" font-family="Times,serif" font-size="9.00" fill="#000000">okleveles</text>\n<text text-anchor="middle" x="277.5" y="-22.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munk\xc3\xa1s</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU&#45;&gt;1287 -->\n<g id="edge2" class="edge">\n<title>MSZKSMU&#45;&gt;1287</title>\n<path fill="none" stroke="#ff0000" d="M164.1941,-56.1504C183.6481,-52.519 207.8022,-48.0103 228.805,-44.0897"/>\n<polygon fill="#ff0000" stroke="#ff0000" points="229.6399,-47.4944 238.8279,-42.2188 228.3554,-40.6133 229.6399,-47.4944"/>\n<text text-anchor="middle" x="195.5" y="-54.6" font-family="Times,serif" font-size="8.00" fill="#ff0000">START</text>\n</g>\n</g>\n</svg>\n</div><br><br><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott szak:</div></b><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'><tr><td align=\'left\' valign=\'top\'><b>nyilv. szak ID</b></td><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>telephely</b></td><td align=\'left\' valign=\'top\'><b>nyelv</b></td><td align=\'left\' valign=\'top\'><b>munkarend</b></td></tr>\n<tr><td align=\'left\' valign=\'top\'><a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>36318</a></td><td align=\'left\' valign=\'top\'>MSZKSMU</td><td align=\'left\' valign=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>Szeged</td><td align=\'left\' valign=\'top\'>magyar</td><td align=\'left\' valign=\'top\'>levelez\xc5\x91</td></tr>\n</table><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9si elemek:</b></div><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'>\n<tr><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>t\xc3\xadpus</b></td><td align=\'left\' valign=\'top\'><b>minimum kredit</b></td><td align=\'left\' valign=\'top\'><b>maximum kredit</b></td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710\'>MSPCKSM</a></td><td align=\'left\' valig=\'top\'>klinikai szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>specializ\xc3\xa1ci\xc3\xb3</td><td align=\'left\' valig=\'top\'>35</td><td align=\'left\' valig=\'top\'>40</td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414\'>MSZKSMU</a></td><td align=\'left\' valig=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>szak</td><td align=\'left\' valig=\'top\'>120</td><td align=\'left\' valig=\'top\'>120</td></tr></table></form>\r\n    </td>\r\n  </tr>\r\n  <tr>\r\n    <td colspan="2" bgcolor=\'#0994dc\' width="100%">\r\n      <table width="100%">\r\n\t<tr>\r\n\t  <td align=\'left\'>\r\n\t      <font size=\'1\' color=\'#ffffff\'>Az adatb\xc3\xa1zis 2022-09-24 hajnalban friss\xc3\xbclt.</font>\r\n\t  </td>\r\n\t  <td align="right">\r\n\t    <font size=\'1\' color=\'#ffffff\'>K\xc3\xa9sz\xc3\xbclt az EKOP-1.A.1-08/C-2009-0009  "Az Oktat\xc3\xa1si Hivatal k\xc3\xb6zigazgat\xc3\xa1si szolg\xc3\xa1ltat\xc3\xa1sainak elektroniz\xc3\xa1l\xc3\xa1sa" projekt keret\xc3\xa9ben. &copy; 2012.</font>\r\n\t  </td>\r\n\t</tr>\r\n    </td>\r\n  </tr>\r\n</table>\r\n</body>\r\n</html>\r\n\n'
encoding = detect_encoding(html_byte)
tree = HTMLTree.parse_from_bytes(html_byte, encoding)
str(tree)

Resiliparse does not Compile under Ubuntu 18

It seems like resiliparse does not compile under Ubuntu 18, it fails with this error message:

  building 'fastwarc.warc' extension
  creating build/temp.linux-x86_64-cpython-37
  creating build/temp.linux-x86_64-cpython-37/fastwarc
  gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.7m -c fastwarc/warc.cpp -o build/temp.linux-x86_64-cpython-37/fastwarc/warc.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  fastwarc/warc.cpp:1348:3: error: ‘LZ4F_cctx’ does not name a type; did you mean ‘LZ4F_cctx_s’?
     LZ4F_cctx *cctx;
     ^~~~~~~~~
     LZ4F_cctx_s
  fastwarc/warc.cpp:1349:3: error: ‘LZ4F_dctx’ does not name a type; did you mean ‘LZ4F_dctx_s’?
     LZ4F_dctx *dctx;
     ^~~~~~~~~
     LZ4F_dctx_s
  error: command '/usr/bin/gcc' failed with exit code 1
  ----------------------------------------
  ERROR: Failed building wheel for fastwarc

It seems like the lz4 version that comes from the package repository in ubuntu 18 is in the wrong version?

When I install lz4 from source, it works:

git clone https://github.com/lz4/lz4
cd lz4
make
make install

Since installing lz4 from source resolves the problem, this might not have the highest priority.

Fastwarc: CLI may index gzipped WARC records with erroneous length 0

The fastwarc command-line tool "index" index some records of a gzipped WARC file with an erroneous zero record length:

$> wget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/09/CC-NEWS-20210930113548-00741.warc.gz

$> fastwarc index -fwarc-type,warc-target-uri,offset,length CC-NEWS-20210930113548-00741.warc.gz \
    | grep -F '"length": "0"'
{"warc-type": "response", "warc-target-uri": "https://www.themarketsdaily.com/2021/09/30/ishares-sp-500-etf-nysearcaivv-sees-strong-trading-volume.html", "offset": "232757027", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.timeturk.com/yasam/baskan-buyukkilic-10-milyon-tl-yatirim-yapilan-yeralti-carsisi-nda-incelemede-bulundu-esnafi-ziyaret-etti/haber-1703634", "offset": "278528237", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.sondakika.com/haber/haber-yayinci-tevfik-rauf-baysal-vefat-etti-14429565/", "offset": "1044381471", "length": "0"}

See also the discussion in #11, however, fewer records are affected here. With uncompressed the WARC file the error is not reproducible.

Interesting Benchmarks running resilparse 'HTML2text' sequentially vs parallel

After running some benchmarking on resiliparse "HTMl2text" extract_plain_text(tree, main_content=True)) it seems the extract_plain_text method is significantly slower in parallel than sequentially.

sequentially : 508.147 items/sec
parallel : 62.7322 items/sec

I ran the benchmarking with a tool I wrote, https://github.com/Nootka-io/wee-benchmarking-tool. I'll work on pulling out a minimal example.

It seems strange to me, and not sure where to begin profiling/debugging. Other libraries see little improvement, but resiliparse is the only one showing a dramatic drop, although it's still the fastest.

yum install

Hi,

Thanks for the very nice package.

Do you know which dependencies should be installed with yum?
I am struggling to build fastWARC from source within a lambda container. Here is my Dockerfile.

FROM public.ecr.aws/lambda/python:3.8

RUN yum groupinstall "Development Tools" -y
RUN yum install python3-devel -y
RUN yum install -y zlib-devel lz4-devel liblexbor-devel uchardet-devel 
RUN pip3 install --no-binary fastwarc fastwarc --target "${LAMBDA_TASK_ROOT}"

COPY app.py ${LAMBDA_TASK_ROOT}
CMD [ "app.handler" ]

This is the error message

  ERROR: Command errored out with exit status 1:
   command: /var/lang/bin/python3.8 /var/lang/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmparkimzwm
       cwd: /tmp/pip-install-1hzfg9i1/fastwarc_fcfee32f14f34b609444e2992925ac95
  Complete output (26 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.8
  creating build/lib.linux-x86_64-3.8/fastwarc
  copying fastwarc/cli.py -> build/lib.linux-x86_64-3.8/fastwarc
  copying fastwarc/__init__.py -> build/lib.linux-x86_64-3.8/fastwarc
  copying fastwarc/stream_io.pxd -> build/lib.linux-x86_64-3.8/fastwarc
  copying fastwarc/warc.pxd -> build/lib.linux-x86_64-3.8/fastwarc
  copying fastwarc/__init__.pxd -> build/lib.linux-x86_64-3.8/fastwarc
  running build_ext
  building 'fastwarc.warc' extension
  creating build/temp.linux-x86_64-3.8
  creating build/temp.linux-x86_64-3.8/fastwarc
  gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/var/lang/include/python3.8 -c fastwarc/warc.cpp -o build/temp.linux-x86_64-3.8/fastwarc/warc.o -std=c++17 -O3 -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function -fpermissive -Wno-c++11-narrowing
  g++ -pthread -shared -Wl,-rpath=/var/lang/lib build/temp.linux-x86_64-3.8/fastwarc/warc.o -L/var/lang/lib -o build/lib.linux-x86_64-3.8/fastwarc/warc.cpython-38-x86_64-linux-gnu.so -std=c++17
  building 'fastwarc.stream_io' extension
  gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/var/lang/include/python3.8 -c fastwarc/stream_io.cpp -o build/temp.linux-x86_64-3.8/fastwarc/stream_io.o -std=c++17 -O3 -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function -fpermissive -Wno-c++11-narrowing
  fastwarc/stream_io.cpp: In function ‘int __pyx_pf_8fastwarc_9stream_io_9LZ4Stream_2__cinit__(__pyx_obj_8fastwarc_9stream_io_LZ4Stream*, PyObject*, PyObject*, PyObject*)’:
  fastwarc/stream_io.cpp:7441:23: error: ‘struct LZ4F_preferences_t’ has no member named ‘favorDecSpeed’
     __pyx_v_self->prefs.favorDecSpeed = __pyx_t_4;
                         ^~~~~~~~~~~~~
  At global scope:
  cc1plus: warning: unrecognized command line option ‘-Wno-c++11-narrowing’
  error: command 'gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for fastwarc

Many thanks!

svg caused lexbor to crash

MRE:

from resiliparse.parse.html import HTMLTree
str(HTMLTree.parse("<svg><template>\n"))

It causes segmentation fault, and trace show that is was caused by lxb_html_serialize_node_cb.

CC @lexborisov

can not install on python 3.11 ubuntu docker

I am getting this error inside ubuntu:23.04 docker

pip install fastwarc
Collecting fastwarc
 Downloading FastWARC-0.14.5.tar.gz (42 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.6/42.6 kB 2.0 MB/s eta 0:00:00
 Installing build dependencies ... done
 Getting requirements to build wheel ... error
 error: subprocess-exited-with-error

 × Getting requirements to build wheel did not run successfully.
 │ exit code: 1
 ╰─> [18 lines of output]
     Traceback (most recent call last):
       File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
         main()
       File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
         json_out['return_val'] = hook(**hook_input['kwargs'])
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       File "/usr/lib/python3/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
         return hook(config_settings)
                ^^^^^^^^^^^^^^^^^^^^^
       File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel
         return self._get_build_requires(config_settings, requirements=['wheel'])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 325, in _get_build_requires
         self.run_setup()
       File "/tmp/pip-build-env-l7hwcul9/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 341, in run_setup
         exec(code, locals())
       File "<string>", line 34, in <module>
     AttributeError: module 'setuptools.config.pyprojecttoml' has no attribute '_BetaConfiguration'. Did you mean: 'read_configuration'?
     [end of output]

 note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

steady memory grouth while working on web pages

I'm trying to use Resiliparse to handle CC, but I find that single-process memory usage grows slowly, and web pages get slower, and I can't find the source of the problem with the memory analysis tool, it feels like some sort of memory leak.

from fastwarc.warc import ArchiveIterator as FastWarcArchiveIterator
from fastwarc.warc import WarcRecordType, WarcRecord
from fastwarc.stream_io import FastWARCError
from fastwarc.warc import is_http

from resiliparse.parse.html import HTMLTree

import os
import psutil
import time


file = open("../CC-MAIN-20231129041834-20231129071834-00864.warc.gz", "rb")

archive_iterator = FastWarcArchiveIterator(
    file,
    record_types=WarcRecordType.response,
    parse_http=True,
    func_filter=is_http,
)
idx = 0
s = time.time()
process = psutil.Process(os.getpid())
for record in archive_iterator:
    raw = record.reader.read()
    try:
        str(HTMLTree.parse(raw.decode("utf-8")))
    except:
        pass
    

    idx += 1
    if idx % 1000 == 0:
        print(process.memory_info().rss / 1024**2, time.time() - s)
        s = time.time()

And here is the output:

91.0625 0.5237786769866943
108.921875 0.5350120067596436
132.015625 0.6991021633148193
132.828125 0.6604471206665039
144.25 0.6309971809387207
149.703125 0.9664597511291504
163.6875 1.0105879306793213
168.65625 0.9966611862182617
175.578125 1.1011531352996826
219.890625 0.9911088943481445
233.671875 1.1710970401763916
238.421875 1.0304219722747803
238.453125 0.9967031478881836
242.5625 1.016160011291504
246.40625 1.0688650608062744
246.40625 1.0952439308166504
246.40625 1.100167989730835
248.84375 1.077517032623291
248.84375 1.0870559215545654
248.84375 1.0127251148223877
248.84375 1.1437797546386719
253.125 1.0540661811828613
255.75 1.1424810886383057
255.75 1.0953960418701172
261.203125 1.1094980239868164
262.96875 1.293619155883789
265.6875 1.2223970890045166
265.6875 1.2639102935791016
265.6875 1.2537767887115479
265.6875 1.1750257015228271
266.203125 1.2749638557434082
266.203125 1.2159233093261719
272.484375 1.2846689224243164
275.515625 1.2536969184875488
275.515625 1.1752068996429443
275.515625 1.156343936920166
275.515625 1.249042272567749
276.765625 1.1668882369995117

Nested Span

Hello,

I was wondering if you had any input on what would be the best way to traverse and remove the contents of a nested span?

Nested span to traverse:

<span normalizedcite="<span class="citation no-link">98 T.C. 141</span>">98 T.C. 141</span>

Span to remove or replace:

"<span class="citation no-link">98 T.C. 141</span>"

Using query selector all does not seem to get the .normalizedcite class.

Thank you

pipx run resiliparse faild: ModuleNotFoundError: No module named 'joblib'

user@box:~$ pipx run resiliparse
Traceback (most recent call last):
  File "/home/user/.local/pipx/.cache/42f25da10f76b98/bin/resiliparse", line 5, in <module>
    from resiliparse.cli import main
  File "/home/user/.local/pipx/.cache/42f25da10f76b98/lib/python3.9/site-packages/resiliparse/cli.py", line 18, in <module>
    from joblib import Parallel, delayed
ModuleNotFoundError: No module named 'joblib'
user@box:~$ pipx install resiliparse
  installed package resiliparse 0.11.1, installed using Python 3.9.2
  These apps are now globally available
    - resiliparse
done! ✨ 🌟 ✨
user@box:~$ resiliparse
Traceback (most recent call last):
  File "/home/user/.local/bin/resiliparse", line 5, in <module>
    from resiliparse.cli import main
  File "/home/user/.local/pipx/venvs/resiliparse/lib/python3.9/site-packages/resiliparse/cli.py", line 18, in <module>
    from joblib import Parallel, delayed
ModuleNotFoundError: No module named 'joblib'
user@box:~$ 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.