Giter Site home page Giter Site logo

mikwielgus / forum-dl Goto Github PK

View Code? Open in Web Editor NEW
67.0 4.0 2.0 400 KB

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC

License: MIT License

Python 100.00%
python scraper forum discourse phpbb simplemachines data-fetching internet-archiving warc

forum-dl's People

Contributors

mikwielgus avatar pabs3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

br8km

forum-dl's Issues

vBulletin errors

I would like to use forum-dl to generate a list of links from a given forum that I could then send to SingleFile to generate HTML pages of all posts in the thread.

I am using this command:

forum-dl -g --no-boards --no-files https://forum.com/forums/showthread.php?12345-title-of-the-post/page19

When I run this command for vbulletin, it does not generate a list of all 19 pages in the thread as I would expect to happen -- just the one page that I entered. Like so:

https://forum.com/forums/showthread.php
https://forum.com/forums/showthread.php?12345-title-of-the-post/page19
https://forum.com/

This happens no matter which page in the forum I pass into forum-dl.

When I add -v to the above command, I get the following output:

DEBUG:root:Attempting GET https://forum.com/forums/showthread.php {} {}
https://forum.com/forums/showthread.php
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): forum.com:443
DEBUG:urllib3.connectionpool:https://forum.com:443 "GET /forums/showthread.php HTTP/1.1" 200 None
DEBUG:root:Attempting GET https://forum.com/forums/showthread.php?12345-title-of-the-post/page19 {} {}
https://forum.com/forums/showthread.php?12345-title-of-the-post/page19
DEBUG:urllib3.connectionpool:https://forum.com:443 "GET /forums/showthread.php?12345-title-of-the-post/page19 HTTP/1.1" 200 None
DEBUG:root:Attempting GET https://forum.com/forums/showthread.php {} {}
DEBUG:root:Attempting GET https://forum.com/forums/showthread.php {} {}
DEBUG:root:Attempting GET https://forum.com/forums/showthread.php?12345-title-of-the-post/page19 {} {}
DEBUG:root:Attempting GET https://forum.com/forums/ {} {}
https://forum.com/forums/
DEBUG:urllib3.connectionpool:https://forum.com:443 "GET /forums/ HTTP/1.1" 200 None
DEBUG:root:Attempting GET https://forum.com/forums/showthread.php?12345-title-of-the-post/page19 {} {}
DEBUG:root:Attempting GET https://forum.com/forums/ {} {}

I tried running the command to output the files to a directory:

forum-dl --files-output="test/" https://forum.com/forums/showthread.php?12345-title-of-the-post/page19

I got the following error:

INFO:root:GET https://forum.com/forums/showthread.php {} {}
INFO:root:GET https://forum.com/forums/showthread.php?12345-title-of-the-post/page19 {} {}
INFO:root:GET https://forum.com/ {} {}
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.10.11/bin/forum-dl", line 8, in <module>
    sys.exit(main())
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/__init__.py", line 34, in main
    forumdl.download(
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/forumdl.py", line 24, in download
    self.download_url(
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/forumdl.py", line 48, in download_url
    writer.write(url)
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/writers/common.py", line 78, in write
    self.write_board(base_node)
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/writers/common.py", line 103, in write_board
    self._write_board_object(board)
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/writers/common.py", line 235, in _write_board_object
    sys.stdout.write(f"{self._serialize_entry(entry)}\n")
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/writers/jsonl.py", line 10, in _serialize_entry
    return entry.json(models_as_dict=False)
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/typing_extensions.py", line 2562, in wrapper
    return __arg(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/pydantic/main.py", line 950, in json
    raise TypeError('The `models_as_dict` argument is no longer supported; use a model serializer instead.')
TypeError: The `models_as_dict` argument is no longer supported; use a model serializer instead.

--

Result of pip3 --version

pip 23.2.1 from /home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/pip (python 3.10)

Result of uname -a

Linux computername 5.19.0-46-generic #47-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 16 13:30:11 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Result of cat /etc/os-release

PRETTY_NAME="Ubuntu 22.10"
NAME="Ubuntu"
VERSION_ID="22.10"
VERSION="22.10 (Kinetic Kudu)"
VERSION_CODENAME=kinetic
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=kinetic
LOGO=ubuntu-logo

Too many open files when trying to download a large forum

Hello,

I tried to backup a large forum as warc (using -f warc as output). However, after approximately 1.1GB of download it started failing for each call with a "Too many open files" error.

It looks like some connections/files are never closed ?

WARNING:root:Traceback (most recent call last):
File "/home/username/.local/lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn
sock = connection.create_connection(
File "/home/username/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
OSError: [Errno 24] Too many open files

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/username/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
File "/home/username/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 491, in _make_request
raise new_e
File "/home/username/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
self._validate_conn(conn)
File "/home/username/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn
conn.connect()
File "/home/username/.local/lib/python3.10/site-packages/urllib3/connection.py", line 616, in connect
self.sock = sock = self._new_conn()
File "/home/username/.local/lib/python3.10/site-packages/urllib3/connection.py", line 213, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f03ae62f550>: Failed to establish a new connection: [Errno 24] Too many open files

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/username/.local/lib/python3.10/site-packages/requests/adapters.py", line 589, in send
File "/home/username/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
File "/home/username/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='redacted', port=443): Max retries exceeded with url: [redacted] (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f03ae62f550>: Failed to establish a new connection: [Errno 24] Too many open files'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/username/.local/lib/python3.10/site-packages/forum_dl/extractors/common.py", line 360, in _fetch_thread_posts
self.thread_state = yield from self._fetch_thread_page_posts(
File "/home/username/.local/lib/python3.10/site-packages/forum_dl/extractors/common.py", line 439, in _fetch_thread_page_posts
response = self._session.get(state.url)
File "/home/username/.local/lib/python3.10/site-packages/forum_dl/session.py", line 75, in get
response = self.try_get(
File "/home/username/.local/lib/python3.10/site-packages/forum_dl/session.py", line 132, in try_get
response = retrying_get(url, params=params, headers=headers, **kwargs)
File "/home/username/.local/lib/python3.10/site-packages/tenacity/init.py", line 330, in wrapped_f
return self(f, *args, **kw)
File "/home/username/.local/lib/python3.10/site-packages/tenacity/init.py", line 467, in call
do = self.iter(retry_state=retry_state)
File "/home/username/.local/lib/python3.10/site-packages/tenacity/init.py", line 368, in iter
result = action(retry_state)
File "/home/username/.local/lib/python3.10/site-packages/tenacity/init.py", line 410, in exc_check
raise retry_exc.reraise()
File "/home/username/.local/lib/python3.10/site-packages/tenacity/init.py", line 183, in reraise
raise self.last_attempt.result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/username/.local/lib/python3.10/site-packages/tenacity/init.py", line 470, in call
result = fn(*args, **kwargs)
File "/home/username/.local/lib/python3.10/site-packages/forum_dl/session.py", line 130, in retrying_get
return self._do_get(url, params=params, headers=headers, **kwargs)
File "/home/username/.local/lib/python3.10/site-packages/forum_dl/session.py", line 164, in _do_get
return self._session.get(
File "/home/username/.local/lib/python3.10/site-packages/requests/sessions.py", line 602, in get
return self.request("GET", url, **kwargs)
File "/home/username/.local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/home/username/.local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/home/username/.local/lib/python3.10/site-packages/requests/adapters.py", line 622, in send
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='redacted', port=443): Max retries exceeded with url: [redacted] (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f03ae62f550>: Failed to establish a new connection: [Errno 24] Too many open files'))

The term 'forum-dl' is not recognized

Hello! Any time I run
forum-dl "https://www.kanyetothe.com/threads/whats-on-your-mind-also-on-ktt2.4044058/"

in powershell, it gives me this error.

The term 'forum-dl' is not recognized as the name of a cmdlet, function, script file, or operable program.
Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:1
+ forum-dl "https://www.kanyetothe.com/threads/whats-on-your-mind-also- ...
+ ~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (forum-dl:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

Any idea what I am doing wrong? For reference, I installed this via "pip install forum-dl" which worked, but when I run the forum-dl command from that folder it gives me the error.

Install

/Documents/forum_dl-0.3.0$ pip3 install forum-dl
/usr/bin/pip3:6: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
from pkg_resources import load_entry_point
ERROR: Could not find a version that satisfies the requirement forum-dl (from versions: none)
ERROR: No matching distribution found for forum-dl

Errors in identifying extractor

I've run into a couple of cases where forum-dl can't identify the type of forum. What if we could specify an extractor as an option when running forum-dl?

dateutil

pip complains about not finding dateutil.
pypi calls it python-dateutil.

Install issue requires 3.8.10

Within the local repo ran:

python3 -m pip install .

Output:

  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
    Preparing wheel metadata ... done
    ERROR: Package 'forum-dl-0.3.0' requires a different Python: 3.8.10 not in '>=3.10.11'
    

Error on install:

When I try installing from pip, I get the following error:

ERROR: Could not find a version that satisfies the requirement forum-dl (from versions: none)
ERROR: No matching distribution found for forum-dl

When I try to install from a local clone, I get the following error:

ERROR: File "setup.py" not found. Directory cannot be installed in editable mode: /path/to/forum-dl
(A "pyproject.toml" file was found, but editable mode currently requires a setup.py based build.)

For pip, I've tried pip and pip3. I generally need to use pip3 to install things from pip for python3. I get the same error using both commands.

Results of uname -a

Linux computername 5.15.0-79-generic #86~20.04.2-Ubuntu SMP Mon Jul 17 23:27:17 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/os-release

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

pip3 --version

pip 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)

Thanks for any help. And let me know if you need any additional data.

Problem with Discourse extractor?

Hey @mikwielgus thanks for this project as I have been looking for a way to crawl a large Discourse site. Having the below problem and not sure if it is a Python bug or PEBCAK.

python3.10 -m pip install forum-dl
forum-dl "https://community.example.com/" -o ~/Downloads/community.jsonl --textify -f jsonl
{
    "generator": "forum-dl",
    "version": "0.1.0",
    "extractor": "discourse",
    "download_time": "2023-05-21T22:32:10.732919+00:00",
    "type": "post",
    "item": {
        "path": [
            "1",
            "12973"
        ],
        "url": "https://community.example.com/t/slug/<built-in function id>",
        "origin": "https://community.example.com/t/slug/12973.json",
...

The issue is that the URL has <built-in function id> after the topic slug instead of the item path/id. I guess it has to do with this line which appears to clash with the built-in Python id function...

Set Date to Post date.

For mbox, the date in the header should be set to the actual post date instead of the date that it was scraped.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.