Tool to help scrape, mirror, and push content s3 website, and then queue in Pocket.
The 'mirror site' also includes a .png
screenshot and .pdf
"print to PDF" version.
Additional ability to consume links via Slack Bot or Google Tasks API (see settings.cfg.example
)
- scrape site using firefox (selenium) + load
.xpi
plugins - parse article using newspaper3k (text)
-p/--push-pocket
- take screenshot
- print
.pdf
- format
.html
- upload files to
s3
bucket under - queue item in Pocket
similar to scrape url
but will accept URLs from a slack bot. slackbot is run sync so best run as a systemd
service.
(TODO: dockerize this with something like selenium firefox as a base image.)
Probably doesn't work on windows without a few tweaks to pathing.
- Python 3.9
- poetry
- firefox (tested on
89.0.1
) - geckodriver (tested on
0.30.0
) - firefox dependecies (varies by system). Example for Debian 11:
- libgtk-3-0
- gconf-service
- libasound2
- libatk1.0-0
- libc6
- libcairo2
- libcups2
- libdbus-1-3
- libexpat1
- libfontconfig1
- libgcc1
- libgconf-2-4
- libgdk-pixbuf2.0-0
- libglib2.0-0
- libgtk-3-0
- libnspr4
- libpango-1.0-0
- libpangocairo-1.0-0
- libstdc++6
- libx11-6
- libx11-xcb1
- libxcb1
- libxcomposite1
- libxcursor1
- libxdamage1
- libxext6
- libxfixes3
- libxi6
- libxrandr2
- libxrender1
- libxss1
- libxtst6
- ca-certificates
- fonts-liberation
- libnss3
- lsb-release
- xdg-utils
- wget
- geckodriver Supported platforms¶
- Firefox Releases - e.g.
https://archive.mozilla.org/pub/firefox/releases/{{ firefox_version }}/linux-x86_64/en-US/firefox-{{ firefox_version }}.tar.bz2
- geckodriver Releases - e.g.
https://github.com/mozilla/geckodriver/releases/download/v{{ geckodriver_version }}/geckodriver-v{{ geckodriver_version }}-linux64.tar.gz
folder | description |
---|---|
xpi/ |
firefox plugins that get loaded into selenium, e.g. bypass-paywall-chrome |
bin/ |
EXPECTS; geckodriver and compatible firefox/firefox binary |
poetry install
- Obtain a pocket consumer key
- update
pocket_consumer_key
insettings.cfg
- user
./readcli pocket gen-access-token
to get an access token - update
pocket_access_token
insettings.cfg
- setup an s3 bucket website (e.g. with domain name)
- create IAM user with s3 bucket permissions
- update in
settings.cfg
bucket_name
bucket_is_domain_alias
aws_access_key_id
aws_secret_access_key
- setup a slack bot
- update in
settings.cfg
bot_token
app_token