Giter Site home page Giter Site logo

muety / website-watcher Goto Github PK

View Code? Open in Web Editor NEW
60.0 5.0 9.0 60 KB

πŸ•΅οΈβ€β™€οΈ Naively watch websites for changes on regular intervals.

License: MIT License

Python 92.85% Shell 7.15%
sendmail watch url-watcher notification website-watcher cron-jobs

website-watcher's Introduction

πŸ•΅οΈβ€β™€οΈ website-watcher

License Coding Activity GitHub code size in bytes GitHub issues GitHub last commit Say thanks Security Rating Maintainability Rating Technical Debt Lines of Code

πŸ—’ Summary

This script watches a website, saves its contents to a specified text file, compares this file's contents to the website contents at the next visit and sends an e-mail if there are differences.

Please note: This will only work for static websites, which are completely rendered on the server. To parse dynamic, JavaScript-powered websites, like Single Page Apps, you would need a tool like Selenium WebDriver. If you're interested, please refer to my blog article about "Building a cloud-native web scraper using 8 different AWS services".

πŸ–Š Description

I made it for the purpose to repeatedly check a specific webpage where university exam results get published so I get notified almost instantly. Another application could be watching on the postal service's shipment tracking or the like. The script is very simple and works in a way that it visits a website, saves the entire HTML code into a local file and compares its contents to the potentially new page contents at the next visit. If there was a difference you will be notified via an e-mail. You can specify a threshold for saying how many single-character changes you want to actually be considered a change (maybe some webpages will display the current time at the right bottom, which you want to ignore - if time is displayed like 6:45 pm than a theshold of at least 5 would result in ignoring these changes). In order to save memory and CPU time in idle (although only very few) the script itself will only run once when executing it and instantly exit after it has finished one website visit. To make it run repeatedly you will have to set up a cron job that simply execute the script.

βš™οΈ Requirements

  • Python >= 3.9
  • Cron jobs

▢️ Usage

  • Clone project: git clone https://github.com/n1try/website-watcher-script
  • sudo pip3 install -r requirements.txt
  • chmod +x watcher.py
  • Create cronjob for your user account with crontab -e and add – for instance – @hourly ~/dev/watcher.py -u https://kit.edu -t 5 --adapter email -r [email protected]. This will hourly visit kit.edu and send an e-mail in case of changes, while ignoring changes less than 6 characters.
  • See python3 watcher.py -h for information on all available parameters.
  • πŸ‘‰ New: See batch.sh for information on how to watch multiple websites at once

Options

  • -u URL (required): URL of the website to watch
  • -t TOLERANCE: Tolerance in characters, i.e. changes with a difference of less than or equal to TOLERANCE characters will be ignored and not trigger a notification
  • -x XPATH: An XPath query to restrict watching to certain parts of a website. Only child elements of the element matching the query will be considered while watching
  • -i XPATH_IGNORE: A list of XPath queries to exclude certain parts of a website. Multiple queries possible by separating with a space like -i "//script" "//style".
  • -ua USER_AGENT: A custom user agent header to set in requests, e.g. for pretending to be a browser. Shortcut firefox is available to fake a Firefox 84 on Windows 10
  • --adapter ADAPTER: Which sending adapter to use (see below)

πŸ‘€ Please note

When running the script for the first time, you will get an e-mail that there where changes, since there is a difference between the empty file and the entire webiste HMTL code.

πŸ”Œ Adapters

Multiple send methods are supported in the form of adapters. To choose one, supply --adapter (e.g. --adapter email) as a an argument to watcher.py

To write your own adapter, you need to implement abstract SendAdapter class. See adapters/email.py for an example.

E-Mail (email)

This adapter, which is also the default one, will send an e-mail to notify about changes. It either uses local sendmail or a specified SMTP server.

Options

  -r RECIPIENT_ADDRESS          – Recipient e-mail address (required)
  -s SENDER_ADDRESS             – Sender e-mail address
  --subject SUBJECT             – E-Mail subject
  --sendmail_path SENDMAIL_PATH – Path to Sendmail binary
  --smtp                        – If set, SMTP is used instead of local Sendmail.
  --smtp_host SMTP_HOST         – SMTP server host name to send mails with – only required of "--smtp" is set to true
  --smtp_port SMTP_PORT         – SMTP server port – only required of "--smtp" is set to true
  --smtp_username SMTP_USERNAME – SMTP server login username – only required of "--smtp" is set to true
  --smtp_password SMTP_PASSWORD – SMTP server login password – only required of "--smtp" is set to true
  --disable_tls                 – If set, SMTP connection is unencrypted (TLS disabled) – only required of "--smtp" is set to true

Telepush (telepush)

This adapter will send an push notification via Telegram using Telepush. You have to register for the bot first to get an token. To do so, send a message to TelepushBot (Telepush was formerly called MiddlemanBot).

Options

  -r RECIPIENT_TOKEN            – Recipient token (required)
  -s SENDER                     – Sender name
  --webhook_url WEBHOOK_URL     – URL of the Telepush bot instance

Gotify (gotify)

This adapter will send an push notification via Gotify. First, you have to register a new app in Gotify and gets its key as an authorization token.

Options

  --gotify_key GOTIFY_KEY       – Gotify app key / token (required)
  --gotify_url GOTIFY_URL       – Gotify server instance address (required)

Ntfy.sh (ntfy)

This adapter will send an push notification via ntfy.sh.

Options

  --ntfy_topic NTFY_TOPIC       – Ntfy topic to publish to (required)
  --ntfy_url NTFY_URL           – Ntfy server instance address (optional)
  --ntfy_token NTFY_TOKEN       – Ntfy access token (if server required authentication) (optional)

WebSub (websub)

This adapter will send a ping to a WebSub Hub (e.g. pubsubhubbub.superfeedr.com as a hosted service or Switchboard as a self-hosted hub). However, a check whether the target resource is actually a publisher for that hub is skipped. You should verify that yourself.

Options

  --hub_url HUB_URL             – URL of the WebSub hub to publish to (required)

Sub Process (subprocess)

This adapter allows executing arbitrary shell commands with the watch result included as environment variables (WATCHER_URL and WATCHER_DIFF).

Example

python watcher.py \
  -u https://kit.edu \
  --adapter subprocess \
  --cmd "echo $WATCHER_DIFF characters changed at $WATCHER_URL > /tmp/watcher.txt"

Options

  --cmd CMD                     – A shell command to execute in case of a change (required)

Stdout / Log (stdout)

This adapter simply prints a message (either as plain text or in JSON) to the console.

Options

  --log_format LOG_FORMAT       – Format of the logged message (default: 'plain')

🧩 Website Examples

  1. Go to the front page
  2. Use F12 to open your browser's dev tools and switch to the Network tab
  3. Enter your search query, location and radius and git Search
  4. Right-click the first request of type html and status code 301 and copy its URL (starts with https://www.ebay-kleinanzeigen.de/s-suchanfrage.html)
  5. Watch it: python3 watcher.py -u "<URL_FROM_STEP_4>" -ua firefox -x "//div[@id='srchrslt-content']" --adapter stdout

πŸ§‘β€πŸ’» Developer Notes

Tests

$ python3 -m unittest discover . '*_test.py'

↗️ Contributing

Feel free to contribute! All contributions that add value to the project are welcome. Please check the issues section for bug reports and feature requests.

πŸ““ License

MIT @ Ferdinand MΓΌtsch

website-watcher's People

Contributors

coveritytest avatar dependabot[bot] avatar mrtc avatar muety avatar n-gao avatar rimorres avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

website-watcher's Issues

Bug in batch.sh

I don't know why yet, but when using

"xpath": "//div[@class='l-splitpage-content position-relative srpold']",

in example/many.json, batch.sh crashes.

Python3.8 breaking

This used to work on python3.7, but I moved to an OS which runs python3.8. Now I get this:

$ python3 watcher.py -u https://www.thisworddoesnotexist.com --adapter email -r [email protected] -s [email protected] --subject "EXAMPLE" --smtp_username [email protected] --smtp_password EXAMPLE --smtp


> /media/jorxster/NTFS_2/1_GIT/website-watcher-script/adapters/email.py(34)send()
-> smtp = smtplib.SMTP(self.args.smtp_host, self.args.smtp_port)
Traceback (most recent call last):
  File "watcher.py", line 90, in <module>
    main(*parser.parse_known_args())
  File "watcher.py", line 68, in main
    ok = adapter.send('Difference is %s characters.\n%s' % (str(diff), args.url))
  File "/media/jorxster/NTFS_2/1_GIT/website-watcher-script/adapters/email.py", line 34, in send
    smtp = smtplib.SMTP(self.args.smtp_host, self.args.smtp_port)
  File "/usr/lib/python3.8/smtplib.py", line 253, in __init__
    (code, msg) = self.connect(host, port)
  File "/usr/lib/python3.8/smtplib.py", line 339, in connect
    self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/lib/python3.8/smtplib.py", line 308, in _get_socket
    return socket.create_connection((host, port), timeout,
  File "/usr/lib/python3.8/socket.py", line 808, in create_connection
    raise err
  File "/usr/lib/python3.8/socket.py", line 796, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

cheers,
Jordan

Use difflib instead of length to detect changes

Currently you only register changes that result in a change of the length of the HTML source code. But it could happen that additions and deletions sum out at 0 which leads to a false positive. Also you don't register changes from Test to test. Do you consider using difflib instead? Would also have the benefit, that you could send the diff with the mail, instead of just a link to the URL.

Predicting Tolerance

How about embedding a light-weight statistical model that can study a website for say, an hour, generate enough data to predict a suitable tolerance score to automatically detect changes?

SSL Doesn't support

Hello,

Your code does not support SSL?

It is giving out error.

requests.exceptions.SSLError: HTTPSConnectionPool(host='XXXXXX.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '_ssl.c:510: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure'),))

XPath queries to only include certain parts of a document

Currently, any change change on the website that is "bigger" than n characters is considered a change. However, this might not be useful for all website. For instance, some websites show a timestamp or the last page's rendering time in the footer, which continuously changes.

Instead, more advanced mechanisms for change detection are desirable. For instance, one might be able to specify XPath queries to define, which sub-trees to inspect in HTML pages. For non-HTML pages, one could think of applying Regex matching to explicitly include or exclude certain parts of the document for change detection.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.