Python-driven web crawler and scraper. Uses BeautifulSoup to gather all URLs from a target page, and initiates a crawl from a start URL, considering Whitelist/Blacklist criteria that are populated in crawl.py
Python 100.00%
python_crawler's Introduction
Crawl.py
Crawl.py is a threaded web crawler that crawls a specified domain (or collection of domains) and collects metadata
about each page it visits.
This crawler can be used to analyze the performance of a web server, analyze relationships between pages, or traverse
a site and scrape pages.
Data collected includes:
- url & parsed url (scheme/netloc/path/etc...)
- page load time in milliseconds
- page size in bytes
- link addresses on the page
- number of links on the page
- number of links within the page domain
- number of links targeting external domains
Links on each page are recorded in the 'url_canonical' table.
Visits to each link are recorded in the 'visit_metadata' table.
Relationships between a visited link and all the links in the response are recorded in the 'page_rel' table.
Planned functionality:
- addition of an optional 'scrape page' function
- global toggle to allow already-visited URLs to be requeued.
- usage guidelines and documentation
Known issues:
- URL encoding is sometimes incorrect, which may cause duplication issues and does result in errors opening requests