Giter Site home page Giter Site logo

arghya721 / pysitecrawler Goto Github PK

View Code? Open in Web Editor NEW
7.0 1.0 0.0 17 KB

PySiteCrawler is a Python library for web crawling and data extraction. It's designed for exploring web pages, extracting text, and managing links efficiently. You can easily store scraped data in .txt files for analysis. Future updates will add more traversal strategies.

Home Page: https://pypi.org/project/PySiteCrawler/

Python 100.00%
bfs bfs-algorithm bfs-search chromedriver crawler dfs dfs-algorithm dfs-search geckodriver library

pysitecrawler's Introduction

PySiteCrawler - A Simple Web Crawling Library

PySiteCrawler is a Python library designed for web crawling and data extraction, offering a simple and efficient way to explore web pages, extract text content, and manage links during the crawling process. The library is designed to provide versatile traversal methods, with additional traversal strategies planned for future updates. All scraped data is conveniently stored in .txt files for easy access and analysis.

Features

  • Breadth-First Search Crawling: Seamlessly traverse websites using a breadth-first search strategy.

  • Depth-First Search Crawling: Efficiently explore websites using a depth-first search strategy.

  • Text Extraction: Extract text content and titles from HTML pages for further analysis.

  • Headless Browsing: Use either GeckoDriver or ChromeDriver for headless browsing.

Prerequisites

Before using PySiteCrawler, ensure that you have the following prerequisites in place:

  • Python: PySiteCrawler requires Python (version 3.6 or higher). You can download the latest version of Python from here.

  • WebDriver Setup:

    • GeckoDriver: For Firefox browser automation, download the latest GeckoDriver from here and make sure it is available in your system's PATH.
    • ChromeDriver: For Chrome browser automation, download the latest ChromeDriver from here and make sure it is available in your system's PATH.

Installation

You can easily install PySiteCrawler using pip:

pip install PySiteCrawler

Classes and Functions

BFSWebCrawler

The BFSWebCrawler class provides the following functions and methods:

  • __init__(base_url, geckodriver_path=None, chromedriver_path=None, max_depth=None, headless=False): Initialize the BFSWebCrawler instance.
  • crawl(): Perform a breadth-first search crawl on the specified website.

DFSWebCrawler

The DFSWebCrawler class provides the following functions and methods:

  • __init__(base_url, geckodriver_path=None, chromedriver_path=None, max_depth=None, headless=False): Initialize the DFSWebCrawler instance.
  • crawl(): Perform a depth-first search crawl on the specified website.

Note: The Selenium driver used for crawling has a default timeout of 10 seconds per page.

Usage

Here's a quick example of how to use PySiteCrawler to perform a breadth-first search crawl on a website:

from PySiteCrawler.crawler.bfs_web_crawler import BFSWebCrawler

# Initialize a BFSWebCrawler
crawler = BFSWebCrawler("https://example.com", max_depth=2,
                         geckodriver_path=r"path/to/geckodriver")
crawler.crawl()

You can also specify the chromedriver_path parameter during initialization to use the ChromeDriver for crawling. (It is suggested to use the geckodriver as chromedriver causes some issue in loading the website correctly in headless mode)

from PySiteCrawler.crawler.dfs_web_crawler import DFSWebCrawler

# Initialize a DFSWebCrawler
crawler = DFSWebCrawler("https://example.com", max_depth=2,
                         chromedriver_path=r"path/to/chromedriver")
crawler.crawl()

You can get all the text for all the visited page using website_text dictionary.

text = crawler.website_text["https://example.com"]

Pass the base url to get all the crawled website text data.

Parameters

  • base_url: The starting URL for web crawling.
  • max_depth (optional): The maximum depth of crawling. Default is None (no limit).
  • geckodriver_path (optional): Path to GeckoDriver executable for Firefox. Default is None (uses ChromeDriver).
  • chromedriver_path (optional): Path to ChromeDriver executable for Chrome. Default is None (uses GeckoDriver).
  • headless (optional): If True, the browser will run in headless mode (no GUI display). If False, the browser GUI will be visible. Default is True.
  • disable_file_generation(optional): If False, all the text content for each visited page will be stored into a .txt file. Default is False.

Note: The base_url parameter and either geckodriver_path or chromedriver_path are necessary for PySiteCrawler to work correctly. Specify the appropriate WebDriver path based on your preferred browser automation. If geckodriver_path is provided, GeckoDriver will be used by default. If chromedriver_path is provided, ChromeDriver will be used for crawling. It is suggested to use GeckoDriver, as ChromeDriver may cause issues in loading websites correctly in headless mode.

Contribution

Contributions are welcome! If you encounter any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request.

pysitecrawler's People

Contributors

arghya721 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.