Giter Site home page Giter Site logo

scraping-boilerplate's Introduction


scraping-boilerplate

📍 Scrape with ease using this Boilerplate!

⚙️ Developed with the software and tools below:

Python


📚 Table of Contents


📍 Overview

This codebase provides a web crawling and data extraction tool for scraping websites using regular expressions. The main script initiates the crawling from a CSV input file using a Parser object and saves the progress of the crawler.

The Crawler class provides the ability to extract data and process it, while the Parser class uses regular expressions to extract new emails, phone numbers, linkedin URLs, and Facebook URLs from the HTML of a website.

The value proposition of this project is to automate the scraping process and extract relevant data efficiently and accurately but, most importantly, to provide a boilerplate for future scraping projects.


📂 Project Structure


🧩 Modules

Root
File Summary Module
main.py The code initiates a web crawler from the provided CSV input file using a Parser object to parse web data, limits the crawling to only 5 web pages (but could be more depanding on the chosen value), and saves the progress of the crawler. In case of any exception, it prints an error message. main.py
Crawler.py The provided code snippet consists of a Crawler class that has the ability to crawl websites and extract data from them. The class takes in parameters such as the parser to use, input file, output folder, and rate limit. It contains methods to compute header indexes, process extracted data, log progression, and save the crawling progression. One can use the crawl method to scrape websites and write the extracted data to an output file. utils\Crawler.py
Parser.py The provided code snippet contains a Parser class that uses regular expressions to extract new emails, phone numbers, linkedin URLs, and Facebook URLs from the HTML of a website. These extraction methods are decorated with the @set_func_headers decorator to specify headers and order. The Parser class also has methods to get headers and extract content from HTML using all extractor methods in order. utils\Parser.py
utilities.py The provided code snippet contains two utility functions. The first function is a decorator that assigns attributes such as headers, func_type, and order to a function. The second function normalizes a phone number based on predefined rules, such as removing spaces and adding country codes. utils\utilities.py

🚀 Getting Started

🖥 Installation

  1. Clone the scraping-boilerplate repository:
git clone D:/Utilisateurs/Bastien/Documents/Programmation/scraping-boilerplate
  1. Change to the project directory:
cd scraping-boilerplate
  1. Install the dependencies:
pip install -r requirements.txt

🤖 Using scraping-boilerplate

Add your own extractors in the Crawler class following the same structure as the linkedin and facebook extractors for instance:

@set_func_headers(headers=["linkedin_urls"], func_type="extractor", order=3)
def __extract_linkedin_urls(self, html_page_content):
    """
    Extract linkedin urls from the website HTML.

    Args:
        html_page_content: The content of the website HTML page

    Returns:
        A dictionary with the key 'linkedin_urls' and the urls found as value
    """
    # Regular expression pattern to match linkedin urls
    linkedin_regex_pattern = (
        r"https?:\/\/(www\.)?linkedin\.com\/[a-zA-Z%\däëüïöâêûîôàèùìòé\-_,\/]{4,}"
    )

    # Extract linkedin_urls from the website HTML using regular expressions
    matches = re.finditer(linkedin_regex_pattern, html_page_content, re.IGNORECASE)
    return {"linkedin_urls": [match.group() for match in matches]}

Change the limit of the crawler in the main.py script. It's currently set to 5, but you can change it to any number you want.

my_crawler.crawl(limit=5)

Then, run the main.py script to start the crawler:

python main.py

scraping-boilerplate's People

Contributors

bvelitchkine avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.