WebReaper

Overview

WebReaper is a declarative high performance web scraper, crawler and parser in C#. Designed as simple, extensible and scalable web scraping solution. Easily crawl any web site and parse the data, save structed result to a file, DB, or pretty much to anywhere you want.

It provides a simple yet extensible API to make web scraping a breeze.

📋 Example:

Install
Requirements
Features
Usage examples
API overview
Repository structure

Install

dotnet add package WebReaper

Requirements

.NET 7

Features

⚡ High crawling speed due to parallelism and asynchrony
🗒 Declarative and easy to use
💾 Saving data to any data storages such as JSON or CSV file, MongoDB, CosmosDB, Redis, etc.
🌎 Scalable: run your web scraper on ony cloud VMs, serverless functions, on-prem servers, etc.
🐙 Crawling and parsing Single Page Applications with Puppeteer
🖥 Proxy support
🌀 Extensible: replace out-of-the-box implementations with your own

Usage examples

Data mining
Gathering data for machine learning
Online price change monitoring and price comparison
News aggregation
Product review scraping (to watch the competition)
Tracking online presence and reputation

API overview

Parsing Single Page Applications

Parsing single page applications is super simple, just use the GetWithBrowser and/or FollowWithBrowser method. In this case Puppeteer will be used to load the pages.

using WebReaper.Builders;

var engine = await new ScraperEngineBuilder()
    .GetWithBrowser("https://www.alexpavlov.dev/blog")
    .FollowWithBrowser(".text-gray-900.transition")
    .Parse(new()
    {
        new("title", ".text-3xl.font-bold"),
        new("text", ".max-w-max.prose.prose-dark")
    })
    .WriteToJsonFile("output.json")
    .PageCrawlLimit(10)
    .WithParallelismDegree(30)
    .LogToConsole()
    .BuildAsync();

await engine.RunAsync();

Additionally, you can run any JavaScript on dynamic pages as they are loaded with headless browser. In order to do that you need to add some page actions such as .ScrollToEnd():

using WebReaper.Core.Builders;

var engine = await new ScraperEngineBuilder()
    .GetWithBrowser("https://www.reddit.com/r/dotnet/", actions => actions
        .ScrollToEnd()
        .Build())
    .Follow("a.SQnoC3ObvgnGjWt90zD9Z._2INHSNB8V5eaWp4P0rY_mE")
    .Parse(new()
    {
        new("title", "._eYtD2XCVieq6emjKBH3m"),
        new("text", "._3xX726aBn29LDbsDtzr_6E._1Ap4F5maDtT1E1YuCiaO0r.D3IL3FD0RFy_mkKLPwL4")
    })
    .WriteToJsonFile("output.json")
    .LogToConsole()
    .BuildAsync()

await engine.RunAsync();

Console.ReadLine();

It can be helpful if the required content is loaded only after some user interactions such as clicks, scrolls, etc.

Persist the progress locally

If you want to persist the visited links and job queue locally, so that you can start crawling where you left off you can use ScheduleWithTextFile and TrackVisitedLinksInFile methods:

var engine = await new ScraperEngineBuilder()
    .WithLogger(logger)
    .Get("https://rutracker.org/forum/index.php?c=33")
    .Follow("#cf-33 .forumlink>a")
    .Follow(".forumlink>a")
    .Paginate("a.torTopic", ".pg")
    .Parse(new()
    {
	new("name", "#topic-title"),
	new("category", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(3)"),
	new("subcategory", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(5)"),
	new("torrentSize", "div.attach_link.guest>ul>li:nth-child(2)"),
	new("torrentLink", ".magnet-link", "href"),
	new("coverImageUrl", ".postImg", "src")
    })
    .WriteToJsonFile("result.json")
    .IgnoreUrls(blackList)
    .ScheduleWithTextFile("jobs.txt", "progress.txt")
    .TrackVisitedLinksInFile("links.txt")
    .BuildAsync();

Authorization

If you need to pass authorization before parsing the web site, you can call SetCookies method on Scraper that has to fill CookieContainer with all cookies required for authorization. You are responsible for performing the login operation with your credentials, the Scraper only uses the cookies that you provide.

var engine = await new ScraperEngineBuilder()
    .WithLogger(logger)
    .Get("https://rutracker.org/forum/index.php?c=33")
    .SetCookies(cookies =>
    {
        cookies.Add(new Cookie("AuthToken", "123");
    })
    ...

How to disable headless mode

If you scrape pages with a browser using GetWithBrowser and FollowWithBrowser methods, the default mode is headless meaning that you won't see the browser during scraping. However, seeing the browser during scraping for debugging or troubleshooting may be useful. To disable headless mode you the .HeadlessMode(false) method call.

var engine = await new ScraperEngineBuilder()
    .GetWithBrowser("https://www.reddit.com/r/dotnet/", actions => actions
        .ScrollToEnd()
        .Build())
    .HeadlessMode(false)
    ...

How to clean scraped data from the previous web scrapping run

You may want to clean the data recived during the previous scraping to start you web scraping from scratch. In this case use dataCleanupOnStart when adding a new sink:

var engine = await new ScraperEngineBuilder()
    .Get("https://www.reddit.com/r/dotnet/")
    .WriteToJsonFile("output.json", dataCleanupOnStart: true)

This dataCleanupOnStart parameter is present for all sinks, e.g. MongoDbSink, RedisSink, CosmosSink, etc.

How to clean visited links from the previous web scrapping run

To clean up the list of visited links just pass true for dataCleanupOnStart parameter:

var engine = await new ScraperEngineBuilder()
    .Get("https://www.reddit.com/r/dotnet/")
    .TrackVisitedLinksInFile("visited.txt", dataCleanupOnStart: true)

How to clean job queue from the previous web scraping run

Job queue is a queue of tasks schedules for web scraper. To clean up the job queue pass the dataCleanupOnStart parameter set to true.

var engine = await new ScraperEngineBuilder()
    .Get("https://www.reddit.com/r/dotnet/")
    .WithTextFileScheduler("jobs.txt", "currentJob.txt", dataCleanupOnStart: true)

Distributed web scraping with Serverless approach

In the Examples folder you can find the project called WebReaper.AzureFuncs. It demonstrates the use of WebReaper with Azure Functions. It consists of two serverless functions:

StartScrapting

First of all, this function uses ScraperConfigBuilder to build the scraper configuration e. g.:

Secondly, this function writes the first web scraping job with startUrl to the Azure Service Bus queue:

WebReaperSpider

This Azure function is triggered by messages sent to the Azure Service Bus queue. Messages represent web scraping job.

Firstly, this function builds the spider that is going to execute the job from the queue.

Secondly, it executes the job by loading the page, parsing content, saving to the database, etc.

Finally, it iterates through these new jobs and sends them the the Job queue.

Extensibility

Adding a new sink to persist your data

Out of the box there are 4 sinks you can send your parsed data to: ConsoleSink, CsvFileSink, JsonFileSink, CosmosSink ( Azure Cosmos database).

You can easily add your own by implementing the IScraperSink interface:

public interface IScraperSink
{
    public Task EmitAsync(ParsedData data);
}

Here is an example of the Console sink:

public class ConsoleSink : IScraperSink
{
    public Task EmitAsync(ParsedData parsedItam)
    {
        Console.WriteLine($"{parsedItam.Data.ToString()}");
        return Task.CompletedTask;
    }
}

Adding your sink to the Scraper is simple, just call AddSink method on the Scraper:

var engine = await new ScraperEngineBuilder()
    .AddSink(new ConsoleSink());
    .Get("https://rutracker.org/forum/index.php?c=33")
    .Follow("#cf-33 .forumlink>a")
    .Follow(".forumlink>a")
    .Paginate("a.torTopic", ".pg")
    .Parse(new() {
        new("name", "#topic-title"),
    })
    .BuildAsync();

For other ways to extend your functionality see the next section.

Intrefaces

Interface	Description
IScheduler	Reading and writing from the job queue. By default, the in-memory queue is used, but you can provider your implementation
IVisitedLinkTracker	Tracker of visited links. A default implementation is an in-memory tracker. You can provide your own for Redis, MongoDB, etc.
IPageLoader	Loader that takes URL and returns HTML of the page as a string
IContentParser	Takes HTML and schema and returns JSON representation (JObject).
ILinkParser	Takes HTML as a string and returns page links
IScraperSink	Represents a data store for writing the results of web scraping. Takes the JObject as parameter
ISpider	A spider that does the crawling, parsing, and saving of the data

Main entities

Job - a record that represents a job for the spider
LinkPathSelector - represents a selector for links to be crawled

Repository structure

Project	Description
WebReaper	Library for web scraping
WebReaper.ScraperWorkerService	Example of using WebReaper library in a Worker Service .NET project.
WebReaper.DistributedScraperWorkerService	Example of using WebReaper library in a distributed way wih Azure Service Bus
WebReaper.AzureFuncs	Example of using WebReaper library with serverless approach using Azure Functions
WebReaper.ConsoleApplication	Example of using WebReaper library with in a console application

See the LICENSE file for license rights and limitations (GNU GPLv3).

Enhancement - End Engine's task once it's done scraping and reached all the target pages available.

Hello,

First of all - I very much appreciate the work you did for this project. I've just tried the library and man it's cool, works like magic and is very configurable.

As far as I understand, the main use case is the long-running engine to collect lots of data from the website and store it somewhere.
However, it might also be very useful when there's a finite amount of data to retrieve and this has to be done in finite time, let's say to navigate a few pages, parse some data and return it back. And as far I can tell, it's very hard to achieve here.

Two possible ways I found to get data and parse it to an object are either by using Subscribe() and then using the JObject to deserialize data to object or implementing own IScraperSink and storing the data there for further usage. I'm fine with both solutions and I've tested them - they both work perfectly. However, when we start the engine - it never stops even when there's nothing to parse because no one closes Channel, since it's opened forever - AsyncEnumerable never ends.

Therefore, I propose to make a change that the engine would actually store the current parse status in a form of a tree and once we've reached the state where all the leaf pages are of PageCategory TargetPage type, we close the channel and allow Parallel.ForEachAsync stops its execution returning us back from the engine and allowing actually await engine execution before retrieving results. It might not be perfect and I'm sure it's not, it's just the first thing I have in my mind that could work, maybe you have different ideas.

Please let me know what you think about this and whether you have plans and time for this enhancement.

Thank you,
Bogdan

pavlovtech / webreaper Goto Github PK

webreaper's Introduction

WebReaper

Overview

📋 Example:

Table of contents

Install

Requirements

Features

Usage examples

API overview

Parsing Single Page Applications

Persist the progress locally

Authorization

How to disable headless mode

How to clean scraped data from the previous web scrapping run

How to clean visited links from the previous web scrapping run

How to clean job queue from the previous web scraping run

Distributed web scraping with Serverless approach

StartScrapting

WebReaperSpider

Extensibility

Adding a new sink to persist your data

Intrefaces

Main entities

Repository structure

webreaper's People

Contributors

Stargazers

Watchers

Forkers

webreaper's Issues

Recommend Projects

Recommend Topics

Recommend Org