Giter Site home page Giter Site logo

pavlovtech / webreaper Goto Github PK

View Code? Open in Web Editor NEW
94.0 5.0 19.0 38.24 MB

Web scraper, crawler and parser in C#. Designed as simple, declarative and scalable web scraping solution.

License: GNU General Public License v3.0

C# 66.97% HTML 33.03%
crawler parser webcrawler webscraping datamining scraper parsing scraping scraping-api scraping-data

webreaper's Introduction

logo

WebReaper

NuGet build status

Overview

WebReaper is a declarative high performance web scraper, crawler and parser in C#. Designed as simple, extensible and scalable web scraping solution. Easily crawl any web site and parse the data, save structed result to a file, DB, or pretty much to anywhere you want.

It provides a simple yet extensible API to make web scraping a breeze.

๐Ÿ“‹ Example:

ray-so-export

Table of contents

Install

dotnet add package WebReaper

Requirements

.NET 7

Features

  • โšก High crawling speed due to parallelism and asynchrony
  • ๐Ÿ—’ Declarative and easy to use
  • ๐Ÿ’พ Saving data to any data storages such as JSON or CSV file, MongoDB, CosmosDB, Redis, etc.
  • ๐ŸŒŽ Scalable: run your web scraper on ony cloud VMs, serverless functions, on-prem servers, etc.
  • ๐Ÿ™ Crawling and parsing Single Page Applications with Puppeteer
  • ๐Ÿ–ฅ Proxy support
  • ๐ŸŒ€ Extensible: replace out-of-the-box implementations with your own

Usage examples

  • Data mining
  • Gathering data for machine learning
  • Online price change monitoring and price comparison
  • News aggregation
  • Product review scraping (to watch the competition)
  • Tracking online presence and reputation

API overview

Parsing Single Page Applications

Parsing single page applications is super simple, just use the GetWithBrowser and/or FollowWithBrowser method. In this case Puppeteer will be used to load the pages.

using WebReaper.Builders;

var engine = await new ScraperEngineBuilder()
    .GetWithBrowser("https://www.alexpavlov.dev/blog")
    .FollowWithBrowser(".text-gray-900.transition")
    .Parse(new()
    {
        new("title", ".text-3xl.font-bold"),
        new("text", ".max-w-max.prose.prose-dark")
    })
    .WriteToJsonFile("output.json")
    .PageCrawlLimit(10)
    .WithParallelismDegree(30)
    .LogToConsole()
    .BuildAsync();

await engine.RunAsync();

Additionally, you can run any JavaScript on dynamic pages as they are loaded with headless browser. In order to do that you need to add some page actions such as .ScrollToEnd():

using WebReaper.Core.Builders;

var engine = await new ScraperEngineBuilder()
    .GetWithBrowser("https://www.reddit.com/r/dotnet/", actions => actions
        .ScrollToEnd()
        .Build())
    .Follow("a.SQnoC3ObvgnGjWt90zD9Z._2INHSNB8V5eaWp4P0rY_mE")
    .Parse(new()
    {
        new("title", "._eYtD2XCVieq6emjKBH3m"),
        new("text", "._3xX726aBn29LDbsDtzr_6E._1Ap4F5maDtT1E1YuCiaO0r.D3IL3FD0RFy_mkKLPwL4")
    })
    .WriteToJsonFile("output.json")
    .LogToConsole()
    .BuildAsync()

await engine.RunAsync();

Console.ReadLine();

It can be helpful if the required content is loaded only after some user interactions such as clicks, scrolls, etc.

Persist the progress locally

If you want to persist the visited links and job queue locally, so that you can start crawling where you left off you can use ScheduleWithTextFile and TrackVisitedLinksInFile methods:

var engine = await new ScraperEngineBuilder()
    .WithLogger(logger)
    .Get("https://rutracker.org/forum/index.php?c=33")
    .Follow("#cf-33 .forumlink>a")
    .Follow(".forumlink>a")
    .Paginate("a.torTopic", ".pg")
    .Parse(new()
    {
	new("name", "#topic-title"),
	new("category", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(3)"),
	new("subcategory", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(5)"),
	new("torrentSize", "div.attach_link.guest>ul>li:nth-child(2)"),
	new("torrentLink", ".magnet-link", "href"),
	new("coverImageUrl", ".postImg", "src")
    })
    .WriteToJsonFile("result.json")
    .IgnoreUrls(blackList)
    .ScheduleWithTextFile("jobs.txt", "progress.txt")
    .TrackVisitedLinksInFile("links.txt")
    .BuildAsync();

Authorization

If you need to pass authorization before parsing the web site, you can call SetCookies method on Scraper that has to fill CookieContainer with all cookies required for authorization. You are responsible for performing the login operation with your credentials, the Scraper only uses the cookies that you provide.

var engine = await new ScraperEngineBuilder()
    .WithLogger(logger)
    .Get("https://rutracker.org/forum/index.php?c=33")
    .SetCookies(cookies =>
    {
        cookies.Add(new Cookie("AuthToken", "123");
    })
    ...

How to disable headless mode

If you scrape pages with a browser using GetWithBrowser and FollowWithBrowser methods, the default mode is headless meaning that you won't see the browser during scraping. However, seeing the browser during scraping for debugging or troubleshooting may be useful. To disable headless mode you the .HeadlessMode(false) method call.

var engine = await new ScraperEngineBuilder()
    .GetWithBrowser("https://www.reddit.com/r/dotnet/", actions => actions
        .ScrollToEnd()
        .Build())
    .HeadlessMode(false)
    ...

How to clean scraped data from the previous web scrapping run

You may want to clean the data recived during the previous scraping to start you web scraping from scratch. In this case use dataCleanupOnStart when adding a new sink:

var engine = await new ScraperEngineBuilder()
    .Get("https://www.reddit.com/r/dotnet/")
    .WriteToJsonFile("output.json", dataCleanupOnStart: true)

This dataCleanupOnStart parameter is present for all sinks, e.g. MongoDbSink, RedisSink, CosmosSink, etc.

How to clean visited links from the previous web scrapping run

To clean up the list of visited links just pass true for dataCleanupOnStart parameter:

var engine = await new ScraperEngineBuilder()
    .Get("https://www.reddit.com/r/dotnet/")
    .TrackVisitedLinksInFile("visited.txt", dataCleanupOnStart: true)

How to clean job queue from the previous web scraping run

Job queue is a queue of tasks schedules for web scraper. To clean up the job queue pass the dataCleanupOnStart parameter set to true.

var engine = await new ScraperEngineBuilder()
    .Get("https://www.reddit.com/r/dotnet/")
    .WithTextFileScheduler("jobs.txt", "currentJob.txt", dataCleanupOnStart: true)

Distributed web scraping with Serverless approach

In the Examples folder you can find the project called WebReaper.AzureFuncs. It demonstrates the use of WebReaper with Azure Functions. It consists of two serverless functions:

StartScrapting

First of all, this function uses ScraperConfigBuilder to build the scraper configuration e. g.:

Secondly, this function writes the first web scraping job with startUrl to the Azure Service Bus queue:

WebReaperSpider

This Azure function is triggered by messages sent to the Azure Service Bus queue. Messages represent web scraping job.

Firstly, this function builds the spider that is going to execute the job from the queue.

Secondly, it executes the job by loading the page, parsing content, saving to the database, etc.

Finally, it iterates through these new jobs and sends them the the Job queue.

Extensibility

Adding a new sink to persist your data

Out of the box there are 4 sinks you can send your parsed data to: ConsoleSink, CsvFileSink, JsonFileSink, CosmosSink ( Azure Cosmos database).

You can easily add your own by implementing the IScraperSink interface:

public interface IScraperSink
{
    public Task EmitAsync(ParsedData data);
}

Here is an example of the Console sink:

public class ConsoleSink : IScraperSink
{
    public Task EmitAsync(ParsedData parsedItam)
    {
        Console.WriteLine($"{parsedItam.Data.ToString()}");
        return Task.CompletedTask;
    }
}

Adding your sink to the Scraper is simple, just call AddSink method on the Scraper:

var engine = await new ScraperEngineBuilder()
    .AddSink(new ConsoleSink());
    .Get("https://rutracker.org/forum/index.php?c=33")
    .Follow("#cf-33 .forumlink>a")
    .Follow(".forumlink>a")
    .Paginate("a.torTopic", ".pg")
    .Parse(new() {
        new("name", "#topic-title"),
    })
    .BuildAsync();

For other ways to extend your functionality see the next section.

Intrefaces

Interface Description
IScheduler Reading and writing from the job queue. By default, the in-memory queue is used, but you can provider your implementation
IVisitedLinkTracker Tracker of visited links. A default implementation is an in-memory tracker. You can provide your own for Redis, MongoDB, etc.
IPageLoader Loader that takes URL and returns HTML of the page as a string
IContentParser Takes HTML and schema and returns JSON representation (JObject).
ILinkParser Takes HTML as a string and returns page links
IScraperSink Represents a data store for writing the results of web scraping. Takes the JObject as parameter
ISpider A spider that does the crawling, parsing, and saving of the data

Main entities

  • Job - a record that represents a job for the spider
  • LinkPathSelector - represents a selector for links to be crawled

Repository structure

Project Description
WebReaper Library for web scraping
WebReaper.ScraperWorkerService Example of using WebReaper library in a Worker Service .NET project.
WebReaper.DistributedScraperWorkerService Example of using WebReaper library in a distributed way wih Azure Service Bus
WebReaper.AzureFuncs Example of using WebReaper library with serverless approach using Azure Functions
WebReaper.ConsoleApplication Example of using WebReaper library with in a console application

See the LICENSE file for license rights and limitations (GNU GPLv3).

webreaper's People

Contributors

fossabot avatar justynhunter avatar pavlovtech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

webreaper's Issues

Save crawled links and job queue to MongoDB

Is your feature request related to a problem? Please describe.
There is a sink for MongoDB, so it would be convenient to store the crawled links and current job queue in MongoDB as well.

Describe the solution you'd like
Implement the ICrawledLinkTracker interface for MongoDB to store the crawled links in MongoDB.

Duplicates

Hello and thanks for your work. Took me a while to figure out how to use it but now it works and it's quite fast
Just have a question: is there a way to automatically skip duplicates? Maybe keep a list of already seen urls or what?

Thank you <3

Save parsed results to Redis

Is your feature request related to a problem? Please describe.
WebReaper should support Redis for saving the web scraping results.

Describe the solution you'd like
Implementation of the ISink interface for Redis

Additional context
Use StackExchange.Redis package

Enhancement - Enable custom parser HtmlAgilityPack / JSON

Is your feature request related to a problem? Please describe.
I'm always frustrated when I cannot scrape a JSON endpoint for example the one provided by WordPress websites or I cannot understand how to scrape html based on the limited documentation of the current parser implementation while knowing myself accustomed to HtmlAgilityPack

Describe the solution you'd like
Enable JSON crawling and/or custom html parser implementation

Describe alternatives you've considered
I have considered using the postprocess but then it means that many of the core features of the library will not be used at all

Save crawled links and job queue to Azure CosmosDB

Is your feature request related to a problem? Please describe.
There is a sink for CosmosDB, so it would be convenient to store the crawled links in CosmosDB as well.

Describe the solution you'd like
Implement the ICrawledLinkTracker interface for CosmosDB to store the crawled links in CosmosDB.

SetCookies feature is not working as expected

Describe the bug
When I pass the login cookies via .SetCookies in ScraperEngineBuilder as example mentioned, the login was not set properly.

To Reproduce
Steps to reproduce the behavior:

  1. Extract login cookies from the selenium chrome browser
  2. Inject the cookiecollection into puppeteer via .SetCookies with ScraperEngineBuilder as in example
  3. Run it with headless mode to see the login status. (Make sure set parallelsimDegree to 1)
  4. target page will load for three times and throw the error.

Expected behavior
The target page shall be loaded with login set.

Desktop (please complete the following information):

  • OS: Windows11
  • Browser Chrome
  • Version 119.0.6045.200

Additional context
Please have a look in PuppeteerPageLoader.cs: line 63 .
I think url must be loaded prior setting the cookie.

Please enable JS and disable any ad blocker message returned

This is a very nice package thank you. When using it to review the number of cars on the page below, I am getting the message back "Please enable JS and disable any ad blocker"

Is this a limitation with the code, or is there something I can change with the headless browser ?

Steps to reproduce the behavior:

var x = new ScraperEngineBuilder()
    .GetWithBrowser("https://shift.com/cars/", actions => actions
        .ScrollToEnd()
        .Build())
    .Parse(new()
    {
        new("action", "html")
    })
    .WriteToJsonFile(@"c://Oxford//output123.json")
    .LogToConsole()
    .Build()
    .Run();

[Recommendation] Separate the HTML parser framework from the core library

I always follow a pattern: Create separate projects for major dependencies.

For a crawler, we generally use HtmlAgilityPack and AngleSharp as HTML parsers. In the beginning, we use them rarely but later on, we use them in many places and they become a major dependency. I see that your core library Exoscan depends on HtmlAgilityPack. I recommend creating a separate project named Exoscan.HtmlAgilityPack and put the concrete implementation in this project.

I usually prefer AngleSharp so, if you were having Exoscan.HtmlAgilityPack or Exoscan.AngleSharp package options, I would prefer your Exoscan.AngleSharp package. Because I already use AngleSharp in my current project for other parsing businesses.

I think you chose the current design for keeping things simple. I respect it for sure. My opinion is just a recommendation for flexibility.

Enhancement - End Engine's task once it's done scraping and reached all the target pages available.

Hello,

First of all - I very much appreciate the work you did for this project. I've just tried the library and man it's cool, works like magic and is very configurable.

As far as I understand, the main use case is the long-running engine to collect lots of data from the website and store it somewhere.
However, it might also be very useful when there's a finite amount of data to retrieve and this has to be done in finite time, let's say to navigate a few pages, parse some data and return it back. And as far I can tell, it's very hard to achieve here.

Two possible ways I found to get data and parse it to an object are either by using Subscribe() and then using the JObject to deserialize data to object or implementing own IScraperSink and storing the data there for further usage. I'm fine with both solutions and I've tested them - they both work perfectly. However, when we start the engine - it never stops even when there's nothing to parse because no one closes Channel, since it's opened forever - AsyncEnumerable never ends.

Therefore, I propose to make a change that the engine would actually store the current parse status in a form of a tree and once we've reached the state where all the leaf pages are of PageCategory TargetPage type, we close the channel and allow Parallel.ForEachAsync stops its execution returning us back from the engine and allowing actually await engine execution before retrieving results. It might not be perfect and I'm sure it's not, it's just the first thing I have in my mind that could work, maybe you have different ideas.

Please let me know what you think about this and whether you have plans and time for this enhancement.

Thank you,
Bogdan

System.NullReferenceException

First of all, thanks for the project.
When I apply the code "var engine = await new ScraperEngineBuilder() " as in the example, it gives "System.NullReferenceException" error.
What could be the problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.