Giter Site home page Giter Site logo

aadish-mittal / webreaper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pavlovtech/webreaper

0.0 0.0 0.0 35.41 MB

Declarative high performance web scraper in C#. Easily crawl any web site and parse the data, save structed result to a file, DB, etc.

Home Page: https://webreaper.io

License: GNU General Public License v3.0

C# 100.00%

webreaper's Introduction

image WebReaper

NuGet

Overview

Declarative high performance web scraper in C#. Easily crawl any web site and parse the data, save structed result to a file, DB, etc.

Install

dotnet add WebReaper

Requirements

.NET 6

Tech stack:

โ— This is work in progress! API is not stable and will change.

๐Ÿ“‹ Example:

new Scraper()
    .WithStartUrl("https://rutracker.org/forum/index.php?c=33")
    .FollowLinks("#cf-33 .forumlink>a") // first level links
    .FollowLinks(".forumlink>a").       // second level links
    .FollowLinks("a.torTopic", ".pg").  // third level links to target pages
    .Parse(new Schema {
        new("name", "#topic-title"),
        new("category", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(3)"),
        new Url("torrentLink", ".magnet-link"), // get a link from <a> HTML tag (href attribute)
        new Image("coverImageUrl", ".postImg")  // get a link to the image from HTML <img> tag (src attribute)
    })
    .WriteToJsonFile("result.json")
    .Run(10); // 10 - degree of parallerism

Features:

  • โšก It's extremly fast due to parallelism and asynchrony
  • ๐Ÿ—’ Declarative parsing with a structured scheme
  • ๐Ÿ’พ Saving data to any sinks such as JSON or CSV file, MongoDB, CosmosDB, Redis, etc.
  • ๐ŸŒŽ Distributed crawling support: run your web scraper on ony cloud VMs, serverless functions, on-prem servers, etc.
  • ๐Ÿ™ Crowling and parsing Single Page Applications as well as static
  • ๐ŸŒ€ Automatic reties

Usage examples

  • Data mining
  • Gathering data for machine learning
  • Online price change monitoring and price comparison
  • News aggregation
  • Product review scraping (to watch the competition)
  • Gathering real estate listings
  • Tracking online presence and reputation
  • Web mashup and web data integration
  • MAP compliance
  • Lead generation

API overview

SPA parsing example

Parsing single page applications is super simple, just specify PageType.Dynamic

scraper = new Scraper()
    .WithLogger(logger)
    .WithStartUrl("https://rutracker.org/forum/index.php?c=33", PageType.Dynamic)
    .FollowLinks("#cf-33 .forumlink>a", PageType.Dynamic)
    .FollowLinks(".forumlink>a", PageType.Dynamic)
    .FollowLinks("a.torTopic", ".pg", PageType.Dynamic)
    .Parse(new Schema {
	new("name", "#topic-title"),
        new("category", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(3)"),
        new("subcategory", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(5)"),
        new("torrentSize", "div.attach_link.guest>ul>li:nth-child(2)"),
        new Url("torrentLink", ".magnet-link"),
	new Image("coverImageUrl", ".postImg")
     })
    .WriteToJsonFile("result.json")
    .WriteToCsvFile("result.csv")
    .IgnoreUrls(blackList);

Additionaly, you can run any JavaScript on dynamic pages as they are loaded with headless browser. In order to do that you need to pass the third parameter:

.WithStartUrl("https://rutracker.org/forum/index.php?c=33", PageType.Dynamic, "alert('startPage')")
.FollowLinks("#cf-33 .forumlink>a", PageType.Dynamic, "alert('first level page')")
.FollowLinks(".forumlink>a", PageType.Dynamic, "alert('first second level page')")
.FollowLinks("a.torTopic", ".pg", PageType.Dynamic, "alert('third level page')")

It can be helpful if the required content is loaded only after some user interactions such as clicks, scrolls, etc.

Authorization

If you need to pass authorization before parsing the web site, you can call Authorize method on Scraper that has to return CookieContainer with all cookies required for authorization. You are responsible for performing the login operation with your credentials, the Scraper only uses the cookies that you provide.

scraper = new Scraper()
    .WithLogger(logger)
    .WithStartUrl("https://rutracker.org/forum/index.php?c=33")
    .Authorize(() =>
    {
        var container = new CookieContainer();
        container.Add(new Cookie("AuthToken", "123");
        return container;
    })

Distributed web scraping with Serverless approach

In the Examples folder you can find the project called WebReaper.AzureFuncs. It demonstrates the use of WebReaper with Azure Functions. It consists of two serverless functions:

StartScrapting

First of all, this function uses ScraperConfigBuilder to build the scraper configuration e. g.:

var config = new ScraperConfigBuilder()
    .WithLogger(_logger)
    .WithStartUrl("https://rutracker.org/forum/index.php?c=33")
    .FollowLinks("#cf-33 .forumlink>a")
    .FollowLinks(".forumlink>a")
    .FollowLinks("a.torTopic", ".pg")
    .WithScheme(new Schema
    {
        new("name", "#topic-title"),
        new("category", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(3)"),
        new("subcategory", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(5)"),
        new("torrentSize", "div.attach_link.guest>ul>li:nth-child(2)"),
        new Url("torrentLink", ".magnet-link"),
        new Image("coverImageUrl", ".postImg")
    })
    .Build();

Secondly, this function writes the first web scraping job with startUrl to the Azure Service Bus queue:

var jobQueueWriter = new AzureJobQueueWriter("connectionString", "jobqueue");

await jobQueueWriter.WriteAsync(new Job(
config.ParsingScheme!,
config.BaseUrl,
config.StartUrl!,
ImmutableQueue.Create(config.LinkPathSelectors.ToArray()),
DepthLevel: 0));

WebReaperSpider

This Azure function is triggered by messages sent to the Azure Service Bus queue. Messages represent web scraping job.

Firstly, this function builds the spider that is going to execute the job from the queue:

var spiderBuilder = new SpiderBuilder()
    .WithLogger(log)
    .IgnoreUrls(blackList)
    .WithLinkTracker(LinkTracker)
    .AddSink(CosmosSink)
    .Build();

Secondly, it executes the job by loading the page, parsing content, saving to the database, etc:

var newJobs = await spiderBuilder.CrawlAsync(job);

CrawlAsync method returns new jobs that are produced as a result of handling the current job.

Finally, it iterates through these new jobs and sends them the the Job queue:

foreach(var newJob in newJobs)
{
    log.LogInformation($"Adding to the queue: {newJob.Url}");
    await outputSbQueue.AddAsync(SerializeToJson(newJob));
}

Extensibility

Adding a new sink to persist your data

Out of the box there are 4 sinks you can send your parsed data to: ConsoleSink, CsvFileSink, JsonFileSink, CosmosSink (Azure Cosmos database).

You can easly add your own by implementing the IScraperSink interface:

 public interface IScraperSink
{
    public Task EmitAsync(JObject scrapedData);
}

Here is an example of the Console sink:

public class ConsoleSink : IScraperSink
{
    public Task EmitAsync(JObject scrapedData)
    {
        Console.WriteLine($"{scrapedData.ToString()}");
        return Task.CompletedTask;
    }
}

The scrapedData parameter is JSON object that contains scraped data that you specified in your schema.

Adding your sink to the Scraper is simple, just call AddSink method on the Scraper:

scraper = new Scraper()
    .AddSink(new ConsoleSink());
    .WithStartUrl("https://rutracker.org/forum/index.php?c=33")
    .FollowLinks("#cf-33 .forumlink>a")
    .FollowLinks(".forumlink>a")
    .FollowLinks("a.torTopic", ".pg")
    .Parse(new Schema {
        new("name", "#topic-title"),
    });

For other ways to extend your functionality see the next section.

Intrefaces

Interface Description
IJobQueueReader Reading from the job queue. By default, the in-memory queue is used, but you can provider your implementation for RabbitMQ, Azure Service Bus queue, etc.
IJobQueueWriter Writing to the job queue. By default, the in-memory queue is used, but you can provider your implementation for RabbitMQ, Azure Service Bus queue, etc.
ICrawledLinkTracker Tracker of visited links. A default implementation is an in-memory tracker. You can provide your own for Redis, MongoDB, etc.
IPageLoader Loader that takes URL and returns HTML of the page as a string
IContentParser Takes HTML and schema and returns JSON representation (JObject).
ILinkParser Takes HTML as a string and returns page links
IScraperSink Represents a data store for writing the results of web scraping. Takes the JObject as parameter
ISpider A spider that does the crawling, parsing, and saving of the data

Main entities

  • Job - a record that represends a job for the spider
  • LinkPathSelector - represents a selector for links to be crawled
  • PageCategory enum. Calculated automatically based on job's fields. Possible values:
    • TransitPage any page on the path to target page that you want to parse
    • PageWithPagination - page with pagination such as a catalog of goods, blog posts with pagination, etc
    • TargetPage - page that you want to scrape and save the result

Repository structure

Project Description
WebReaper.Core Library for web scraping
ScraperWorkerService Example of using WebReaper library in a Worker Service .NET project.
DistributedScraperWorkerService Example of using WebReaper library in a distributed way wih Azure Service Bus
WebReaper.AzureFuncs Example of using WebReaper library with serverless approach using Azure Functions

Coming soon:

  • Nuget package
  • Azure functions for the distributed crawling
  • Parsing lists
  • Loading pages with headless browser and flexible SPA page manipulations (clicks, scrolls, etc)
  • Add flexible conditions for ignoring or allowing certain pages
  • Breadth first traversal with priprity channels
  • Save auth cookies to redis
  • Rest API example for web craping
  • Proxy support
  • Sitemap crawling support
  • Ports to NodeJS and Go

Features under consideration

  • Saving logs to Seq
  • Add LogTo method with Console and File support
  • Site API support
  • CRON for scheduling
  • Request auto throttling
  • Add bloom filter for revisiting same urls
  • Improve architecture and refactor
  • Subscribe to logs with lambda expression

See the LICENSE file for license rights and limitations (GNU GPLv3).

webreaper's People

Contributors

pavlovtech avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.