Giter Site home page Giter Site logo

serritor's Introduction

Serritor

GitHub release GitHub Release Date Gitter

Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that require JavaScript to render data.

Using Serritor in your build

Maven : Project management tool that is based on POM

Add the following dependency to your pom.xml:

<dependency>
    <groupId>com.github.peterbencze</groupId>
    <artifactId>serritor</artifactId>
    <version>2.1.1</version>
</dependency>

Gradle

Add the following dependency to your build.gradle:

compile group: 'com.github.peterbencze', name: 'serritor', version: '2.1.1'

Manual dependencies

The standalone JAR files are available on the releases page.

Documentation

  • The Wiki contains usage information and examples
  • The Javadoc is available here

Quickstart

The Crawler abstract class provides a skeletal implementation of a crawler to minimize the effort to create your own. The extending class should implement the logic of the crawler.

Below you can find a simple example that is enough to get you started:

public class MyCrawler extends Crawler {

    private final UrlFinder urlFinder;

    public MyCrawler(final CrawlerConfiguration config) {
        super(config);

        // A helper class that is intended to make it easier to find URLs on web pages
        urlFinder = UrlFinder.createDefault();
    }

    @Override
    protected void onResponseSuccess(final ResponseSuccessEvent event) {
        // Crawl every URL found on the page
        urlFinder.findAllInResponse(event.getCompleteCrawlResponse())
                .stream()
                .map(CrawlRequest::createDefault)
                .forEach(this::crawl);

        // ...
    }
}

By default, the crawler uses the HtmlUnit headless browser:

// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFilterEnabled(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
        .build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start crawling with HtmlUnit
crawler.start();

Of course, you can also use other browsers. Currently Chrome and Firefox are supported.

// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFilterEnabled(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
        .build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start crawling with Chrome
crawler.start(Browser.CHROME);

That's it! In just a few lines you can create a crawler that crawls every link it finds, while filtering duplicate and offsite requests. You also get access to the WebDriver, so you can use all the features that are provided by Selenium.

Special thanks

serritor's People

Contributors

vsqdeveloper avatar prince1409 avatar ansh29082000 avatar ashwanisng avatar vanshvarshney avatar

Stargazers

Kaustubh Singh avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.