Giter Site home page Giter Site logo

tenteikura's Introduction

Tenteikura

A minimal C# multithreaded web crawler

Usage

From an user's point of view, what is needed to start the crawler is a call to the #crawl method on a Crawler instance. Crawler's constructor takes a Cache instance as a parameter, which in turn requires a starting URL and a target directory to be instantiated.

    String targetDirectory = @"C:\tenteikura_cache";
    Uri startingURL        = new Uri("http://www.andreadallera.com");
    Cache cache            = new Cache(startingURL, targetDirectory);
    Crawler crawler        = new Crawler(cache);
    crawler.Crawl(startingURL); //starts the crawler at http://www.andreadallera.com

Crawler's constructor takes an optional parameter (bool, default false) which, if true, instructs the crawler to fetch pages outside the starting URI's domain or not:

    new Crawler(cache, true);  //will follow urls outside the starting URI's domain
    new Crawler(cache, false); //will fetch only pages inside the starting URI's domain
    new Crawler(cache);        //same as above

This will only keep the downloaded pages in the Cache object, which is an IEnumerable:

    foreach(Page page in cache) 
    {
        Console.WriteLine(page.Title);  //page title
        Console.WriteLine(page.HTML);   //page full HTML
        Console.WriteLine(page.Uri);    //page URI object
        Console.WriteLine(page.Hash);   //an hash of the URI's AbsoluteUri
        foreach(Uri link in page.Links) 
        {
            //the page has a IEnumerable<Uri> which contains all the links found on the page itself
            Console.WriteLine(link.AbsoluteUri);
        }
    }

Crawler exposes two events - NewPageFetched and WorkComplete:

    //fired when a valid page not in cache is downloaded    
    crawler.NewPageFetched += (page) {
        //do something with the fetched page
    };
    //fired when the crawler has no more pages left to fetch
    crawler.WorkComplete += () {
        //shut down the application, or forward to the GUI, or whatever
    };

If you want to persist the fetched pages, a very rudimental file system backed storage option is available, via the Persister class:

    Persister persister = new Persister(targetDirectory, startingURL);
    crawler.NewPageFetched += (page) {
        persister.save(page);
    };

Persister will save the page, in a subdirectory of targetDirectory named after startingURL.Authority, as two files: one file, with filename page.Hash + ".link", contains the page's absolute URI and the other, with filename page.Hash, contains the page itself in full.

There is an example console application on Tenteikura.Example.

TO DO

There's an hard dependency between Cache and Persister at the moment: Cache expects pages from the targetDirectory + startingUri.Authority path to be in the same format as the ones saved from Persister, while the loading strategy should be injected (and ideally provided by Persister itself).

Persister should use a more effective storage strategy - maybe backed by a RDMS or a documental storage.

The pages are fetched in random order, so there is no traversal priority strategy of any kind.

tenteikura's People

Watchers

James Cloos avatar Zak Soliman avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.