sjdirect / abot Goto Github PK

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

License: Apache License 2.0

C# 100.00%

c-sharp abot abot-nuget crawler web-crawler parsing spider spiders pluggable unit-testing csharp netcore netcore2 netcore3 netsta netstandard20 netstandard21 csharp-library javascript-renderer cross-platform

abot's Introduction

Abot

Please star this project!!

C# web crawler built for speed and flexibility.

Abot is an open source C# web crawler framework built for speed and flexibility. It takes care of the low level plumbing (multithreading, http requests, scheduling, link parsing, etc..). You just register for events to process the page data. You can also plugin your own implementations of core interfaces to take complete control over the crawl process. Abot Nuget package version >= 2.0 targets Dotnet Standard 2.0 and Abot Nuget package version < 2.0 targets .NET version 4.0 which makes it highly compatible with many .net framework/core implementations.

What's So Great About It?

Open Source (Free for commercial and personal use)
It's fast, really fast!!
Easily customizable (Pluggable architecture allows you to decide what gets crawled and how)
Heavily unit tested (High code coverage)
Very lightweight (not over engineered)
No out of process dependencies (no databases, no installed services, etc...)

Links of Interest

Ask a question, please search for similar questions first!!!
Report a bug
Learn how you can contribute
Need expert Abot customization?
Take the usage survey to help prioritize features/improvements
Consider making a donation

Use AbotX for more powerful extensions/wrappers

Quick Start

Installing Abot

Install Abot using Nuget

PM> Install-Package Abot

Using Abot

using System;
using System.Threading.Tasks;
using Abot2.Core;
using Abot2.Crawler;
using Abot2.Poco;
using Serilog;

namespace TestAbotUse
{
    class Program
    {
        static async Task Main(string[] args)
        {
            Log.Logger = new LoggerConfiguration()
                .MinimumLevel.Information()
                .WriteTo.Console()
                .CreateLogger();

            Log.Logger.Information("Demo starting up!");

            await DemoSimpleCrawler();
            await DemoSinglePageRequest();
        }

        private static async Task DemoSimpleCrawler()
        {
            var config = new CrawlConfiguration
            {
                MaxPagesToCrawl = 10, //Only crawl 10 pages
                MinCrawlDelayPerDomainMilliSeconds = 3000 //Wait this many millisecs between requests
            };
            var crawler = new PoliteWebCrawler(config);

            crawler.PageCrawlCompleted += PageCrawlCompleted;//Several events available...

            var crawlResult = await crawler.CrawlAsync(new Uri("http://!!!!!!!!YOURSITEHERE!!!!!!!!!.com"));
        }

        private static async Task DemoSinglePageRequest()
        {
            var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());

            var crawledPage = await pageRequester.MakeRequestAsync(new Uri("http://google.com"));
            Log.Logger.Information("{result}", new
            {
                url = crawledPage.Uri,
                status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
            });
        }

        private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
        {
            var httpStatus = e.CrawledPage.HttpResponseMessage.StatusCode;
            var rawPageText = e.CrawledPage.Content.Text;
        }
    }
}

Abot Configuration

Abot's Abot2.Poco.CrawlConfiguration class has a ton of configuration options. You can see what effect each config value has on the crawl by looking at the code comments .

var crawlConfig = new CrawlConfiguration();
crawlConfig.CrawlTimeoutSeconds = 100;
crawlConfig.MaxConcurrentThreads = 10;
crawlConfig.MaxPagesToCrawl = 1000;
crawlConfig.UserAgentString = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36";
crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue1", "1111");
crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue2", "2222");
etc...

Abot Events

crawler.PageCrawlStarting += crawler_ProcessPageCrawlStarting;
crawler.PageCrawlCompleted += crawler_ProcessPageCrawlCompleted;
crawler.PageCrawlDisallowed += crawler_PageCrawlDisallowed;
crawler.PageLinksCrawlDisallowed += crawler_PageLinksCrawlDisallowed;

void crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e)
{
	PageToCrawl pageToCrawl = e.PageToCrawl;
	Console.WriteLine($"About to crawl link {pageToCrawl.Uri.AbsoluteUri} which was found on page {pageToCrawl.ParentUri.AbsoluteUri}");
}

void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
	CrawledPage crawledPage = e.CrawledPage;	
	if (crawledPage.HttpRequestException != null || crawledPage.HttpResponseMessage.StatusCode != HttpStatusCode.OK)
		Console.WriteLine($"Crawl of page failed {crawledPage.Uri.AbsoluteUri}");
	else
		Console.WriteLine($"Crawl of page succeeded {crawledPage.Uri.AbsoluteUri}");

	if (string.IsNullOrEmpty(crawledPage.Content.Text))
		Console.WriteLine($"Page had no content {crawledPage.Uri.AbsoluteUri}");

	var angleSharpHtmlDocument = crawledPage.AngleSharpHtmlDocument; //AngleSharp parser
}

void crawler_PageLinksCrawlDisallowed(object sender, PageLinksCrawlDisallowedArgs e)
{
	CrawledPage crawledPage = e.CrawledPage;
	Console.WriteLine($"Did not crawl the links on page {crawledPage.Uri.AbsoluteUri} due to {e.DisallowedReason}");
}

void crawler_PageCrawlDisallowed(object sender, PageCrawlDisallowedArgs e)
{
	PageToCrawl pageToCrawl = e.PageToCrawl;
	Console.WriteLine($"Did not crawl page {pageToCrawl.Uri.AbsoluteUri} due to {e.DisallowedReason}");
}

Custom objects and the dynamic crawl bag

Add any number of custom objects to the dynamic crawl bag or page bag. These objects will be available in the CrawlContext.CrawlBag object, PageToCrawl.PageBag object or CrawledPage.PageBag object.

var crawler crawler = new PoliteWebCrawler();
crawler.CrawlBag.MyFoo1 = new Foo();
crawler.CrawlBag.MyFoo2 = new Foo();
crawler.PageCrawlStarting += crawler_ProcessPageCrawlStarting;
...

void crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e)
{
    //Get your Foo instances from the CrawlContext object
    var foo1 = e.CrawlConext.CrawlBag.MyFoo1;
    var foo2 = e.CrawlConext.CrawlBag.MyFoo2;

    //Also add a dynamic value to the PageToCrawl or CrawledPage
    e.PageToCrawl.PageBag.Bar = new Bar();
}

Cancellation

CancellationTokenSource cancellationTokenSource = new CancellationTokenSource();

var crawler = new PoliteWebCrawler();
var result = await crawler.CrawlAsync(new Uri("addurihere"), cancellationTokenSource);

Customizing Crawl Behavior

Abot was designed to be as pluggable as possible. This allows you to easily alter the way it works to suite your needs.

The easiest way to change Abot's behavior for common features is to change the config values that control them. See the Quick Start page for examples on the different ways Abot can be configured.

CrawlDecision Callbacks/Delegates

Sometimes you don't want to create a class and go through the ceremony of extending a base class or implementing the interface directly. For all you lazy developers out there Abot provides a shorthand method to easily add your custom crawl decision logic. NOTE: The ICrawlDecisionMaker's corresponding method is called first and if it does not "allow" a decision, these callbacks will not be called.

var crawler = new PoliteWebCrawler();

crawler.ShouldCrawlPageDecisionMaker = (pageToCrawl, crawlContext) => 
{
	var decision = new CrawlDecision{ Allow = true };
	if(pageToCrawl.Uri.Authority == "google.com")
		return new CrawlDecision{ Allow = false, Reason = "Dont want to crawl google pages" };
	
	return decision;
};

crawler.ShouldDownloadPageContentDecisionMaker = (crawledPage, crawlContext) =>
{
	var decision = new CrawlDecision{ Allow = true };
	if (!crawledPage.Uri.AbsoluteUri.Contains(".com"))
		return new CrawlDecision { Allow = false, Reason = "Only download raw page content for .com tlds" };

	return decision;
};

crawler.ShouldCrawlPageLinksDecisionMaker = (crawledPage, crawlContext) =>
{
	var decision = new CrawlDecision{ Allow = true };
	if (crawledPage.Content.Bytes.Length < 100)
		return new CrawlDecision { Allow = false, Reason = "Just crawl links in pages that have at least 100 bytes" };

	return decision;
};

Custom Implementations

PoliteWebCrawler is the master of orchestrating the crawl. Its job is to coordinate all the utility classes to "crawl" a site. PoliteWebCrawler accepts an alternate implementation for all its dependencies through its constructor.

var crawler = new PoliteWebCrawler(
    	new CrawlConfiguration(),
	new YourCrawlDecisionMaker(),
	new YourThreadMgr(), 
	new YourScheduler(), 
	new YourPageRequester(), 
	new YourHyperLinkParser(), 
	new YourMemoryManager(), 
    	new YourDomainRateLimiter,
	new YourRobotsDotTextFinder());

Passing null for any implementation will use the default. The example below will use your custom implementation for the IPageRequester and IHyperLinkParser but will use the default for all others.

var crawler = new PoliteWebCrawler(
	null, 
	null, 
    	null,
    	null,
	new YourPageRequester(), 
	new YourHyperLinkParser(), 
	null,
    	null, 
	null);

The following are explanations of each interface that PoliteWebCrawler relies on to do the real work.

ICrawlDecisionMaker

The callback/delegate shortcuts are great to add a small amount of logic but if you are doing anything more heavy you will want to pass in your custom implementation of ICrawlDecisionMaker. The crawler calls this implementation to see whether a page should be crawled, whether the page's content should be downloaded and whether a crawled page's links should be crawled.

CrawlDecisionMaker.cs is the default ICrawlDecisionMaker used by Abot. This class takes care of common checks like making sure the config value MaxPagesToCrawl is not exceeded. Most users will only need to create a class that extends CrawlDecision maker and just add their custom logic. However, you are completely free to create a class that implements ICrawlDecisionMaker and pass it into PoliteWebCrawlers constructor.

/// <summary>
/// Determines what pages should be crawled, whether the raw content should be downloaded and if the links on a page should be crawled
/// </summary>
public interface ICrawlDecisionMaker
{
	/// <summary>
	/// Decides whether the page should be crawled
	/// </summary>
	CrawlDecision ShouldCrawlPage(PageToCrawl pageToCrawl, CrawlContext crawlContext);

	/// <summary>
	/// Decides whether the page's links should be crawled
	/// </summary>
	CrawlDecision ShouldCrawlPageLinks(CrawledPage crawledPage, CrawlContext crawlContext);

	/// <summary>
	/// Decides whether the page's content should be dowloaded
	/// </summary>
	CrawlDecision ShouldDownloadPageContent(CrawledPage crawledPage, CrawlContext crawlContext);
}

IThreadManager

The IThreadManager interface deals with the multithreading details. It is used by the crawler to manage concurrent http requests.

TaskThreadManager.cs is the default IThreadManager used by Abot.

/// <summary>
/// Handles the multithreading implementation details
/// </summary>
public interface IThreadManager : IDisposable
{
	/// <summary>
	/// Max number of threads to use.
	/// </summary>
	int MaxThreads { get; }

	/// <summary>
	/// Will perform the action asynchrously on a seperate thread
	/// </summary>
	/// <param name="action">The action to perform</param>
	void DoWork(Action action);

	/// <summary>
	/// Whether there are running threads
	/// </summary>
	bool HasRunningThreads();

	/// <summary>
	/// Abort all running threads
	/// </summary>
	void AbortAll();
}

IScheduler

The IScheduler interface deals with managing what pages need to be crawled. The crawler gives the links it finds to and gets the pages to crawl from the IScheduler implementation. A common use cases for writing your own implementation might be to distribute crawls across multiple machines which could be managed by a DistributedScheduler.

Scheduler.cs is the default IScheduler used by the crawler and by default is constructed with in memory collection to determine what pages have been crawled and which need to be crawled.

/// <summary>
/// Handles managing the priority of what pages need to be crawled
/// </summary>
public interface IScheduler
{
	/// <summary>
	/// Count of remaining items that are currently scheduled
	/// </summary>
	int Count { get; }

	/// <summary>
	/// Schedules the param to be crawled
	/// </summary>
	void Add(PageToCrawl page);

	/// <summary>
	/// Schedules the param to be crawled
	/// </summary>
	void Add(IEnumerable<PageToCrawl> pages);

	/// <summary>
	/// Gets the next page to crawl
	/// </summary>
	PageToCrawl GetNext();

	/// <summary>
	/// Clear all currently scheduled pages
	/// </summary>
	void Clear();
}

IPageRequester

The IPageRequester interface deals with making the raw http requests.

PageRequester.cs is the default IPageRequester used by the crawler.

public interface IPageRequester : IDisposable
{
	/// <summary>
	/// Make an http web request to the url and download its content
	/// </summary>
	Task<CrawledPage> MakeRequestAsync(Uri uri);

	/// <summary>
	/// Make an http web request to the url and download its content based on the param func decision
	/// </summary>
	Task<CrawledPage> MakeRequestAsync(Uri uri, Func<CrawledPage, CrawlDecision> shouldDownloadContent);
}

IHyperLinkParser

The IHyperLinkParser interface deals with parsing the links out of raw html.

AngleSharpHyperLinkParser.cs is the default IHyperLinkParser used by the crawler. It uses the well known AngleSharp to do the html parsing. AngleSharp uses a css style selector like jquery but all in c#.

/// <summary>
/// Handles parsing hyperlikns out of the raw html
/// </summary>
public interface IHyperLinkParser
{
	/// <summary>
	/// Parses html to extract hyperlinks, converts each into an absolute url
	/// </summary>
	IEnumerable<Uri> GetLinks(CrawledPage crawledPage);
}

IMemoryManager

The IMemoryManager handles memory monitoring. This feature is still experimental and could be removed in a future release if found to be unreliable.

MemoryManager.cs is the default implementation used by the crawler.

/// <summary>
/// Handles memory monitoring/usage
/// </summary>
public interface IMemoryManager : IMemoryMonitor, IDisposable
{
	/// <summary>
	/// Whether the current process that is hosting this instance is allocated/using above the param value of memory in mb
	/// </summary>
	bool IsCurrentUsageAbove(int sizeInMb);

	/// <summary>
	/// Whether there is at least the param value of available memory in mb
	/// </summary>
	bool IsSpaceAvailable(int sizeInMb);
}

IDomainRateLimiter

The IDomainRateLimiter handles domain rate limiting. It will handle determining how much time needs to elapse before it is ok to make another http request to the domain.

DomainRateLimiter.cs is the default implementation used by the crawler.

/// <summary>
/// Rate limits or throttles on a per domain basis
/// </summary>
public interface IDomainRateLimiter
{
	/// <summary>
	/// If the domain of the param has been flagged for rate limiting, it will be rate limited according to the configured minimum crawl delay
	/// </summary>
	void RateLimit(Uri uri);

	/// <summary>
	/// Add a domain entry so that domain may be rate limited according the the param minumum crawl delay
	/// </summary>
	void AddDomain(Uri uri, long minCrawlDelayInMillisecs);
}

IRobotsDotTextFinder

The IRobotsDotTextFinder is responsible for retrieving the robots.txt file for every domain (if isRespectRobotsDotTextEnabled="true") and building the robots.txt abstraction which implements the IRobotsDotText interface.

RobotsDotTextFinder.cs is the default implementation used by the crawler.

/// <summary>
/// Finds and builds the robots.txt file abstraction
/// </summary>
public interface IRobotsDotTextFinder
{
	/// <summary>
	/// Finds the robots.txt file using the rootUri. 
        /// 
	IRobotsDotText Find(Uri rootUri);
}

abot's People

Contributors

Stargazers

Watchers

Forkers

manouchehr codeinpeace paullou dagstuan justinverhoef sathish125 markmnl manaf-aboalrous vinchu leebolea sprativadi voltes ilyi1116 imclab luicas69 resultly cosificando negiworx leopc chic alievilshat suniltaneja ifutan quyenbc lampo1024 iis-markkuang foxsofter ciprianporumbescu cerun oliver012345 pclunxos philipp-plank aibrain wheelhousedev wizconn-consultancy-services levka9 alain-es developez yuriga felenko keefehiggins eraldnano ribbles huoxudong125 drewtherat aldrichchen arsenshnurkov mallegrissimo jizhonglee miladsalehi tauseefnizamani jiangpan cjsheehan quangfox yeeler xtmhm2000 erls-corporation mewtwo70182 stephanytang gadgitech tomasbouda roninwest fzhenmei tkcode123 anaramer pjobs halex84 akym03 nengmou bingocherry desait yhtsnda marcosrioj tandao ststeiger hxhlb sundebin coveord gblosser iamtonyzhou macesxx jiejohn dvs39 annakaminskaya3005inbox modulexcite cloudxtreme greenoaktree jamespinkard zanpen2000 shoeshi solertis xu405049495 lizhi5753186 rodrigobrito workwebresources shaimaasultan mewyatt abbas-oveissi siberianwolf josuecorrea

abot's Issues

Nuget package

Would it be possible to make a Nuget package?
I'll try drafting one you can base it on

Does the crawler use the disk to store the visited pages?

Hello

Please, I don't know where ask this thing, I would to ask you what is it the way of the crawler at the time to manage the visited page. I am wonder if all the "count" is on the memory of is the crawler uses temporal files to store the hash of the visited pages, and where are the files in the disk.

Thank you very much. And sorry for use this channel.

Regards.

MaxMemoryUsage config does nothing

MaxMemoryUsageInMb config does nothing, was never hooked up

Create a PageCrawlFailed/PageCrawlFailedAsync event

That may be used for retries or special logging/ratelimiting

MaxPagesToCrawl set to 0 for infinite crawl?

To crawl all pages of sites actually I set MaxPagesToCrawl to 1000000.
Another approach would be to set to 0 for infinite crawl.

Can you also make a change to the code to ignore the check if MaxPagesToCrawl set to zero?

Excessive memory consumption

As can be seen by the amount of code to detect and attempt to handle excessive memory use, there is an issue where the the crawler allocates memory faster than it can be released. I quick hunt through the code showed 3 causes:

Not implementing IDisposable where required; Not wrapping IDisposable objects in using{} blocks. No memory leaks were detected, but the GC is straining to keep up with extra work of queuing up finalization in GEN1 and and calling the finalizers in GEN2.
Due to the nature of a crawler, a significant amount of memory is quickly allocated and discarded, i.e. for page content. The default "client mode" garbage collector can't keep up, and needs to be changed the server-mode GC, which can be done in app.config >> \configuration\runtime\gcServer[@enabled=true]
Because the event "PageCrawlCompletedAsync" is asynchronous, there is no callback to dispose the relevant objects.

How to sign in a page during the crawl?

I have a website which need username/password to sign in, how? Thank you :)

Include Abot.xml in NuGet package

Right now we don't get all the nice comments in the source code as intellisense documentation when using the NuGet package.

You should output the "XML documentation file" in the build of Abot and include the generated Abot.xml in the NuGet package

Improve CrawlDecision

Add a constructor to set Allow and Reason.

I would've made a pull request, but appearantly my Git-mojo is working against me.

Here's the code:
https://github.com/LordMike/abot/commit/b09193d4b012db206c5606b6e7624cc25f3e756e

Which related projects for this crawler exists?

http://stackoverflow.com/questions/26925181/how-to-store-crawled-htmls

Something about dynamic page rank to prevent cycling?
Something for analyzing history of content changes?
Something for website structure analysis?

HtmlEntity.DeEntitize() throws error and stops crawl

[2014-02-18 01:23:07,009] [4 ] [INFO ] - Page crawl complete, Status:[200] Url:[http://www.dmoz.org/Games/Video_Games/Roleplaying/N/Numen/] Parent:[http://www.dmoz.org/Games/Video_Games/N/] - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [FATAL] - Error occurred during processing of page [http://www.dmoz.org/Games/Video_Games/Roleplaying/N/Numen/] - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [FATAL] - System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
at HtmlAgilityPack.HtmlEntity.DeEntitize(String text)
at Abot.Core.HapHyperLinkParser.GetLinks(HtmlNodeCollection nodes)
at Abot.Core.HapHyperLinkParser.GetHrefValues(CrawledPage crawledPage)
at Abot.Core.HyperLinkParser.GetLinks(CrawledPage crawledPage)
at Abot.Crawler.WebCrawler.ParsePageLinks(CrawledPage crawledPage)
at Abot.Crawler.WebCrawler.ProcessPage(PageToCrawl pageToCrawl) - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [INFO ] - Hard crawl stop requested for site [http://www.dmoz.org/]! - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [INFO ] - Crawl complete for site [http://www.dmoz.org/]: [02:38:29.0024196] - [Abot.Crawler.WebCrawler]

Add auto retry

Add auto retry based on http status and configurable maxretrycount

Allow customization of HttpWebRequest before request is made

Sometimes I'd like to add a header to the request before it's made, e.g. add If-Modified-Since header, or add authentication headers, etc.

I know I can create a class derived from PageRequester and override BuildRequestObject(), but adding a request customization delegate would be a much lower barrier.

HasRobotsNoFollow does not support "none"

In the function HasRobotsNoFollow(...), you check robot meta tag like this
return robotsMeta != null && robotsMeta.ToLower().Contains("nofollow");

But there is the following robots meta tag value NONE equivalent to "NOINDEX, NOFOLLOW".
The test must be
return robotsMeta != null && (robotsMeta.ToLower().Contains("nofollow") || robotsMeta.ToLower().Contains("none"));

internal / external link detection is wrong?

Function _isInternalDecisionMaker falsely detects that the link is external

protected Func<Uri, Uri, bool> _isInternalDecisionMaker = (uriInQuestion, rootUri) => uriInQuestion.Authority == rootUri.Authority;

EG :
http://docs.mysite.com/P1.html has a link such as http://www.mysite.com/

rootUri.Authority = docs.mysite.com
uriInQuestion.Authority = www.mysite.com

=> _isInternalDecisionMaker return False

Or is True because. docs or www are subdomains. www.mysite.com is internal.

To detect if link is internal or external, you can use domainname-parser lib.

Persist urls crawled / urls to crawl / downloaded content in an external storage, like a db

It would be great if I could just fire up a mysql db to have the crawler store essential data like
urls to crawl
urls crawled
url outlinks
url inlinks
url content

How to download crawled pages?

I'm a noob. I followed the tuturial and the console is running right with messages. But I don't really know how to use crawled pages' content. Can u help me?

Execute javascript

I really like you library and used it for a couple of projects. Now I want to crawl a page that uses javascript to generate html. Is this possible? It should capture the html after a couple of seconds.

CrawlContext.CrawledUrls count is showing crawled and scheduled count instead of actual crawled

This count is actually showing how many pages have been crawled or scheduled. It was by design but I can see why its confusing.

Add auto-retry functionality

Create AutoRetryCount config value that when is greater than 0 will automatically retry failed (non-200) requests that many times.

Also consider a CrawlDecisionMaker.ShouldRetry() which would allow a way to apply custom logic of when to retry and when not to.

Updating Nuget created a second entry crawlBehavior & politeness in app.config

Add pause, stop, resume

Add functionality that will allow a crawl to be continued from where it was stopped or paused.

Pause/Resume

I am not sure if this feature has been added however I would like to pause and resume the search while holding on to an instance of the web crawler. Also pause to disk and resume from disk the web crawler.

Add option to get charset of page from body which is enclosed by `'` character

I noticed in WebContentExtractor class of Abot Crawler, in GetCharsetFromBody method, that when it comes to parsing charset of page from page's body, it only counts with the case that charset attribute is enclosed in " characters.
For example this works and charset is correctly recognized:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1250" />

But in situation that meta tag uses ' characters, for example

<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />

the charset is not correctly recognized and whole text later extracted from page contains weird characters.

I suggest to edit following method, especially this part:

if (meta != null)
{
     int start_ind = meta.IndexOf("charset=");
     int end_ind = -1;
     if (start_ind != -1)
     {
            end_ind = meta.IndexOf("\"", start_ind);
            if (end_ind != -1)
            {
                   int start = start_ind + 8;
                   charset = meta.Substring(start, end_ind - start + 1);
                   charset = charset.TrimEnd(new Char[] { '>', '"' });
            }
     }
}

and change it to this:

if (meta != null)
{
     Match match = Regex.Match(meta, @"<meta.*charset=(.+)\/>");
     if (match.Success)
     {
          string match_str = match.Groups[1].Value;

          int end_ind = match_str.IndexOf('"');
          if (end_ind == -1)
               end_ind = match_str.IndexOf('\'');

          if (end_ind != -1)
               charset = match_str.Remove(end_ind);
     }
}

so now it will correctly recognize both cases of writing the meta tag.

HtmlAgilityPack conflict with nuget version

Abot.1.2.3.1026 the current version on nuget as of 06th March 2014 uses the HtmlAgilityPack v1.4.7 while the only available version of the HtmlAgilityPack on nuget is 1.4.6. This cause a version conflict within the application.

Robots.txt empty disallow "Disallow:" is treated as "Disallow: /"

I am trying to crawl this website with isRespectRobotsDotTextEnabled set to true: http://artofprogress.com/

This is triggering the PageCrawlDisallowed event with "[Disallowed by robots.txt file]" as the DisallowedReason.

As far as I can tell, the robots.txt file doesn't prevent any page from being crawled. Here is complete text of the robots.txt file:

User-agent: *
Disallow:

This is happening on other websites (incidentally, all WordPress sites) as well.

Abot is not able to crawl a password protected web site

When we are trying to crawl password protected site (with basic authentication) crawler fails with message that page has no content. Url That I am passing is http://username:password@siteurl

Require the user to change default user agent and robots user agent string

Or suggest it in the documentation. Maybe print a warning or throw an exception of they are not changed.

Allow seeding the crawler with a list of URLs to crawl

I need to crawl a site multiple times to get updated content on a regular basis.

In order to save on bandwidth and avoid unnecessary work, I have customized the PageRequester class by deriving from it so that I can add an If-Modified-Since header to only request pages that have been modified since the last crawl. (I know I could have just added a delegate for ShouldDownloadPageContent to make the decision, it's much cleaner to use HTTP headers to tell the server not to send the response if the page wasn't modified, than to let the server send the response but have the crawler ignore it).

The problem I'm having is that let's say the root page wasn't modified. In this case it's not going to be fetched and no links will ever be scheduled for crawling. This is going to stop the crawl immediately.

My request is to allow the crawler to be seeded with a list of URLs to crawl. In the first crawl session, I'm going to supply only the root Uri. In each subsequent crawl session I'm going to supply all URLs that have been crawled before for scheduling. This should allow the crawler to continue crawling those links even if any page containing those links isn't crawled due to not being modified.

When I looked I found that if I get access to the Scheduler object on the CrawlContext I can schedule those links. But the problem is that the CrawlContext is not available before starting the crawl; it's only passed when events/delegates are raised/invoked. I can certainly try to inherit from WebCrawler (or PoliteWebCrawler) and expose the context/scheduler or a method to seed the crawler with a list of URLs.

I guess my request is either:

Expose the CrawlContext on the IWebCrawler interface,
Expose the IScheduler interface through the IWebCrawler interface (doesn't feel right), or
Add an override to the IWebCrawler.Crawl() method that takes IEnumerable<Uri>

Add a priority queue implementation for url scheduling

It would be great to have a priority queue implementing Abot.Core.IPagesToCrawlRepository, allowing url's to be prioritised dynamically. There are generic .NET implementations of priority queues in the public domain (hopefully avoid reinventing the wheel) although there doesn't appear to be one in the .NET framework.

Create an IWebContentExtractor that renders javascript

Could use phantom js or possibly some low level .net libs that power the web browser plugin

Check if the latest html agility pack nuget package has fixed the stackoverflow issues

Has this been solved so I can stop bunding the hap dll???

https://code.google.com/p/abot/issues/detail?id=77&can=1&q=htmlagilitypack

Support X-Robots-Tag HTTP header

In some HTML pages there is the X-Robots-Tag HTTP header with this values :
HTTP/1.1 200 OK
Date: Sun, 02 March 2014 21:42:43 GMT
(…)
X-Robots-Tag: noindex,nofollow
(…)

Some informations : Robots meta tag and X-Robots-Tag HTTP header specifications

Does Abot can check this X-Robots-Tag HTTP header?

when a page can't be accessed, it seems Abot doesn't retry, even worse, it stop the whole crawling process

Hello,
The maxRetryCount is very useful for me. I test it today.
I found one problem, when a page can't be accessed, it seems Abot doesn't retry, even worse, it stop the whole crawling process. is there something I didn't configure well?
I add two config value
maxRetryCount="3" minRetryDelayInMilliseconds="5000"

Any help is appreciated.

here under is the log,

[2014-12-31 18:24:55,174] [6 ] [INFO ] - Page crawl complete, Status:[NA] Url:[(site Uri)] Parent:[] Retry:[0] - [AbotLogger]
[2014-12-31 18:24:55,174] [6 ] [FATAL] - Error occurred during processing of page [(site Uri)] - [AbotLogger]
[2014-12-31 18:24:55,174] [20 ] [ERROR] - Crawl of page failed (site Uri) - [Abot.Crawler.PoliteWebCrawler]
[2014-12-31 18:24:55,174] [6 ] [FATAL] - System.NullReferenceException: Object reference not set to an instance of an object
在 Abot.Crawler.WebCrawler.ShouldRecrawlPage(CrawledPage crawledPage)
在 Abot.Crawler.WebCrawler.ProcessPage(PageToCrawl pageToCrawl) - [AbotLogger]
[2014-12-31 18:24:55,174] [1 ] [INFO ] - Hard crawl stop requested for site [http://www.amazon.com/Best-Sellers-Clothing/zgbs/apparel/ref=zg_bs_unv_a_1_1040660_1]! - [AbotLogger]
[2014-12-31 18:24:55,174] [1 ] [INFO ] - Crawl complete for site [http://www.amazon.com/Best-Sellers-Clothing/zgbs/apparel/ref=zg_bs_unv_a_1_1040660_1]: Crawled [219] pages in [00:20:18.1264665] - [AbotLogger]

Abot 1.2.3 with multiple threads is not stopping with CancellationTokenSource

I am using the latest Abot 1.2.3 via NuGet.
Maybe I am doing something wrong but the following code does not stop the crawl process:

    CancellationTokenSource cts = new CancellationTokenSource();

    private void btnStart_Click(object sender, EventArgs e)
    {
        BackgroundWorker bgw = new BackgroundWorker();
        bgw.DoWork += bgw_DoWork;
        bgw.RunWorkerCompleted += bgw_RunWorkerCompleted;
        bgw.RunWorkerAsync();
    }

    void bgw_RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)
    {
        //never reaches this point
    }

    void bgw_DoWork(object sender, DoWorkEventArgs e)
    {
        PoliteWebCrawler crawler = new PoliteWebCrawler();
        crawler.PageCrawlCompletedAsync += crawler_PageCrawlCompletedAsync;
        CrawlResult result = crawler.Crawl(new Uri("http://www.finanzen-forum.net"), cts);
    }

    void crawler_PageCrawlCompletedAsync(object sender, PageCrawlCompletedArgs e)
    {

    }

    private void btnStop_Click(object sender, EventArgs e)
    {
        cts.Cancel();
    }

The problem only occurs when you have set multiple threads (in my case 20 threads) and you have to let the crawler run a bit (15 seconds was enough to reproduce the problem).

If you are using 1 thread, it works.

I know that you have to wait a bit, depending on how much threads you have started but I have waited for over a minute for Abot to end all 20 threads. No success.

PageBag entries by http requester are deleted

I created a TimingPageRequester:

public class TimingPageRequester : PageRequester
{
    ...

    public override CrawledPage MakeRequest(Uri uri, Func<CrawledPage, CrawlDecision> shouldDownloadContent)
    {
        var timer = Stopwatch.StartNew();
        var crawledPage = base.MakeRequest(uri, shouldDownloadContent);
        timer.Stop();
        crawledPage.PageBag.WebTime = timer.ElapsedMilliseconds;
        return crawledPage;
    }
}

However, later on in the complete event, I get a null ref exp, since WebTime is not present on the page's page bag. The reason for this is because the PageToCrawl data is merged into the CrawledPage object in WebCrawler.CrawlThePage(). This is using automapper, which can't merge dynamic properties by default, so the whole expando object simply treated as a scalar dynamic property is overwritten.

I was able to overcome this by adding the following code to WebCrawler.CrawlThePage():

...
var srcBag = crawledPage.PageBag as IDictionary<string, object>;
var dstBag = pageToCrawl.PageBag as IDictionary<string, object>;
if (srcBag != null && dstBag != null)
{
    foreach (var entry in srcBag)
        dstBag.Add(entry);
}
AutoMapper.Mapper.Map(pageToCrawl, crawledPage);

Sorry for not submitting a pull request, I couldn't build the solution - I'm using VS2013 and it blew up when I tried to open the main .sln file. Besides my code couldn't probably use a cleanup.

(I'm using the 1.2.3 branch)

Documentation For v2.0

Be sure to include

ImplementationContainer
ImplementationContainer.ImplementationBag

Abot crawer is not working when IsRespectRobotsDotTextEnabled is set to True

Abot crawler is not crawling the website : http://www.percona.com when i set IsRespectRobotsDotTextEnabled = true in the configuration.
I even validated the robots.txt file of www.percona.com.

problem with the crawler as the tutorial said

With this code:
CrawlConfiguration crawlConfig = new CrawlConfiguration();
crawlConfig.CrawlTimeoutSeconds = 100;
crawlConfig.MaxConcurrentThreads = 10;
crawlConfig.MaxPagesToCrawl = 1000;
crawlConfig.UserAgentString = "abot v1.0 http://code.google.com/p/abot";
//crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue1", "1111");
//crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue2", "2222");
PoliteWebCrawler crawler = new PoliteWebCrawler(crawlConfig, null, null, null, null, null, null, null);

This is the error:
Error 1 The best overloaded method match for 'Abot.Crawler.PoliteWebCrawler.PoliteWebCrawler(Abot.Poco.CrawlConfiguration, Abot.Core.ICrawlDecisionMaker, Abot.Util.IThreadManager, Abot.Core.IScheduler, Abot.Core.IPageRequester, Abot.Core.IHyperLinkParser, Abot.Util.IMemoryManager, Abot.Core.IDomainRateLimiter, Abot.Core.IRobotsDotTextFinder)' has some invalid arguments

VS2010 and VS2012 crashes

When I open Abot.sln both VS2010 and VS2012 crashes.. Known issue..?

"System.AggregateException" in mscorlib.dll

Hi,

I am getting a really strange error message when running the latest abot (1.2.3.1005) as a windows service on my server. This error does not occur on my development machine.

I have attached the VS 2012 remote debugger and got the following exception. (Too bad it's exteral code. Therefore I cannot see where the error occurs exactly.):

"System.AggregateException" in mscorlib.dll

A Task's exception(s) were not observed either by Waiting on the Task or accessing its Exception property. As a result, the unobserved exception was rethrown by the finalizer thread.

I haven't changed much on my windows service code but I recently updated Abot via nuget. Maybe this problem is Abot related.
Another point is that the error does not occur when running Abot with a single thread. Running with 10 threads, the windows service crashes.

How to reproduce the problem?
(It only occurs on my windows server. Not on my development machine)

Tell Abot to use 10 Threads
Start the crawl process, wait 5 seconds, stop it, start the crawl process - ERROR

Switch in App.config file to set or not the display of Abot configuration?

Abot Configuration parameters is displayed by default. ( PrintConfigValues(....) in Crawl function);

Is it possible to have a switch in App.config file to set or not the display of Abot configuration?

For ex. In debug mode I set to "1" to display Abot Configuration, In production mode I set it to "0".

sorry, resosaise header-content of crawling content

i have problems with parsing, each link, i thing a-href, wants to use, a header-content callback, must more for stable, for right content crawling...

best regards

AutoRetry bad requests feature

We need to have a config for IsAutoRetryEnabled and AutoRetryCount. If set it will make sure to not stop the crawl until all retries have been met.

Unable to crawl https site

I am trying to run Abot.Demo on
https://focus.kontur.ru

I added site certificate with
yes | certmgr -ssl -v https://focus.kontur.ru

Abot.Demo program gives me "Max. redirections exceeded." exception.
and the following line in log:
[2014-11-15 08:24:41,678] [1] [INFO ] - Page crawl complete, Status:[302] Url:[https://focus.kontur.ru/] Parent:[https://focus.kontur.ru/] - [AbotLogger]

I use mono 3.10.1 on linux

What is the problem, and how to overcome it?

Require dynamic object scoped IWebCrawler.Crawl(Uri uri);

It would be useful to have a dynamic object that is available only for the duration of the Crawl call e.g.

IWebCrawler.Crawl(Uri uri, dynamic localConfig)

I am currently using the CrawlBag but it's a little bit messy as I want to pass a business object to the crawler that really should only be valid for that single call to Crawl, subsequent calls will pass different localConfig objects. These objects handle the building and processing of the DOM according to my business logic and construction of extracted hiererchical data.

I can see the _crawlContext persists for the lifetime of the IWebCrawler, which is great as I need some configuration valid for the entire IWebCrawler existence i.e. multiple subsequent calls to Crawl, but I also need configuration scoped only to an individual call to Crawl which I'd imagine is best controlled as a method parameter. Let me know if there is a better way of accomplishing my tasks or if you need better info.

OnDiskCrawledUrlRepository crawls duplicate links

OnDiskCrawledUrlRepository, the problem is easy to replicate. Just add some logs to record all the crawled urls, and you find tons of repeated ones. My seed URL was http://finance.sina.com.cn/

Use a named logger for all logging

someone wrote to me....

With default Abot App.config file, all logging is done to a single file appender.
I feel it pollutes the normal application log file, and would instead prefer this to be stored in its own log file.

To do that you can easily create a special logger like this :
ILog AppLogger = LogManager.GetLogger("MyApplicationLogger");
And in App.config file, you can configure the system to use a dedicated appender for this logger like this :

MinCrawlDelayPerDomain ignored when cancellation token is sent

MinCrawlDelayPerDomain ignored when cancellation token is sent in crawl(uri, cancellationtoken) call.