sjdirect / abot Goto Github PK

View Code? Open in Web Editor NEW

2.2K 199.0 554.0 7.09 MB

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

License: Apache License 2.0

C# 100.00%

c-sharp abot abot-nuget crawler web-crawler parsing spider spiders pluggable unit-testing

abot's Issues

Create an IWebContentExtractor that renders javascript

Could use phantom js or possibly some low level .net libs that power the web browser plugin

Require dynamic object scoped IWebCrawler.Crawl(Uri uri);

It would be useful to have a dynamic object that is available only for the duration of the Crawl call e.g.

IWebCrawler.Crawl(Uri uri, dynamic localConfig)

I am currently using the CrawlBag but it's a little bit messy as I want to pass a business object to the crawler that really should only be valid for that single call to Crawl, subsequent calls will pass different localConfig objects. These objects handle the building and processing of the DOM according to my business logic and construction of extracted hiererchical data.

I can see the _crawlContext persists for the lifetime of the IWebCrawler, which is great as I need some configuration valid for the entire IWebCrawler existence i.e. multiple subsequent calls to Crawl, but I also need configuration scoped only to an individual call to Crawl which I'd imagine is best controlled as a method parameter. Let me know if there is a better way of accomplishing my tasks or if you need better info.

Nuget package

Would it be possible to make a Nuget package?
I'll try drafting one you can base it on

Check if the latest html agility pack nuget package has fixed the stackoverflow issues

Has this been solved so I can stop bunding the hap dll???

https://code.google.com/p/abot/issues/detail?id=77&can=1&q=htmlagilitypack

Which related projects for this crawler exists?

http://stackoverflow.com/questions/26925181/how-to-store-crawled-htmls

Something about dynamic page rank to prevent cycling?
Something for analyzing history of content changes?
Something for website structure analysis?

CrawlContext.CrawledUrls count is showing crawled and scheduled count instead of actual crawled

This count is actually showing how many pages have been crawled or scheduled. It was by design but I can see why its confusing.

MaxPagesToCrawl set to 0 for infinite crawl?

To crawl all pages of sites actually I set MaxPagesToCrawl to 1000000.
Another approach would be to set to 0 for infinite crawl.

Can you also make a change to the code to ignore the check if MaxPagesToCrawl set to zero?

Excessive memory consumption

As can be seen by the amount of code to detect and attempt to handle excessive memory use, there is an issue where the the crawler allocates memory faster than it can be released. I quick hunt through the code showed 3 causes:

Not implementing IDisposable where required; Not wrapping IDisposable objects in using{} blocks. No memory leaks were detected, but the GC is straining to keep up with extra work of queuing up finalization in GEN1 and and calling the finalizers in GEN2.
Due to the nature of a crawler, a significant amount of memory is quickly allocated and discarded, i.e. for page content. The default "client mode" garbage collector can't keep up, and needs to be changed the server-mode GC, which can be done in app.config >> \configuration\runtime\gcServer[@enabled=true]
Because the event "PageCrawlCompletedAsync" is asynchronous, there is no callback to dispose the relevant objects.

Allow customization of HttpWebRequest before request is made

Sometimes I'd like to add a header to the request before it's made, e.g. add If-Modified-Since header, or add authentication headers, etc.

I know I can create a class derived from PageRequester and override BuildRequestObject(), but adding a request customization delegate would be a much lower barrier.

VS2010 and VS2012 crashes

When I open Abot.sln both VS2010 and VS2012 crashes.. Known issue..?

Improve CrawlDecision

Add a constructor to set Allow and Reason.

I would've made a pull request, but appearantly my Git-mojo is working against me.

Here's the code:
https://github.com/LordMike/abot/commit/b09193d4b012db206c5606b6e7624cc25f3e756e

Create a PageCrawlFailed/PageCrawlFailedAsync event

That may be used for retries or special logging/ratelimiting

Pause/Resume

I am not sure if this feature has been added however I would like to pause and resume the search while holding on to an instance of the web crawler. Also pause to disk and resume from disk the web crawler.

Use a named logger for all logging

someone wrote to me....

With default Abot App.config file, all logging is done to a single file appender.
I feel it pollutes the normal application log file, and would instead prefer this to be stored in its own log file.

To do that you can easily create a special logger like this :
ILog AppLogger = LogManager.GetLogger("MyApplicationLogger");
And in App.config file, you can configure the system to use a dedicated appender for this logger like this :

Does the crawler use the disk to store the visited pages?

Hello

Please, I don't know where ask this thing, I would to ask you what is it the way of the crawler at the time to manage the visited page. I am wonder if all the "count" is on the memory of is the crawler uses temporal files to store the hash of the visited pages, and where are the files in the disk.

Thank you very much. And sorry for use this channel.

Regards.

How to download crawled pages?

I'm a noob. I followed the tuturial and the console is running right with messages. But I don't really know how to use crawled pages' content. Can u help me?

Abot crawer is not working when IsRespectRobotsDotTextEnabled is set to True

Abot crawler is not crawling the website : http://www.percona.com when i set IsRespectRobotsDotTextEnabled = true in the configuration.
I even validated the robots.txt file of www.percona.com.

Add pause, stop, resume

Add functionality that will allow a crawl to be continued from where it was stopped or paused.

Unable to crawl https site

I am trying to run Abot.Demo on
https://focus.kontur.ru

I added site certificate with
yes | certmgr -ssl -v https://focus.kontur.ru

Abot.Demo program gives me "Max. redirections exceeded." exception.
and the following line in log:
[2014-11-15 08:24:41,678] [1] [INFO ] - Page crawl complete, Status:[302] Url:[https://focus.kontur.ru/] Parent:[https://focus.kontur.ru/] - [AbotLogger]

I use mono 3.10.1 on linux

What is the problem, and how to overcome it?

Execute javascript

I really like you library and used it for a couple of projects. Now I want to crawl a page that uses javascript to generate html. Is this possible? It should capture the html after a couple of seconds.

Updating Nuget created a second entry crawlBehavior & politeness in app.config

Support X-Robots-Tag HTTP header

In some HTML pages there is the X-Robots-Tag HTTP header with this values :
HTTP/1.1 200 OK
Date: Sun, 02 March 2014 21:42:43 GMT
(…)
X-Robots-Tag: noindex,nofollow
(…)

Some informations : Robots meta tag and X-Robots-Tag HTTP header specifications

Does Abot can check this X-Robots-Tag HTTP header?

internal / external link detection is wrong?

Function _isInternalDecisionMaker falsely detects that the link is external

protected Func<Uri, Uri, bool> _isInternalDecisionMaker = (uriInQuestion, rootUri) => uriInQuestion.Authority == rootUri.Authority;

EG :
http://docs.mysite.com/P1.html has a link such as http://www.mysite.com/

rootUri.Authority = docs.mysite.com
uriInQuestion.Authority = www.mysite.com

=> _isInternalDecisionMaker return False

Or is True because. docs or www are subdomains. www.mysite.com is internal.

To detect if link is internal or external, you can use domainname-parser lib.

AutoRetry bad requests feature

We need to have a config for IsAutoRetryEnabled and AutoRetryCount. If set it will make sure to not stop the crawl until all retries have been met.

CrawlDecisionMaker, Scheduler and WebCrawler do not respect the PagetoCrawl.IsRetry property

Persist urls crawled / urls to crawl / downloaded content in an external storage, like a db

It would be great if I could just fire up a mysql db to have the crawler store essential data like
urls to crawl
urls crawled
url outlinks
url inlinks
url content

Abot 1.2.3 with multiple threads is not stopping with CancellationTokenSource

I am using the latest Abot 1.2.3 via NuGet.
Maybe I am doing something wrong but the following code does not stop the crawl process:

    CancellationTokenSource cts = new CancellationTokenSource();

    private void btnStart_Click(object sender, EventArgs e)
    {
        BackgroundWorker bgw = new BackgroundWorker();
        bgw.DoWork += bgw_DoWork;
        bgw.RunWorkerCompleted += bgw_RunWorkerCompleted;
        bgw.RunWorkerAsync();
    }

    void bgw_RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)
    {
        //never reaches this point
    }

    void bgw_DoWork(object sender, DoWorkEventArgs e)
    {
        PoliteWebCrawler crawler = new PoliteWebCrawler();
        crawler.PageCrawlCompletedAsync += crawler_PageCrawlCompletedAsync;
        CrawlResult result = crawler.Crawl(new Uri("http://www.finanzen-forum.net"), cts);
    }

    void crawler_PageCrawlCompletedAsync(object sender, PageCrawlCompletedArgs e)
    {

    }

    private void btnStop_Click(object sender, EventArgs e)
    {
        cts.Cancel();
    }

The problem only occurs when you have set multiple threads (in my case 20 threads) and you have to let the crawler run a bit (15 seconds was enough to reproduce the problem).

If you are using 1 thread, it works.

I know that you have to wait a bit, depending on how much threads you have started but I have waited for over a minute for Abot to end all 20 threads. No success.

sorry, resosaise header-content of crawling content

i have problems with parsing, each link, i thing a-href, wants to use, a header-content callback, must more for stable, for right content crawling...

best regards

HasRobotsNoFollow does not support "none"

In the function HasRobotsNoFollow(...), you check robot meta tag like this
return robotsMeta != null && robotsMeta.ToLower().Contains("nofollow");

But there is the following robots meta tag value NONE equivalent to "NOINDEX, NOFOLLOW".
The test must be
return robotsMeta != null && (robotsMeta.ToLower().Contains("nofollow") || robotsMeta.ToLower().Contains("none"));

Documentation For v2.0

Be sure to include

ImplementationContainer
ImplementationContainer.ImplementationBag

Include Abot.xml in NuGet package

Right now we don't get all the nice comments in the source code as intellisense documentation when using the NuGet package.

You should output the "XML documentation file" in the build of Abot and include the generated Abot.xml in the NuGet package

Switch in App.config file to set or not the display of Abot configuration?

Abot Configuration parameters is displayed by default. ( PrintConfigValues(....) in Crawl function);

Is it possible to have a switch in App.config file to set or not the display of Abot configuration?

For ex. In debug mode I set to "1" to display Abot Configuration, In production mode I set it to "0".

Add a priority queue implementation for url scheduling

It would be great to have a priority queue implementing Abot.Core.IPagesToCrawlRepository, allowing url's to be prioritised dynamically. There are generic .NET implementations of priority queues in the public domain (hopefully avoid reinventing the wheel) although there doesn't appear to be one in the .NET framework.

Abot is not able to crawl a password protected web site

When we are trying to crawl password protected site (with basic authentication) crawler fails with message that page has no content. Url That I am passing is http://username:password@siteurl

Allow seeding the crawler with a list of URLs to crawl

I need to crawl a site multiple times to get updated content on a regular basis.

In order to save on bandwidth and avoid unnecessary work, I have customized the PageRequester class by deriving from it so that I can add an If-Modified-Since header to only request pages that have been modified since the last crawl. (I know I could have just added a delegate for ShouldDownloadPageContent to make the decision, it's much cleaner to use HTTP headers to tell the server not to send the response if the page wasn't modified, than to let the server send the response but have the crawler ignore it).

The problem I'm having is that let's say the root page wasn't modified. In this case it's not going to be fetched and no links will ever be scheduled for crawling. This is going to stop the crawl immediately.

My request is to allow the crawler to be seeded with a list of URLs to crawl. In the first crawl session, I'm going to supply only the root Uri. In each subsequent crawl session I'm going to supply all URLs that have been crawled before for scheduling. This should allow the crawler to continue crawling those links even if any page containing those links isn't crawled due to not being modified.

When I looked I found that if I get access to the Scheduler object on the CrawlContext I can schedule those links. But the problem is that the CrawlContext is not available before starting the crawl; it's only passed when events/delegates are raised/invoked. I can certainly try to inherit from WebCrawler (or PoliteWebCrawler) and expose the context/scheduler or a method to seed the crawler with a list of URLs.

I guess my request is either:

Expose the CrawlContext on the IWebCrawler interface,
Expose the IScheduler interface through the IWebCrawler interface (doesn't feel right), or
Add an override to the IWebCrawler.Crawl() method that takes IEnumerable<Uri>

Add option to get charset of page from body which is enclosed by `'` character

I noticed in WebContentExtractor class of Abot Crawler, in GetCharsetFromBody method, that when it comes to parsing charset of page from page's body, it only counts with the case that charset attribute is enclosed in " characters.
For example this works and charset is correctly recognized:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1250" />

But in situation that meta tag uses ' characters, for example

<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />

the charset is not correctly recognized and whole text later extracted from page contains weird characters.

I suggest to edit following method, especially this part:

if (meta != null)
{
     int start_ind = meta.IndexOf("charset=");
     int end_ind = -1;
     if (start_ind != -1)
     {
            end_ind = meta.IndexOf("\"", start_ind);
            if (end_ind != -1)
            {
                   int start = start_ind + 8;
                   charset = meta.Substring(start, end_ind - start + 1);
                   charset = charset.TrimEnd(new Char[] { '>', '"' });
            }
     }
}

and change it to this:

if (meta != null)
{
     Match match = Regex.Match(meta, @"<meta.*charset=(.+)\/>");
     if (match.Success)
     {
          string match_str = match.Groups[1].Value;

          int end_ind = match_str.IndexOf('"');
          if (end_ind == -1)
               end_ind = match_str.IndexOf('\'');

          if (end_ind != -1)
               charset = match_str.Remove(end_ind);
     }
}

so now it will correctly recognize both cases of writing the meta tag.

Robots.txt empty disallow "Disallow:" is treated as "Disallow: /"

I am trying to crawl this website with isRespectRobotsDotTextEnabled set to true: http://artofprogress.com/

This is triggering the PageCrawlDisallowed event with "[Disallowed by robots.txt file]" as the DisallowedReason.

As far as I can tell, the robots.txt file doesn't prevent any page from being crawled. Here is complete text of the robots.txt file:

User-agent: *
Disallow:

This is happening on other websites (incidentally, all WordPress sites) as well.

Add auto-retry functionality

Create AutoRetryCount config value that when is greater than 0 will automatically retry failed (non-200) requests that many times.

Also consider a CrawlDecisionMaker.ShouldRetry() which would allow a way to apply custom logic of when to retry and when not to.

MinCrawlDelayPerDomain ignored when cancellation token is sent

MinCrawlDelayPerDomain ignored when cancellation token is sent in crawl(uri, cancellationtoken) call.

HtmlAgilityPack conflict with nuget version

Abot.1.2.3.1026 the current version on nuget as of 06th March 2014 uses the HtmlAgilityPack v1.4.7 while the only available version of the HtmlAgilityPack on nuget is 1.4.6. This cause a version conflict within the application.

Require the user to change default user agent and robots user agent string

Or suggest it in the documentation. Maybe print a warning or throw an exception of they are not changed.

"System.AggregateException" in mscorlib.dll

Hi,

I am getting a really strange error message when running the latest abot (1.2.3.1005) as a windows service on my server. This error does not occur on my development machine.

I have attached the VS 2012 remote debugger and got the following exception. (Too bad it's exteral code. Therefore I cannot see where the error occurs exactly.):

"System.AggregateException" in mscorlib.dll

A Task's exception(s) were not observed either by Waiting on the Task or accessing its Exception property. As a result, the unobserved exception was rethrown by the finalizer thread.

I haven't changed much on my windows service code but I recently updated Abot via nuget. Maybe this problem is Abot related.
Another point is that the error does not occur when running Abot with a single thread. Running with 10 threads, the windows service crashes.

How to reproduce the problem?
(It only occurs on my windows server. Not on my development machine)

Tell Abot to use 10 Threads
Start the crawl process, wait 5 seconds, stop it, start the crawl process - ERROR

OnDiskCrawledUrlRepository crawls duplicate links

OnDiskCrawledUrlRepository, the problem is easy to replicate. Just add some logs to record all the crawled urls, and you find tons of repeated ones. My seed URL was http://finance.sina.com.cn/

MaxMemoryUsage config does nothing

MaxMemoryUsageInMb config does nothing, was never hooked up

HtmlEntity.DeEntitize() throws error and stops crawl

[2014-02-18 01:23:07,009] [4 ] [INFO ] - Page crawl complete, Status:[200] Url:[http://www.dmoz.org/Games/Video_Games/Roleplaying/N/Numen/] Parent:[http://www.dmoz.org/Games/Video_Games/N/] - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [FATAL] - Error occurred during processing of page [http://www.dmoz.org/Games/Video_Games/Roleplaying/N/Numen/] - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [FATAL] - System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
at HtmlAgilityPack.HtmlEntity.DeEntitize(String text)
at Abot.Core.HapHyperLinkParser.GetLinks(HtmlNodeCollection nodes)
at Abot.Core.HapHyperLinkParser.GetHrefValues(CrawledPage crawledPage)
at Abot.Core.HyperLinkParser.GetLinks(CrawledPage crawledPage)
at Abot.Crawler.WebCrawler.ParsePageLinks(CrawledPage crawledPage)
at Abot.Crawler.WebCrawler.ProcessPage(PageToCrawl pageToCrawl) - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [INFO ] - Hard crawl stop requested for site [http://www.dmoz.org/]! - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [INFO ] - Crawl complete for site [http://www.dmoz.org/]: [02:38:29.0024196] - [Abot.Crawler.WebCrawler]

Add auto retry

Add auto retry based on http status and configurable maxretrycount

PageBag entries by http requester are deleted

I created a TimingPageRequester:

public class TimingPageRequester : PageRequester
{
    ...

    public override CrawledPage MakeRequest(Uri uri, Func<CrawledPage, CrawlDecision> shouldDownloadContent)
    {
        var timer = Stopwatch.StartNew();
        var crawledPage = base.MakeRequest(uri, shouldDownloadContent);
        timer.Stop();
        crawledPage.PageBag.WebTime = timer.ElapsedMilliseconds;
        return crawledPage;
    }
}

However, later on in the complete event, I get a null ref exp, since WebTime is not present on the page's page bag. The reason for this is because the PageToCrawl data is merged into the CrawledPage object in WebCrawler.CrawlThePage(). This is using automapper, which can't merge dynamic properties by default, so the whole expando object simply treated as a scalar dynamic property is overwritten.

I was able to overcome this by adding the following code to WebCrawler.CrawlThePage():

...
var srcBag = crawledPage.PageBag as IDictionary<string, object>;
var dstBag = pageToCrawl.PageBag as IDictionary<string, object>;
if (srcBag != null && dstBag != null)
{
    foreach (var entry in srcBag)
        dstBag.Add(entry);
}
AutoMapper.Mapper.Map(pageToCrawl, crawledPage);

Sorry for not submitting a pull request, I couldn't build the solution - I'm using VS2013 and it blew up when I tried to open the main .sln file. Besides my code couldn't probably use a cleanup.

(I'm using the 1.2.3 branch)

problem with the crawler as the tutorial said

With this code:
CrawlConfiguration crawlConfig = new CrawlConfiguration();
crawlConfig.CrawlTimeoutSeconds = 100;
crawlConfig.MaxConcurrentThreads = 10;
crawlConfig.MaxPagesToCrawl = 1000;
crawlConfig.UserAgentString = "abot v1.0 http://code.google.com/p/abot";
//crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue1", "1111");
//crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue2", "2222");
PoliteWebCrawler crawler = new PoliteWebCrawler(crawlConfig, null, null, null, null, null, null, null);

This is the error:
Error 1 The best overloaded method match for 'Abot.Crawler.PoliteWebCrawler.PoliteWebCrawler(Abot.Poco.CrawlConfiguration, Abot.Core.ICrawlDecisionMaker, Abot.Util.IThreadManager, Abot.Core.IScheduler, Abot.Core.IPageRequester, Abot.Core.IHyperLinkParser, Abot.Util.IMemoryManager, Abot.Core.IDomainRateLimiter, Abot.Core.IRobotsDotTextFinder)' has some invalid arguments

How to sign in a page during the crawl?

I have a website which need username/password to sign in, how? Thank you :)

when a page can't be accessed, it seems Abot doesn't retry, even worse, it stop the whole crawling process

Hello,
The maxRetryCount is very useful for me. I test it today.
I found one problem, when a page can't be accessed, it seems Abot doesn't retry, even worse, it stop the whole crawling process. is there something I didn't configure well?
I add two config value
maxRetryCount="3" minRetryDelayInMilliseconds="5000"

Any help is appreciated.

here under is the log,

[2014-12-31 18:24:55,174] [6 ] [INFO ] - Page crawl complete, Status:[NA] Url:[(site Uri)] Parent:[] Retry:[0] - [AbotLogger]
[2014-12-31 18:24:55,174] [6 ] [FATAL] - Error occurred during processing of page [(site Uri)] - [AbotLogger]
[2014-12-31 18:24:55,174] [20 ] [ERROR] - Crawl of page failed (site Uri) - [Abot.Crawler.PoliteWebCrawler]
[2014-12-31 18:24:55,174] [6 ] [FATAL] - System.NullReferenceException: Object reference not set to an instance of an object
在 Abot.Crawler.WebCrawler.ShouldRecrawlPage(CrawledPage crawledPage)
在 Abot.Crawler.WebCrawler.ProcessPage(PageToCrawl pageToCrawl) - [AbotLogger]
[2014-12-31 18:24:55,174] [1 ] [INFO ] - Hard crawl stop requested for site [http://www.amazon.com/Best-Sellers-Clothing/zgbs/apparel/ref=zg_bs_unv_a_1_1040660_1]! - [AbotLogger]
[2014-12-31 18:24:55,174] [1 ] [INFO ] - Crawl complete for site [http://www.amazon.com/Best-Sellers-Clothing/zgbs/apparel/ref=zg_bs_unv_a_1_1040660_1]: Crawled [219] pages in [00:20:18.1264665] - [AbotLogger]

sjdirect / abot Goto Github PK

abot's Issues

Recommend Projects

Recommend Topics

Recommend Org