sjdirect / abot Goto Github PK
View Code? Open in Web Editor NEWCross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
License: Apache License 2.0
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
License: Apache License 2.0
Could use phantom js or possibly some low level .net libs that power the web browser plugin
It would be useful to have a dynamic object that is available only for the duration of the Crawl call e.g.
IWebCrawler.Crawl(Uri uri, dynamic localConfig)
I am currently using the CrawlBag but it's a little bit messy as I want to pass a business object to the crawler that really should only be valid for that single call to Crawl, subsequent calls will pass different localConfig objects. These objects handle the building and processing of the DOM according to my business logic and construction of extracted hiererchical data.
I can see the _crawlContext persists for the lifetime of the IWebCrawler, which is great as I need some configuration valid for the entire IWebCrawler existence i.e. multiple subsequent calls to Crawl, but I also need configuration scoped only to an individual call to Crawl which I'd imagine is best controlled as a method parameter. Let me know if there is a better way of accomplishing my tasks or if you need better info.
Would it be possible to make a Nuget package?
I'll try drafting one you can base it on
Has this been solved so I can stop bunding the hap dll???
https://code.google.com/p/abot/issues/detail?id=77&can=1&q=htmlagilitypack
http://stackoverflow.com/questions/26925181/how-to-store-crawled-htmls
Something about dynamic page rank to prevent cycling?
Something for analyzing history of content changes?
Something for website structure analysis?
This count is actually showing how many pages have been crawled or scheduled. It was by design but I can see why its confusing.
To crawl all pages of sites actually I set MaxPagesToCrawl to 1000000.
Another approach would be to set to 0 for infinite crawl.
Can you also make a change to the code to ignore the check if MaxPagesToCrawl set to zero?
As can be seen by the amount of code to detect and attempt to handle excessive memory use, there is an issue where the the crawler allocates memory faster than it can be released. I quick hunt through the code showed 3 causes:
Not implementing IDisposable where required; Not wrapping IDisposable objects in using{} blocks. No memory leaks were detected, but the GC is straining to keep up with extra work of queuing up finalization in GEN1 and and calling the finalizers in GEN2.
Due to the nature of a crawler, a significant amount of memory is quickly allocated and discarded, i.e. for page content. The default "client mode" garbage collector can't keep up, and needs to be changed the server-mode GC, which can be done in app.config >> \configuration\runtime\gcServer[@enabled=true]
Because the event "PageCrawlCompletedAsync" is asynchronous, there is no callback to dispose the relevant objects.
Sometimes I'd like to add a header to the request before it's made, e.g. add If-Modified-Since header, or add authentication headers, etc.
I know I can create a class derived from PageRequester
and override BuildRequestObject()
, but adding a request customization delegate would be a much lower barrier.
When I open Abot.sln both VS2010 and VS2012 crashes.. Known issue..?
Add a constructor to set Allow and Reason.
I would've made a pull request, but appearantly my Git-mojo is working against me.
Here's the code:
https://github.com/LordMike/abot/commit/b09193d4b012db206c5606b6e7624cc25f3e756e
That may be used for retries or special logging/ratelimiting
I am not sure if this feature has been added however I would like to pause and resume the search while holding on to an instance of the web crawler. Also pause to disk and resume from disk the web crawler.
someone wrote to me....
With default Abot App.config file, all logging is done to a single file appender.
I feel it pollutes the normal application log file, and would instead prefer this to be stored in its own log file.
To do that you can easily create a special logger like this :
ILog AppLogger = LogManager.GetLogger("MyApplicationLogger");
And in App.config file, you can configure the system to use a dedicated appender for this logger like this :
Hello
Please, I don't know where ask this thing, I would to ask you what is it the way of the crawler at the time to manage the visited page. I am wonder if all the "count" is on the memory of is the crawler uses temporal files to store the hash of the visited pages, and where are the files in the disk.
Thank you very much. And sorry for use this channel.
Regards.
I'm a noob. I followed the tuturial and the console is running right with messages. But I don't really know how to use crawled pages' content. Can u help me?
Abot crawler is not crawling the website : http://www.percona.com when i set IsRespectRobotsDotTextEnabled = true in the configuration.
I even validated the robots.txt file of www.percona.com.
Add functionality that will allow a crawl to be continued from where it was stopped or paused.
I am trying to run Abot.Demo on
https://focus.kontur.ru
I added site certificate with
yes | certmgr -ssl -v https://focus.kontur.ru
Abot.Demo program gives me "Max. redirections exceeded." exception.
and the following line in log:
[2014-11-15 08:24:41,678] [1] [INFO ] - Page crawl complete, Status:[302] Url:[https://focus.kontur.ru/] Parent:[https://focus.kontur.ru/] - [AbotLogger]
I use mono 3.10.1 on linux
What is the problem, and how to overcome it?
I really like you library and used it for a couple of projects. Now I want to crawl a page that uses javascript to generate html. Is this possible? It should capture the html after a couple of seconds.
Updating Nuget created a second entry crawlBehavior & politeness in app.config
In some HTML pages there is the X-Robots-Tag HTTP header with this values :
HTTP/1.1 200 OK
Date: Sun, 02 March 2014 21:42:43 GMT
(…)
X-Robots-Tag: noindex,nofollow
(…)
Some informations : Robots meta tag and X-Robots-Tag HTTP header specifications
Does Abot can check this X-Robots-Tag HTTP header?
Function _isInternalDecisionMaker falsely detects that the link is external
protected Func<Uri, Uri, bool> _isInternalDecisionMaker = (uriInQuestion, rootUri) => uriInQuestion.Authority == rootUri.Authority;
EG :
http://docs.mysite.com/P1.html has a link such as http://www.mysite.com/
rootUri.Authority = docs.mysite.com
uriInQuestion.Authority = www.mysite.com
=> _isInternalDecisionMaker return False
Or is True because. docs or www are subdomains. www.mysite.com is internal.
To detect if link is internal or external, you can use domainname-parser lib.
We need to have a config for IsAutoRetryEnabled and AutoRetryCount. If set it will make sure to not stop the crawl until all retries have been met.
CrawlDecisionMaker, Scheduler and WebCrawler do not respect the PagetoCrawl.IsRetry property
It would be great if I could just fire up a mysql db to have the crawler store essential data like
urls to crawl
urls crawled
url outlinks
url inlinks
url content
I am using the latest Abot 1.2.3 via NuGet.
Maybe I am doing something wrong but the following code does not stop the crawl process:
CancellationTokenSource cts = new CancellationTokenSource();
private void btnStart_Click(object sender, EventArgs e)
{
BackgroundWorker bgw = new BackgroundWorker();
bgw.DoWork += bgw_DoWork;
bgw.RunWorkerCompleted += bgw_RunWorkerCompleted;
bgw.RunWorkerAsync();
}
void bgw_RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)
{
//never reaches this point
}
void bgw_DoWork(object sender, DoWorkEventArgs e)
{
PoliteWebCrawler crawler = new PoliteWebCrawler();
crawler.PageCrawlCompletedAsync += crawler_PageCrawlCompletedAsync;
CrawlResult result = crawler.Crawl(new Uri("http://www.finanzen-forum.net"), cts);
}
void crawler_PageCrawlCompletedAsync(object sender, PageCrawlCompletedArgs e)
{
}
private void btnStop_Click(object sender, EventArgs e)
{
cts.Cancel();
}
The problem only occurs when you have set multiple threads (in my case 20 threads) and you have to let the crawler run a bit (15 seconds was enough to reproduce the problem).
If you are using 1 thread, it works.
I know that you have to wait a bit, depending on how much threads you have started but I have waited for over a minute for Abot to end all 20 threads. No success.
i have problems with parsing, each link, i thing a-href, wants to use, a header-content callback, must more for stable, for right content crawling...
best regards
In the function HasRobotsNoFollow(...), you check robot meta tag like this
return robotsMeta != null && robotsMeta.ToLower().Contains("nofollow");
But there is the following robots meta tag value NONE equivalent to "NOINDEX, NOFOLLOW".
The test must be
return robotsMeta != null && (robotsMeta.ToLower().Contains("nofollow") || robotsMeta.ToLower().Contains("none"));
Be sure to include
ImplementationContainer
ImplementationContainer.ImplementationBag
Right now we don't get all the nice comments in the source code as intellisense documentation when using the NuGet package.
You should output the "XML documentation file" in the build of Abot and include the generated Abot.xml in the NuGet package
Abot Configuration parameters is displayed by default. ( PrintConfigValues(....) in Crawl function);
Is it possible to have a switch in App.config file to set or not the display of Abot configuration?
For ex. In debug mode I set to "1" to display Abot Configuration, In production mode I set it to "0".
It would be great to have a priority queue implementing Abot.Core.IPagesToCrawlRepository, allowing url's to be prioritised dynamically. There are generic .NET implementations of priority queues in the public domain (hopefully avoid reinventing the wheel) although there doesn't appear to be one in the .NET framework.
When we are trying to crawl password protected site (with basic authentication) crawler fails with message that page has no content. Url That I am passing is http://username:password@siteurl
I need to crawl a site multiple times to get updated content on a regular basis.
In order to save on bandwidth and avoid unnecessary work, I have customized the PageRequester
class by deriving from it so that I can add an If-Modified-Since header to only request pages that have been modified since the last crawl. (I know I could have just added a delegate for ShouldDownloadPageContent
to make the decision, it's much cleaner to use HTTP headers to tell the server not to send the response if the page wasn't modified, than to let the server send the response but have the crawler ignore it).
The problem I'm having is that let's say the root page wasn't modified. In this case it's not going to be fetched and no links will ever be scheduled for crawling. This is going to stop the crawl immediately.
My request is to allow the crawler to be seeded with a list of URLs to crawl. In the first crawl session, I'm going to supply only the root Uri. In each subsequent crawl session I'm going to supply all URLs that have been crawled before for scheduling. This should allow the crawler to continue crawling those links even if any page containing those links isn't crawled due to not being modified.
When I looked I found that if I get access to the Scheduler object on the CrawlContext
I can schedule those links. But the problem is that the CrawlContext
is not available before starting the crawl; it's only passed when events/delegates are raised/invoked. I can certainly try to inherit from WebCrawler
(or PoliteWebCrawler
) and expose the context/scheduler or a method to seed the crawler with a list of URLs.
I guess my request is either:
CrawlContext
on the IWebCrawler
interface,IScheduler
interface through the IWebCrawler
interface (doesn't feel right), orIWebCrawler.Crawl()
method that takes IEnumerable<Uri>
I noticed in WebContentExtractor class of Abot Crawler, in GetCharsetFromBody
method, that when it comes to parsing charset of page from page's body, it only counts with the case that charset
attribute is enclosed in "
characters.
For example this works and charset is correctly recognized:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1250" />
But in situation that meta
tag uses '
characters, for example
<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />
the charset is not correctly recognized and whole text later extracted from page contains weird characters.
I suggest to edit following method, especially this part:
if (meta != null)
{
int start_ind = meta.IndexOf("charset=");
int end_ind = -1;
if (start_ind != -1)
{
end_ind = meta.IndexOf("\"", start_ind);
if (end_ind != -1)
{
int start = start_ind + 8;
charset = meta.Substring(start, end_ind - start + 1);
charset = charset.TrimEnd(new Char[] { '>', '"' });
}
}
}
and change it to this:
if (meta != null)
{
Match match = Regex.Match(meta, @"<meta.*charset=(.+)\/>");
if (match.Success)
{
string match_str = match.Groups[1].Value;
int end_ind = match_str.IndexOf('"');
if (end_ind == -1)
end_ind = match_str.IndexOf('\'');
if (end_ind != -1)
charset = match_str.Remove(end_ind);
}
}
so now it will correctly recognize both cases of writing the meta
tag.
I am trying to crawl this website with isRespectRobotsDotTextEnabled set to true: http://artofprogress.com/
This is triggering the PageCrawlDisallowed event with "[Disallowed by robots.txt file]" as the DisallowedReason.
As far as I can tell, the robots.txt file doesn't prevent any page from being crawled. Here is complete text of the robots.txt file:
User-agent: *
Disallow:
This is happening on other websites (incidentally, all WordPress sites) as well.
Create AutoRetryCount config value that when is greater than 0 will automatically retry failed (non-200) requests that many times.
Also consider a CrawlDecisionMaker.ShouldRetry() which would allow a way to apply custom logic of when to retry and when not to.
MinCrawlDelayPerDomain ignored when cancellation token is sent in crawl(uri, cancellationtoken) call.
Abot.1.2.3.1026 the current version on nuget as of 06th March 2014 uses the HtmlAgilityPack v1.4.7 while the only available version of the HtmlAgilityPack on nuget is 1.4.6. This cause a version conflict within the application.
Or suggest it in the documentation. Maybe print a warning or throw an exception of they are not changed.
Hi,
I am getting a really strange error message when running the latest abot (1.2.3.1005) as a windows service on my server. This error does not occur on my development machine.
I have attached the VS 2012 remote debugger and got the following exception. (Too bad it's exteral code. Therefore I cannot see where the error occurs exactly.):
"System.AggregateException" in mscorlib.dll
A Task's exception(s) were not observed either by Waiting on the Task or accessing its Exception property. As a result, the unobserved exception was rethrown by the finalizer thread.
I haven't changed much on my windows service code but I recently updated Abot via nuget. Maybe this problem is Abot related.
Another point is that the error does not occur when running Abot with a single thread. Running with 10 threads, the windows service crashes.
How to reproduce the problem?
(It only occurs on my windows server. Not on my development machine)
OnDiskCrawledUrlRepository, the problem is easy to replicate. Just add some logs to record all the crawled urls, and you find tons of repeated ones. My seed URL was http://finance.sina.com.cn/
MaxMemoryUsageInMb config does nothing, was never hooked up
[2014-02-18 01:23:07,009] [4 ] [INFO ] - Page crawl complete, Status:[200] Url:[http://www.dmoz.org/Games/Video_Games/Roleplaying/N/Numen/] Parent:[http://www.dmoz.org/Games/Video_Games/N/] - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [FATAL] - Error occurred during processing of page [http://www.dmoz.org/Games/Video_Games/Roleplaying/N/Numen/] - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [FATAL] - System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
at HtmlAgilityPack.HtmlEntity.DeEntitize(String text)
at Abot.Core.HapHyperLinkParser.GetLinks(HtmlNodeCollection nodes)
at Abot.Core.HapHyperLinkParser.GetHrefValues(CrawledPage crawledPage)
at Abot.Core.HyperLinkParser.GetLinks(CrawledPage crawledPage)
at Abot.Crawler.WebCrawler.ParsePageLinks(CrawledPage crawledPage)
at Abot.Crawler.WebCrawler.ProcessPage(PageToCrawl pageToCrawl) - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [INFO ] - Hard crawl stop requested for site [http://www.dmoz.org/]! - [Abot.Crawler.WebCrawler]
[2014-02-18 01:23:07,009] [4 ] [INFO ] - Crawl complete for site [http://www.dmoz.org/]: [02:38:29.0024196] - [Abot.Crawler.WebCrawler]
Add auto retry based on http status and configurable maxretrycount
I created a TimingPageRequester:
public class TimingPageRequester : PageRequester
{
...
public override CrawledPage MakeRequest(Uri uri, Func<CrawledPage, CrawlDecision> shouldDownloadContent)
{
var timer = Stopwatch.StartNew();
var crawledPage = base.MakeRequest(uri, shouldDownloadContent);
timer.Stop();
crawledPage.PageBag.WebTime = timer.ElapsedMilliseconds;
return crawledPage;
}
}
However, later on in the complete event, I get a null ref exp, since WebTime is not present on the page's page bag. The reason for this is because the PageToCrawl data is merged into the CrawledPage object in WebCrawler.CrawlThePage(). This is using automapper, which can't merge dynamic properties by default, so the whole expando object simply treated as a scalar dynamic property is overwritten.
I was able to overcome this by adding the following code to WebCrawler.CrawlThePage():
...
var srcBag = crawledPage.PageBag as IDictionary<string, object>;
var dstBag = pageToCrawl.PageBag as IDictionary<string, object>;
if (srcBag != null && dstBag != null)
{
foreach (var entry in srcBag)
dstBag.Add(entry);
}
AutoMapper.Mapper.Map(pageToCrawl, crawledPage);
Sorry for not submitting a pull request, I couldn't build the solution - I'm using VS2013 and it blew up when I tried to open the main .sln file. Besides my code couldn't probably use a cleanup.
(I'm using the 1.2.3 branch)
With this code:
CrawlConfiguration crawlConfig = new CrawlConfiguration();
crawlConfig.CrawlTimeoutSeconds = 100;
crawlConfig.MaxConcurrentThreads = 10;
crawlConfig.MaxPagesToCrawl = 1000;
crawlConfig.UserAgentString = "abot v1.0 http://code.google.com/p/abot";
//crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue1", "1111");
//crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue2", "2222");
PoliteWebCrawler crawler = new PoliteWebCrawler(crawlConfig, null, null, null, null, null, null, null);
This is the error:
Error 1 The best overloaded method match for 'Abot.Crawler.PoliteWebCrawler.PoliteWebCrawler(Abot.Poco.CrawlConfiguration, Abot.Core.ICrawlDecisionMaker, Abot.Util.IThreadManager, Abot.Core.IScheduler, Abot.Core.IPageRequester, Abot.Core.IHyperLinkParser, Abot.Util.IMemoryManager, Abot.Core.IDomainRateLimiter, Abot.Core.IRobotsDotTextFinder)' has some invalid arguments
I have a website which need username/password to sign in, how? Thank you :)
Hello,
The maxRetryCount is very useful for me. I test it today.
I found one problem, when a page can't be accessed, it seems Abot doesn't retry, even worse, it stop the whole crawling process. is there something I didn't configure well?
I add two config value
maxRetryCount="3" minRetryDelayInMilliseconds="5000"
Any help is appreciated.
here under is the log,
[2014-12-31 18:24:55,174] [6 ] [INFO ] - Page crawl complete, Status:[NA] Url:[(site Uri)] Parent:[] Retry:[0] - [AbotLogger]
[2014-12-31 18:24:55,174] [6 ] [FATAL] - Error occurred during processing of page [(site Uri)] - [AbotLogger]
[2014-12-31 18:24:55,174] [20 ] [ERROR] - Crawl of page failed (site Uri) - [Abot.Crawler.PoliteWebCrawler]
[2014-12-31 18:24:55,174] [6 ] [FATAL] - System.NullReferenceException: Object reference not set to an instance of an object
在 Abot.Crawler.WebCrawler.ShouldRecrawlPage(CrawledPage crawledPage)
在 Abot.Crawler.WebCrawler.ProcessPage(PageToCrawl pageToCrawl) - [AbotLogger]
[2014-12-31 18:24:55,174] [1 ] [INFO ] - Hard crawl stop requested for site [http://www.amazon.com/Best-Sellers-Clothing/zgbs/apparel/ref=zg_bs_unv_a_1_1040660_1]! - [AbotLogger]
[2014-12-31 18:24:55,174] [1 ] [INFO ] - Crawl complete for site [http://www.amazon.com/Best-Sellers-Clothing/zgbs/apparel/ref=zg_bs_unv_a_1_1040660_1]: Crawled [219] pages in [00:20:18.1264665] - [AbotLogger]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.