Giter Site home page Giter Site logo

Comments (6)

sjdirect avatar sjdirect commented on August 20, 2024

Can you submit a snippet that shows how/where you are setting the dynamic value and where you expect to get the value back out?

from abotx.

MossP avatar MossP commented on August 20, 2024

Sorry for the delay I wanted to be sure the error wasn't getting triggered elsewhere.

Here's an example of it going awry for me..

globalCrawlEngine.parallelCrawler.CrawlerInstanceCreatedAsync += (sender, e) =>
{
    e.Crawler.PageCrawlStartingAsync += ProcessPageCrawlStarting;
    e.Crawler.PageCrawlCompletedAsync += ProcessPageCrawlCompleted;
    e.Crawler.PageCrawlDisallowedAsync += PageCrawlDisallowed;
    e.Crawler.PageLinksCrawlDisallowedAsync += PageLinksCrawlDisallowed;

    e.Crawler.CrawlBag.siteID = e.SiteToCrawl.SiteBag.siteID;
    e.Crawler.CrawlBag.siteGrabID = e.SiteToCrawl.SiteBag.siteGrabID;

    var testID1 = e.Crawler.CrawlBag.siteID;
    var testUri1 = e.SiteToCrawl.Uri.AbsoluteUri;
}

The variables all match up here. I can add a break point and see that the URL matches the id that has been passed in. But then later in code when the PageCrawlStartingAsync event is triggered the variables no longer match.

public void ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e)
{
    var testID2 = e.crawlContext.crawlBag.siteID;
    var testUri2 = e.pageToCrawl.Uri.AbsoluteUri;
}

Although testID1 and testID2 match (they are generated new for each crawl), I can see a difference in the host in testUri1 and testUri2.

As mentioned originally though, I can only trigger this error after a fresh compile and only if two additions are made closely together via the following code

SiteToCrawl siteAddition = new SiteToCrawl();
siteAddition.Uri = new Uri(siteGrab.domain);
siteAddition.SiteBag.siteGrabID = siteGrab.id;
siteAddition.SiteBag.siteID = siteGrab.siteID;

siteAddition.SiteBag.baseUri = siteAddition.Uri;
 _siteToCrawlProviderX.AddSitesToCrawl(new List<SiteToCrawl> { siteAddition });
globalCrawlEngine.parallelCrawler.StartAsync();

All of the data at this point is also matching and correct.

from abotx.

sjdirect avatar sjdirect commented on August 20, 2024

Can you try something for me? Subscribe to the non async version and see if that solves your issue.

globalCrawlEngine.parallelCrawler.CrawlerInstanceCreated

instead of

globalCrawlEngine.parallelCrawler.CrawlerInstanceCreatedAsync

I suspect that maybe the since the initialization of those values is happening async that it may cause a scenario where it may fire the ProcessPageCrawlStarting event when the CrawlerInstanceCreatedAsync event is halfway done completing.

from abotx.

MossP avatar MossP commented on August 20, 2024

Thanks for getting back to me.

OK. I've tried that and that wasn't it. I did find a dirty way around it but then realised that it was inconsistent. It appears that rootUri and originalRootUri are often set to null the first time through the set to the first site's url on the second pass. The crawler always takes this url as it's base url for the crawl and the isInternal check comes back incorrectly. I have overridden this check now and manually check in the decision maker. as well as passing the base url down in the SiteBag so i can manually re-add it during the instanceCreated function. This seems to work but is a very dirty fix and sometimes drops out (different results each test, even from fresh compile). I am about to try an even dirtier fix by adding a delay in the init of the parallel crawler to simply prevent any additions if the singleton crawler has only recently been setup. I'll let you know how I get on and if you have any other ideas, I'm all ears.

from abotx.

sjdirect avatar sjdirect commented on August 20, 2024

When I try to set this up with your snippets I'm still unable to repro. Could you create a fully runnable example? Ie.. a console app that registers everything and add a condition in the SiteCrawlStarting that detects when they are mismatched so I can add a breakpoint there and wait for the condition to happen?

from abotx.

sjdirect avatar sjdirect commented on August 20, 2024

Closing this issue since I cannot repro and have not heard from the reporter in almost 2 weeks

from abotx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.