Hi, To start, thank you for an excellent piece of work. Appreciated

Falling into Crawl Traps about cobweb HOT 10 OPEN

fuzzygroup commented on June 21, 2024

Falling into Crawl Traps

from cobweb.

Comments (10)

stewartmckee commented on June 21, 2024

As a quick solution you could exclude the page 4 url? I'll look at it in more detail when i get back home.

from cobweb.

stewartmckee commented on June 21, 2024

Yeah, if you add to your external_urls config option the url for page 4 then it should exclude that page and assuming the subsequent pages are only crawled because its gone into page 4 then it will stop at that point. A better solution would be giving the ability to prevent the subsequent crawl of links within the current page as you are processing it. This would mean you can detect this issue, maybe due to no results in the page, and mark it as not to be processed.

from cobweb.

fuzzygroup commented on June 21, 2024

Hi Stewart,

I'm more than willing to take a stab at fixing this. Mind giving me a pointer as to where to best start so I don't make a hash of your nice work?

from cobweb.

stewartmckee commented on June 21, 2024

I think that possibly the best solution would be to include the internal_links data in the hash passed to the block, that way in your code you can add and remove items to and from the list, giving you control over what next steps the crawler will take. That would mean moving the yield above the internal_links.each call on line 121, and changing "internal_links.each" to "content[:internal_links].each" so that updates within the block are now used.

So in your scenario, you would detect there are no items returned in the page body and remove the "?p=" link.

Thats my thinking just now, haven't had a chance to try it out though yet.

from cobweb.

fuzzygroup commented on June 21, 2024

So I've been thinking about the solution you propose and I feel like quite a jerk. I was about to tell you that this was a Cobweb level issue -- but its not. The site itself is buggy:

https://www.udemy.com/courses/photography/mobile-photography/all-courses/

and it has a link to:

So you goto https://www.udemy.com/courses/photography/mobile-photography/all-courses/?p=2"

and it has a link to:

So you goto https://www.udemy.com/courses/photography/mobile-photography/all-courses/?p=3

and it has a link to:

which doesn't exist -- YEP -- this is a site level bug and it HAS to be handled at my specific application logic. The site is generating an infinite succession of pagination offsets even though the there are only 3 pages. Sheesh. Your code is doing things exactly right; you probably knew that and I should have dug deeper before even raising the issue. Apologies.

The only question becomes how does this get handled at a extensible approach without my having to maintain my own fork -- and I don't have a great answer for this. The only thing that comes to mind is some kind of is_link_valid? method but given that this is javascript pagination, even knowing that the link is invalid is hard.

Kudos by the way for supporting . I don't think I've ever put that into any crawlers I wrote from scratch.

The only question becomes how does this get handled at a extensible approach without my having to maintain my own fork -- and I don't have a great answer for this. The only thing that comes to mind is some kind of is_link_valid? method. Another possibility is maybe not following <link tags at all -- but then I see no way to navigate this particular succession of content.

I was able to confirm that one other gem, spidr, has this exact same problem which doubly confirms that the site is actually buggy. Interestingly Google doesn't have this issue - perhaps it is because they simply aren't indexing the pagination but that's odd for Google.

This is really a problem of identifying duplicate content. An easy way to solve this might be to look at a signature on just the html from the <body tag forward. I took two of these pages that were technically invalid -- the ?p=4 and ?p=5 and did a wget on them. Then I diff'ed them and the only difference was the single <link tag that was invalid. Then I noticed that the <link tag was in the element. So I removed the html up to the <body and diff'ed them again and at that point they were the same page i.e. duplicate content.

So one possible approach might be to keep an SHA hash of the content from <body forward and compare to see if this has already been processed -- or matched the last thing processed (much, much harder in a threaded crawler; been there; fought that battle).

Thoughts? I can certainly hack in fix for my own needs but it likely means maintaining a fork indefinitely which always sucks and identifying duplicate content is really a core crawler issue. Let me know if you're interested in this. I really like cobweb and its the first open source crawler I've found that really meets my needs so I'm willing to help get this addressed if you're interested in the help.

Thank you so much.

from cobweb.

fuzzygroup commented on June 21, 2024

Hey Stewart -- any thoughts about what I wrote up above? I haven't forked anything yet and I'd really prefer not to because I truly think that this belongs in the core.

from cobweb.

stewartmckee commented on June 21, 2024

There isn't any issue with forking, it's the normal process for adding to open source projects. If you fork, you can work on adding functionality, documentation or anything within the codebase. Make sure you work in a feature branch and then when you are ready, you can request a pull request into the core repository, we can then work on if anything needs to change and then it gets merged into core.

I'll have a quick go at it today, and let you know how it goes, otherwise, feel free to fork and make changes, as long as it is in a branch its easy to merge back into core.

from cobweb.

fuzzygroup commented on June 21, 2024

I think there is properly done forking -- and you're correct about that -- and there is forking where your goal is to fix something for your own needs. My concern is that this feels to me like a core issue that needs to get addressed by enhancing the core structure and if someone unfamiliar with the code does it (me), they're likely to screw it up badly.

from cobweb.

fuzzygroup commented on June 21, 2024

Just to let you know Stewart I am tackling this and I think I have a fairly elegant approach (at least in my head; implementation is where it gets icky).

from cobweb.

fuzzygroup commented on June 21, 2024

On my fork I've placed a wiki entry with the start of my proposed change: https://github.com/fuzzygroup/cobweb/wiki/Duplicate-Content-Detection

from cobweb.

Falling into Crawl Traps about cobweb HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent