Giter Site home page Giter Site logo

Comments (7)

stevenwaterman avatar stevenwaterman commented on May 30, 2024 1

The program should only be sending GET requests, surely? In which case there shouldn't be any effects on the site if it's configured properly and not changing state based on GET requests. I can see how that would be an issue for misconfigured sites though.

It'd be fine for it to be a very hidden option, it just seemed crazy that it wasn't there when it seems like a fairly fundamental part of accessing/navigating a website.

The site I wanted to use it on was my own, and authentication was enabled due to large amounts of sensitive information on the site, which was like a knowledge base. I was attempting to crawl the site to reduce the amount of duplicated information and reorganize the site to be more natural to navigate. I don't have an example to hand that you could use for testing, sorry.

from seomacroscope.

nazuke avatar nazuke commented on May 30, 2024

Many thanks for the suggestion @motherlymuppet,

So far, I have not planned to support crawling sites that require form-based log ins yet. However, this would very likely be reasonably straightforward to add an option for.

One thing to bear in mind here, is that crawling a site with this type of log in may have unintended side-effects.

For example, if there are links that perform actions like "delete this page", or similar, then SEO Macroscope will merrily follow these links too.

This is also one of the reasons why GoogleBot et al will not crawl sites as a particular user.

from seomacroscope.

nazuke avatar nazuke commented on May 30, 2024

Hi @motherlymuppet, following up, I took a look at how Screaming Frog handles this situation.

They too include a dire warning about data loss when using forms-based log ins.

Cookie support itself may be fine though.

Do you happen to have an example site that absolutely requires the setting of cookies in order to crawl it properly please?

many thanks

from seomacroscope.

nazuke avatar nazuke commented on May 30, 2024

Thanks @motherlymuppet, that feedback helps a lot.

This is one of those cases where things in the real world, don't always match the specs. i.e. there will be some websites that will have regular links that have potentially damaging, to the user, side-effects when clicked. Generally, because these will always expect a human to be logged in, and not a robot that'll "click" everything it can get to on the page.

For example, SEO Macroscope would not know to not click this link:

<a href="/very/important/docs/delete/123">Delete this doc</a>

Under the hood, things are a little convoluted. The only HTTP methods used by the application are HEAD and GET.

In as many cases as possible, HEAD is used to probe a URL, with a subsequent GET where necessary.

You can see the rough flow that occurs for each fetched document here:

https://github.com/nazuke/SEOMacroscope/blob/master/SEOMacroscopeSeriesOne/src/MacroscopeDocument/MacroscopeDocument.cs

...in the public async Task<bool> Execute () method.

In fact, I just recently added an option to force GETs on web servers that don't service HEAD requests properly. The whole web is hack piled upon hack ;-)

So far, HTTP Basic Authentication should work in most cases; but as I don't get as much time as I'd like to work on this, forms-based authentication has so far not been on my TODO list. Hm, I don't actually have a forms-based authentication website to test with at the moment either...

You make some great points though, and this will be something that I'll be taking a look at soon.

many thanks!

from seomacroscope.

nazuke avatar nazuke commented on May 30, 2024

Hi again @motherlymuppet,

At a quick glance, it appears that cookie support itself is reasonably trivial.

So, the next detail would be the login process itself.

Does your login form use a GET like this:

https://www.company.com/login?username=bob&password=secret

or a POST to an endpoint somewhat like this:

https://www.company.com/login

with the credentials in the body?

If so, then this type of process would normally require the login page's URL and the credentials to be entered before the crawl takes place. Alternatively, a form field pattern would be required, with the credentials being prompted for during the crawl.

Either way, the login page would be requested first, in order for the resultant session cookie to be captured.

thanks!

from seomacroscope.

stevenwaterman avatar stevenwaterman commented on May 30, 2024

It's a POST endpoint, but that shouldn't matter. What I had in mind was a simple text field in the option where you could paste the cookie. I don't expect SEO macroscope to navigate me to the login page or guide me through it or anything like that, and I'd prefer it didn't for security reasons.

I can use the login form myself in a web browser, then take the cookie from the developer menu. All you need to do then is provide the box to put the cookie into, and attach that cookie to all outgoing requests.

That would provide complete flexibility across all login methods, and anyone trying to solve this problem is probably advanced enough to go to the developer menu and grab a cookie.

I don't mean to be patronising if this is already obvious to you, but thought I'd give an example of what I mean:

  • Go to the network tab of the developer menu
  • Navigate to a new page on github
  • On the right, you'll see in the request headers the Cookie: field. If you send a request to github with that cookie attached, github will respond as if you're logged in as you. I'm not 100% on which bits are the important bits for github specifically, but it's probably user_session and _gh_session.

from seomacroscope.

benhadad avatar benhadad commented on May 30, 2024

I have several websites that I own that require the acceptance of using cookies, this agreement is the "form" but it gives no more rights to the user except access to the website. This is now a very common use case in EU and now US. I just notice on these websites SEOMacroscope fails

from seomacroscope.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.