Comments (7)
The program should only be sending GET requests, surely? In which case there shouldn't be any effects on the site if it's configured properly and not changing state based on GET requests. I can see how that would be an issue for misconfigured sites though.
It'd be fine for it to be a very hidden option, it just seemed crazy that it wasn't there when it seems like a fairly fundamental part of accessing/navigating a website.
The site I wanted to use it on was my own, and authentication was enabled due to large amounts of sensitive information on the site, which was like a knowledge base. I was attempting to crawl the site to reduce the amount of duplicated information and reorganize the site to be more natural to navigate. I don't have an example to hand that you could use for testing, sorry.
from seomacroscope.
Many thanks for the suggestion @motherlymuppet,
So far, I have not planned to support crawling sites that require form-based log ins yet. However, this would very likely be reasonably straightforward to add an option for.
One thing to bear in mind here, is that crawling a site with this type of log in may have unintended side-effects.
For example, if there are links that perform actions like "delete this page", or similar, then SEO Macroscope will merrily follow these links too.
This is also one of the reasons why GoogleBot et al will not crawl sites as a particular user.
from seomacroscope.
Hi @motherlymuppet, following up, I took a look at how Screaming Frog handles this situation.
They too include a dire warning about data loss when using forms-based log ins.
Cookie support itself may be fine though.
Do you happen to have an example site that absolutely requires the setting of cookies in order to crawl it properly please?
many thanks
from seomacroscope.
Thanks @motherlymuppet, that feedback helps a lot.
This is one of those cases where things in the real world, don't always match the specs. i.e. there will be some websites that will have regular links that have potentially damaging, to the user, side-effects when clicked. Generally, because these will always expect a human to be logged in, and not a robot that'll "click" everything it can get to on the page.
For example, SEO Macroscope would not know to not click this link:
<a href="/very/important/docs/delete/123">Delete this doc</a>
Under the hood, things are a little convoluted. The only HTTP methods used by the application are HEAD and GET.
In as many cases as possible, HEAD is used to probe a URL, with a subsequent GET where necessary.
You can see the rough flow that occurs for each fetched document here:
...in the public async Task<bool> Execute ()
method.
In fact, I just recently added an option to force GETs on web servers that don't service HEAD requests properly. The whole web is hack piled upon hack ;-)
So far, HTTP Basic Authentication should work in most cases; but as I don't get as much time as I'd like to work on this, forms-based authentication has so far not been on my TODO list. Hm, I don't actually have a forms-based authentication website to test with at the moment either...
You make some great points though, and this will be something that I'll be taking a look at soon.
many thanks!
from seomacroscope.
Hi again @motherlymuppet,
At a quick glance, it appears that cookie support itself is reasonably trivial.
So, the next detail would be the login process itself.
Does your login form use a GET like this:
https://www.company.com/login?username=bob&password=secret
or a POST to an endpoint somewhat like this:
https://www.company.com/login
with the credentials in the body?
If so, then this type of process would normally require the login page's URL and the credentials to be entered before the crawl takes place. Alternatively, a form field pattern would be required, with the credentials being prompted for during the crawl.
Either way, the login page would be requested first, in order for the resultant session cookie to be captured.
thanks!
from seomacroscope.
It's a POST endpoint, but that shouldn't matter. What I had in mind was a simple text field in the option where you could paste the cookie. I don't expect SEO macroscope to navigate me to the login page or guide me through it or anything like that, and I'd prefer it didn't for security reasons.
I can use the login form myself in a web browser, then take the cookie from the developer menu. All you need to do then is provide the box to put the cookie into, and attach that cookie to all outgoing requests.
That would provide complete flexibility across all login methods, and anyone trying to solve this problem is probably advanced enough to go to the developer menu and grab a cookie.
I don't mean to be patronising if this is already obvious to you, but thought I'd give an example of what I mean:
- Go to the network tab of the developer menu
- Navigate to a new page on github
- On the right, you'll see in the request headers the
Cookie:
field. If you send a request to github with that cookie attached, github will respond as if you're logged in as you. I'm not 100% on which bits are the important bits for github specifically, but it's probablyuser_session
and_gh_session
.
from seomacroscope.
I have several websites that I own that require the acceptance of using cookies, this agreement is the "form" but it gives no more rights to the user except access to the website. This is now a very common use case in EU and now US. I just notice on these websites SEOMacroscope fails
from seomacroscope.
Related Issues (20)
- Error 400 HOT 3
- Pasting a list of 100 urls and only 43 were scanned HOT 3
- 404 link not found? HOT 5
- Page Depth Not Working? HOT 1
- Humans.txt URL gets keywords from automatic 404 page HOT 3
- Program hang on "ndoherty.com" website scan HOT 3
- 1366x768 doesn't allow to save settings HOT 1
- Licence window bug HOT 2
- Improve the export of sitemap
- Macroscope 1.7.5 automatically quits after crawl HOT 12
- Proposal, save scan in the background already HOT 3
- Look at the MinHash algorithm, for similar content detection
- Error when saving session HOT 3
- Update url list without redraw
- Linux / MacOS versions? HOT 2
- Larga scan makes the program freeze HOT 1
- Visual Studio 2017 & above support
- Does Not Attempt to Authenticate
- Application Crash: 0xe0434352
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seomacroscope.