Hi! I noticed that for relative URLs within sub URLs, they are resolved back to the ro

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

That's correct, two changes are actually necessary: Pass the a

process_response does not properly resolve urls about courlan HOT 13 CLOSED

adbar commented on September 28, 2024

process_response does not properly resolve urls

from courlan.

Comments (13)

Ristellise commented on September 28, 2024 1

@feltcat I believe @Ristellise diagnosed exactly the opposite problem but both of you are right. In any case, the way relative URLs are handled could be improved, which would also have a positive impact on Trafilatura.

To be clearer, its a issue with both packages. courlan doesn't properly resolve full urls while traufiltura sends only base urls to courlan.

from courlan.

Ristellise commented on September 28, 2024

Noticed that the improper link resolution is related to courlan as well. I'll PR there as well.

from courlan.

Ristellise commented on September 28, 2024

Gonna close the issue since I fixed it myself. I'm suspecting incorrect version for certain packages causing issues.

from courlan.

Ristellise commented on September 28, 2024

sorry, I just checked again and seems like it's a mess to untangle the whole issue, but I'm reopening the issue as it still persists.

from courlan.

adbar commented on September 28, 2024

Hi @Ristellise, feel free to write a PR for courlan, I believe the interplay of this two function is the problem:

from courlan.

feltcat commented on September 28, 2024

An option could be to make both functions take a full URL instead of a base URL.
Currently fix_relative_urls('example.com/foo', 'bar') and fix_relative_urls('example.com/foo', '/bar') both return 'example.com/foo/bar'. It could be changed so that the latter returns 'example.com/bar' instead, and then you could always pass a full URL to it to handle cases like this.

from courlan.

adbar commented on September 28, 2024

@feltcat I believe @Ristellise diagnosed exactly the opposite problem but both of you are right. In any case, the way relative URLs are handled could be improved, which would also have a positive impact on Trafilatura.

from courlan.

feltcat commented on September 28, 2024

Am I understanding correctly that the base_url that gets passed to extract_links would be something like https://www.example.com, not https://www.example.com/sub_url? If that is the case, then that's why a link to sub_sub_url would resolve to https://www.example.com/sub_sub_url. If instead https://www.example.com/sub_url was passed as the base_url, then that should solve @Ristellise's issue with sub_sub_url, but then the changes to fix_relative_urls that I suggested above would need to be made as well so that it would still handle /sub_sub_url (with the starting slash) correctly.

from courlan.

adbar commented on September 28, 2024

That's correct, two changes are actually necessary:

Pass the actual page URL instead of base_url to extract_links
Add support for URLs of the form /sub_sub_url

from courlan.

feltcat commented on September 28, 2024

It looks like fix_relative_urls could be replaced with urllib.parse.urljoin:

>>> from urllib.parse import urljoin
>>> urljoin('https://www.example.com/dir/subdir/file.html', '/absolute')
'https://www.example.com/absolute'
>>> urljoin('https://www.example.com/dir/subdir/file.html', 'relative')
'https://www.example.com/dir/subdir/relative'
>>> urljoin('https://www.example.com/dir/subdir/', 'relative')
'https://www.example.com/dir/subdir/relative'
>>> urljoin('https://www.example.com/dir/subdir', 'relative')
'https://www.example.com/dir/relative'

This is correct as long as directories have a slash at the end. @Ristellise do you know of any real-world examples where a directory wouldn't have a slash at the end, but should still behave like the example in your original post?

If I change fix_relative_urls to just return urljoin(baseurl, url), all the unit tests in test_fix_relative still pass, but I get a different failure in test_extraction, line 813.

Edit: Just realised that this doesn't account for fully absolute URLs passed to the function, but some extra code could be added to account for that.

Edit 2: Here's a version that passes all the existing tests:

from urllib.parse import urljoin, urlparse, urlunparse

def fix_relative_urls(baseurl: str, url: str) -> str:
    "Prepend protocol and host information to relative links."
    if url.startswith('{'):
        return url
    base_p = urlparse(baseurl)
    url_p = urlparse(url)
    if url_p.netloc not in [base_p.netloc, '']:
        if url_p.scheme:
            return url
        return urlunparse(url_p._replace(scheme='http'))
    return urljoin(baseurl, url)

@adbar let me know if you'd like me to submit a PR with this change and some extra tests (I have the tests ready now as well).

from courlan.

Ristellise commented on September 28, 2024

do you know of any real-world examples where a directory wouldn't have a slash at the end, but should still behave like the example in your original post?

I kinda don't, been trying to find a IRL example for a site that exhibits this kind of behaviour but I cant find one that is popular enough. I only encountered the issue for 1 particular site and I'd rather not post it as it isn't safe for all audiences.

Looked online to see some "Test sites" but I haven't seen a site that specifically mentions testing against URIs.

I might consider building a site just for that...

Re-Reading my issue, I think it is also wording wrongly. I apologise. It should be https://www.example.com/sub_url/ instead. the website uses sub_sub_url/ to redirect the user to another subpage while sub_page_url to direct to a page on the directory. The updated example is as follows:

https://www.example.com/sub_url/

"sub_sub_url/" -> "https://www.example.com/sub_url/sub_sub_url/"
"sub_page_url" -> "https://www.example.com/sub_url/sub_page_url"

I hope this clears up confusion!

from courlan.

feltcat commented on September 28, 2024

Re-Reading my issue, I think it is also wording wrongly. I apologise. It should be https://www.example.com/sub_url/ instead. the website uses sub_sub_url/ to redirect the user to another subpage while sub_page_url to direct to a page on the directory. The updated example is as follows:
https://www.example.com/sub_url/

"sub_sub_url/" -> "https://www.example.com/sub_url/sub_sub_url/"
"sub_page_url" -> "https://www.example.com/sub_url/sub_page_url"
I hope this clears up confusion!

Ah, great. My code should fix your issue then, at least on the courlan side. There will still need to be changes on the trafilatura side to pass full URLs to the courlan functions as well though.

from courlan.

adbar commented on September 28, 2024

A note concerning Trafilatura: it would be best not to break things while making the transition, meaning we will have to test the changes before a release.

from courlan.

process_response does not properly resolve urls about courlan HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent