Giter Site home page Giter Site logo

Comments (13)

Ristellise avatar Ristellise commented on September 28, 2024 1

@feltcat I believe @Ristellise diagnosed exactly the opposite problem but both of you are right. In any case, the way relative URLs are handled could be improved, which would also have a positive impact on Trafilatura.

To be clearer, its a issue with both packages. courlan doesn't properly resolve full urls while traufiltura sends only base urls to courlan.

from courlan.

Ristellise avatar Ristellise commented on September 28, 2024

Noticed that the improper link resolution is related to courlan as well. I'll PR there as well.

from courlan.

Ristellise avatar Ristellise commented on September 28, 2024

Gonna close the issue since I fixed it myself. I'm suspecting incorrect version for certain packages causing issues.

from courlan.

Ristellise avatar Ristellise commented on September 28, 2024

sorry, I just checked again and seems like it's a mess to untangle the whole issue, but I'm reopening the issue as it still persists.

from courlan.

adbar avatar adbar commented on September 28, 2024

Hi @Ristellise, feel free to write a PR for courlan, I believe the interplay of this two function is the problem:

from courlan.

feltcat avatar feltcat commented on September 28, 2024

An option could be to make both functions take a full URL instead of a base URL.
Currently fix_relative_urls('example.com/foo', 'bar') and fix_relative_urls('example.com/foo', '/bar') both return 'example.com/foo/bar'. It could be changed so that the latter returns 'example.com/bar' instead, and then you could always pass a full URL to it to handle cases like this.

from courlan.

adbar avatar adbar commented on September 28, 2024

@feltcat I believe @Ristellise diagnosed exactly the opposite problem but both of you are right. In any case, the way relative URLs are handled could be improved, which would also have a positive impact on Trafilatura.

from courlan.

feltcat avatar feltcat commented on September 28, 2024

Am I understanding correctly that the base_url that gets passed to extract_links would be something like https://www.example.com, not https://www.example.com/sub_url? If that is the case, then that's why a link to sub_sub_url would resolve to https://www.example.com/sub_sub_url. If instead https://www.example.com/sub_url was passed as the base_url, then that should solve @Ristellise's issue with sub_sub_url, but then the changes to fix_relative_urls that I suggested above would need to be made as well so that it would still handle /sub_sub_url (with the starting slash) correctly.

from courlan.

adbar avatar adbar commented on September 28, 2024

That's correct, two changes are actually necessary:

  1. Pass the actual page URL instead of base_url to extract_links
  2. Add support for URLs of the form /sub_sub_url

from courlan.

feltcat avatar feltcat commented on September 28, 2024

It looks like fix_relative_urls could be replaced with urllib.parse.urljoin:

>>> from urllib.parse import urljoin
>>> urljoin('https://www.example.com/dir/subdir/file.html', '/absolute')
'https://www.example.com/absolute'
>>> urljoin('https://www.example.com/dir/subdir/file.html', 'relative')
'https://www.example.com/dir/subdir/relative'
>>> urljoin('https://www.example.com/dir/subdir/', 'relative')
'https://www.example.com/dir/subdir/relative'
>>> urljoin('https://www.example.com/dir/subdir', 'relative')
'https://www.example.com/dir/relative'

This is correct as long as directories have a slash at the end. @Ristellise do you know of any real-world examples where a directory wouldn't have a slash at the end, but should still behave like the example in your original post?

If I change fix_relative_urls to just return urljoin(baseurl, url), all the unit tests in test_fix_relative still pass, but I get a different failure in test_extraction, line 813.

Edit: Just realised that this doesn't account for fully absolute URLs passed to the function, but some extra code could be added to account for that.

Edit 2: Here's a version that passes all the existing tests:

from urllib.parse import urljoin, urlparse, urlunparse

def fix_relative_urls(baseurl: str, url: str) -> str:
    "Prepend protocol and host information to relative links."
    if url.startswith('{'):
        return url
    base_p = urlparse(baseurl)
    url_p = urlparse(url)
    if url_p.netloc not in [base_p.netloc, '']:
        if url_p.scheme:
            return url
        return urlunparse(url_p._replace(scheme='http'))
    return urljoin(baseurl, url)

@adbar let me know if you'd like me to submit a PR with this change and some extra tests (I have the tests ready now as well).

from courlan.

Ristellise avatar Ristellise commented on September 28, 2024

do you know of any real-world examples where a directory wouldn't have a slash at the end, but should still behave like the example in your original post?

I kinda don't, been trying to find a IRL example for a site that exhibits this kind of behaviour but I cant find one that is popular enough. I only encountered the issue for 1 particular site and I'd rather not post it as it isn't safe for all audiences.

Looked online to see some "Test sites" but I haven't seen a site that specifically mentions testing against URIs.

I might consider building a site just for that...

Re-Reading my issue, I think it is also wording wrongly. I apologise. It should be https://www.example.com/sub_url/ instead. the website uses sub_sub_url/ to redirect the user to another subpage while sub_page_url to direct to a page on the directory. The updated example is as follows:

https://www.example.com/sub_url/

"sub_sub_url/" -> "https://www.example.com/sub_url/sub_sub_url/"
"sub_page_url" -> "https://www.example.com/sub_url/sub_page_url"

I hope this clears up confusion!

from courlan.

feltcat avatar feltcat commented on September 28, 2024

Re-Reading my issue, I think it is also wording wrongly. I apologise. It should be https://www.example.com/sub_url/ instead. the website uses sub_sub_url/ to redirect the user to another subpage while sub_page_url to direct to a page on the directory. The updated example is as follows:

https://www.example.com/sub_url/

"sub_sub_url/" -> "https://www.example.com/sub_url/sub_sub_url/"
"sub_page_url" -> "https://www.example.com/sub_url/sub_page_url"

I hope this clears up confusion!

Ah, great. My code should fix your issue then, at least on the courlan side. There will still need to be changes on the trafilatura side to pass full URLs to the courlan functions as well though.

from courlan.

adbar avatar adbar commented on September 28, 2024

A note concerning Trafilatura: it would be best not to break things while making the transition, meaning we will have to test the changes before a release.

from courlan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.