Comments (13)
@feltcat I believe @Ristellise diagnosed exactly the opposite problem but both of you are right. In any case, the way relative URLs are handled could be improved, which would also have a positive impact on Trafilatura.
To be clearer, its a issue with both packages. courlan doesn't properly resolve full urls while traufiltura sends only base urls to courlan.
from courlan.
Noticed that the improper link resolution is related to courlan as well. I'll PR there as well.
from courlan.
Gonna close the issue since I fixed it myself. I'm suspecting incorrect version for certain packages causing issues.
from courlan.
sorry, I just checked again and seems like it's a mess to untangle the whole issue, but I'm reopening the issue as it still persists.
from courlan.
Hi @Ristellise, feel free to write a PR for courlan
, I believe the interplay of this two function is the problem:
- https://github.com/adbar/courlan/blob/master/courlan/core.py#L121
- https://github.com/adbar/courlan/blob/master/courlan/urlutils.py#L115
from courlan.
An option could be to make both functions take a full URL instead of a base URL.
Currently fix_relative_urls('example.com/foo', 'bar')
and fix_relative_urls('example.com/foo', '/bar')
both return 'example.com/foo/bar'
. It could be changed so that the latter returns 'example.com/bar'
instead, and then you could always pass a full URL to it to handle cases like this.
from courlan.
@feltcat I believe @Ristellise diagnosed exactly the opposite problem but both of you are right. In any case, the way relative URLs are handled could be improved, which would also have a positive impact on Trafilatura.
from courlan.
Am I understanding correctly that the base_url
that gets passed to extract_links
would be something like https://www.example.com
, not https://www.example.com/sub_url
? If that is the case, then that's why a link to sub_sub_url
would resolve to https://www.example.com/sub_sub_url
. If instead https://www.example.com/sub_url
was passed as the base_url
, then that should solve @Ristellise's issue with sub_sub_url
, but then the changes to fix_relative_urls
that I suggested above would need to be made as well so that it would still handle /sub_sub_url
(with the starting slash) correctly.
from courlan.
That's correct, two changes are actually necessary:
- Pass the actual page URL instead of
base_url
toextract_links
- Add support for URLs of the form
/sub_sub_url
from courlan.
It looks like fix_relative_urls
could be replaced with urllib.parse.urljoin:
>>> from urllib.parse import urljoin
>>> urljoin('https://www.example.com/dir/subdir/file.html', '/absolute')
'https://www.example.com/absolute'
>>> urljoin('https://www.example.com/dir/subdir/file.html', 'relative')
'https://www.example.com/dir/subdir/relative'
>>> urljoin('https://www.example.com/dir/subdir/', 'relative')
'https://www.example.com/dir/subdir/relative'
>>> urljoin('https://www.example.com/dir/subdir', 'relative')
'https://www.example.com/dir/relative'
This is correct as long as directories have a slash at the end. @Ristellise do you know of any real-world examples where a directory wouldn't have a slash at the end, but should still behave like the example in your original post?
If I change fix_relative_urls
to just return urljoin(baseurl, url)
, all the unit tests in test_fix_relative still pass, but I get a different failure in test_extraction, line 813.
Edit: Just realised that this doesn't account for fully absolute URLs passed to the function, but some extra code could be added to account for that.
Edit 2: Here's a version that passes all the existing tests:
from urllib.parse import urljoin, urlparse, urlunparse
def fix_relative_urls(baseurl: str, url: str) -> str:
"Prepend protocol and host information to relative links."
if url.startswith('{'):
return url
base_p = urlparse(baseurl)
url_p = urlparse(url)
if url_p.netloc not in [base_p.netloc, '']:
if url_p.scheme:
return url
return urlunparse(url_p._replace(scheme='http'))
return urljoin(baseurl, url)
@adbar let me know if you'd like me to submit a PR with this change and some extra tests (I have the tests ready now as well).
from courlan.
do you know of any real-world examples where a directory wouldn't have a slash at the end, but should still behave like the example in your original post?
I kinda don't, been trying to find a IRL example for a site that exhibits this kind of behaviour but I cant find one that is popular enough. I only encountered the issue for 1 particular site and I'd rather not post it as it isn't safe for all audiences.
Looked online to see some "Test sites" but I haven't seen a site that specifically mentions testing against URIs.
I might consider building a site just for that...
Re-Reading my issue, I think it is also wording wrongly. I apologise. It should be https://www.example.com/sub_url/
instead. the website uses sub_sub_url/
to redirect the user to another subpage while sub_page_url
to direct to a page on the directory. The updated example is as follows:
https://www.example.com/sub_url/
"sub_sub_url/" -> "https://www.example.com/sub_url/sub_sub_url/"
"sub_page_url" -> "https://www.example.com/sub_url/sub_page_url"
I hope this clears up confusion!
from courlan.
Re-Reading my issue, I think it is also wording wrongly. I apologise. It should be
https://www.example.com/sub_url/
instead. the website usessub_sub_url/
to redirect the user to another subpage whilesub_page_url
to direct to a page on the directory. The updated example is as follows:https://www.example.com/sub_url/ "sub_sub_url/" -> "https://www.example.com/sub_url/sub_sub_url/" "sub_page_url" -> "https://www.example.com/sub_url/sub_page_url"
I hope this clears up confusion!
Ah, great. My code should fix your issue then, at least on the courlan side. There will still need to be changes on the trafilatura side to pass full URLs to the courlan functions as well though.
from courlan.
A note concerning Trafilatura: it would be best not to break things while making the transition, meaning we will have to test the changes before a release.
from courlan.
Related Issues (20)
- Domain/subdomain confusion in link extraction
- Courlan does not load `/page/` links HOT 3
- Make use of signal optional
- Offer IRI to URI conversion
- Define option to focus on given extension types
- Provide function `is_valid_url()`
- Add functioning courlan image or link to Pypi readme file
- Add a function to check robots.txt rules and page type
- Investigate sampling issue
- Check if `langcodes` can be replaced by `babel`
- Persistance for `UrlStore` (file I/O)
- Navigation: add heuristic based on site depth
- Deprecate Python 3.6 & 3.7
- Replace language and country codes in `langinfo.py` by `pycountry` HOT 1
- Add `is_homepage()` heuristic
- Add support for username in netloc? HOT 1
- Change license to Apache 2.0 HOT 1
- UrlStore: keep track of last response code and adjust backoff strategy
- Convert Readme file to markdown format
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from courlan.