⚡ Web | ✍ Blog | 🐦 Twitter | 🎞 Youtube | ☕ Coffee
🔭 Currently working on gathering texts on the Web and detecting word trends
🖩 First programs written on a TI-83 Plus in TI-BASIC
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
Home Page: https://adrien.barbaresi.eu/blog/easy-content-aware-url-filtering.html
License: Apache License 2.0
In order to facilitate the generation of download lists, add a function to the store along those lines: repeated calls to get_download_urls()
until there are URLs to download.
while True:
bufferlist = self.get_download_urls()
if bufferlist or self.done:
break
sleep(sleep_time)
return bufferlist
Hi! I noticed that for relative URLs within sub URLs, they are resolved back to the root URL.
Let's assume I have a site domain as such: https://www.example.com
,
where within that site there is a URL that looks like this https://www.example.com/sub_url
.
Within that sub URL, there is an tag. Which looks like this: <a href="super_sub_url">super_SubURL</a>
. Currently this URL will be resolved as: https://www.example.com/sub_sub_url
. (Take note that there isn't any forward slash in the href.) As such this should instead should be resolved to https://www.example.com/sub_url/sub_sub_url
I hacked on the code a bit and the easiest solution would to change process_response
's process_links
function to use response.url
instead of base_url
. However I'm not sure what else would break.
EDIT: Modified the code a bit and it will break some other URLs too. I'll properly have a look into it and probably do a PR.
In reference to the nav filter, courlan
will not extract links containing /page/
path. Also, I think page
and tag|category
should be handled separately. I do need to get all blog posts on my website, which are paginated but I don't want to get tags and categories.
Either by modifying the manifest file or by providing a full link.
RobotFileParser
is_not_crawlable(link)
+ can_fetch(crawler, link)
→ is_doable(link)
Add __all__
to configure explicit exports, this will address warnings by code linters.
Pages of the type .../index.php?abc=d
are often quite similar to /
which is relevant for web crawling.
Add a function to determine if a given URL is possibly the homepage of a website.
Related to #59.
The langinfo.py contains a list of potential language and country codes.
It could be replaced by including the pycountry package and loading it during init.
See also list of ISO language codes in Python.
RST-syntax seems to be broken until further notice on Github.
Remove tldextract
and replace it with tld
to reduce the total number of package dependencies as mentioned in adbar/trafilatura#41
Write functions to add persistance to the UrlStore
:
.write()
: write to disk.load()
: load from file.add()
: combine two stores (?)When getting a 4XX HTTP response code wait longer before sending URLs down the line.
The clean_url function fails when a URL contains an apostrophe. I tried to quote/encode the URL, but it wouldn't parse & clean it correctly.
The clean_url function should accept escaped/encoded URLs or better handle characters such as an apostrophe.
So far Courlan will only output links related to HTML documents. Let users define a given list of extensions to override this behavior.
So far one has to use validate_url(url)[0] is not None
.
The sampling function may not always work as it should, working example:
>>> from courlan import sample_urls
>>> my_urls = ['https://example.org/' + str(x) for x in range(100)]
>>> my_sample = list(sample_urls(my_urls, 10))
Only support Python versions 3.6+ in the future and see if the code can be improved or cleaned on the way.
Example to search the code: https://github.com/adbar/courlan/search?l=Python&q=%22Python+3.%22
The signal module interfere with use in distributed queues: adbar/trafilatura#325
It can be made optional to allow for use without URL dump on exit.
Example: https://usr:[email protected]/
.
urllib.parse
doesn't break apart usr:pwd
and example.org
in netloc.
Check if this is relevant and potentially add the corresponding functionality so that the extracted hostname is example.org
.
e.g. considered as the number of slashes/folders.
I wish to make the license more permissive for future versions and to change it to Apache 2.0.
@feltcat You're the only other contributor at this stage, do you agree with the change?
A domain abc.com
gets conflated with a subdomain abc.xyz.com
although they are two different websites.
Originally mentioned in adbar/trafilatura#291
Steps to reproduce the bug:
>>> from courlan import extract_links
>>> extract_links('<html><body><a href="https://knoema.com/o/data-engineer-india"/><a href="https://knoema.recruitee.com/"/></body></html>', base_url="https://knoema.com", external_bool=False)
{'https://knoema.com/o/data-engineer-india', 'https://knoema.recruitee.com'}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.