adbar / courlan Goto Github PK

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

Home Page: https://adrien.barbaresi.eu/blog/easy-content-aware-url-filtering.html

License: Apache License 2.0

Python 100.00%

url url-parsing crawler tld uri url-validation url-parser recon crawling

courlan's Introduction

Hi there! 👋

Links

⚡ Web | ✍ Blog | 🐦 Twitter | 🎞 Youtube | ☕ Coffee

Activity

🔭 Currently working on gathering texts on the Web and detecting word trends

Programming experience

🖩 First programs written on a TI-83 Plus in TI-BASIC

courlan's People

Contributors

Stargazers

Watchers

Forkers

zanachka hussnainghani arpitjain799 feltcat vishalbelsare techthiyanes lightenedlimited adrianpuiu naz-theori

courlan's Issues

UrlStore: add blocking convenience function around get_download_urls()

In order to facilitate the generation of download lists, add a function to the store along those lines: repeated calls to get_download_urls() until there are URLs to download.

while True:
    bufferlist = self.get_download_urls()
    if bufferlist or self.done:
        break
    sleep(sleep_time)
return bufferlist

process_response does not properly resolve urls

Hi! I noticed that for relative URLs within sub URLs, they are resolved back to the root URL.

Let's assume I have a site domain as such: https://www.example.com,
where within that site there is a URL that looks like this https://www.example.com/sub_url.

Within that sub URL, there is an tag. Which looks like this: <a href="super_sub_url">super_SubURL</a>. Currently this URL will be resolved as: https://www.example.com/sub_sub_url. (Take note that there isn't any forward slash in the href.) As such this should instead should be resolved to https://www.example.com/sub_url/sub_sub_url

I hacked on the code a bit and the easiest solution would to change process_response's process_links function to use response.url instead of base_url. However I'm not sure what else would break.

EDIT: Modified the code a bit and it will break some other URLs too. I'll properly have a look into it and probably do a PR.

Courlan does not load `/page/` links

In reference to the nav filter, courlan will not extract links containing /page/ path. Also, I think page and tag|category should be handled separately. I do need to get all blog posts on my website, which are paginated but I don't want to get tags and categories.

Add functioning courlan image or link to Pypi readme file

Either by modifying the manifest file or by providing a full link.

Add a function to check robots.txt rules and page type

Check the default RobotFileParser
is_not_crawlable(link) + can_fetch(crawler, link) → is_doable(link)

Define `all` in `init.py`

Add __all__ to configure explicit exports, this will address warnings by code linters.

Add `is_homepage()` heuristic

Pages of the type .../index.php?abc=d are often quite similar to / which is relevant for web crawling.

Add a function to determine if a given URL is possibly the homepage of a website.

Deprecate Python 3.6 & 3.7

Related to #59.

Replace language and country codes in `langinfo.py` by `pycountry`

The langinfo.py contains a list of potential language and country codes.

It could be replaced by including the pycountry package and loading it during init.

Convert Readme file to markdown format

RST-syntax seems to be broken until further notice on Github.

Test and fix URL sampling to support Python 3.11

See log here: https://github.com/adbar/courlan/runs/4448004256?check_suite_focus=true

Remove tox test settings

see adbar/trafilatura@db4c586

Replace tldextract with tld?

Remove tldextract and replace it with tld to reduce the total number of package dependencies as mentioned in adbar/trafilatura#41

Persistance for `UrlStore` (file I/O)

Write functions to add persistance to the UrlStore:

.write(): write to disk
.load(): load from file
.add(): combine two stores (?)

UrlStore: keep track of last response code and adjust backoff strategy

When getting a 4XX HTTP response code wait longer before sending URLs down the line.

Check if `langcodes` can be replaced by `babel`

The langcodes module doesn't seemed to be maintained. See if it's possible to switch to babel in this function:

courlan/courlan/filters.py

Line 189 in 8d3e12e

def langcodes_score(language: str, segment: str, score: int) -> int:

Bug: clean_url fails on apostrophe in urls

The clean_url function fails when a URL contains an apostrophe. I tried to quote/encode the URL, but it wouldn't parse & clean it correctly.

The clean_url function should accept escaped/encoded URLs or better handle characters such as an apostrophe.

Define option to focus on given extension types

So far Courlan will only output links related to HTML documents. Let users define a given list of extensions to override this behavior.

Provide function `is_valid_url()`

So far one has to use validate_url(url)[0] is not None.

Investigate sampling issue

The sampling function may not always work as it should, working example:

>>> from courlan import sample_urls
>>> my_urls = ['https://example.org/' + str(x) for x in range(100)]
>>> my_sample = list(sample_urls(my_urls, 10))

Drop support for Python 3.5

Only support Python versions 3.6+ in the future and see if the code can be improved or cleaned on the way.

Example to search the code: https://github.com/adbar/courlan/search?l=Python&q=%22Python+3.%22

Make use of signal optional

The signal module interfere with use in distributed queues: adbar/trafilatura#325

It can be made optional to allow for use without URL dump on exit.

Add support for username in netloc?

Example: https://usr:[email protected]/.

urllib.parse doesn't break apart usr:pwd and example.org in netloc.

Check if this is relevant and potentially add the corresponding functionality so that the extracted hostname is example.org.

Offer IRI to URI conversion

Normalization of international links (IRIs) should go further towards valid URIs:

NFC conversion
"All non-ASCII code points in the IRI should next be encoded as UTF-8, and the resulting bytes percent-encoded, to produce a valid URI."

For example with Python's urlencode.

Steps to reproduce the bug:

>>> from courlan import extract_links
>>> extract_links('<html><body><a href="https://knoema.com/o/data-engineer-india"/><a href="https://knoema.recruitee.com/"/></body></html>', base_url="https://knoema.com", external_bool=False)
{'https://knoema.com/o/data-engineer-india', 'https://knoema.recruitee.com'}