Giter Site home page Giter Site logo

adbar / courlan Goto Github PK

View Code? Open in Web Editor NEW
114.0 3.0 9.0 549 KB

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

Home Page: https://adrien.barbaresi.eu/blog/easy-content-aware-url-filtering.html

License: Apache License 2.0

Python 100.00%
url url-parsing crawler tld uri url-validation url-parser recon crawling

courlan's Introduction

Hi there! 👋

Links

⚡  Web   |   ✍  Blog   |   🐦  Twitter   |   🎞  Youtube   |   ☕  Coffee

Activity

🔭  Currently working on gathering texts on the Web and detecting word trends

Programming experience

🖩  First programs written on a TI-83 Plus in TI-BASIC

Top Langs


Most popular blog posts

courlan's People

Contributors

adbar avatar feltcat avatar naz-theori avatar sourcery-ai[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

courlan's Issues

UrlStore: add blocking convenience function around get_download_urls()

In order to facilitate the generation of download lists, add a function to the store along those lines: repeated calls to get_download_urls() until there are URLs to download.

while True:
    bufferlist = self.get_download_urls()
    if bufferlist or self.done:
        break
    sleep(sleep_time)
return bufferlist

process_response does not properly resolve urls

Hi! I noticed that for relative URLs within sub URLs, they are resolved back to the root URL.

Let's assume I have a site domain as such: https://www.example.com,
where within that site there is a URL that looks like this https://www.example.com/sub_url.

Within that sub URL, there is an tag. Which looks like this: <a href="super_sub_url">super_SubURL</a>. Currently this URL will be resolved as: https://www.example.com/sub_sub_url. (Take note that there isn't any forward slash in the href.) As such this should instead should be resolved to https://www.example.com/sub_url/sub_sub_url

I hacked on the code a bit and the easiest solution would to change process_response's process_links function to use response.url instead of base_url. However I'm not sure what else would break.

EDIT: Modified the code a bit and it will break some other URLs too. I'll properly have a look into it and probably do a PR.

Courlan does not load `/page/` links

In reference to the nav filter, courlan will not extract links containing /page/ path. Also, I think page and tag|category should be handled separately. I do need to get all blog posts on my website, which are paginated but I don't want to get tags and categories.

Add `is_homepage()` heuristic

Pages of the type .../index.php?abc=d are often quite similar to / which is relevant for web crawling.

Add a function to determine if a given URL is possibly the homepage of a website.

Bug: clean_url fails on apostrophe in urls

The clean_url function fails when a URL contains an apostrophe. I tried to quote/encode the URL, but it wouldn't parse & clean it correctly.

The clean_url function should accept escaped/encoded URLs or better handle characters such as an apostrophe.

Investigate sampling issue

The sampling function may not always work as it should, working example:

>>> from courlan import sample_urls
>>> my_urls = ['https://example.org/' + str(x) for x in range(100)]
>>> my_sample = list(sample_urls(my_urls, 10))

Add support for username in netloc?

Example: https://usr:[email protected]/.

urllib.parse doesn't break apart usr:pwd and example.org in netloc.

Check if this is relevant and potentially add the corresponding functionality so that the extracted hostname is example.org.

Offer IRI to URI conversion

Normalization of international links (IRIs) should go further towards valid URIs:

  1. NFC conversion
  2. "All non-ASCII code points in the IRI should next be encoded as UTF-8, and the resulting bytes percent-encoded, to produce a valid URI."

For example with Python's urlencode.

Change license to Apache 2.0

I wish to make the license more permissive for future versions and to change it to Apache 2.0.

@feltcat You're the only other contributor at this stage, do you agree with the change?

Domain/subdomain confusion in link extraction

A domain abc.com gets conflated with a subdomain abc.xyz.com although they are two different websites.

Originally mentioned in adbar/trafilatura#291

Steps to reproduce the bug:

>>> from courlan import extract_links
>>> extract_links('<html><body><a href="https://knoema.com/o/data-engineer-india"/><a href="https://knoema.recruitee.com/"/></body></html>', base_url="https://knoema.com", external_bool=False)
{'https://knoema.com/o/data-engineer-india', 'https://knoema.recruitee.com'}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.