Giter Site home page Giter Site logo

python-urltools's Introduction

urltools version Supported Python versions format downloads license

Some functions to parse and normalize URLs.

The main focus of this library is to make it possible to work on all segments of an URL. Thus a core feature (which is not provided by stdlib) is to split a domain name correctly by using the Public Suffix List (see below).

Functions

Normalize

>>> urltools.normalize("Http://exAMPLE.com./foo")
http://example.com/foo

Rules that are applied to normalize a URL:

  • tolower scheme
  • tolower host (also works with IDNs)
  • remove default port
  • remove ':' without port
  • remove DNS root label
  • unquote path, query, fragment
  • collapse path (remove '//', '/./', '/../')
  • sort query params and remove params without value

normalize uses the functions for splitting and normalization which are descriped below. The hostname is not tolowered by normalize_host. It is already done in the split_host step before to make splitting of malformed netlocs easier.

Parse

The result of parse and extract is a URL named tuple that contains scheme, username, password, subdomain, domain, tld, port, path, query, fragment and the original url itself.

>>> urltools.parse("http://example.co.uk/foo/bar?x=1#abc")
URL(scheme='http', username='', password='', subdomain='', domain='example',
tld='co.uk', port='', path='/foo/bar', query='x=1', fragment='abc',
url='http://example.co.uk/foo/bar?x=1#abc')

If the scheme is missing parse interprets the URL as relative.

>>> urltools.parse("www.example.co.uk/abc")
URL(scheme='', username='', password='', subdomain='', domain='', tld='',
port='', path='www.example.co.uk/abc', query='', fragment='',
url='www.example.co.uk/abc')

Extract

extract does not care about relative URLs and always tries to extract as much information as possible.

>>> urltools.extract("www.example.co.uk/abc")
URL(scheme='', username='', password='', subdomain='www', domain='example',
tld='co.uk', port='', path='/abc', query='', fragment='',
url='www.example.co.uk/abc')

Additional functions

Besides the already described main functions urltools has some more functions to manipulate segments of a URL or create new URLs.

  • construct a new URL from parts

      >>> construct(URL('http', '', '', '', 'example', 'com', '/abc', 'x=1',
      ... 'foo', None))
      'http://example.com/abc?x=1#foo'
    
  • compare two urls to check if they are the same

      >>> compare("http://examPLe.com:80/abc?x=&b=1",
      ... "http://eXAmple.com/abc?b=1")
      True
    
  • encode (IDNA, see RFC 3490)

      >>> urltools.encode("http://müller.de")
      'http://xn--mller-kva.de/'
    
  • normalize_host decodes IDNA encoded segments of a DNS name

      >>> normalize_host('xn--mller-kva.de')
      u'müller.de'
      >>> normalize_host('xn--e1afmkfd.xn--p1ai')
      u'пример.рф'
    
  • normalize_path

      >>> normalize_path("/a/b/../../c")
      '/c'
    
  • normalize_query

      >>> normalize_query("x=1&y=&z=3")
      'x=1&z=3'
    
  • normalize_fragment unquotes fragments except for the characters +# and space

  • unquote a string. Optional it's possible to specify a list of characters which are not unquoted

      >>> unquote('foo%23bar')
      'foo#bar'
      >>> unquote('foo%23bar', ['#'])
      'foo%23bar'
    
  • split is basically the same as urlparse.urlparse in Python2.7 or urllib.parse.urlparse in Python3.4. In Python2.7 it handles some malformed URLs better than urlparse. Differences to urlparse in Python3.4 were not analyzed.

      >>> split("http://www.example.com/abc?x=1&y=2#foo")
      SplitResult(scheme='http', netloc='www.example.com', path='/abc',
      query='x=1&y=2', fragment='foo')
    
  • split_netloc splits a network location (netloc) to username, password, host and port

      >>> split_netloc("foo:[email protected]:8080")
      ('foo', 'bar', 'www.example.com', '8080')
    
  • split_host uses the Public Suffix List to split a domain name correctly

      >>> split_host("www.example.ac.at")
      ('www', 'example', 'ac.at')
    

Public Suffix List

urltools uses the Public Suffix List (PSL) to split domain names correctly. E.g. the TLD of example.co.uk would be .co.uk and not .uk. It is not possible to decide "how big" the TLD is without a lookup in this list.

A local copy of the PSL is recommended. Otherwise it is downloaded with each import of urltools. The path of the local copy has to be defined in the env variable PUBLIC_SUFFIX_LIST:

export PUBLIC_SUFFIX_LIST=/path/to/effective_tld_names.dat

For more information about how PSL works see http://publicsuffix.org/

Installation

You can install urltools from the Python Package Index (PyPI):

pip install urltools

... or get the latest version directly from GitHub:

pip install -e git://github.com/rbaier/python-urltools.git#egg=urltools

The second option is not recommended because some features might be in an experimental state.

There is (or should be) a git tag for each version that was released on PyPI.

Testing

tox and pytest are used for testing. Simply install tox and run it:

pip install tox
tox

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.