lipoja / urlextract Goto Github PK

View Code? Open in Web Editor NEW

236.0 9.0 61.0 333 KB

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.

License: MIT License

Python 100.00%

urls extract extractor hacktoberfest

urlextract's Introduction

URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.

How does it work

It tries to find any occurrence of TLD in given text. If TLD is found it starts from that position to expand boundaries to both sides searching for "stop character" (usually whitespace, comma, single or double quote).

A dns check option is available to also reject invalid domain names.

NOTE: List of TLDs is downloaded from iana.org to keep you up to date with new TLDs.

Installation

Package is available on PyPI - you can install it via pip.

pip install urlextract

Documentation

Online documentation is published at http://urlextract.readthedocs.io/

Requirements

IDNA for converting links to IDNA format
uritools for domain name validation
platformdirs for determining user's cache directory

dnspython to cache DNS results

pip install idna
pip install uritools
pip install platformdirs
pip install dnspython

Or you can install the requirements with `requirements.txt`:

pip install -r requirements.txt

Run tox

Install tox:

pip install tox

Then run it:

tox

Example

You can look at command line program at the end of urlextract.py. But everything you need to know is this:

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Text with URLs. Let's have URL janlipovsky.cz as an example.")
print(urls) # prints: ['janlipovsky.cz']

Or you can get generator over URLs in text by:

from urlextract import URLExtract

extractor = URLExtract()
example_text = "Text with URLs. Let's have URL janlipovsky.cz as an example."

for url in extractor.gen_urls(example_text):
    print(url) # prints: ['janlipovsky.cz']

Or if you want to just check if there is at least one URL you can do:

from urlextract import URLExtract

extractor = URLExtract()
example_text = "Text with URLs. Let's have URL janlipovsky.cz as an example."

if extractor.has_urls(example_text):
    print("Given text contains some URL")

If you want to have up to date list of TLDs you can use update():

from urlextract import URLExtract

extractor = URLExtract()
extractor.update()

or update_when_older() method:

from urlextract import URLExtract

extractor = URLExtract()
extractor.update_when_older(7) # updates when list is older that 7 days

Known issues

Since TLD can be not only shortcut but also some meaningful word we might see "false matches" when we are searching for URL in some HTML pages. The false match can occur for example in css or JS when you are referring to HTML item using its classes.

Example HTML code:

<p class="bold name">Jan</p>
<style>
  p.bold.name {
    font-weight: bold;
  }
</style>

If this HTML snippet is on the input of urlextract.find_urls() it will return p.bold.name as an URL. Behavior of urlextract is correct, because .name is valid TLD and urlextract just see that there is bold.name valid domain name and p is valid sub-domain.

License

This piece of code is licensed under The MIT License.

urlextract's People

Contributors

Stargazers

Watchers

Forkers

ourcach penafieljlm launchlabau knightth0r juls858 leite80 refeed benys94 asbconsulting laurachan karlicoss ohtar10 minhle2994 hidhineshraja pythonthings infokiller phantomdft lan17 clinjie supernothing jayvdb whitespur jackson-forks dmascialino rsf9495 5l1v3r1 srbhr dangkhoa0894 hollow667 sulaimanzai mingxuanyao0528 za stromajer rickmeasham blakduk iskou9821 joel-osebe kak-bo-che nicolasassi gryfi hzj415909583 ngi-nix sh9369 hugovk lasx georgettica laundromat amoldavsky khoben mimi89999 cs-owen iwangpeng elliotwutingfeng lukfil pankajul danilomatosdev alusru ra80533 aloneboy121

urlextract's Issues

Doesn't checks for valid termination

For the following input:

from urlextract import URLExtract

extractor = URLExtract()
text="""
http://httpbin.org/status/204, http://httpbin.org/status/204.
"""
urls = extractor.find_urls(text)
print(urls)

The output generated is:
['http://httpbin.org/status/204,', 'http://httpbin.org/status/204.']

The set [.,?!-] are not valid terminal symbols for the url and thus should be checked.

urlextract does not have a stdin option

Most text manipulation tools have the option to work on a file OR from stdin.

stdin is not an option in this tool.

I added a simple CLI option to my version that functioned this way:

    if args.input_file:
        with open(args.input_file, "r", encoding="UTF-8") as f:
            content = f.read()

    elif args.stdin:
        content = ' '.join(sys.stdin.readlines())

    for url in urlextract.find_urls(content, args.unique):
        print(url)

URLExtract installs module "version" out of "urlextract" namespace

URLExctract installs version.py as a separate module out of urlextract namespace.

$ pip3 install urlextract
$ python3
>>> import version
>>> version
<module 'version' from '/tmp/venv/lib/python3.6/site-packages/version.py'>

This is quite unexpected, PIP packages should use namespaces to avoid mixing with other projects. Currently it has conflict with namespace of https://pypi.org/project/version/ project.

Expected is having at least:

>>> from urlexctract import version

or better

>>> import urlextract
>>> urlextract.__version__

Temp directory as the 3rd fallback for TLDs cache file

As suggested by @keyz182 in #24 (comment) it would be nice to have fallback to temp directory when we are trying to download TLDs to cache file.

[Bug] MemoryError

I fed it with a 800.000Mb text file and broke with that error. I used -u command also.
Running it in python 3.7 in Win10x64

[Bug] ValueError with text from a reference

What a great library! I'm parsing PDF -> text files from the arXiv and a common motif that crashes the program is shown in this minimal example below:

from urlextract import URLExtract
extractor = URLExtract()

text = "et.al.[10]"
extractor.find_urls(text)

with the traceback

Traceback (most recent call last):
  File "failure.py", line 5, in <module>
    extractor.find_urls(text)
  File "/home/hoppeta/.pyenv/versions/3.6.0/lib/python3.6/site-packages/urlextract/urlextract_core.py", line 756, in find_urls
    return list(urls)
  File "/home/hoppeta/.pyenv/versions/3.6.0/lib/python3.6/site-packages/urlextract/urlextract_core.py", line 739, in gen_urls
    tmp_url = self._complete_url(text, offset + tld_pos, tld)
  File "/home/hoppeta/.pyenv/versions/3.6.0/lib/python3.6/site-packages/urlextract/urlextract_core.py", line 560, in _complete_url
    if not self._is_domain_valid(complete_url, tld):
  File "/home/hoppeta/.pyenv/versions/3.6.0/lib/python3.6/site-packages/urlextract/urlextract_core.py", line 632, in _is_domain_valid
    host = url_parts.gethost()
  File "/home/hoppeta/.pyenv/versions/3.6.0/lib/python3.6/site-packages/uritools/split.py", line 157, in gethost
    raise ValueError('Invalid host %r' % host)
ValueError: Invalid host 'et.al.[10]'

Release New Version to PyPi

In light of the latest changes made to this project, would it be possible to release a new version of URLExtract to PyPi? The most recent release on PyPi was on February 24, while there have been several changes made since then.

Add ability to read local TLDs file rather than download

Hi. Thanks for creating URLExtract. It works well for me locally, but when I deploy a Flask app that uses URLExtract to AWS ElasticBeanstalk (kind of like a PaaS), I get an error.

The problem is that AWS ElasticBeanstalk implements good web server security practices, which includes making the home directory, and the deployed code's directory, read-only. They will have done this to make it harder for a web app vulnerability to results in a file being written to disk on the web server.

However, this presents a problem for URLExtract - if it's unable to write to the module's __file__ or $HOME then it throws an exception because it has nowhere to store the TLDs file.

To get around this, it'd be great if there was the option to be able package the TLDs file as part of the deployed app, and pass the full path to this file to URLExtract at initialisation. If passed a file path like this URLExtract would use the file, and not try to download a version from iana.org.

This approach also has the added benefit of being more robust - your app is not dependent on being able to download a file from iana.org to work. It also allows you to guarantee two environments are exactly the same - they will have the same TLDs file (great for testing). It's also more secure - pulling down files from third-party sites at runtime is considered risky by some, in fact in some commercial environments this is simply not possible - production outbound internet access is blocked for security reasons.

URL surrounded with parenthesis

Hi there,
I simply tested code below and but surprisingly it returned an empty list:

from urlextract import URLExtract
extractor = URLExtract()
urls = extractor.find_urls("Let's have URL (stackoverflow.com) as an example.")
print(urls)

pypidb issues

Continuing from #63 , these are the known issues (list will grow).

As a general rule, the higher priority issues are where urlextract doesnt extract valuable urls, or extracts truncated urls. Returning extra junk around urls or extra urls is problematic, but I can trim/remove junk. I cant fix data I dont have.

#62 (high priority bug)
#67 (medium priority enhancement)
#36 (medium priority bug)
#43 (annoying)
#13 (bug, e.g. https://pypi.org/project/ebcdic/ )

Others I think are harder and may not be in urlextract scope:

Lots of annoying invalid .py domains filtered out by DNS checking, such as setup.py which is assumed to be https://setup.py, https://manifest.py, etc. This is a significant performance problem for the first few requests, as they are DNS negatives which need to get cached, and they slow down urlextract also. Lots of other country codes occasionally correlate with file extensions, such as https://manifest.in/ and http://readme.md/. This could be handled in dns_cache by seeding the DNS cache with known invalid entries. urlextract could help with domain name filtering.
Relative urls jayvdb/pypidb#38 This would be a huge enhancement to URLExtract, but requires adding a completely different extraction algorithm.
DOS/Maximum results #69
http://docs.red-dove.com/cfg/python.html e.target is really common, appearing in <script> blocks, but I am not sure it would be useful to exclude urls found in script tags via https://pypi.org/project/config

{{ in url ; pydevd-pycharm

DEBUG    pypidb._pypi:_pypi.py:313 processing Webpage: https://ci.appveyor.com/project/fabioz/pydev-debugger
DEBUG    pypidb._pypi:_pypi.py:379 @@ ran <function _url_extractor_wrapper at 0x7f03e2f1b5e0> on text size 7901 for 8 urls !!
DEBUG    pypidb._pypi:_pypi.py:384 extracted ['account.name', 'https://help.appveyor.com/', 'https://js.stripe.com/v2/', 'https://status.appveyor.com/', 'https://www.appveyor.com/docs/', 'https://www.appveyor.com/docs/server/', 'https://www.appveyor.com/updates/', 'https://www.gravatar.com/avatar/{{Session.user().gravatarHash}}?d=https%3a%2f%2fci.appveyor.com%2fassets%2fimages%2fuser.png&s=40']

backticks are not trimmed, related to #13

'git://github.com/ingydotnet/package-py.git``'

so I use

_scm_url_cleaner.py:                repo = repo.strip("`")

urlextract returns urls starting with [

See image. The url returned contains a [ at the start. I do not know what other information I should add

& problem

This URL is getting chopped of:
https://secure.webforum.com/fut/projects/documents?dfRefID=64282&ProjectID=48830

to this after running it through find_urls:
https://secure.webforum.com/fut/projects/documents?dfRefID=64282&amp

Is this by design? Meaning that I should always unescape any strings before running URLExtract?

dns_cache

I've created https://github.com/jayvdb/dns-cache which caches negative responses, which is quite helpful when using the recently added DNS checking in URLExtract.

Should I add dns_cache to dns_cache_install? Or just mention it in the README for users which want more control?

Also there is a fairly serious problem with the dnspython "socket" resolver on Windows during negative responses.
rthalley/dnspython#416

However the AttributeError caused there should be caught at https://github.com/lipoja/URLExtract/blob/1eb9ad5/urlextract/urlextract_core.py#L564 , so the logging there is the only bit which can be improved.

We can also improve the logging by catching socket.gaierror and giving it a better log entry.

Extraction is not deterministic

If you run that:

for _ in {1..100}; do python3 -c 'from urlextract import URLExtract; ex = URLExtract(); print(ex.find_urls(" lesswrong.com, "))'; done

You'd get ['lesswrong.com'] sometimes, and sometimes nothing. This doesn't reproduce if you just call that within python in a loop, so I'd suspect it has to do with python hash function changing between interpreter runs.

I'd imagine it would be hard to test too (because once you have a running python test process, the hash order would be fixed), but perhaps we could add extra test for that which do spawn separate python instances.

I might look at it later and try to fix, just wanted to leave it here before I forget..

P.S. tested on latest (0.9) version

Race condition in cache loading

Appreciate your work on this library! When using urlextract in a concurrent environment, I encountered a race condition when loading the cache file. As a simple test case:

from concurrent.futures import ThreadPoolExecutor
from urlextract import URLExtract

t = ThreadPoolExecutor(max_workers=8)

def test():
    a = URLExtract()
    a.update()

for i in [t.submit(test) for _ in range(100)]:
    print(i.result())

In my environment, this results in:

~/race.py in test()
      8 def test(): 
      9     a = URLExtract()
---> 10     a.update()
     11                                       
     12 for i in [t.submit(test) for _ in range(100)]:

~/venv/lib/python3.6/site-packages/urlextract/urlextract_core.py in update(self)
    146             return False
    147                                          
--> 148         self._reload_tlds_from_file()
    149
    150         return True

~/venv/lib/python3.6/site-packages/urlextract/urlextract_core.py in _reload_tlds_from_file(self)
    114         """
    115
--> 116         tlds = sorted(self._load_cached_tlds(), key=len, reverse=True)
    117         re_escaped = [re.escape(str(tld)) for tld in tlds]
    118         self._tlds_re = re.compile('|'.join(re_escaped))

~/venv/lib/python3.6/site-packages/urlextract/cachefile.py in _load_cached_tlds(self)
    216
    217                 set_of_tlds.add("." + tld)
--> 218                 set_of_tlds.add("." + idna.decode(tld))
    219
    220         return set_of_tlds

~/venv/lib/python3.6/site-packages/idna/core.py in decode(s, strict, uts46, std3_rules)
    390         trailing_dot = True
    391     for label in labels:
--> 392         s = ulabel(label)
    393         if s:
    394             result.append(s)

~/venv/lib/python3.6/site-packages/idna/core.py in ulabel(label)
    309
    310     label = label.decode('punycode')
--> 311     check_label(label)
    312     return label
    313

~/venv/lib/python3.6/site-packages/idna/core.py in check_label(label)
    237         label = label.decode('utf-8')
    238     if len(label) == 0:
--> 239         raise IDNAError('Empty Label')
    240
    241     check_nfc(label)

I'm not sure if urlextract is meant to be concurrency-safe, but if so, maybe using something like filelock would be appropriate here.

get/set_stop_chars API inconsistency

get_stop_chars returns a list but set_stop_chars requires a set as input.

URL returned twice

A URL was unexpectedly returned twice:

>>> from urlextract import URLExtract
>>> ex = URLExtract()
>>> ex.find_urls('https://i2.wp.com/siliconfilter.com/wp-content/uploads/2011/06/Techmeme100-top-20.jpg')
['https://i2.wp.com/siliconfilter.com/wp-content/uploads/2011/06/Techmeme100-top-20.jpg', 'https://i2.wp.com/siliconfilter.com/wp-content/uploads/2011/06/Techmeme100-top-20.jpg']

Doesn't extracts from markdown format properly

Code:

from urlextract import URLExtract

extractor = URLExtract()
text="""
[http://httpbin.org/status/200](http://httpbin.org/status/200)
"""
urls = extractor.find_urls(text)
print(urls)

Output:
['[http://httpbin.org/status/200](http://httpbin.org/status/200)', '[http://httpbin.org/status/200](http://httpbin.org/status/200)']

Support of detecting localhost URLs

@lipoja it would also be great if http://localhost can be detected

Originally posted by @infokiller in #10 (comment)

Python 2 Support

Is there a python 2 version of this? Would love to use this for a python 2 project I am working on.

Also, are there unit tests? It would be great to see what test cases are used validation.

similar to this:
https://mathiasbynens.be/demo/url-regex

thanks, keep up the great work!

list of url exception

Hello! Is there a way to make some list of URL exceptions? For example, I consider youtube.com/channel/UC22n-g_flDuGDvUgFPkAgIw as URL, but asp.net I don't want to be detected as a URL. Is it possible to achieve?

[request] URLExtract.has_url

Thanks for the package. I was very happy to find it. For runtime efficiency, I would like to request a has_url(text) method. The point is that in this use case, I only need to know if there is at least one URL in the input string, and so I want a boolean value returned. I am currently using find_urls(text) and this works, but this is obviously less efficient in the case of a large input string.

[bug] get wrong TLD, result in parse failed with("[email protected]")

First,the Traceback is
a=extractor.find_urls("http://[email protected]:51733/hn35/")

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python\Python36\lib\site-packages\urlextract\urlextract_core.py", line 683, in find_urls
    urls = OrderedDict.fromkeys(urls) if only_unique else urls
  File "C:\Python\Python36\lib\site-packages\urlextract\urlextract_core.py", line 645, in gen_urls
    if tld_pos != -1 and validated:
  File "C:\Python\Python36\lib\site-packages\urlextract\urlextract_core.py", line 413, in _complete_url
    complete_url, tld_pos-start_pos, tld)
  File "C:\Python\Python36\lib\site-packages\urlextract\urlextract_core.py", line 530, in _is_domain_valid
AttributeError: 'IPv4Address' object has no attribute 'split'

My attempt to fix this bug is in urlextract_core.py line 104:

        after_tld_chars = set(string.whitespace)
        after_tld_chars |= {'/', '\"', '\'', '<', '>', '?', ':', '.', ','}
        # after_tld_chars |= {'/', '\"', '\'', '<', '>', '?', ':', ','}
        # get left enclosure characters

I remove "." from ater_tld_chars,and follows is my oppions:

when I trace in the urlextract_core,I find the tld you search is the end of domain.(For exmple
the "a.edu.cn " 'tld is cn whether it 's edu.cn in tldextract.)
If i understand correctly, when you get tld from "test.com.cn" you should take "cn" as tld ,but it's "com" in fact .In given case ,the url's tld will be "edu",but in fact it's an IPV4.
So whether remove "." from after_tld_chars will work ？Or there is any better way to solve this question.
thx

URL not extracted if `.` at end of url

Hi! Thanks for an amazing library. I've successfully used it to extract urls from 2400 documents. But there are just a few, where extraction isn't happening.

The following content is similar to many other documents that I've extracted from which is why I was surprised it wasn't working.

Here are some example docs:

https://www.sec.gov/Archives/edgar/data/1636023/000117184317003392/0001171843-17-003392.txt
https://www.sec.gov/Archives/edgar/data/1636023/000117184317003392/exh_101.htm
https://www.sec.gov/Archives/edgar/data/1636023/000117184317003392/fsd_053117.htm

Each of the above docs contains the url www.westrock.com

I finally figured it out that a period at the end of the url was causing it to not be detected.
I tried futzing with set_after_tld_chars and set_stop_chars but didn't get anywhere.

I used content.replace('.com.', '.com') as a pre-processing workaround (but obviously this isn't particularly scalable as there are lots of other domain extensions)

Thanks again for an amazing library.

Unable to extract url withing quotes

Testcase:

from urlextract import URLExtract

extractor = URLExtract()
text="""
`https://coala.io/200`
"""
urls = extractor.find_urls(text)
print(urls)

Output:
['`https://coala.io/200`']

[Bug] If an url contains an url...

Hi!

While using the find_urls method I noticed the following:

https://bladomain.com/bla/?cid=74530889&h=bladomain.com

leads to

extractor.find_urls("https://bladomain.com/bla/?cid=74530889&h=bladomain.com")

['https://bladomain.com/bla/?cid=74530889&h=bladomain.com', 'https://bladomain.com/bla/?cid=74530889&h=bladomain.com']

is that supposed to happen? :)

[Bug] Wrong with python2

URLExtract/urlextract/cachefile.py
line:14 import urllib.request
urllib.request is not support for python2 .

Considering that a url could contain "@"

Hi,

Are you using a regex for URLs extraction ?

python
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> from urlextract import URLExtract

>>> extractor = URLExtract()

>>> urls = extractor.find_urls("https://medium.com/@eon01/docker-rancher-efs-glusterfs-minikube-sns-sqs-microservices-and-containerd-b4c5c9c7cc0c")

>>> print(urls)
['https://medium.com']

find_urls method doesn't return anything

Returns an empty list after I use the method find_urlson strings.


Python 2.7.15rc1 (default, Apr 15 2018, 21:51:34) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlextract import URLExtract
>>> extract = URLExtract()
>>> urls = extract.find_urls("This is an example... https://github.com/lipoja/URLExtract/issues")
>>> print urls
[]
>>>

Email addresses are detected as URLs

Individual email addresses are detected as URLs. For example:

Email addresses should not be classified as URLs. Thanks.

For anyone else who runs into this issue, for now I'm using an oversimplified regex to approximately filter out such values:

urls = [url for url in urls if not re.fullmatch(r'[^@]+@[^@]+\.[^@]+', url)]

Guarantee order of returned URLs

As a user, I find it important that URLs are always returned in the order they exist (or exist first) in the input text.

Using set() unfortunately violates this requirement.

URLExtract/urlextract/urlextract_core.py

Line 755 in 5bb2f37

urls = set(urls) if only_unique else urls

>>> url_extractor.find_urls('yahoo.com msn.com', only_unique=False)
['yahoo.com', 'msn.com']  # in order
>>> url_extractor.find_urls('yahoo.com msn.com', only_unique=True)
['msn.com', 'yahoo.com']  # NOT IN ORDER!

This thread covers many alternatives. Please consider one of these options. For example, per this answer the implementation of find_urls could then become:

urls = self.gen_urls(text)
urls = OrderedDict.fromkeys(urls) if only_unique else urls
return list(urls)

Adding several unit tests to always ensure the ordering would also be meaningful. Thanks.

Wrong result for several urls

Some links whose domain suffix consists of two parts (eg org.uk, com.au) are duplicated in the result:

>>> from urlextract import URLExtract
>>> ue = URLExtract()
>>> ue.find_urls('ukrainian news pravda.com.ua')
['pravda.com.ua', 'pravda.com.ua']

Support of config file

It would be good if urlextract cli could load some configuration from config file.
Where could be set all currently supported set methods e.g. set stop characters (left, right), email extraction.

VCS/Git remote URLs not found

[email protected]:snowplow/snowplow-python-tracker.git is not found

This can be found at https://pypi.org/project/minimal-snowplow-tracker/

>>> import urlextract
>>> e = urlextract.urlextract_core.URLExtract()
>>> e.find_urls('[email protected]:snowplow/snowplow-python-tracker.git')
[]

A good list of sample VCS links can be found at
https://github.com/coala/git-url-parse/blob/master/test/conftest.py

'IPv4Address' object has no attribute 'split' v 0.14

If the tld has a value, then the line 524 test returns false.

Trailing enclosure characters not always excluded

I have a case where a trailing character is not excluded like expected:

extractor.find_urls(r"{\url{http://www.google.com}}")
returns ['http://www.google.com']

extractor.find_urls(r"{\url{http://www.google.com/file.pdf}}")
returns ['http://www.google.com/file.pdf}']

I am running version 0.10.

[Question] Force urlextract to use proxy

I'm using urlextract in Jupyter where my scripts has set a custom proxy for connecting to the internet. Setting the proxy up for the requests package is fine by just adding the following:

Is there anything similar to this in urlextract? Cause right now, when it attempts to download the TLD's it spits an error out saying it can't connect, subsequently running the examples returns the following.

Cheers for any help!

Incomplete URL extracted

An incomplete URL was extracted from a larger string. Here is the minimal code:

>>> import urlextract
>>> urlextract.__version__
'0.10'
>>> url_extractor = urlextract.URLExtract()
>>> input_url = "https://www.semanticscholar.org/paper/When-You-Eat-Matters%3A-60-Years-of-Franz-Halberg's-Cornelissen/3c57ee642835494a8b46f3ff4799f746d3a4607f"
>>> url_extractor.find_urls(input_url)
['https://www.semanticscholar.org/paper/When-You-Eat-Matters%3A-60-Years-of-Franz-Halberg']

I would like for the full URL to have been extracted instead. I don't see why it should stop after '. Thanks.

HTML character references &foo; are cut at the semi-colon

A URL containing an XML entity/HTML character reference, such as http://.../..?foo&bar;baz, will be cut at the semi-colon.

urlextract does not support IP address based URLs

$ echo this is a test of http://1.1.1.1/neatstuff using urlextract > urltest
$ echo this is a test of http://fun.domain.com/neatstuff using urlextract >> urltest
$ urlextract urltest
http://fun.domain.com/neatstuff
$

URLs containing IPs instead of hostnames are not extracted.

Doesn't ignore surrounding parentheses

Consider the case

>>> string = "Foo (http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)) Bar"
>>> URLExtract().find_urls(string)
['http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management))']

I would expect it to return

['http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)']

This is the same issue as in jinja2.utils.urlize (pallets/jinja#827).

See this article by Jeff Atwood on this very topic: https://blog.codinghorror.com/the-problem-with-urls/

None of the cache directory is writable

We are using URLExtract in one of our python projects. The python script works fine locally but when uploaded as AWS lambda the script fails as none of the cache directories is writable.

Ideally, there should have been a way to provide the cache directory path as a URL parameter in the URLExtract constructor itself and the default should be whatever is currently

Not found ULR (stop char missing)

If there is no valid left stop char and there is no space in string valid url not found.
Example string:
スマホの方はこちらをクリック➡https://line.me/R/ti/p/%40pnd3781y

More efficient use of memory

    urls = list(self.gen_urls(text))
    return urls if not only_unique else list(set(urls))

The above has an issue in that it needlessly creates the initial list even when a set is desired. This can be wasteful of memory. I didn't realize this before, but a better form is:

    urls = self.gen_urls(text)
    urls = set(urls) if only_unique else urls
    return list(urls)

The above form prevents creating a list of possibly non-unique URLs if only the unique ones are desired.

Leading chars not excluded

When you have leading chars to the URL, this is not excluded. E.g. "Bla bla 1234|http://www.google.com"

Missing URLs using find_urls

I have run into some text (email spam) that find_urls fails to extract all URLs from. Example input:

One night's accommodation, double occupancy http://example.com/gxhcht-5kdpwgk3/ 

$8000 http://example.com/gxhchu-5kdpwgk4/ 	
Value $166.00 http://example.com/gxhchv-5kdpwgk5/ 	
 https://content.idassociates.ca/images/shopico_new/spacer.png 	 http://example.com/gxhchw-5kdpwgk6/ 	

Like in the South - Deluxe Room http://example.com/gxhchy-5kdpwgk8/ 

$20308 http://example.com/gxhchz-5kdpwgk9/ 	
Value $406.19 http://example.com/gxhci0-5kdpwgk6/ 	
 https://content.idassociates.ca/images/shopico_new/spacer.png 	
 http://example.com/gxhci1-5kdpwgk7/ 	
Camping de la rivire Nicolet http://example.com/gxhci2-5kdpwgk8/ 

Accommodations / Cottage http://example.com

There are 12 URLs, but urlextract finds only 7 of them. Found URLs:

http://example.com/gxhcht-5kdpwgk3/
http://example.com/gxhchu-5kdpwgk4/
http://example.com/gxhchv-5kdpwgk5/
https://content.idassociates.ca/images/shopico_new/spacer.png
http://example.com/gxhci1-5kdpwgk7/
http://example.com/gxhci2-5kdpwgk8/
http://example.com

The behavior is really strange. For example, if I remove the following URL from input: https://content.idassociates.ca/images/shopico_new/spacer.png, all the remaining 11 URLs are found.
EDIT: Also sorry for not posting a smaller test input, but all bigger modifications led to the module working properly.

Used version 0.10

[Bug] IPv4Address object has no attribute split

Issue

Ran into this issue when on v0.10 running this over a large sample of strings. I believe it has something to do with having a domain nested in the URL.

'IPv4Address' object has no attribute 'split'
Traceback (most recent call last):
...
  File "/usr/local/lib/python3.6/site-packages/urlextract/urlextract_core.py", line 568, in find_urls
    return list(urls)
  File "/usr/local/lib/python3.6/site-packages/urlextract/urlextract_core.py", line 542, in gen_urls
    tmp_url = self._complete_url(text, offset + tld_pos, tld)
  File "/usr/local/lib/python3.6/site-packages/urlextract/urlextract_core.py", line 351, in _complete_url
    if not self._is_domain_valid(complete_url, tld):
  File "/usr/local/lib/python3.6/site-packages/urlextract/urlextract_core.py", line 438, in _is_domain_valid
    host_parts = host.split('.')
AttributeError: 'IPv4Address' object has no attribute 'split'

Steps to reproduce

Below is a minimal test case:

import urlextract
urlextract.URLExtract().find_urls("http://0.0.0.0/a.io")

Filename extracted as URL

From the following input (which is a legit archive filename):

PAYMENT EUR 1,420.00.zip

URL is extracted using find_urls:

1,420.00.zip

[Bug] Wrong with url endswith tld

for example: http://107.174.47.156/mr.sh ( The url is malicious，be careful).
This url will be considered as a url with domain,but in urlextract_core.py line 530, the host is ipv4, and 'IPv4Address' object has no attribute 'split',will raise the exception.

Getting wrong URL when there is dot before url

For this text:
extractor.find_urls("My name is claim...https://t.co/SZlazvFzYx")

URL extractor returns:
['claim...https://t.co/SZlazvFzYx']