robinst / autolink-java Goto Github PK

View Code? Open in Web Editor NEW

207.0 207.0 40.0 156 KB

Java library to extract links (URLs, email addresses) from plain text; fast, small and smart

License: MIT License

Java 100.00%

autolink extraction java-library linkify links url

autolink-java's People

Contributors

Stargazers

Watchers

Forkers

yusukeiwaki manuelleduc geekcarl abernardino dominicqi yuanzhaoyz pabranch mtddk zhouzhe8013 languagerecipes andyklimczak code-in-practice vijaykumarmidde gestatech yulizi1937 cybernetics williamtechnote guillaumegarcia13 suritprakash fatjyc xzel23 minsifansi arcnor abraham313 miljanbjelojica stellarbit huk77 tool-recommender-bot cass-green klinkai dan085 divyagh biangkerok32 thejakeofink alainqtec ziaridoy20 zhaoxjmail zzzping mauriziocasciano

autolink-java's Issues

Linkify domain-only links (without scheme or www)

Thanks for your library.

Would you consider adding a LinkType for just domains (eg: example.com)?

There are some cases where this is considered a link and needs to be detected/extracted (eg: Gmail and Twitter detect domains like these as links and turn them into an anchor tag).

Thanks again.

URL having consecutive "https://https://" are parsed as it is

Hi,

URL having consecutive "https://https://" are parsed as it is. can we exclude "https://". unit test case is failing for this URL.

Thx
Vin

URL Parsing getting stuck for non clickable URL

Hi Author,

I love this frame work it very nice and robust. I recently came across parsing non clickable URL resulting partial URL .

Can you suggest any workaround for this issue

input : https://us[.]quarantine[.]abc[.]com/notify/
Output : https://us

Links in HTML source followed by "), ">, ') or '> returns a wrong end index

I tried to extract links from https://fr.yahoo.com/?p=us for a test.

Some of the returned links contained too many characters at the end, for example :

I'm trying to understand UrlScanner's code but I'm not sure to be able to fix it.

I'll send a pull request later if I manage to fix it.

xss attacks questions

If I ensure that the input text is free of html, is there any vulnerability to xss attacks?

(I don't have too much of an understanding of this type of attack, I just read that it's a potential problem with "linkifying" code).

Thanks.

Don't autolink if authority is only "end" characters

See commonmark/commonmark-java#99, the following examples should not result in any links:

http://.
http://"
http://<space>

Note that http:// and http://. are valid URLs according to RFC 3986, because authority can be zero or more unreserved characters. But we don't autolink http:// on its own or the trailing . of http://example.org.

Issue in extracting links if they are just extracted by commas

I had been trying to separate URL with
String res = " https://www.gooogle.com,https://facebook.com and www.googlle.com";

Now this gives me the result.

But the required result must be

I think this library does not extract links even if they are separated by commas.
Look if anyone can help with this.

Thank You.

Links with non-ASCII characters are not always extracted

This URL is extracted:

https://www.bücher.de

But this one not:

www.bücher.de

Adapt autolink-java to replace rinku in JRuby

Hello! I am working on getting the Discourse app to run in JRuby, and need to replace its dependency on rinku.

There are two ways we typically do this:

Port the extension. This shouldn't be difficult, but it seems like you may have already done this work?
Wrap a JVM library.

The latter would be preferable, since all we'd need to write is a bit of Ruby to wrap your library.

However there's a few things that would make this integrate better with JRuby:

CharSequence is great, but the API produces String eventually. This means JRuby's byte[]-based Ruby strings need an extra conversion step, which will obviously slow down the rendering of a large document.
Compatibility with rinku. I'm not sure how to map the features of rinku to autolink-java and will need some tips here.

Here's a quick and dirty rinku-like wrapper based on your example code from README. It can serve as a place to start discussing: https://github.com/headius/jruby-autolink

Discourse on JRuby work: https://meta.discourse.org/t/getting-discourse-running-on-jruby/81273/14
Issue to make a JRuby port of rinku: vmg/rinku#75

Stop URL on < or >

code to reproduce:

String input = "wow <p>http://test.com</p> such linked";
LinkExtractor linkExtractor = LinkExtractor.builder().build();
Iterable<LinkSpan> links = linkExtractor.extractLinks(input);
String result = Autolink.renderLinks(input, links, (link, text, sb) -> {
    sb.append("<a href=\"");
    sb.append(text, link.getBeginIndex(), link.getEndIndex());
    sb.append("\">");
    sb.append(text, link.getBeginIndex(), link.getEndIndex());
    sb.append("</a>");
});
System.out.println(result);

expect:

wow <p><a href="http://test.com">http://test.com</a></p> such linked

actual:

wow <p><a href="http://test.com</p>">http://test.com</p></a> such linked

This seems to be a bug, do we support the skip_tags feature of rinku?

Version 0.10.2 broke binary compatibility

Version 0.10.1 was published with Java 7 as baseline.

$ jarviz bytecode show --gav org.nibor.autolink:autolink:0.10.1
subject: autolink-0.10.1.jar
Unversioned classes. Bytecode version: 51 (Java 7) total: 19

Version 0.10.2 was published with Java 9 as baseline.

$ jarviz bytecode show --gav org.nibor.autolink:autolink:0.10.2
subject: autolink-0.10.2.jar
Unversioned classes. Bytecode version: 53 (Java 9) total: 20

I understand the library wants to move to newer Java versions, matter of fact version 0.11.0 also requires Java 9 as a minimum (👍). However, bumping bytecode compatibility on a build/patch release, even pre 1.0.0, well ... that was unexpected.

Please consider reverting to Java 7 if and only if a 0.10.3 were to be released in the future.

Possible code injection

When using autolink on a text including a link like this one

www.google.com"onclick="alert('gotcha!')

And render the output as it is suggested in the example:

String result = Autolink.renderLinks(input, links, (link, text, sb) -> {
    sb.append("<a href=\"");
    sb.append(text, link.getBeginIndex(), link.getEndIndex());
    sb.append("\">");
    sb.append(text, link.getBeginIndex(), link.getEndIndex());
    sb.append("</a>");
});

the output will be

<a href="www.google.com"onclick="alert('gotcha!')">www.google.com"onclick="alert('gotcha!')</a>"

which is strictly speaking invalid HTML, but browsers will still execute the click handler. See https://jsfiddle.net/vLjLLo8n/2/ to try it out.

I understand that appending a subsequence to the StringBuilder is the more efficient than providing the link as a String, but to make this secure, you would need to get the substring and perform encoding on it.

So, for example using OWASP Java Encoder, the rendering needs to be done like this:

String result = Autolink.renderLinks(input, links, (link, text, sb) -> {
    String linkString = new StringBuilder().append(text, link.getBeginIndex(), link.getEndIndex()).toString();

    sb.append("<a href=\"");
    sb.append(Encode.forHtmlAttribute(linkString));
    sb.append("\">");
    sb.append(Encode.forHtml(linkString));
    sb.append("</a>");
});

resulting in a safe output:

<a href="www.google.com&#34;onclick=&#34;alert(&#39;gotcha!&#39;)">www.google.com&#34;onclick=&#34;alert(&#39;gotcha!&#39;)</a>

Easiest fix for this particular problem would probably be if autolink would not include ~~single or~~ double quotes, or any other character not legal in a URL.
(EDIT: single quotes are legal characters)

A possibly breaking API change would be to provide the linkString as part of the LinkSpan interface.

Some url without http and www domain

This kind of link doesn't work, for example: github.com/robinst/autolink-java/

Add jlink-compatible Java9/Jigsaw module-info

The commonmark-ext-autolink extension in https://github.com/commonmark/commonmark-java depends on this library. I quickly studied the classes and it seems that internal.Scanner is the only class that might be needed outside this module, but then again, the LinkExtractor doesn't seem to be designed to be extended. My suggestion is just to provide a simple module-info.java along the lines of:

module org.nibor.autolink {
    exports org.nibor.autolink;
}

For building and deployment multi-release jars, I've seen Maven projects use maven-jar-plugin or biz.aQute.bnd.bnd-maven-plugin.

URL containing a single quote in middle results in unexpected ending

With the following input: http://example.org/"_(foo)

Currently the extracted link is this: http://example.org/"_(foo

The closing parenthesis is not included, even though it's balanced. The reason is that we check all "unfinished" brackets and quotes in one condition at the end of the loop instead of just when the corresponding character happened. So when we get to ), we're still in the "unfinished" state because of the single quote.

Dealing with | symbol

Hey Robin
We're using your library however I'd like to modify it to account for | symbols and www.
We have templates with urls such as
http://test.com|test site and sometimes we get things like www.blah

The lib is extracting the first as http://test.com|test and doesn't deal with the second.

I'll fix this myself, but are you interested in pull requests related to these?

Cheers

Potentially misparsed URL

Using:
LinkExtractor.builder().linkTypes(setOf(LinkType.WWW, LinkType.URL)).build()
and then extractor.extractLinks()

"man...http://i.imgur.com/rPRnI.jpg"
is parsed as an URL, which seems wrong to me (not sure about the specs though)

support of git/github links automatic linking

in the context of gitbucket/gitbucket#1323 and extending commonmark-java, I'd like to add the possibility to discover links from:

git SHA1 references
issues & PR ids

The "autolinked references" I'd like to detect are those described in github documentation.

I am currently implementing this, but it requires some internal changes in the project because several scanners will be able to be fired for the same characterset (for example for '@' or [a-zA-Z]). Thus I'd like to know if you would accept such changes. The changes I have in mind:

registration of Scanners would be ordered (to establish some precedence)
each scanner would be responsible of knowing on which characters it need to be triggered (inversion of responsibility between Scanner & LinkExtractor)
for each scan (ie LinkIterator#setNext()), first Scanner answering a Link would win

Do not insert HtmlTag when there is already a tag

When the input Text is already proper formated as a Html Link

eg. input= "wow, so example: <a href="http://test.com">http://test.com</a>"

The output would be:
wow, so example: <a href="<a href="http://test.com">http://test.com</a>"><a href="http://test.com">http://test.com</a></a>

It would be great, when it detects already proper formated Html Links and don't append the tags in this case.

Extract Phone numbers (request)

No doubts this library is great, I would appreciate your effort or time to write this library.

but is it possible to have phone number or mobile number extraction features.

~ Thanks

Creole links contain the link text as well

Hello,

with creole type links, the link contains the link text as well.

Here is the creole format:
[[link_address|link text]]

e.g.
[[http://www.google.com/|this is a link to google]]

In this case, the link will be http://www.google.com/|this

Thanks,
Tamás