Giter Site home page Giter Site logo

robinst / autolink-java Goto Github PK

View Code? Open in Web Editor NEW
207.0 207.0 40.0 156 KB

Java library to extract links (URLs, email addresses) from plain text; fast, small and smart

License: MIT License

Java 100.00%
autolink extraction java-library linkify links url

autolink-java's People

Contributors

andyklimczak avatar dependabot[bot] avatar divyagh avatar mtddk avatar robinst avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autolink-java's Issues

Linkify domain-only links (without scheme or www)

Thanks for your library.

Would you consider adding a LinkType for just domains (eg: example.com)?

There are some cases where this is considered a link and needs to be detected/extracted (eg: Gmail and Twitter detect domains like these as links and turn them into an anchor tag).

Thanks again.

Links in HTML source followed by "), ">, ') or '> returns a wrong end index

I tried to extract links from https://fr.yahoo.com/?p=us for a test.

Some of the returned links contained too many characters at the end, for example :

I'm trying to understand UrlScanner's code but I'm not sure to be able to fix it.

I'll send a pull request later if I manage to fix it.

xss attacks questions

If I ensure that the input text is free of html, is there any vulnerability to xss attacks?

(I don't have too much of an understanding of this type of attack, I just read that it's a potential problem with "linkifying" code).

Thanks.

Adapt autolink-java to replace rinku in JRuby

Hello! I am working on getting the Discourse app to run in JRuby, and need to replace its dependency on rinku.

There are two ways we typically do this:

  • Port the extension. This shouldn't be difficult, but it seems like you may have already done this work?
  • Wrap a JVM library.

The latter would be preferable, since all we'd need to write is a bit of Ruby to wrap your library.

However there's a few things that would make this integrate better with JRuby:

  • CharSequence is great, but the API produces String eventually. This means JRuby's byte[]-based Ruby strings need an extra conversion step, which will obviously slow down the rendering of a large document.
  • Compatibility with rinku. I'm not sure how to map the features of rinku to autolink-java and will need some tips here.

Here's a quick and dirty rinku-like wrapper based on your example code from README. It can serve as a place to start discussing: https://github.com/headius/jruby-autolink

Discourse on JRuby work: https://meta.discourse.org/t/getting-discourse-running-on-jruby/81273/14
Issue to make a JRuby port of rinku: vmg/rinku#75

Stop URL on < or >

code to reproduce:

String input = "wow <p>http://test.com</p> such linked";
LinkExtractor linkExtractor = LinkExtractor.builder().build();
Iterable<LinkSpan> links = linkExtractor.extractLinks(input);
String result = Autolink.renderLinks(input, links, (link, text, sb) -> {
    sb.append("<a href=\"");
    sb.append(text, link.getBeginIndex(), link.getEndIndex());
    sb.append("\">");
    sb.append(text, link.getBeginIndex(), link.getEndIndex());
    sb.append("</a>");
});
System.out.println(result);

expect:

wow <p><a href="http://test.com">http://test.com</a></p> such linked

actual:

wow <p><a href="http://test.com</p>">http://test.com</p></a> such linked

This seems to be a bug, do we support the skip_tags feature of rinku?

Version 0.10.2 broke binary compatibility

Version 0.10.1 was published with Java 7 as baseline.

$ jarviz bytecode show --gav org.nibor.autolink:autolink:0.10.1
subject: autolink-0.10.1.jar
Unversioned classes. Bytecode version: 51 (Java 7) total: 19

Version 0.10.2 was published with Java 9 as baseline.

$ jarviz bytecode show --gav org.nibor.autolink:autolink:0.10.2
subject: autolink-0.10.2.jar
Unversioned classes. Bytecode version: 53 (Java 9) total: 20

I understand the library wants to move to newer Java versions, matter of fact version 0.11.0 also requires Java 9 as a minimum (👍). However, bumping bytecode compatibility on a build/patch release, even pre 1.0.0, well ... that was unexpected.

Please consider reverting to Java 7 if and only if a 0.10.3 were to be released in the future.

Possible code injection

When using autolink on a text including a link like this one

www.google.com"onclick="alert('gotcha!')

And render the output as it is suggested in the example:

String result = Autolink.renderLinks(input, links, (link, text, sb) -> {
    sb.append("<a href=\"");
    sb.append(text, link.getBeginIndex(), link.getEndIndex());
    sb.append("\">");
    sb.append(text, link.getBeginIndex(), link.getEndIndex());
    sb.append("</a>");
});

the output will be

<a href="www.google.com"onclick="alert('gotcha!')">www.google.com"onclick="alert('gotcha!')</a>"

which is strictly speaking invalid HTML, but browsers will still execute the click handler. See https://jsfiddle.net/vLjLLo8n/2/ to try it out.

I understand that appending a subsequence to the StringBuilder is the more efficient than providing the link as a String, but to make this secure, you would need to get the substring and perform encoding on it.

So, for example using OWASP Java Encoder, the rendering needs to be done like this:

String result = Autolink.renderLinks(input, links, (link, text, sb) -> {
    String linkString = new StringBuilder().append(text, link.getBeginIndex(), link.getEndIndex()).toString();

    sb.append("<a href=\"");
    sb.append(Encode.forHtmlAttribute(linkString));
    sb.append("\">");
    sb.append(Encode.forHtml(linkString));
    sb.append("</a>");
});

resulting in a safe output:

<a href="www.google.com&#34;onclick=&#34;alert(&#39;gotcha!&#39;)">www.google.com&#34;onclick=&#34;alert(&#39;gotcha!&#39;)</a>

Easiest fix for this particular problem would probably be if autolink would not include single or double quotes, or any other character not legal in a URL.
(EDIT: single quotes are legal characters)

A possibly breaking API change would be to provide the linkString as part of the LinkSpan interface.

Add jlink-compatible Java9/Jigsaw module-info

The commonmark-ext-autolink extension in https://github.com/commonmark/commonmark-java depends on this library. I quickly studied the classes and it seems that internal.Scanner is the only class that might be needed outside this module, but then again, the LinkExtractor doesn't seem to be designed to be extended. My suggestion is just to provide a simple module-info.java along the lines of:

module org.nibor.autolink {
    exports org.nibor.autolink;
}

For building and deployment multi-release jars, I've seen Maven projects use maven-jar-plugin or biz.aQute.bnd.bnd-maven-plugin.

URL containing a single quote in middle results in unexpected ending

With the following input: http://example.org/"_(foo)

Currently the extracted link is this: http://example.org/"_(foo

The closing parenthesis is not included, even though it's balanced. The reason is that we check all "unfinished" brackets and quotes in one condition at the end of the loop instead of just when the corresponding character happened. So when we get to ), we're still in the "unfinished" state because of the single quote.

Dealing with | symbol

Hey Robin
We're using your library however I'd like to modify it to account for | symbols and www.
We have templates with urls such as
http://test.com|test site and sometimes we get things like www.blah

The lib is extracting the first as http://test.com|test and doesn't deal with the second.

I'll fix this myself, but are you interested in pull requests related to these?

Cheers

support of git/github links automatic linking

in the context of gitbucket/gitbucket#1323 and extending commonmark-java, I'd like to add the possibility to discover links from:

  • git SHA1 references
  • issues & PR ids

The "autolinked references" I'd like to detect are those described in github documentation.

I am currently implementing this, but it requires some internal changes in the project because several scanners will be able to be fired for the same characterset (for example for '@' or [a-zA-Z]). Thus I'd like to know if you would accept such changes. The changes I have in mind:

  • registration of Scanners would be ordered (to establish some precedence)
  • each scanner would be responsible of knowing on which characters it need to be triggered (inversion of responsibility between Scanner & LinkExtractor)
  • for each scan (ie LinkIterator#setNext()), first Scanner answering a Link would win

Do not insert HtmlTag when there is already a tag

When the input Text is already proper formated as a Html Link

eg. input= "wow, so example: <a href="http://test.com">http://test.com</a>"

The output would be:
wow, so example: <a href="<a href="http://test.com">http://test.com</a>"><a href="http://test.com">http://test.com</a></a>

It would be great, when it detects already proper formated Html Links and don't append the tags in this case.

Extract Phone numbers (request)

No doubts this library is great, I would appreciate your effort or time to write this library.

but is it possible to have phone number or mobile number extraction features.

~ Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.