robinst / autolink-java Goto Github PK
View Code? Open in Web Editor NEWJava library to extract links (URLs, email addresses) from plain text; fast, small and smart
License: MIT License
Java library to extract links (URLs, email addresses) from plain text; fast, small and smart
License: MIT License
Thanks for your library.
Would you consider adding a LinkType
for just domains (eg: example.com
)?
There are some cases where this is considered a link and needs to be detected/extracted (eg: Gmail and Twitter detect domains like these as links and turn them into an anchor tag).
Thanks again.
Hi,
URL having consecutive "https://https://" are parsed as it is. can we exclude "https://". unit test case is failing for this URL.
assertLinked("https://https://abc.com/","|https://abc.com/|");
assertLinked("http://http://abc.com/","|http://abc.com/|");
assertLinked("ftp://ftp://abc.com/","|ftp://abc.com/|");
Thx
Vin
Hi Author,
I love this frame work it very nice and robust. I recently came across parsing non clickable URL resulting partial URL .
Can you suggest any workaround for this issue
input : https://us[.]quarantine[.]abc[.]com/notify/
Output : https://us
I tried to extract links from https://fr.yahoo.com/?p=us for a test.
Some of the returned links contained too many characters at the end, for example :
I'm trying to understand UrlScanner's code but I'm not sure to be able to fix it.
I'll send a pull request later if I manage to fix it.
If I ensure that the input text is free of html
, is there any vulnerability to xss attacks?
(I don't have too much of an understanding of this type of attack, I just read that it's a potential problem with "linkifying" code).
Thanks.
See commonmark/commonmark-java#99, the following examples should not result in any links:
http://.
http://"
http://<space>
Note that http://
and http://.
are valid URLs according to RFC 3986, because authority
can be zero or more unreserved
characters. But we don't autolink http://
on its own or the trailing .
of http://example.org.
I had been trying to separate URL with
String res = " https://www.gooogle.com,https://facebook.com and www.googlle.com";
Now this gives me the result.
But the required result must be
I think this library does not extract links even if they are separated by commas.
Look if anyone can help with this.
Thank You.
Hello! I am working on getting the Discourse app to run in JRuby, and need to replace its dependency on rinku.
There are two ways we typically do this:
The latter would be preferable, since all we'd need to write is a bit of Ruby to wrap your library.
However there's a few things that would make this integrate better with JRuby:
Here's a quick and dirty rinku-like wrapper based on your example code from README. It can serve as a place to start discussing: https://github.com/headius/jruby-autolink
Discourse on JRuby work: https://meta.discourse.org/t/getting-discourse-running-on-jruby/81273/14
Issue to make a JRuby port of rinku: vmg/rinku#75
code to reproduce:
String input = "wow <p>http://test.com</p> such linked";
LinkExtractor linkExtractor = LinkExtractor.builder().build();
Iterable<LinkSpan> links = linkExtractor.extractLinks(input);
String result = Autolink.renderLinks(input, links, (link, text, sb) -> {
sb.append("<a href=\"");
sb.append(text, link.getBeginIndex(), link.getEndIndex());
sb.append("\">");
sb.append(text, link.getBeginIndex(), link.getEndIndex());
sb.append("</a>");
});
System.out.println(result);
expect:
wow <p><a href="http://test.com">http://test.com</a></p> such linked
actual:
wow <p><a href="http://test.com</p>">http://test.com</p></a> such linked
This seems to be a bug, do we support the skip_tags
feature of rinku?
Version 0.10.1
was published with Java 7 as baseline.
$ jarviz bytecode show --gav org.nibor.autolink:autolink:0.10.1
subject: autolink-0.10.1.jar
Unversioned classes. Bytecode version: 51 (Java 7) total: 19
Version 0.10.2
was published with Java 9 as baseline.
$ jarviz bytecode show --gav org.nibor.autolink:autolink:0.10.2
subject: autolink-0.10.2.jar
Unversioned classes. Bytecode version: 53 (Java 9) total: 20
I understand the library wants to move to newer Java versions, matter of fact version 0.11.0
also requires Java 9 as a minimum (👍). However, bumping bytecode compatibility on a build/patch release, even pre 1.0.0
, well ... that was unexpected.
Please consider reverting to Java 7 if and only if a 0.10.3
were to be released in the future.
When using autolink on a text including a link like this one
www.google.com"onclick="alert('gotcha!')
And render the output as it is suggested in the example:
String result = Autolink.renderLinks(input, links, (link, text, sb) -> {
sb.append("<a href=\"");
sb.append(text, link.getBeginIndex(), link.getEndIndex());
sb.append("\">");
sb.append(text, link.getBeginIndex(), link.getEndIndex());
sb.append("</a>");
});
the output will be
<a href="www.google.com"onclick="alert('gotcha!')">www.google.com"onclick="alert('gotcha!')</a>"
which is strictly speaking invalid HTML, but browsers will still execute the click handler. See https://jsfiddle.net/vLjLLo8n/2/ to try it out.
I understand that appending a subsequence to the StringBuilder is the more efficient than providing the link as a String, but to make this secure, you would need to get the substring and perform encoding on it.
So, for example using OWASP Java Encoder, the rendering needs to be done like this:
String result = Autolink.renderLinks(input, links, (link, text, sb) -> {
String linkString = new StringBuilder().append(text, link.getBeginIndex(), link.getEndIndex()).toString();
sb.append("<a href=\"");
sb.append(Encode.forHtmlAttribute(linkString));
sb.append("\">");
sb.append(Encode.forHtml(linkString));
sb.append("</a>");
});
resulting in a safe output:
<a href="www.google.com"onclick="alert('gotcha!')">www.google.com"onclick="alert('gotcha!')</a>
Easiest fix for this particular problem would probably be if autolink would not include single or double quotes, or any other character not legal in a URL.
(EDIT: single quotes are legal characters)
A possibly breaking API change would be to provide the linkString as part of the LinkSpan interface.
This kind of link doesn't work, for example: github.com/robinst/autolink-java/
The commonmark-ext-autolink
extension in https://github.com/commonmark/commonmark-java depends on this library. I quickly studied the classes and it seems that internal.Scanner
is the only class that might be needed outside this module, but then again, the LinkExtractor
doesn't seem to be designed to be extended. My suggestion is just to provide a simple module-info.java
along the lines of:
module org.nibor.autolink {
exports org.nibor.autolink;
}
For building and deployment multi-release jars, I've seen Maven projects use maven-jar-plugin
or biz.aQute.bnd.bnd-maven-plugin
.
With the following input: http://example.org/"_(foo)
Currently the extracted link is this: http://example.org/"_(foo
The closing parenthesis is not included, even though it's balanced. The reason is that we check all "unfinished" brackets and quotes in one condition at the end of the loop instead of just when the corresponding character happened. So when we get to )
, we're still in the "unfinished" state because of the single quote.
Hey Robin
We're using your library however I'd like to modify it to account for | symbols and www.
We have templates with urls such as
http://test.com|test site and sometimes we get things like www.blah
The lib is extracting the first as http://test.com|test and doesn't deal with the second.
I'll fix this myself, but are you interested in pull requests related to these?
Cheers
Using:
LinkExtractor.builder().linkTypes(setOf(LinkType.WWW, LinkType.URL)).build()
and then extractor.extractLinks()
"man...http://i.imgur.com/rPRnI.jpg"
is parsed as an URL, which seems wrong to me (not sure about the specs though)
in the context of gitbucket/gitbucket#1323 and extending commonmark-java, I'd like to add the possibility to discover links from:
The "autolinked references" I'd like to detect are those described in github documentation.
I am currently implementing this, but it requires some internal changes in the project because several scanners will be able to be fired for the same characterset (for example for '@' or [a-zA-Z]). Thus I'd like to know if you would accept such changes. The changes I have in mind:
Scanner
& LinkExtractor
)LinkIterator#setNext()
), first Scanner answering a Link would winWhen the input Text is already proper formated as a Html Link
eg. input= "wow, so example: <a href="http://test.com">http://test.com</a>"
The output would be:
wow, so example: <a href="<a href="http://test.com">http://test.com</a>"><a href="http://test.com">http://test.com</a></a>
It would be great, when it detects already proper formated Html Links and don't append the tags in this case.
No doubts this library is great, I would appreciate your effort or time to write this library.
but is it possible to have phone number or mobile number extraction features.
~ Thanks
Hello,
with creole type links, the link contains the link text as well.
Here is the creole format:
[[link_address|link text]]
e.g.
[[http://www.google.com/|this is a link to google]]
In this case, the link will be http://www.google.com/|this
Thanks,
Tamás
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.