Comments (11)
Hey! Sorry for the late reply (I was on holidays).
It looks like you're trying to extract links from HTML source? That's not a goal of this library at the moment. It's for extracting links from plain text written by humans, such as markup formats.
If you want to extract links from the text contents of HTML, I recommend using e.g. jsoup to parse the HTML, and then running autolink-java on the text nodes.
Note that rinku (which was the inspiration for this library) tries to detect HTML tags to exclude them. I would be open to having an option to enable such a feature, in case you want to contribute it. See here how this is implemented in rinku.
from autolink-java.
Very very dirty solution but adding a replaceAll("["'].*", "") to the resulting URL fixes this problem, also, library wise, it wouldn't be too difficult to stop whenever one of those 2 chars is detected
from autolink-java.
Ok, I've taken a stab at implementing this.
@manuelleduc and @frmz, feel free to look at the test cases in PR #4 and give feedback. Basically, all the above cases now work as expected, except this one: https://s.yimg.com/os/uh-icons/0.1.16/uh/fonts/uh.eot?);src:url(https://s.yimg.com/os/uh-icons/0.1.16/uh/fonts/uh.eot?#iefix
from autolink-java.
That works however i don't understand why you cannot just stop when a " or ' is found, i mean, a standard, properly encoded URL shouldn't contain those 2 chars, do you have situations where URLs contains single / double quotes?
from autolink-java.
@frmz Sure, how about this one?: https://en.wiktionary.org/wiki/it's (note how GitHub recognizes it)
Both RFC 3986 and 1738 allow '
. If you can point me to a document that says otherwise, I'd love to read it. The situation for "
is less clear, though it seems like they should be treated the same.
from autolink-java.
Yeah you are right, i was actually thinking about double quotes ("), single quotes in HTML shouldn't really be used (although they are), even i found quite a lot of links with single quotes around i did not found any with a non escaped double quote inside, so i would argue that, at least ", can be safely considered a delimiter.
from autolink-java.
Not sure. What about this?: https://en.wiktionary.org/wiki/"_"
from autolink-java.
If you look at this very page source code you will see that inside the href the link is correctly escaped as "https://en.wiktionary.org/wiki/%22_%22", the quotes are only in the text part of the link which is not the link itself (off course a text dump of this page will still have the quotes). Real problem i see is that my browser is not escaping them in the URL bar so it might still represent an usable char.
Anyway i am still fine with implementaion above, that works in most cases, even if when scraping html probably a better solution would be using something like s/href="([^"]*)"/\1/ to get the link (but i see we might go out of topic in this case being the library a generic implementation)
from autolink-java.
Sure, for HTML this is mostly true, although the following also works (not sure if actually valid according to spec): <a href='https://en.wiktionary.org/wiki/"_"'>test</a>
.
But, as I've said in my first comment: This library is not about extracting links from HTML. Use a HTML parser for that. This library is for extracting links from plain text that a user might write, such as markup text or a GitHub comment. If it happens to also work for some forms of HTML, that is fine, but not an explicit goal.
from autolink-java.
Merged PR #4, closing this.
from autolink-java.
These changes are now released in version 0.4.0.
from autolink-java.
Related Issues (20)
- xss attacks questions HOT 5
- Don't autolink if authority is only "end" characters HOT 2
- support of git/github links automatic linking HOT 3
- Potentially misparsed URL HOT 3
- Adapt autolink-java to replace rinku in JRuby HOT 1
- Possible code injection HOT 6
- Links with non-ASCII characters are not always extracted HOT 1
- URL Parsing getting stuck for non clickable URL HOT 3
- URL having consecutive "https://https://" are parsed as it is HOT 1
- Creole links contain the link text as well HOT 2
- Extract Phone numbers (request) HOT 1
- Linkify domain-only links (without scheme or www) HOT 1
- Do not insert HtmlTag when there is already a tag HOT 1
- Issue in extracting links if they are just extracted by commas HOT 1
- Some url without http and www domain HOT 2
- Add jlink-compatible Java9/Jigsaw module-info
- Version 0.10.2 broke binary compatibility HOT 1
- URL containing a single quote in middle results in unexpected ending HOT 1
- Dealing with | symbol HOT 4
- Stop URL on < or > HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from autolink-java.