unbescape / unbescape Goto Github PK
View Code? Open in Web Editor NEWAdvanced yet easy to use escaping library for Java
License: Apache License 2.0
Advanced yet easy to use escaping library for Java
License: Apache License 2.0
Any plans to add support for XML attributes? In particular, newline encoding support?
HtmlEscape.escapeHtml5("Æ")
returns "Æ"
, but should return "Æ
".
Tracked this down to a comparison in HtmlEscapeSymbols:
Since Æ
is the lexicographically smallest NCR for HTML 5, it is sorted to index 0 in ncrsOrdered
. Thus, NCRS_BY_CODEPOINT[198]
is set to 0 for Æ
. Next comes Æ
(without the semicolon) at index 1 in ncrsOrdered
. The comparison detects that NCRS_BY_CODEPOINT[198]
is 0 (because it was set to 0), but assumes that this 0 must be NO_NCR
, and overwrites the entry. This eventually causes the missing semicolon when escaping Æ.
Maybe NO_NCR
should be set to a different value to avoid mixing it up with the valid index 0.
CssEscape.unescapeCss("here\\\nthere")
outputs
here\
there
but based on the reading of the CSS spec it seems like this should be
herethere
https://www.w3.org/TR/CSS2/syndata.html#escaped-characters
"First, inside a string, a backslash followed by a newline is ignored (i.e., the string is deemed not to contain either the backslash or the newline)"
U+110000 is not a valid code point. It exceeds Plane 16. However, HTML in the wild is often dirty, and we've seen poorly formed HTML documents that contain hexadecimal escapes that are much larger than the valid range.
This particular case:
HtmlEscape.unescapeHtml("�");
Fails with an IllegalArgumentException
thrown by Character.toChars
. However, it would be nice if unescaping either ignored these escapes like it does for unknown symbolic escapes such as HtmlEscape.unescapeHtml("&uumqqq;");
which yields "&uumqqq;"
as output or replaced them with the Unicode replacement character (U+FFFD).
I'm happy to submit a patch, but I was curious which behavior would be preferred for this edge case.
Adding an Automatic-Module-Name
to the MANIFEST.MF
file will allow us to reserve the module name for future use. See http://blog.joda.org/2017/05/java-se-9-jpms-automatic-modules.html
Completely modularising the library could mean an issue for some scenarios as of today, as the jar file would contain at least one file (module-info.class
) with Java 9 bytecode level, which can nowadays be troublesome. See http://blog.joda.org/2018/03/jpms-negative-benefits.html
There are at least two approaches to module naming (project-name-based and reverse-DNS-based, see this, but we will go for project-based as it is simpler, and the approach recommended by Mark Reinhold (see this) and adopted by the Spring family of projects.
So module name will be unbescape
.
Hi author!
I have:
{"GetMusicTempResult":"{"ErrorCode":0, "ErrorMessage":"Thành Công", "Data" : [{"album_id" : "035BBCB36","music_url" : "/temp/zmnepemsqntotmndhn2vstt5/langletonthuong_mrsiro.mp3","music_picture" : "/PictureLip/Album/V-PopHotMp3Icon.png","music_title" : " Lặng Lẽ Tổn Thương ","music_performers" : "Mr. Siro","music_listen" : "265","music_lyric" : "(Vậy mà em vẫn cứ ngây thơ
Giờ đây em có vui đâu
Vậy mà em vẫn cứ ngây thơ...)
Tại sao chúng mình, không nhận ra nhau sớm hơn
Để anh có thể mạnh mẽ đến gần em, và
Thì thầm lời hứa, em sẽ không phải khóc
Chẳng như hôm nay, chỉ biết im lặng thôi
Từ giây phút đầu, em nở nụ cười rất tươi
Mà chỉ anh thấy, em vẫn không hạnh phúc, chút nào
Và anh biết, người ở bên cạnh em bây giờ
Không hề quan tâm đến cảm giác em như ngày xưa
ĐK1:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Xin đừng để anh phải thấy em buồn)
(Nhìn nụ cười buồn nở trên môi em, mà lòng này chợt nhận ra con tim, đã yêu em thật rồi
When I see you cry, I love you more and more)
-----
Bản thân anh đã nhiều lần tự hỏi
Trong tim em anh chẳng là gì cả
Sao phải quan tâm em quá như là
"Em xem anh là tất cả, niềm tin trong tim"
Nhưng khi anh chợt tỉnh giấc chỉ mình anh mơ
ĐK2:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Có lẽ đã không còn cần một người như anh để che chở)
(Và đừng quên anh từng ở bên em...) ","cat_id" : "1","cat_name" : "V-Pop","performers_id" : "C5E60D989"}]}"}
After parse:
{"album_id" : "035BBCB36","music_url" : "/temp/unnxccplluaogwx01dckf5en/langletonthuong_mrsiro.mp3","music_picture" : "/PictureLip/Album/V-PopHotMp3Icon.png","music_title" : " Lặng Lẽ Tổn Thương ","music_performers" : "Mr. Siro","music_listen" : "266","music_lyric" : "(Vậy mà em vẫn cứ ngây thơ
Giờ đây em có vui đâu
Vậy mà em vẫn cứ ngây thơ...)
Tại sao chúng mình, không nhận ra nhau sớm hơn
Để anh có thể mạnh mẽ đến gần em, và
Thì thầm lời hứa, em sẽ không phải khóc
Chẳng như hôm nay, chỉ biết im lặng thôi
Từ giây phút đầu, em nở nụ cười rất tươi
Mà chỉ anh thấy, em vẫn không hạnh phúc, chút nào
Và anh biết, người ở bên cạnh em bây giờ
Không hề quan tâm đến cảm giác em như ngày xưa
ĐK1:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Xin đừng để anh phải thấy em buồn)
(Nhìn nụ cười buồn nở trên môi em, mà lòng này chợt nhận ra con tim, đã yêu em thật rồi
When I see you cry, I love you more and more)
-----
Bản thân anh đã nhiều lần tự hỏi
Trong tim em anh chẳng là gì cả
Sao phải quan tâm em quá như là
"Em xem anh là tất cả, niềm tin trong tim"
Nhưng khi anh chợt tỉnh giấc chỉ mình anh mơ
ĐK2:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Có lẽ đã không còn cần một người như anh để che chở)
(Và đừng quên anh từng ở bên em...) ","cat_id" : "1","cat_name" : "V-Pop","performers_id" : "C5E60D989"}
Error is: "Em xem anh là tất cả, niềm tin trong tim"
Please fix library to parse: "Em xem anh là tất cả, niềm tin trong tim"
Check json by: http://jsonlint.com/
This is related to my earlier reported bug #7, but is actually more extreme.
The following statement results in a
HtmlEscape.unescapeHtml("�");
stack trace that looks similar to the following:
java.lang.ArrayIndexOutOfBoundsException: 476792037
at org.unbescape.html.HtmlEscapeUtil.unescape(HtmlEscapeUtil.java:683)
at org.unbescape.html.HtmlEscape.unescapeHtml(HtmlEscape.java:616)
...
I've tracked this down to the way that unbescape uses a custom string to integer conversion which doesn't do bounds checking and can cause integer overflow. Specifically, it's the HtmlEscapeUtil.parseIntFromReference
. This can return a negative number which unbescape then later uses as an indicator for a so called DOUBLE_CODEPOINT
or a symbolic entity reference that gets translated into two Unicode characters.
Add info to META-INF/MANIFEST.MF
like specification/implementation version.
This should be done by means of the maven-jar-plugin
The semi-colon will only be used as a field separator in non-English setups, because comma (,
) is used as a decimal separator. But in English setups (and in the RFC standard), comma is the standard field separator.
Hi,
Thanks for an interesting library, and since the library looks powerful and provides what I need, I would like to use it in my project. However, I have a question/request that probably that might look not trivial... Suppose there is a response like this:
"{"foo":"bar: \\\"baz\\\""}"
(Not sure if I put backslashes properly) Yes, there are enclosing quotes and the expected response is as follows: {"foo":"bar: \"baz\""}
, but due to the specifics of the server I'm not involved to I cannot extract it due to those quotes.
I'm wondering: is it possible to use unbescape in order to transform either InputStream
or Reader
not writing to OutputStream
or Writer
somehow? Why this is necessary: I would like to avoid using intermediate storages like intermediate JSON-parse string (that may be huge) or any other thread-related stuff like PipedInputStream
/PipedOutputStream
etc. So, something like this:
final Reader unescapedReader = unescapeJson(escapedReader);
...
processNormalizedJsonReader(unescapedReader);
Stripping the enclosing quotes would not be hard and I think I could control it myself (like stripping the very first enclosing quote before invoking unescapeJson
to trigger unbescape consider the read content as escaped string; and then detect the very last enclosing and dangling quote myself).
I tried to implement it myself in the simplest way but didn't consider enclosed escapes so my parser fails where there are literals nested at two levels. Here is a question and my failing suggestion (see the Edit: section and the StringWrapperInputStream
implementation (input streams are not good anyway)): http://stackoverflow.com/questions/41745307/how-to-parse-json-which-has-escaped-quotes-with-gson/42120149#42120149
Would it be possible with the library?
Thanks!
Can I escape only disallowed, unwise and reserved characters?
I need to keep cyrillic characters as is. Somethins like UriEscape.escapeUriPathMinimal()
method needed.
for example:
биеийн жин.jpg
-> биеийн%20жин.jpg
Currently every escape functions using only charset names. If we use predefined Charset
instance. it will run more faster.
For example:
Currently we have this method:
UriEscape.escapeUriPathSegment(String text, String encoding)
This overloaded method needed:
UriEscape.escapeUriPathSegment(String text, Charset encoding)
ps: also DEFAULT_ENCODING constant must be Charset instance not String. Can I send pull request for this changes?
The current level-1 escaping mechanisms for JSON and JavaScript escape the slash (/
) symbol in order to prevent code injection attacks when escaped code is used in HTML inside <script>
blocks, like:
<script>
var value = "This is a value<\/script>[SOME_INJECTED_CODE]";
</script>
The \/
escape sequence prevents [SOME_INJECTED_CODE]
from actually executing.
But in XHTML environments, a similar issue could appear if XHTML escapes are used inside literals. When operating in XHTML mode (Content-Type: application/xhtml+xml
), browsers will apply these XHTML escapes before processing the script, so these escapes could be used for closing the literal and even the <script>
tag:
<script>
var value = "This is a value"+[SOME_INJECTED_CODE]+"";
</script>
The most adequate way to avoid this and make JSON, JavaScript and CSS literal escapes safe to be used both in HTML and XHTML scenarios is to always escape not only /
, but also the &
. So the result of the above would be something like:
<script>
var value = "This is a value\u0026#x22;+[SOME_INJECTED_CODE]+\u0026#x22;";
</script>
Hi,
would it be possibel to just unscape everything but <
, >
and &
, as this would render the resulting html invalid.
Is there any option to achieve this?
As many persons write code like:
<a href='http://my/input/url'>
It's not security after ignore '
, thought the RFC define this.
  stays unchanged in resulting unescaped text
Hi Daniel,
I like your concept of multiple levels which give developers more flexibility. Considering the following situation (exactly what I have met):
Huge amount of strings will be processed by Map Reduce program (Java). So I only need to escape \t \n etc single characters and remain other unicode characters still unicode. Surely I can simply escape them all but this would make it harder to store/process in next steps. What do you think if we split Level 1 in Java into 2 more finer levels? i.e. level 0 only contain basic character set except for non-displayable, control characters. level 1 remains the same. I modified your source code and it was a very small change.
Joe
As issue title.
JavaScript (ECMAScript 2015+) allows line continuations based on escaping line feed characters, like:
var lit = "Some\
literal\
with more than one\
line";
Given in #18 Unbescape was added support for treating line continuators as escape sequences (and thus unescape them), it makes sense to do the same for JavaScript.
In order to be able to compile with versions of the JDK >= 9, we should establish the source and target baseline at the maven-compiler-plugin
to Java 6. This should be completely painless in 2018.
My problem is the one I posted on StackOverflow.
Briefly, I think it will be good to have duplicates of level 2, 3 and 4 that does not escape level 1 characters. The reason is explained in the SO question.
Alternatively, it could be useful if escapeHtml
could be overloaded to accept short[]
like HtmlEscapeSymbols.NCRS_BY_CODEPOINT
Hi. At my company I need to unescape content of structured data in web pages described by https://schema.org/ properties. The data might be HTML escaped. I've been using apache StringUtils so far, but it cannot handle all entities which occur in those Strings.
I've tried unbescape and it works really fine! But I wonder if there is any configuration or settings for unescape failing on unrecognized entities, I've noticed there are many options for escaping, but not for unescaping.
Following code returns 55" TV �
.
HtmlEscape.unescapeHtml("55" TV �")
While the desired response for me would be some kind of Exception saying that �
is not a valid entity.
In my use case not having any unescaped String is much preferable than having some invalid characters in it (without even knowing about it). And performing some extra checks on the input simply does not feel optimal if I use a library for parsing it anyway.
Let me know if there is any possibility of using unbescape the described way (maybe I missed some method/option).
Thanks,
Milan
In many places, Character.isHighSurrogate()
is used inside a specific pattern to determine the codepoint from a char[]
or String
, but that's exactly what Character.codePointAt()
does. And calling the JVM method will probably be faster.
Add OSGi bundle metadata to META-INF/MANIFEST.MF
, checking that all packages are exported and that no packages are imported, given unbescape has no runtime dependencies on any packages outside the standard Java runtime.
From: thymeleaf/thymeleaf#769
It should be evaluated whether it would make sense to make all escaping operations filter out characters that are simply not representable in HTML.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.