unbescape / unbescape Goto Github PK

View Code? Open in Web Editor NEW

237.0 15.0 32.0 844 KB

Advanced yet easy to use escaping library for Java

License: Apache License 2.0

CSS 0.06% Java 93.20% HTML 6.74%

java escaping libraries xml json html url css csv

unbescape's People

Contributors

Stargazers

Watchers

unbescape's Issues

XML attribute escaping / unescaping

Any plans to add support for XML attributes? In particular, newline encoding support?

HTML5 escaping for Æ has no trailing semicolon

HtmlEscape.escapeHtml5("Æ") returns "&AElig", but should return "Æ".

Tracked this down to a comparison in HtmlEscapeSymbols:

Since Æ is the lexicographically smallest NCR for HTML 5, it is sorted to index 0 in ncrsOrdered. Thus, NCRS_BY_CODEPOINT[198] is set to 0 for Æ. Next comes &AElig (without the semicolon) at index 1 in ncrsOrdered. The comparison detects that NCRS_BY_CODEPOINT[198] is 0 (because it was set to 0), but assumes that this 0 must be NO_NCR, and overwrites the entry. This eventually causes the missing semicolon when escaping Æ.

Maybe NO_NCR should be set to a different value to avoid mixing it up with the valid index 0.

CSS unescape of slash then newline

 CssEscape.unescapeCss("here\\\nthere")

outputs

here\
there

but based on the reading of the CSS spec it seems like this should be

herethere

https://www.w3.org/TR/CSS2/syndata.html#escaped-characters

"First, inside a string, a backslash followed by a newline is ignored (i.e., the string is deemed not to contain either the backslash or the newline)"

`HtmlEscape.unescapeHtml` throws `IllegalArgumentException` for escapes that are greater than the Unicode range

U+110000 is not a valid code point. It exceeds Plane 16. However, HTML in the wild is often dirty, and we've seen poorly formed HTML documents that contain hexadecimal escapes that are much larger than the valid range.

This particular case:

HtmlEscape.unescapeHtml("&#x110000;");

Fails with an IllegalArgumentException thrown by Character.toChars. However, it would be nice if unescaping either ignored these escapes like it does for unknown symbolic escapes such as HtmlEscape.unescapeHtml("&uumqqq;"); which yields "&uumqqq;" as output or replaced them with the Unicode replacement character (U+FFFD).

I'm happy to submit a patch, but I was curious which behavior would be preferred for this edge case.

Add a JPMS automatic module name

Adding an Automatic-Module-Name to the MANIFEST.MF file will allow us to reserve the module name for future use. See http://blog.joda.org/2017/05/java-se-9-jpms-automatic-modules.html

Completely modularising the library could mean an issue for some scenarios as of today, as the jar file would contain at least one file (module-info.class) with Java 9 bytecode level, which can nowadays be troublesome. See http://blog.joda.org/2018/03/jpms-negative-benefits.html

There are at least two approaches to module naming (project-name-based and reverse-DNS-based, see this, but we will go for project-based as it is simpler, and the approach recommended by Mark Reinhold (see this) and adopted by the Spring family of projects.

So module name will be unbescape.

Parse error

Hi author!

I have:
{"GetMusicTempResult":"{"ErrorCode":0, "ErrorMessage":"Thành Công", "Data" : [{"album_id" : "035BBCB36","music_url" : "/temp/zmnepemsqntotmndhn2vstt5/langletonthuong_mrsiro.mp3","music_picture" : "/PictureLip/Album/V-PopHotMp3Icon.png","music_title" : " Lặng Lẽ Tổn Thương ","music_performers" : "Mr. Siro","music_listen" : "265","music_lyric" : "(Vậy mà em vẫn cứ ngây thơ
Giờ đây em có vui đâu
Vậy mà em vẫn cứ ngây thơ...)

Tại sao chúng mình, không nhận ra nhau sớm hơn
Để anh có thể mạnh mẽ đến gần em, và
Thì thầm lời hứa, em sẽ không phải khóc
Chẳng như hôm nay, chỉ biết im lặng thôi
Từ giây phút đầu, em nở nụ cười rất tươi
Mà chỉ anh thấy, em vẫn không hạnh phúc, chút nào
Và anh biết, người ở bên cạnh em bây giờ
Không hề quan tâm đến cảm giác em như ngày xưa

ĐK1:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Xin đừng để anh phải thấy em buồn)

(Nhìn nụ cười buồn nở trên môi em, mà lòng này chợt nhận ra con tim, đã yêu em thật rồi
When I see you cry, I love you more and more)

-----
Bản thân anh đã nhiều lần tự hỏi
Trong tim em anh chẳng là gì cả
Sao phải quan tâm em quá như là
"Em xem anh là tất cả, niềm tin trong tim"
Nhưng khi anh chợt tỉnh giấc chỉ mình anh mơ

ĐK2:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Có lẽ đã không còn cần một người như anh để che chở)

(Và đừng quên anh từng ở bên em...) ","cat_id" : "1","cat_name" : "V-Pop","performers_id" : "C5E60D989"}]}"}

After parse:
{"album_id" : "035BBCB36","music_url" : "/temp/unnxccplluaogwx01dckf5en/langletonthuong_mrsiro.mp3","music_picture" : "/PictureLip/Album/V-PopHotMp3Icon.png","music_title" : " Lặng Lẽ Tổn Thương ","music_performers" : "Mr. Siro","music_listen" : "266","music_lyric" : "(Vậy mà em vẫn cứ ngây thơ
Giờ đây em có vui đâu
Vậy mà em vẫn cứ ngây thơ...)

Tại sao chúng mình, không nhận ra nhau sớm hơn
Để anh có thể mạnh mẽ đến gần em, và
Thì thầm lời hứa, em sẽ không phải khóc
Chẳng như hôm nay, chỉ biết im lặng thôi
Từ giây phút đầu, em nở nụ cười rất tươi
Mà chỉ anh thấy, em vẫn không hạnh phúc, chút nào
Và anh biết, người ở bên cạnh em bây giờ
Không hề quan tâm đến cảm giác em như ngày xưa

ĐK1:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Xin đừng để anh phải thấy em buồn)

(Nhìn nụ cười buồn nở trên môi em, mà lòng này chợt nhận ra con tim, đã yêu em thật rồi
When I see you cry, I love you more and more)

-----
Bản thân anh đã nhiều lần tự hỏi
Trong tim em anh chẳng là gì cả
Sao phải quan tâm em quá như là
"Em xem anh là tất cả, niềm tin trong tim"
Nhưng khi anh chợt tỉnh giấc chỉ mình anh mơ

ĐK2:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Có lẽ đã không còn cần một người như anh để che chở)

(Và đừng quên anh từng ở bên em...) ","cat_id" : "1","cat_name" : "V-Pop","performers_id" : "C5E60D989"}

Error is: "Em xem anh là tất cả, niềm tin trong tim"
Please fix library to parse: "Em xem anh là tất cả, niềm tin trong tim"
Check json by: http://jsonlint.com/

Thanks!

ArrayIndexOutOfBoundsException for numeric character references that exceed Java's `Integer.MAX_VALUE`.

This is related to my earlier reported bug #7, but is actually more extreme.

The following statement results in a

HtmlEscape.unescapeHtml("&#914351242010;");

stack trace that looks similar to the following:

java.lang.ArrayIndexOutOfBoundsException: 476792037
    at org.unbescape.html.HtmlEscapeUtil.unescape(HtmlEscapeUtil.java:683)
    at org.unbescape.html.HtmlEscape.unescapeHtml(HtmlEscape.java:616)
...

I've tracked this down to the way that unbescape uses a custom string to integer conversion which doesn't do bounds checking and can cause integer overflow. Specifically, it's the HtmlEscapeUtil.parseIntFromReference. This can return a negative number which unbescape then later uses as an indicator for a so called DOUBLE_CODEPOINT or a symbolic entity reference that gets translated into two Unicode characters.

Add overloaded method JsonEscape.escapeJson(String,Writer)

Enhance MANIFEST.MF

Add info to META-INF/MANIFEST.MF like specification/implementation version.

This should be done by means of the maven-jar-plugin

Fix description of MSExcel compatible CSVs

The semi-colon will only be used as a field separator in non-English setups, because comma (,) is used as a decimal separator. But in English setups (and in the RFC standard), comma is the standard field separator.

Converting escaped readers to unescaped readers (or vice versa)

Hi,

Thanks for an interesting library, and since the library looks powerful and provides what I need, I would like to use it in my project. However, I have a question/request that probably that might look not trivial... Suppose there is a response like this:

"{"foo":"bar: \\\"baz\\\""}"

(Not sure if I put backslashes properly) Yes, there are enclosing quotes and the expected response is as follows: {"foo":"bar: \"baz\""}, but due to the specifics of the server I'm not involved to I cannot extract it due to those quotes.

I'm wondering: is it possible to use unbescape in order to transform either InputStream or Reader not writing to OutputStream or Writer somehow? Why this is necessary: I would like to avoid using intermediate storages like intermediate JSON-parse string (that may be huge) or any other thread-related stuff like PipedInputStream/PipedOutputStream etc. So, something like this:

final Reader unescapedReader = unescapeJson(escapedReader);
...
processNormalizedJsonReader(unescapedReader);

Stripping the enclosing quotes would not be hard and I think I could control it myself (like stripping the very first enclosing quote before invoking unescapeJson to trigger unbescape consider the read content as escaped string; and then detect the very last enclosing and dangling quote myself).

I tried to implement it myself in the simplest way but didn't consider enclosed escapes so my parser fails where there are literals nested at two levels. Here is a question and my failing suggestion (see the Edit: section and the StringWrapperInputStream implementation (input streams are not good anyway)): http://stackoverflow.com/questions/41745307/how-to-parse-json-which-has-escaped-quotes-with-gson/42120149#42120149

Would it be possible with the library?
Thanks!

URI escaping only disallowed chars

Can I escape only disallowed, unwise and reserved characters?
I need to keep cyrillic characters as is. Somethins like UriEscape.escapeUriPathMinimal() method needed.

for example:
биеийн жин.jpg -> биеийн%20жин.jpg

Possibility to use java.nio.charset.Charset instance.

Currently every escape functions using only charset names. If we use predefined Charset instance. it will run more faster.

For example:
Currently we have this method:
UriEscape.escapeUriPathSegment(String text, String encoding)

This overloaded method needed:
UriEscape.escapeUriPathSegment(String text, Charset encoding)

ps: also DEFAULT_ENCODING constant must be Charset instance not String. Can I send pull request for this changes?

Add ampersand to level-1 escaping for JSON, JavaScript and CSS literals

The current level-1 escaping mechanisms for JSON and JavaScript escape the slash (/) symbol in order to prevent code injection attacks when escaped code is used in HTML inside <script> blocks, like:

<script>
  var value = "This is a value<\/script>[SOME_INJECTED_CODE]";
</script>

The \/ escape sequence prevents [SOME_INJECTED_CODE] from actually executing.

But in XHTML environments, a similar issue could appear if XHTML escapes are used inside literals. When operating in XHTML mode (Content-Type: application/xhtml+xml), browsers will apply these XHTML escapes before processing the script, so these escapes could be used for closing the literal and even the <script> tag:

<script>
  var value = "This is a value&#x22;+[SOME_INJECTED_CODE]+&#x22;";
</script>

The most adequate way to avoid this and make JSON, JavaScript and CSS literal escapes safe to be used both in HTML and XHTML scenarios is to always escape not only /, but also the &. So the result of the above would be something like:

<script>
  var value = "This is a value\u0026#x22;+[SOME_INJECTED_CODE]+\u0026#x22;";
</script>

Unescaping lt, gt and amp

Hi,
would it be possibel to just unscape everything but <, > and &, as this would render the resulting html invalid.
Is there any option to achieve this?

org.unbescape.uri.UriEscape should supports escape '

As many persons write code like:

<a href='http://my/input/url'>

It's not security after ignore ', thought the RFC define this.

unescapeHtml does not replace &nbsp with white space

&nbsp stays unchanged in resulting unescaped text

More finer level for escape (Java)

Hi Daniel,

I like your concept of multiple levels which give developers more flexibility. Considering the following situation (exactly what I have met):
Huge amount of strings will be processed by Map Reduce program (Java). So I only need to escape \t \n etc single characters and remain other unicode characters still unicode. Surely I can simply escape them all but this would make it harder to store/process in next steps. What do you think if we split Level 1 in Java into 2 more finer levels? i.e. level 0 only contain basic character set except for non-displayable, control characters. level 1 remains the same. I modified your source code and it was a very small change.

Joe

Is there a plan to support sql escape?

As issue title.

Support JavaScript line continuations

JavaScript (ECMAScript 2015+) allows line continuations based on escaping line feed characters, like:

var lit = "Some\
literal\
with more than one\
line";

Given in #18 Unbescape was added support for treating line continuators as escape sequences (and thus unescape them), it makes sense to do the same for JavaScript.

Set compilation baseline to Java 6

In order to be able to compile with versions of the JDK >= 9, we should establish the source and target baseline at the maven-compiler-plugin to Java 6. This should be completely painless in 2018.

Special HtmlEscapeLevel level that does not escape Level 1 characters

My problem is the one I posted on StackOverflow.

Briefly, I think it will be good to have duplicates of level 2, 3 and 4 that does not escape level 1 characters. The reason is explained in the SO question.

Alternatively, it could be useful if escapeHtml could be overloaded to accept short[] like HtmlEscapeSymbols.NCRS_BY_CODEPOINT

Possible Exception on unescaping invalid entity

Hi. At my company I need to unescape content of structured data in web pages described by https://schema.org/ properties. The data might be HTML escaped. I've been using apache StringUtils so far, but it cannot handle all entities which occur in those Strings.

I've tried unbescape and it works really fine! But I wonder if there is any configuration or settings for unescape failing on unrecognized entities, I've noticed there are many options for escaping, but not for unescaping.

Example

Following code returns 55" TV �.

HtmlEscape.unescapeHtml("55&quot; TV &#20210525;")

While the desired response for me would be some kind of Exception saying that &#20210525; is not a valid entity.

In my use case not having any unescaped String is much preferable than having some invalid characters in it (without even knowing about it). And performing some extra checks on the input simply does not feel optimal if I use a library for parsing it anyway.

Let me know if there is any possibility of using unbescape the described way (maybe I missed some method/option).
Thanks,
Milan

unbescape / unbescape Goto Github PK

unbescape's People

Contributors

Stargazers

Watchers

Forkers

unbescape's Issues

Example

Recommend Projects

Recommend Topics

Recommend Org