Giter Site home page Giter Site logo

unbescape's People

Contributors

b4hand avatar danielfernandez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unbescape's Issues

HTML5 escaping for Æ has no trailing semicolon

HtmlEscape.escapeHtml5("Æ") returns "&AElig", but should return "Æ".

Tracked this down to a comparison in HtmlEscapeSymbols:

Since Æ is the lexicographically smallest NCR for HTML 5, it is sorted to index 0 in ncrsOrdered. Thus, NCRS_BY_CODEPOINT[198] is set to 0 for Æ. Next comes &AElig (without the semicolon) at index 1 in ncrsOrdered. The comparison detects that NCRS_BY_CODEPOINT[198] is 0 (because it was set to 0), but assumes that this 0 must be NO_NCR, and overwrites the entry. This eventually causes the missing semicolon when escaping Æ.

Maybe NO_NCR should be set to a different value to avoid mixing it up with the valid index 0.

`HtmlEscape.unescapeHtml` throws `IllegalArgumentException` for escapes that are greater than the Unicode range

U+110000 is not a valid code point. It exceeds Plane 16. However, HTML in the wild is often dirty, and we've seen poorly formed HTML documents that contain hexadecimal escapes that are much larger than the valid range.

This particular case:

HtmlEscape.unescapeHtml("�");

Fails with an IllegalArgumentException thrown by Character.toChars. However, it would be nice if unescaping either ignored these escapes like it does for unknown symbolic escapes such as HtmlEscape.unescapeHtml("&uumqqq;"); which yields "&uumqqq;" as output or replaced them with the Unicode replacement character (U+FFFD).

I'm happy to submit a patch, but I was curious which behavior would be preferred for this edge case.

Add a JPMS automatic module name

Adding an Automatic-Module-Name to the MANIFEST.MF file will allow us to reserve the module name for future use. See http://blog.joda.org/2017/05/java-se-9-jpms-automatic-modules.html

Completely modularising the library could mean an issue for some scenarios as of today, as the jar file would contain at least one file (module-info.class) with Java 9 bytecode level, which can nowadays be troublesome. See http://blog.joda.org/2018/03/jpms-negative-benefits.html

There are at least two approaches to module naming (project-name-based and reverse-DNS-based, see this, but we will go for project-based as it is simpler, and the approach recommended by Mark Reinhold (see this) and adopted by the Spring family of projects.

So module name will be unbescape.

Parse error

Hi author!

I have:
{"GetMusicTempResult":"{"ErrorCode":0, "ErrorMessage":"Thành Công", "Data" : [{"album_id" : "035BBCB36","music_url" : "/temp/zmnepemsqntotmndhn2vstt5/langletonthuong_mrsiro.mp3","music_picture" : "/PictureLip/Album/V-PopHotMp3Icon.png","music_title" : " Lặng Lẽ Tổn Thương ","music_performers" : "Mr. Siro","music_listen" : "265","music_lyric" : "(Vậy mà em vẫn cứ ngây thơ
Giờ đây em có vui đâu
Vậy mà em vẫn cứ ngây thơ...)

Tại sao chúng mình, không nhận ra nhau sớm hơn
Để anh có thể mạnh mẽ đến gần em, và
Thì thầm lời hứa, em sẽ không phải khóc
Chẳng như hôm nay, chỉ biết im lặng thôi
Từ giây phút đầu, em nở nụ cười rất tươi
Mà chỉ anh thấy, em vẫn không hạnh phúc, chút nào
Và anh biết, người ở bên cạnh em bây giờ
Không hề quan tâm đến cảm giác em như ngày xưa

ĐK1:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Xin đừng để anh phải thấy em buồn)

(Nhìn nụ cười buồn nở trên môi em, mà lòng này chợt nhận ra con tim, đã yêu em thật rồi
When I see you cry, I love you more and more)

-----
Bản thân anh đã nhiều lần tự hỏi
Trong tim em anh chẳng là gì cả
Sao phải quan tâm em quá như là
"Em xem anh là tất cả, niềm tin trong tim"
Nhưng khi anh chợt tỉnh giấc chỉ mình anh mơ

ĐK2:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Có lẽ đã không còn cần một người như anh để che chở)

(Và đừng quên anh từng ở bên em...) ","cat_id" : "1","cat_name" : "V-Pop","performers_id" : "C5E60D989"}]}"}

After parse:
{"album_id" : "035BBCB36","music_url" : "/temp/unnxccplluaogwx01dckf5en/langletonthuong_mrsiro.mp3","music_picture" : "/PictureLip/Album/V-PopHotMp3Icon.png","music_title" : " Lặng Lẽ Tổn Thương ","music_performers" : "Mr. Siro","music_listen" : "266","music_lyric" : "(Vậy mà em vẫn cứ ngây thơ
Giờ đây em có vui đâu
Vậy mà em vẫn cứ ngây thơ...)

Tại sao chúng mình, không nhận ra nhau sớm hơn
Để anh có thể mạnh mẽ đến gần em, và
Thì thầm lời hứa, em sẽ không phải khóc
Chẳng như hôm nay, chỉ biết im lặng thôi
Từ giây phút đầu, em nở nụ cười rất tươi
Mà chỉ anh thấy, em vẫn không hạnh phúc, chút nào
Và anh biết, người ở bên cạnh em bây giờ
Không hề quan tâm đến cảm giác em như ngày xưa

ĐK1:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Xin đừng để anh phải thấy em buồn)

(Nhìn nụ cười buồn nở trên môi em, mà lòng này chợt nhận ra con tim, đã yêu em thật rồi
When I see you cry, I love you more and more)

-----
Bản thân anh đã nhiều lần tự hỏi
Trong tim em anh chẳng là gì cả
Sao phải quan tâm em quá như là
"Em xem anh là tất cả, niềm tin trong tim"
Nhưng khi anh chợt tỉnh giấc chỉ mình anh mơ

ĐK2:
Vậy mà vẫn cứ ngây thơ, em vẫn bên ai dại khờ
Giờ đây em có vui đâu, xé nát tim anh nhiều lần
Anh không chấp nhận nhìn em cố níu tay ai
Anh xót xa trăm ngàn lần
Tình yêu anh muốn trao em
Nhưng chắc duyên ta phải dừng khi chưa bắt đầu
Xin đừng để anh phải thấy em buồn
(Có lẽ đã không còn cần một người như anh để che chở)

(Và đừng quên anh từng ở bên em...) ","cat_id" : "1","cat_name" : "V-Pop","performers_id" : "C5E60D989"}

Error is: "Em xem anh là tất cả, niềm tin trong tim"
Please fix library to parse: "Em xem anh là tất cả, niềm tin trong tim"
Check json by: http://jsonlint.com/

Thanks!
error_15_5_2015

ArrayIndexOutOfBoundsException for numeric character references that exceed Java's `Integer.MAX_VALUE`.

This is related to my earlier reported bug #7, but is actually more extreme.

The following statement results in a

HtmlEscape.unescapeHtml("�");

stack trace that looks similar to the following:

java.lang.ArrayIndexOutOfBoundsException: 476792037
    at org.unbescape.html.HtmlEscapeUtil.unescape(HtmlEscapeUtil.java:683)
    at org.unbescape.html.HtmlEscape.unescapeHtml(HtmlEscape.java:616)
...

I've tracked this down to the way that unbescape uses a custom string to integer conversion which doesn't do bounds checking and can cause integer overflow. Specifically, it's the HtmlEscapeUtil.parseIntFromReference. This can return a negative number which unbescape then later uses as an indicator for a so called DOUBLE_CODEPOINT or a symbolic entity reference that gets translated into two Unicode characters.

Enhance MANIFEST.MF

Add info to META-INF/MANIFEST.MF like specification/implementation version.

This should be done by means of the maven-jar-plugin

Fix description of MSExcel compatible CSVs

The semi-colon will only be used as a field separator in non-English setups, because comma (,) is used as a decimal separator. But in English setups (and in the RFC standard), comma is the standard field separator.

Converting escaped readers to unescaped readers (or vice versa)

Hi,

Thanks for an interesting library, and since the library looks powerful and provides what I need, I would like to use it in my project. However, I have a question/request that probably that might look not trivial... Suppose there is a response like this:

"{"foo":"bar: \\\"baz\\\""}"

(Not sure if I put backslashes properly) Yes, there are enclosing quotes and the expected response is as follows: {"foo":"bar: \"baz\""}, but due to the specifics of the server I'm not involved to I cannot extract it due to those quotes.

I'm wondering: is it possible to use unbescape in order to transform either InputStream or Reader not writing to OutputStream or Writer somehow? Why this is necessary: I would like to avoid using intermediate storages like intermediate JSON-parse string (that may be huge) or any other thread-related stuff like PipedInputStream/PipedOutputStream etc. So, something like this:

final Reader unescapedReader = unescapeJson(escapedReader);
...
processNormalizedJsonReader(unescapedReader);

Stripping the enclosing quotes would not be hard and I think I could control it myself (like stripping the very first enclosing quote before invoking unescapeJson to trigger unbescape consider the read content as escaped string; and then detect the very last enclosing and dangling quote myself).

I tried to implement it myself in the simplest way but didn't consider enclosed escapes so my parser fails where there are literals nested at two levels. Here is a question and my failing suggestion (see the Edit: section and the StringWrapperInputStream implementation (input streams are not good anyway)): http://stackoverflow.com/questions/41745307/how-to-parse-json-which-has-escaped-quotes-with-gson/42120149#42120149

Would it be possible with the library?
Thanks!

URI escaping only disallowed chars

Can I escape only disallowed, unwise and reserved characters?
I need to keep cyrillic characters as is. Somethins like UriEscape.escapeUriPathMinimal() method needed.

for example:
биеийн жин.jpg -> биеийн%20жин.jpg

Possibility to use java.nio.charset.Charset instance.

Currently every escape functions using only charset names. If we use predefined Charset instance. it will run more faster.

For example:
Currently we have this method:
UriEscape.escapeUriPathSegment(String text, String encoding)

This overloaded method needed:
UriEscape.escapeUriPathSegment(String text, Charset encoding)

ps: also DEFAULT_ENCODING constant must be Charset instance not String. Can I send pull request for this changes?

Add ampersand to level-1 escaping for JSON, JavaScript and CSS literals

The current level-1 escaping mechanisms for JSON and JavaScript escape the slash (/) symbol in order to prevent code injection attacks when escaped code is used in HTML inside <script> blocks, like:

<script>
  var value = "This is a value<\/script>[SOME_INJECTED_CODE]";
</script>

The \/ escape sequence prevents [SOME_INJECTED_CODE] from actually executing.

But in XHTML environments, a similar issue could appear if XHTML escapes are used inside literals. When operating in XHTML mode (Content-Type: application/xhtml+xml), browsers will apply these XHTML escapes before processing the script, so these escapes could be used for closing the literal and even the <script> tag:

<script>
  var value = "This is a value&#x22;+[SOME_INJECTED_CODE]+&#x22;";
</script>

The most adequate way to avoid this and make JSON, JavaScript and CSS literal escapes safe to be used both in HTML and XHTML scenarios is to always escape not only /, but also the &. So the result of the above would be something like:

<script>
  var value = "This is a value\u0026#x22;+[SOME_INJECTED_CODE]+\u0026#x22;";
</script>

Unescaping lt, gt and amp

Hi,
would it be possibel to just unscape everything but &lt;, &gt; and &amp;, as this would render the resulting html invalid.
Is there any option to achieve this?

More finer level for escape (Java)

Hi Daniel,

I like your concept of multiple levels which give developers more flexibility. Considering the following situation (exactly what I have met):
Huge amount of strings will be processed by Map Reduce program (Java). So I only need to escape \t \n etc single characters and remain other unicode characters still unicode. Surely I can simply escape them all but this would make it harder to store/process in next steps. What do you think if we split Level 1 in Java into 2 more finer levels? i.e. level 0 only contain basic character set except for non-displayable, control characters. level 1 remains the same. I modified your source code and it was a very small change.

Joe

Support JavaScript line continuations

JavaScript (ECMAScript 2015+) allows line continuations based on escaping line feed characters, like:

var lit = "Some\
literal\
with more than one\
line";

Given in #18 Unbescape was added support for treating line continuators as escape sequences (and thus unescape them), it makes sense to do the same for JavaScript.

Set compilation baseline to Java 6

In order to be able to compile with versions of the JDK >= 9, we should establish the source and target baseline at the maven-compiler-plugin to Java 6. This should be completely painless in 2018.

Possible Exception on unescaping invalid entity

Hi. At my company I need to unescape content of structured data in web pages described by https://schema.org/ properties. The data might be HTML escaped. I've been using apache StringUtils so far, but it cannot handle all entities which occur in those Strings.

I've tried unbescape and it works really fine! But I wonder if there is any configuration or settings for unescape failing on unrecognized entities, I've noticed there are many options for escaping, but not for unescaping.

Example

Following code returns 55" TV �.

HtmlEscape.unescapeHtml("55&quot; TV &#20210525;")

While the desired response for me would be some kind of Exception saying that &#20210525; is not a valid entity.

In my use case not having any unescaped String is much preferable than having some invalid characters in it (without even knowing about it). And performing some extra checks on the input simply does not feel optimal if I use a library for parsing it anyway.

Let me know if there is any possibility of using unbescape the described way (maybe I missed some method/option).
Thanks,
Milan

Add OSGi bundle metadata to MANIFEST

Add OSGi bundle metadata to META-INF/MANIFEST.MF, checking that all packages are exported and that no packages are imported, given unbescape has no runtime dependencies on any packages outside the standard Java runtime.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.