mathiasbynens / he Goto Github PK
View Code? Open in Web Editor NEWA robust HTML entity encoder/decoder written in JavaScript.
Home Page: https://mths.be/he
License: MIT License
A robust HTML entity encoder/decoder written in JavaScript.
Home Page: https://mths.be/he
License: MIT License
Your site implies an html encoder, when it is really just escaping characters, not html encoding.
Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER.
Some examples:
he.decode('��') → '\uFFFD\uFFFD'
he.decode('�') → '\uFFFD'
Also check out the table in the spec, e.g.:
he.decode('�') → '\uFFFD'
Reported by the amazing @zcorpan in #whatwg.
Currently I'm just decoding before escaping, but I wonder if there's a faster, more elegant solution.
Thanks and great library!
Consider the following Stack Overflow answer to the question How to decode HTML entities using jQuery?
Just do:
var decoded = $('<textarea/>').html(encoded).val();where
encoded
is your string containing HTML entities that you wish to decode.This works similarly to the accepted answer, but is safe to use with untrusted user input.
As noted by Mike Samuel, doing this with a
<div>
instead of a<textarea>
with untrusted user input is an XSS vulnerability, even if the<div>
is never added to the DOM:// Shows the alert in Firefox and Safari (and returns an empty string) $("<div/>").html( '<img src="//www.google.com/images/logos/ps_logo2.png" onload=alert(1337)>' ).text()However, this attack is not possible against a
<textarea>
because there are no HTML elements that are permitted content of a<textarea>
. Consequently, any HTML tags still present in the 'encoded' string will be automatically entity-encoded by the browser.// This is safe (and returns the right answer) $("<textarea/>").html( '<img src="//www.google.com/images/logos/ps_logo2.png" onload=alert(1337)>' ).text()
Previously, the answer just included the first code snippet. I recently edited the answer to note the rationale behind using a textarea
instead of a div
. However, I'm a little uneasy, because I know that your library exists and is not (as far as I can tell) strictly targeting node users. I find myself wondering why.
I'll probably post a link to this library as an answer (unless you'd like to do so yourself) to that question regardless, since I figure that people who are using node may benefit from having a single solution that is usable both clientside and serverside. But how about everyone else? What reason is there for anyone to serve a 300 line script to serve a purpose that can - it seems to my naive eyes - be done in 50 characters with a clever hack?
Are there any situations at all in which the textarea
hack fails (or at least is not guaranteed by spec to succeed)? I confess to being slightly uneasy about it since I don't know where (or for that matter, if) the spec determines the behaviour of browsers when presented with HTML elements containing disallowed children, like
<textarea>
<p>I'm not really supposed to be here.</p>
</textarea>
but from the testing I've done, it seems to work.
Sorry to offload a question like this onto you, but it seems to be right in your area of expertise and is relevant when figuring out to whom this library is useful. (Indeed, if there is something profoundly wrong with the textarea
hack, it almost seems worth noting that in this library's README - otherwise, the case for using a library for this purpose at all is unclear).
When I use this function I get this error...
he.js:202 Uncaught TypeError: Cannot read property 'replace' of undefined
Where is 'regexEscape' being initialized?
Is there a way to ignore emojis, like
he.encode('You\'re so young 😏', { ignoreEmoji: true })
First of all: Great project, you're taking an interesting approach.
Apparently, you have your own version of a bug I had with my entities
module (I'm referring to fb55/entities#8 ):
Decoding ''amp;'
will first decode the hexadecimal escape, and afterwards the named entity.
Even worse, &amp;
will also be decoded twice; you'll end up with &;
.
Like https://github.com/mathiasbynens/swapcase/blob/8ded201ff6456e72192dd39e6b3e8260ea6762db/scripts/swap-map.js#L46-L53 — it’s much nicer, and it avoids the additional process-data.js
step by just making it part of export-data.js
.
Current behavior:
> he.encode('\x80')
'€'
€
is an invalid character reference (parse error) but then again, using the raw U+0080 character is just as invalid. The difference is that U+0080 in HTML source gets ignored, while €
becomes €
due to the overrides table.
Should we continue to return invalid entities, knowing they might map to a completely different symbol? Or should we not escape any invalid code points in the input? Or should we strip invalid characters from the input?
Should there be a strict
option for encode
as well (just like there is for decode
) which errors in case an invalid character is part of the source?
cc @zcorpan
Thanks so much for this repo!
That is all :)
Why decode method doesn't do "%2c" "%20"
Try to use this at client side. Console throws up an error.
Firefox:
SyntaxError: expected expression, got '<'[Learn More] he.js:32:17
Chrome:
Uncaught SyntaxError: Unexpected token < h2.js:32
Hi. Thanks for this great library. I'm having a challenge downloading this library and using it directly (not via npm). I downloaded he.js
from here but it seems to have other dependencies, as it errors on line 32 with Unexpected token
:
var encodeMap = <%= encodeMap %>;
Those %
tags look like they belong server side. How do you download this library directly? What am I doing wrong? Thanks.
Hello,
Have you considered a streaming implementation of he
.
Reading the code, it seems like there are a lot of expression types to check and the streamer would probably need to be code generated out of a description of all the entities.
From your experience on he, do you think that is feasible / that would make sense ?
Thanks
If the string passed into "encode()" is a number, there's an error occurring on line no. 193.
TypeError: string.replace is not a function
string = string.replace(regexEscape, hexEscape);
Can we have v1.1.0 exposed for Bower? Thanks!
Hello,
I ran the test for "Markup characters pass through when allowUnsafeSymbols: true
" on the demo site, but I got a value different from what is expected in the test:
he.encode('foo\xA9<bar\uD834\uDF06>baz\u2603"qux', { 'allowUnsafeSymbols': true })
// results
'foo©<bar𝌆>baz☃"qux' // actual
'foo©<bar𝌆>baz☃"qux' // expected
underscore.string has a function escapeHtml
https://github.com/epeli/underscore.string#string-functions
jQuery also.
I guess he
is more robust (as stated in the intro), but may we have facts ?
How about perfs ? http://jsperf.com/string-escapehtml
See whatwg/html#1257:
&aaa;
or<p title="&ersand;">
should be parse errors
Hey, I'm looking for alternatives to he
for the browser, any recommendations? It's just for UX, I'd still be using he
in the server.
If the character reference is being consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (
;
), and the next character is either a U+003D EQUALS SIGN character (=
) or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&
) must be unconsumed, and nothing is returned. However, if this next character is in fact a U+003D EQUALS SIGN character (=
), then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.
This issue was brought to you by @zcorpan’s Quality Assurance Services™.
Currently, some symbols like +
get escaped (e.g. +
) simply because there exists a named character reference for them. However, it’s not really necessary to encode these symbols, since they’re printable ASCII already.
We should filter these out in data.js
and make sure they don’t end up in encodeMap
.
Posting these here for future reference. he is not going to support these until they’re standardized or supported in more than one browser.
Source: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-July/012235.html
&aafs; U+206D ACTIVATE ARABIC FORM SHAPING
&ass; U+206B ACTIVATE SYMMETRIC SWAPPING
&iafs; U+206C INHIBIT ARABIC FORM SHAPING
&iss; U+206A INHIBIT SYMMETRIC SWAPPING
&lre; U+202A LEFT-TO-RIGHT EMBEDDING
&lro; U+202D LEFT-TO-RIGHT OVERRIDE
&nads; U+206E NATIONAL DIGIT SHAPES
&nods; U+206F NOMINAL DIGIT SHAPES
&pdf; U+202C POP DIRECTIONAL FORMATTING
&rle; U+202B RIGHT-TO-LEFT EMBEDDING
&rlo; U+202E RIGHT-TO-LEFT OVERRIDE
&zwsp; U+200B ZERO WIDTH SPACE
This would make it possible to use he
as part of a HTML parser/validator, for example.
I'm trying to use this library but every time I include it (WordPress Project) I get the following error in the console in both FF & Chrome:
SyntaxError: expected expression, got '<'
http://******************************/js/he.js?ver=0.5.0
Line 32
Cheers!
A minor issue:
he.decode('&abc;', {strict: true})
throws error with this message: Parse error: named character reference was not terminated by a semicolon
, when in fact neither a
nor ab
are valid legacy named character references and &abc;
is terminated by ;
. I think an error message to the effect of Parse error: named character reference is not spec-defined
would be better in this case.
This and #50 notwithstanding, he
has been a great companion to the HTML5 spec as I learn about and write a spec-compliant HTML entity decoder for Swift :)
Hi, good library!
But I don't want to test if my string is null or not before to he.encode it (lazy boy :)).
The strings coming from left jointed sql requests are often null...
So I add, line 139, just after "var encode = function(string, options) { ... "
if(string === null)return '';
Hello,
Thanks for all your hard work on the library and the awesome documentation!
I did a performance test recently between he.decode() and using this trick to use the browser's <textarea>
element to do the conversion for me.
Surprisingly, I found that he.decode()
was 2x slower for my string than using the browser's textarea. Here is the code I used to run my benchmarks. The <script>
src at the top should be changed to point to your he.js
script location:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<script src="/he.js"></script>
<title></title>
</head>
<body>
<script>
var txtArea = document.createElement("textarea");
function decodeHtmlSameTxtArea(html) {
txtArea.innerHTML = html;
return txtArea.value;
}
function decodeHtml(html) {
var txt = document.createElement("textarea");
txt.innerHTML = html;
return txt.value;
}
var count = 100;
var stringToDecode = "hes a's a''s a'''s a''''s b"s b""s b"""s b""""s \\ // '' ::""&*^ < > << >>";
var a = performance.now();
for (var i = 0; i < count; i++) {
var decodedString = decodeHtml(stringToDecode);
}
var b = performance.now();
console.info("Time Taken (using new txtarea each time):", (b - a)/1000, 'seconds.');
var a = performance.now();
for (var i = 0; i < count; i++) {
var decodedString = decodeHtmlSameTxtArea(stringToDecode);
}
var b = performance.now();
console.info("Time Taken (using same txtarea):", (b - a)/1000, 'seconds.');
var a = performance.now();
for (var i = 0; i < count; i++) {
var decodedString = window.he.decode(stringToDecode);
}
var b = performance.now();
console.info("Time Taken (using HtmlEntities library function):", (b - a)/1000, 'seconds.');
</script>
</body>
</html>
Just wanted to point out this interesting comparison. It isn't really an issue so you can close this!
I was testing encode/decode via https://mothereff.in/html-entities while cross-referencing the spec, and I noticed that he
is not able to decode certain named references correctly. On the w3 spec page, it lists this example string, I'm ¬it; I tell you
, which should be parsed into I'm ¬it; I tell you
with a parse error. he
returns the string un-parsed. It appears that he
is not able to parse legacy named references if there are one or more alphanumeric characters after the legacy named reference followed by a semicolon ;
character. he
parses correctly if the tail of alphanumeric characters ends with a character other than semicolon.
When I tested the module with the following code:
// file: decode.js
const Transform = require('stream').Transform;
const he = require('he');
const parser = new Transform();
parser._transform = function(data, encoding, done) {
this.push(he.decode(data));
done();
};
process.stdin.pipe(parser).pipe(process.stdout);
process.stdout.on('error', process.exit);
and run it by decoding any text file, say, decoding its own:
$ cat decode.js | node decode.js
I got the following error:
./node_modules/he/he.js:232
return html.replace(regexDecode, function($0, $1, $2, $3, $4, $5, $6, $7) {
^
TypeError: html.replace is not a function
at Object.decode (./node_modules/he/he.js:232:15)
and if I changed html.replace to String(html).replace, it fixes the TypeError. Is it a valid bug/fix?
Is it by design that '&ersand' is conveter to '&ersand'? If my intuition is correct, it should not.
gentlemen,
\u2026
is being decoded correctly both from hellip
and mldr
named entities.
However, when encoding, \u2026
is encoded into mldr
.
I traversed quite few Unicode reference pages and everywhere the default named entity is referenced as hellip
.
Please consider switching the encoding to hellip
named entity because it is easier to remember and matches majority of reference sources on the Internet.
thank you
Hello,
I'm using this library in production and get reports of JavaScript errors (in TrackJS.com) from IE & Edge browsers like:
And when I check the error file / line, it's sometimes in he.js, like here :
And sometimes it's in other parts of the code (like in YUI library)... seems completely random, by the way :
The only idea I have would be that IE shits on himself when loading the very long JSON defined in he.js. Or maybe the fact that there's very long strings processed in regex ?
Here's the exact version of he.js I'm using in my project : he.js.zip
I've modified it a bit, trying to cut the very long lines into shorter lines. Which didn't fixed it.
Have you ever had any issue of this kind ?
Thanks a lot for your help.
gnutix
…once whatwg/html#326 (comment) is resolved.
I want to use the new decimal
option, but it seems that [email protected] is not published to npm yet.
Hi,
Using he.decode, I need to keep a part of the code encoded. Is it possible ?
For example, I do he.encode on :
<h1>Title</h1>
<pre>
<p>Code</p>
</pre>
So I get :
<h1>Title</h1>
<pre>
<p>Code</p>
</pre>
And here is what I need to get doing a he.decode :
<h1>Title</h1>
<pre>
<p>Code</p>
</pre>
I did created a script in grunt to rename names.
In my object collection the original filename is:
0D0001E2-9AB0-C8D7-D1E8-4F264F3492E3/Adrichem - el Templo de Solomón.jpg
But when i use grunt-contrib-copy .. to read the src encoded the result is
0D0001E2-9AB0-C8D7-D1E8-4F264F3492E3/Adrichem - el Templo de Soloḿn.jpg
Why?
Thanks
This is the code:
cdl: {
files: [{
expand: true,
dot: true,
cwd: 'brain/files',
src: '**/*.{jpg,JPG,png,PNG,gif,jpeg,webp,tiff,mp3,wav,avi,mp4}',
dest: '<%= yeoman.app %>/pages',
rename: function(dest, src) {
var attachments = grunt.config.get('CDL.attachments'),
newFilename;
grunt.log.writeln(['filename:', he.encode(src) ]);
if (typeof attachments[src] !== 'undefined') {
newFilename = attachments[src].guid + attachments[src].format;
grunt.log.writeln([newFilename, src.split('/')[1], dest + src.replace(src.split('/')[1], newFilename)]);
return dest + '/' + src.replace(src.split('/')[1], newFilename);
}
return dest + '/' + src;
}
}]
Suggested by @zcorpan:
13:47:03 <zcorpan> maybe it'd be useful to opt to avoid named references if one wants better compat with old browsers
13:47:40 <zcorpan> e.g. old IE doesn't support '
13:47:56 <zcorpan> nor the 1000s of mathml entities
Currently either the options
argument or the global he.{en,de}code.options
object is used, but not both.
(This used to work until now in most cases because there was only a single option available for each he
function (decode
, encode
).)
he.decode('©123');
// → '©123'
It should be '©123'
.
he.decode('∉ ¬i;');
// → '∉ ¬i;'
It should be '∉ ¬i;'
.
See regexCharacterReferencesThatHaveASemicolonFreeCharacterReferenceAsSubstring
(lolol) in https://github.com/mathiasbynens/mothereff.in/blob/master/ampersands/eff.js.
As per spec, number should be parsed before mapping against the table, so �
should be decoded just in the same way as �
/ �
/ ..., that is, replaced with \uFFFD
.
Currently it instead returns actual "unsafe" \u0000
string.
he.decode('�'); // '\uFFFD'
This may not be worth it, but here goes…
E.g. …
→ U+0085 in XHTML, while in HTML it’s U+2026.
http://www.w3.org/TR/xml/#d0e3895
Entities for these symbols are allowed in XML: http://www.w3.org/TR/xml/#NT-Char
I see that there hasn't been a release since Aug 24, 2014.
Any reason for this and will there be a release soon?
In IE10, I get an error when trying to include he via a Browserify bundle.
The error is:
Invalid character
mergedAssets.min.8516d96.js (5,33406)
Which points to here in he:
"caret",Ç:"caron"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.