aredridel / html5 Goto Github PK
View Code? Open in Web Editor NEWEvent-driven HTML5 Parser in Javascript
Home Page: http://dinhe.net/~aredridel/projects/js/html5/
License: MIT License
Event-driven HTML5 Parser in Javascript
Home Page: http://dinhe.net/~aredridel/projects/js/html5/
License: MIT License
It's a nice touch that you can hand a readable stream to parser.parse
, but it would be even more flexible if the parser was able to act as a writable stream so the stream piping logic could reside in the calling code.
Seems like this is an "emerging pattern" :)
When I attempt to require zombie in a simple script, I get the error below. I've included a listing of the current installed npm packages. Any idea what is causing this?
Error: Cannot find module 'jsdom/level2/core'
at resolveModuleFilename (node.js:280:15)
at loadModule (node.js:242:22)
at require (node.js:306:16)
at EventEmitter.HTML5Parser (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/parser.js:31:12)
at [object Object].appendHtmlToElement (/usr/local/lib/node/.npm/jsdom/0.1.20/package/lib/jsdom/browser/htmltodom.js:86:15)
at Object.innerHTML (/usr/local/lib/node/.npm/jsdom/0.1.20/package/lib/jsdom/browser/index.js:341:27)
at Object.jsdom (/usr/local/lib/node/.npm/jsdom/0.1.20/package/lib/jsdom.js:25:17)
at History. (/usr/local/lib/node/.npm/zombie/0.8.8/package/lib/zombie/history.js:63:24)
at /usr/local/lib/node/.npm/zombie/0.8.8/package/lib/zombie/history.js:2:61
at History. (/usr/local/lib/node/.npm/zombie/0.8.8/package/lib/zombie/history.js:30:16)
1 awt@DEV 2 ~/projects/md2> npm ls installed
npm info it worked if it ends with ok
npm info using [email protected]
npm info using [email protected]
[email protected] A C++ module for node-js that does base64 encoding and decoding. =pkrumins active installed latest remote base conversion base64 base64 encode base64 d
[email protected] Unfancy JavaScript =jashkenas active installed latest remote stable javascript language coffeescript compiler
[email protected] Markup as CoffeeScript. =mauricemach active installed latest remote template view coffeescript
[email protected] High performance middleware framework =creationix =tjholowaychuk active installed remote
[email protected] CSS Object Model implementation and CSS parser =nv active installed latest remote CSS CSSOM parser styleSheet
[email protected] Sinatra inspired web development framework =tjholowaychuk active installed latest remote framework sinatra web rest restful
[email protected] HTML5 HTML parser, including support for SVG and MathML foreign content =aredridel installed remote
[email protected] HTML5 HTML parser, including support for SVG and MathML foreign content =aredridel active installed latest remote
[email protected] Forgiving HTML/XML/RSS Parser in JS for both Node and Browsers =tautologistics active installed latest remote
[email protected] Jade template engine =tjholowaychuk active installed latest remote
[email protected] jQuery: The Write Less, Do More, JavaScript Library =coolaj86 active installed latest remote util dom jquery
[email protected] CommonJS implementation of the DOM intended to be platform independent and as minimal/light as possible while completely adhering to the w3c DOM specifica
[email protected] CommonJS implementation of the DOM intended to be platform independent and as minimal/light as possible while completely adhering to the w3c DOM specifica
[email protected] A super simple utility library for dealing with mime-types =broofa active installed latest remote util mime
[email protected] Command line mjsunit runner which provides an easy way to hook into mjsunit and start running tests immediately.. =tmpvar active installed latest remot
[email protected] Easy unit testing for node.js and the browser. =caolan active installed latest remote
[email protected] A package manager for node =isaacs active installed remote package manager modules install package.json
[email protected] Simplified HTTP request method. =mikeal active installed latest remote
[email protected] Syntactically Awesome Stylesheets (compiles to css) =tjholowaychuk active installed latest remote sass template css view
[email protected] The cross-browser WebSocket =rauchg =Tim-Smart active installed remote
[email protected] Web development, cut-the-crap style. =mauricemach active installed latest remote framework websockets coffeescript
[email protected] Insanely fast, full-stack, headless browser testing using Node.js =assaf active installed latest remote test spec headless full-stack
npm ok
Is there a way to get Parser.parse to be asynchrone for a string ?
In #whatwg, jgraham mentioned it looks like you're using a very out-of-date version of the html5lib test suite. Updating would bring in many more testcases.
On node 0.6.3 and html5 v0.3.5, parsing multiple page fragments with the same parser sometimes causes failures in the treebuilder, which are caused by a pointer-unequal marker element in reconstructActiveFormattingElements and elementInActiveFormattingElements. The problem is mainly triggered if the parsed fragments contain tables.
The following patch converts the pointer equality based check into a node type check, which fixes the problem for me:
--- treebuilder.js 2011-11-28 21:30:17.675749830 +0100
+++ /usr/lib/node_modules/html5/lib/html5/treebuilder.js 2011-11-28 20:21:38.067593170 +0100
@@ -224,9 +224,9 @@
// Step 2 and 3: start with the last element
var i = this.activeFormattingElements.length - 1;
var entry = this.activeFormattingElements[i];
- if(entry == HTML5.Marker || this.open_elements.indexOf(entry) != -1) return;
+ if(entry.type == HTML5.Marker.type || this.open_elements.indexOf(entry) != -1) return;
- while(entry != HTML5.Marker && this.open_elements.indexOf(entry) == -1) {
+ while(entry.type != HTML5.Marker.type && this.open_elements.indexOf(entry) == -1) {
i -= 1;
entry = this.activeFormattingElements[i];
if(!entry) break;
@@ -248,7 +248,7 @@
b.prototype.elementInActiveFormattingElements = function(name) {
var els = this.activeFormattingElements;
for(var i = els.length - 1; i >= 0; i--) {
- if(els[i] == HTML5.Marker) break;
+ if(els[i].type == HTML5.Marker.type) break;
if(els[i].tagName.toLowerCase() == name) return els[i];
}
return false;
Do this while minimally involving parser for state transitions between tokenizing modes.
Using the Zombie module, i get this trying to visit a facebook group page. This happens only with a few of them.
Zombie: GET http://www.facebook.com/group.php?gid=104369172929110&_fb_noscript=1 => 200
/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:62
throw(e);
^
TypeError: Cannot call method 'toLowerCase' of undefined
at Object.endTagFormatting (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/parser/in_body_phase.js:646:85)
at Object.processEndTag (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/parser/phase.js:50:36)
at EventEmitter.do_token (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/parser.js:97:20)
at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/parser.js:112:30)
at EventEmitter.emit (events:27:15)
at EventEmitter.emitToken (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:84:7)
at EventEmitter.emit_current_token (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:813:7)
at EventEmitter.tag_name_state (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:358:8)
at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:59:25)
at EventEmitter.emit (events:27:15)
Testcase:
var html5 = require('html5');
var parser = new html5.Parser();
var html = "Lasse Kronér";
parser.parse(html);
console.log(parser.document.innerHTML)
// Actual: <html><head></head><body>Lasse Kroné;r</body></html>
// Expected: <html><head></head><body>Lasse Kronér</body></html>
Attempting to parse a quirky HTML document, I get the following:
/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/tokenizer.js:62
throw(e);
^
ReferenceError: HTML5 is not defined
at Object.insert_html_element (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/parser/before_html_phase.js:42:4)
at Object.processStartTag (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/parser/before_html_phase.js:29:7)
at EventEmitter.do_token (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/parser.js:94:20)
at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/parser.js:112:30)
at EventEmitter.emit (events:31:17)
at EventEmitter.emitToken (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/tokenizer.js:84:7)
at EventEmitter.emit_current_token (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/tokenizer.js:813:7)
at EventEmitter.tag_name_state (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/tokenizer.js:358:8)
at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/tokenizer.js:59:25)
at EventEmitter.emit (events:31:17)
Looks like there's a debug statement there using a library not require()'d in that module.
I tried running tree-construction-test.js
in node 0.10, and test case data/tree-construction/tests25.dat-19
failed first with a few hundred repeats of
(node) warning: Recursive process.nextTick detected. This will break in the next version of node. Please use setImmediate for recursive deferral.
and then finally
RangeError: Maximum call stack size exceeded
when installing via npm the doc folder appears to have a index.html file which is reported by my system as a symlink to nonexistent file, or non-stat'able file.
normally this wouldn't be a problem, but it causes run.js (https://github.com/DTrejo/run.js) to fall over. i've reported the issue with run.js too.
I'm getting errors from the html5 module, but without any hint at what it was trying to parse. Maybe an issue in the html5 module itself.
Test URL: http://jquery.bassistance.de/qunit/test/
Output I get: https://gist.github.com/766133
This is breaking jsdom. Let me know ASAP if you're going to fix this; otherwise I will publish a hotfix version of jsdom that pins the html5 dependency to the previous version without the problem.
$ git clone https://github.com/aredridel/html5.git && cd html5 && git submodule update --init && npm test
Cloning into 'html5'...
remote: Counting objects: 3493, done.
remote: Compressing objects: 100% (1139/1139), done.
remote: Total 3493 (delta 2353), reused 3440 (delta 2305)
Receiving objects: 100% (3493/3493), 1.21 MiB | 705 KiB/s, done.
Resolving deltas: 100% (2353/2353), done.
Submodule 'tools/ronnjs' (http://github.com/kapouer/ronnjs.git) registered for path 'tools/ronnjs'
Cloning into 'tools/ronnjs'...
remote: Counting objects: 357, done.
remote: Compressing objects: 100% (200/200), done.
remote: Total 357 (delta 159), reused 326 (delta 134)
Receiving objects: 100% (357/357), 195.56 KiB, done.
Resolving deltas: 100% (159/159), done.
Submodule path 'tools/ronnjs': checked out 'f4f3ce7ef546dbc1651f9970eb28ab209afd58e7'
> [email protected] test /home/james/src/html5
> tap test/functional
sh: tap: command not found
npm ERR! [email protected] test: `tap test/functional`
npm ERR! `sh "-c" "tap test/functional"` failed with 127
npm ERR!
npm ERR! Failed at the [email protected] test script.
npm ERR! This is most likely a problem with the html5 package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! tap test/functional
npm ERR! You can get their info via:
npm ERR! npm owner ls html5
npm ERR! There is likely additional logging output above.
npm ERR!
npm ERR! System Linux 3.1.4-1-ARCH
npm ERR! command "node" "/usr/bin/npm" "test"
npm ERR! cwd /home/james/src/html5
npm ERR! node -v v0.6.5
npm ERR! npm -v 1.1.0-alpha-6
npm ERR! code ELIFECYCLE
npm ERR! message [email protected] test: `tap test/functional`
npm ERR! message `sh "-c" "tap test/functional"` failed with 127
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /home/james/src/html5/npm-debug.log
npm not ok
there is no deps/jquery
folder; there is no rakefile
I find some error with following code:
var HTML5 = require('html5')
var p = new HTML5.Parser();
HTML5.enableDebug('tokenizer.token')
var str = '<!DOCTYPE html>\n' +
'<html> \n'+
'<body>\n'+
'<script>\n' +
'var a = 1;\n' +
'</script>\n' +
'<a href="#">top</a>\n'
p.parse(str);
The output is:
DEBUG: 'tokenizer.token' { type: 'Doctype',
name: 'html',
publicId: null,
systemId: null,
correct: true }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'html', data: [] }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'body', data: [] }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'Characters', data: undefined }
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'script', data: [] }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'Characters', data: 'var a = 1;\n' }
DEBUG: 'tokenizer.token' { type: 'Characters', data: 'var a = 1;\n' }
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'script', data: [] }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'StartTag',
name: 'a',
data: [ { nodeName: 'href', nodeValue: '#' } ] }
DEBUG: 'tokenizer.token' { type: 'Characters', data: 'top' }
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'a', data: [] }
DEBUG: 'tokenizer.token' { type: 'EOF', data: 'End of File' }
The token give me { type: 'Characters', data: undefined } after { type: 'StartTag', name: 'body', data: [] } which is wrong.
var html5 = require('html5');
var parser = new html5.Parser();
parser.parse("<p>– Om inget görs åt utsläppen...</p>");
console.log(parser.document.innerHTML)
// Expected: <html><head></head><body><p>– Om inget görs åt utsläppen...</p></body></html>
// Actual: <html><head></head><body><p>– Om inget görs åt utsläppen...</p></body></html>
node version: v0.2.5
jsdom version: 0.1.20
This happens after attempting to run the example:
node.js:63
throw e;
^
Error: Cannot find module 'jsdom/level2/core'
at loadModule (node.js:275:15)
at require (node.js:411:14)
at EventEmitter.HTML5Parser (/home/ubuntu/nvm/v0.2.5/lib/node/.npm/html5/0.2.5/package/lib/html5/parser.js:31:12)
at Object.<anonymous> (/home/ubuntu/html5.js:11:14)
at Module._compile (node.js:462:23)
at Module._loadScriptSync (node.js:469:10)
at Module.loadSync (node.js:338:12)
at Object.runMain (node.js:522:24)
at Array.<anonymous> (node.js:756:12)
at EventEmitter._tickCallback (node.js:55:22)
var util = require('util'),
args = require('argsparser').parse(),
zombie = require('zombie');
var options = {
debug : true,
runScripts : true,
userAgent : 'Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.98 Safari/534.13'
};
var url = 'http://www.despegar.com.ar/search/flights/RoundTrip/bue/mad/2011-04-27/2011-5-14/1/0/0';
zombie.visit(url, options, function(err, browser, status) {
if (err) throw err;
});
throws...
Zombie: GET http://www.despegar.com.ar/search/flights/RoundTrip/bue/mad/2011-04-27/2011-5-14/1/0/0
Zombie: GET http://www.despegar.com.ar/search/flights/RoundTrip/bue/mad/2011-04-27/2011-5-14/1/0/0 => 200
Zombie: GET http://www.despegar.com.ar/Search/js/jquery-1.2.6.min.js
Zombie: GET http://ar.staticontent.com/js-versioned/2.13.4/FrameworkJS/base.js
Zombie: GET http://ar.staticontent.com/js-versioned/2.13.4/FrameworkJS/pkg/FlightsCommonResults.js
Zombie: GET http://ar.staticontent.com/js-versioned/2.13.4/FrameworkJS/pkg/omniture.js
Zombie: GET http://www.despegar.com.ar/search/flights/RoundTrip/bue/mad/2011-04-27/2011-5-14/1/0/0
/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:62
throw(e);
^
Error: undefined: attribute name: "
at Object.createAttribute (/usr/local/lib/node/.npm/jsdom/0.1.23/package/lib/jsdom/level1/core.js:1239:13)
at Object.setAttribute (/usr/local/lib/node/.npm/jsdom/0.1.23/package/lib/jsdom/level1/core.js:937:37)
at TreeBuilder.copyAttributeToElement (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/treebuilder.js:20:11)
at TreeBuilder.createElement (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/treebuilder.js:40:10)
at TreeBuilder.insert_element_normal (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/treebuilder.js:61:21)
at TreeBuilder.insert_element (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/treebuilder.js:52:15)
at Object.addFormattingElement (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/in_body_phase.js:719:12)
at Object.startTagFormatting (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/in_body_phase.js:330:7)
at Object.processStartTag (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/phase.js:41:38)
at Object.startTagOther (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/in_cell_phase.js:59:37)
I'm using the Tokenizer to chop up html snippets, and it is working like a dream. Unfortunately it has this one odd idea of splitting up a script block, if it contains tags - even though they are inside litterals. I don't know if this is by design or by accident, but I would expect it to handle the whole script block as one Character token.
Here is an example of the problem:
<body>
<script>
var x = "<a href='myPage.htm'>myLink<\/a>"
</script>
</body>
When tokenized, this snippet will result in something in the (simplified) neighbourhood of:
I would have expected 3-5 to be all in one Character node. The end a-tag also results in a ParseError, since it is escaped and thus does not get recognized. So my code ends up deeming this snippet for malformed, lacking the end a-tag.
Is this a bug, or is it by design, because it should be handled elsewhere (the Parser?). I'm not using the parser because of speed and a desire to keep everything streaming nicely not having to wait for the whole tree to be build and managed.
My guess is that this thing would count for poorly named selectors in a style tag too, but I haven't tried that out yet.
This not an issue for your project. I just wanted to pass along a comment I just posted on the jsdom github site:
...
I reported on this issue, see #290. Inspired by this post (#328), I also tried the html5 parser. Code snippet:
request(url, function (error, response, body) {
if (error) {
onError(message, error, socket);
} else {
var window = jsdom.jsdom(null, null, { parser : html5 }).createWindow();
var parser = new html5.Parser({ document: window.document });
parser.parse(body);
jsdom.jQueryify(window, jQueryLib, function(window, jquery) {
var $ = window.$;
// Do work...
});
}
});
This pattern can be found on the htlm5 github site. While I was able to pass in 'features' options (as detailed above), the jsdom.jQueryify function appears to overwrite the feature's settings. See jsdom.js (lines 122-4):
122 window.document.implementation.addFeature('FetchExternalResources', ['script']);
123 window.document.implementation.addFeature('ProcessExternalResources', ['script']);
124 window.document.implementation.addFeature('MutationEvents', ["1.0"]);
Naturally, I wanted to set these to false to avoid script processing. My only option was to edited the code, setting all 3 features to false. It would be nice if the global defaultDocumentFeatures function worked as expected or the jQueryify function signature provided for options/features.
...
So using the html5 parser allows me to circumvent the jsdom 'hierarchy request' error. But the integration between html5 and jsdom does not allow me to disable target features. Unless, I'm missing something.
Hello,
I tried to launch the unit tests on master and v0.2.9 and I get a lots of errors.
on master:
FAILURES: 464/2502 assertions failed (4127ms)
do you have the same result or is it my setup that is wrong ?
thank you
Jerome
while working with zombie I found that only lowercase <script> tags appear to be parsed. Uppercase <SCRIPT> tags seem to be ignored or not handled as script tags, so javascript redirects are not followed as a regular browser does. Problem exists in version 0.3.8
var zombie = require("zombie")
var assert = require("assert")
zombie.visit("http://www.tripadvisor.com", function(err, browser, status) {
assert.equal(status, 200)
})
produces
/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:62
throw(e);
^
TypeError: Object [ undefined ] has no method 'getAttribute'
at Object.startTagHtml (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/phase.js:67:35)
at Object.processStartTag (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/phase.js:41:38)
at Object.processStartTag (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/before_html_phase.js:31:20)
at EventEmitter.do_token (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser.js:94:20)
at EventEmitter.<anonymous> (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser.js:112:30)
at EventEmitter.emit (events.js:64:17)
at EventEmitter.emitToken (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:84:7)
at EventEmitter.emit_current_token (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:814:7)
at EventEmitter.after_attribute_value_state (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:558:8)
at EventEmitter.<anonymous> (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:59:25)
For example:
jsdom.jQueryify(window, require.resolve("./jquery.min.js"), function(window, jquery) {
var body = jquery('body').html();
});
I want to be able to write javascript in the current content.
README.md says at the end:
[...] and give it a run:
node test.js
but test.js no longer exists.
var html5 = require('html5');
var parser = new html5.Parser();
parser.parse("
Just nu ligger årets medeltemperatur nästan 0,6 grader.
");I want to use HTML5 for parsing pages, as part of Zombie.js (http://zombie.labnotes.org), but ran into an issue with the way elements are added to the tree.
When an element gets added to the tree, JSDOM fires a DOMNodeInserted event. It also listens to this event, and when it's fired on a SCRIPT element, loads the script (external) or evaluates it (internal). That's consistent with the way browsers evaluate scripts immediately after they're added to the document.
However, when the DOMNodeInserted event is fired, the SCRIPT element has no contents. I'm guessing the element is added to the tree first, then any child nodes are added to it. I couldn't figure out where this takes place, or how to change it (If I could, I would and send a pull request).
Can you help with that?
The page http://www.getdigital.de/ kills the parser with a weird message:
node.js:134
throw e; // process.nextTick error, or 'error' event on first tick
^
" is missing
The same small example code can parse other pages correctly.
Are jsdom, bench, tap, and opts really dependencies? Or are they devDependencies?
This came up because we're considering using this as the default jsdom parser, but then there would be a circular dependency between jsdom -> html5 -> jsdom -> html5 -> ...
Hi Guys,
Document.write in document load up should write the content inline, instead of replacing the body
Example
MYCONTENT
< script >
document.write('WOOT');
< /script >
Expected Output
MyCONTENTWOOT
Bug Output
WOOT
I get the following error while trying to run html5:
node.js:50
throw e;
^
TypeError: Cannot read property '_bytesRead' of undefined
at Object.readFileSync (fs:118:44)
at Object..js (node.js:332:39)
at Module.load (node.js:255:25)
at loadModule (node.js:242:12)
at require (node.js:272:14)
at Object.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/constants.js:1076:22)
at Module._compile (node.js:325:23)
at Object..js (node.js:333:12)
at Module.load (node.js:255:25)
at loadModule (node.js:242:12)
Even after reducing my test to a single line it still errors:
var HTML5 = require('html5');
Any help would be appreciated
i'm trying to parse www.tripadvisor.com with "zombie" (for nodejs) which uses html5.
the following exception is being thrown.
/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:62
throw(e);
^
TypeError: Object [ undefined ] has no method 'getAttribute'
at Object.startTagHtml (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/parser/phase.js:66:35)
at Object.processStartTag (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/parser/phase.js:41:38)
at Object.processStartTag (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/parser/before_html_phase.js:31:20)
at EventEmitter.do_token (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/parser.js:100:20)
at EventEmitter.<anonymous> (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/parser.js:118:30)
at EventEmitter.emit (events.js:64:17)
at EventEmitter.emitToken (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:84:7)
at EventEmitter.emit_current_token (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:814:7)
at EventEmitter.after_attribute_value_state (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:558:8)
at EventEmitter.<anonymous> (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:59:25)
var util = require('util'),
zombie = require('zombie');
var browser = zombie.visit('http://www.google.com/search?q=twitter', {
runScripts : false,
debug : true,
userAgent : "Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.98 Safari/534.13"
}, function(err, browser, st) {
if (err) throw err;
browser.dump();
});
throws...
Zombie: GET http://www.google.com/search?q=twitter
Zombie: GET http://www.google.com/search?q=twitter => 200
Zombie: GET http://www.google.com/blank.html
/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:62
throw(e);
^
TypeError: Cannot read property '27' of undefined
at EventEmitter.consume_numeric_entity (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:174:32)
at EventEmitter.consume_entity (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:103:16)
at EventEmitter.entity_data_state (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:251:20)
at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:59:25)
at EventEmitter.emit (events.js:42:17)
at EventEmitter.pump (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:45:11)
at EventEmitter.tokenize (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:78:21)
at EventEmitter.parse (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser.js:47:17)
at HtmlToDom.appendHtmlToElement (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/browser/htmltodom.js:90:50)
at Object.innerHTML (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/browser/index.js:334:27)
When trying to run example.js, I get an error:
The "sys" module is now called "util". It should have a similar interface.
It is also a problem with ronn.js but I see that @rtomayko already is changing that in the code
Several sites I am looking at have code in them something like this:
<html itemscope itemtype="http://schema.org/NewsArticle" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="msvalidate.01" content="9D28F7743C790DD88F2D9C7375EF7ED5" />
...tons of useless stuff that parses correctly removed for clarity...
<script type="text/javascript">
//<![CDATA[
var shortURL = "";
BitlyCB.alertResponse = function(data) {
var s = '';
var first_result;
// Results are keyed by longUrl, so we need to grab the first one.
for (var r in data.results) {
first_result = data.results[r]; break;
}
for (var key in first_result) {
//s += key + ":" + first_result[key].toString();
if(key == "shortUrl")
{
shortURL = first_result[key].toString();
break;
}
}
PostTwitter();
}
var newstogramURL = "";
window.onload=function(){
if(readCookie("DMUserTrack") != "")
eraseCookie("DMUserTrack");
OnNewsreleaseLoad();
//var userid = readCookie('DMUserTrack');
var csid = 193265181;
newstogramURL = 'http://www.prnewswire.com/templates/PRN_Custom_MutltiVu_Recommendation?t=1362148791373&id=193265181';
//var newstogramURL ='http://www.prnewswire.com/templates/prnwConfig.xml';
var delay = function() { sendToFlash('StartPlay',newstogramURL); };
setTimeout(delay,1000,"JavaScript");
//setTimeout("sendToFlash('StartPlay', 'http://www.prnewswire.com/templates/prnwConfig.xml')",100);
};
function ShortenURL()
{
//Bit.ly function call to shorten url
BitlyClient.call('shorten', {'longUrl': formatURL()}, 'BitlyCB.alertResponse');
}
//]]>
</script>
<!--[if lte IE 6]>
<script type="text/javascript" src="http://www.prnewswire.com/includes/PRN_jquery.pngFix.js"></script>
<script type="text/javascript">
$(document).ready(function(){
$(document).pngFix( );
});
</script>
<![endif]-->
<style type="text/css">
/* Style Definitions */
span.prnews_span
{
font-size:8pt;
font-family:"Arial";
color:black;
}
a.prnews_a
{
color:blue;
}
li.prnews_li
{
font-size:8pt;
font-family:"Arial";
color:black;
}
p.prnews_p
{
font-size:0.62em;
font-family:"Arial";
color:black;
margin:"0in";
}
</style>
<!-- Below script is for Google Analytics -->
<script type="text/javascript">
//<![CDATA[
var _gaq = _gaq || [];
var loc = location.href;
var env = '';
if(env == "dev."){ // for Dev Environment
_gaq.push(['_setAccount', 'UA-21992272-4']);
} else if(env=="stage."){ // For Stage Environment
_gaq.push(['_setAccount', 'UA-21992272-5']);
} else { // For Prodcution Environment
_gaq.push(['_setAccount', 'UA-21992272-1']);
}
//_gaq.push(['_setDomainName', '.www.prnewswire.com']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
//]]>
</script>
<link rel="stylesheet" type="text/css" href="http://ajax.googleapis.com/ajax/libs/jqueryui/1.7.2/themes/dot-luv/jquery-ui.css" />
</head>
jsdom throws an error 3, even though the HTML is valid. I think I have i traced it down to the HTML5 parser, it appears that the data for the comment node is not being set correctly, causing subsequent nodes to attempt to be assigned to the incorrect parent. Here is a dump of the nodes being generated for the above html section. Notice at about the 22nd line down, the '<![endif]' - it appears that conditional comments are not being parsed into data correctly, and this then throws off the rest of the document.
Parent [ HTML ]
Node { data: ' \r\n', type: 'text' }
Parent [ null ]
Node { type: 'tag',
name: 'html',
attribs:
{ itemscope: '',
itemtype: 'http://schema.org/NewsArticle',
xmlns: 'http://www.w3.org/1999/xhtml' },
children:
[ { data: '\r\n', type: 'text' },
{ type: 'tag',
name: 'head',
attribs: {},
children: [Object],
prev: null,
next: [Object],
parent: [Circular] },
{ data: '\t\r\n', type: 'text' },
{ data: '[if lte IE 6]>\r\n<script type="text/javascript" src="http://www.prnewswire.com
/includes/PRN_jquery.pngFix.js"></script>\r\n<script type="text/javascript">\r\n$(document).r
eady(function(){\r\n\t$(document).pngFix( );\r\n\t});\r\n</script>\r\n<![endif]',
type: 'comment' },
{ data: '\r\n', type: 'text' },
{ type: 'style',
name: 'style',
attribs: [Object],
children: [Object],
prev: [Object],
next: [Object],
parent: [Circular] },
{ data: '\r\n\r\n', type: 'text' },
{ data: ' Below script is for Google Analytics ',
type: 'comment' },
{ data: '\r\n', type: 'text' },
{ type: 'script',
name: 'script',
attribs: [Object],
children: [Object],
prev: [Object],
next: null,
parent: [Circular] },
{ data: ' \r\n', type: 'text' } ],
prev: null,
next: null,
parent: null }
Parent [ null ]
Node { data: '\r\n ', type: 'text' }
Parent [ null ]
Node { type: 'tag',
name: 'link',
attribs:
{ rel: 'stylesheet',
type: 'text/css',
href: 'http://ajax.googleapis.com/ajax/libs/jqueryui/1.7.2/themes/dot-luv/jquery-ui.css'
},
children: [],
prev: null,
next: null,
parent: null }
var html5 = require('html5').HTML5;
var p = new html5.Parser();
p.parse('<p><u>foo</p>');
p.parse('<p>bar</p>');
console.log( p.document.innerHTML );
-> <html><head></head><body><p><u>bar</u></p></body></html>
Not 100% sure if this is correct in all cases, but moving the initialization of open_elements and activeFormattingElements to TreeBuilder.prototype.reset and calling it from the constructor fixes this issue for me. Will send a pull request with this change.
I write code invoke the html5 parser twice with the same of html.The the following output show the tokenizer.js still have a bug:
DEBUG: 'tokenizer.content_model=' 0
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'doctype_state'
DEBUG: 'tokenizer.state=' 'before_doctype_name_state'
DEBUG: 'tokenizer.state=' 'doctype_name_state'
DEBUG: 'tokenizer.token' { type: 'Doctype',
name: 'html',
publicId: null,
systemId: null,
correct: true }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'comment_start_state'
DEBUG: 'tokenizer.state=' 'comment_state'
DEBUG: 'tokenizer.state=' 'comment_end_dash_state'
DEBUG: 'tokenizer.state=' 'comment_end_state'
DEBUG: 'tokenizer.token' { type: 'Comment', data: 'NewPage' }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'html', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'comment_start_state'
DEBUG: 'tokenizer.state=' 'comment_state'
DEBUG: 'tokenizer.state=' 'comment_end_dash_state'
DEBUG: 'tokenizer.state=' 'comment_end_state'
DEBUG: 'tokenizer.token' { type: 'Comment',
data: '[if ie 6]>\n\n<![endif]' }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'body', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'script', data: [] }
DEBUG: 'tokenizer.content_model=' 3
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'close_tag_open_state'
DEBUG: 'tokenizer.content_model=' 0
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'Characters',
data: 'var a = 1;\nvar b = 2;\n' }
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'script', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_name_state'
DEBUG: 'tokenizer.state=' 'attribute_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_value_state'
DEBUG: 'tokenizer.state=' 'attribute_value_double_quoted_state'
DEBUG: 'tokenizer.state=' 'after_attribute_value_state'
DEBUG: 'tokenizer.state=' 'before_attribute_name_state'
DEBUG: 'tokenizer.state=' 'attribute_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_value_state'
DEBUG: 'tokenizer.state=' 'attribute_value_double_quoted_state'
DEBUG: 'tokenizer.state=' 'after_attribute_value_state'
DEBUG: 'tokenizer.token' { type: 'StartTag',
name: 'p',
data:
[ { nodeName: 'class', nodeValue: 'po' },
{ nodeName: 'id', nodeValue: 'p' } ] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'Characters', data: 'element ' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'close_tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'p', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'EOF', data: 'End of File' }
DEBUG: 'tokenizer.content_model=' 0
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'doctype_state'
DEBUG: 'tokenizer.state=' 'before_doctype_name_state'
DEBUG: 'tokenizer.state=' 'doctype_name_state'
DEBUG: 'tokenizer.token' { type: 'Doctype',
name: 'html',
publicId: null,
systemId: null,
correct: true }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'comment_start_state'
DEBUG: 'tokenizer.state=' 'comment_state'
DEBUG: 'tokenizer.state=' 'comment_end_dash_state'
DEBUG: 'tokenizer.state=' 'comment_end_state'
DEBUG: 'tokenizer.token' { type: 'Comment', data: 'NewPage' }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'html', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'comment_start_state'
DEBUG: 'tokenizer.state=' 'comment_state'
DEBUG: 'tokenizer.state=' 'comment_end_dash_state'
DEBUG: 'tokenizer.state=' 'comment_end_state'
DEBUG: 'tokenizer.token' { type: 'Comment',
data: '[if ie 6]>\n\n<![endif]' }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'body', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'script', data: [] }
DEBUG: 'tokenizer.content_model=' 3
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'Characters',
data: 'var a = 1;\nvar b = 2;\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'close_tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'Characters', data: undefined }
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'script', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_name_state'
DEBUG: 'tokenizer.state=' 'attribute_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_value_state'
DEBUG: 'tokenizer.state=' 'attribute_value_double_quoted_state'
DEBUG: 'tokenizer.state=' 'after_attribute_value_state'
DEBUG: 'tokenizer.state=' 'before_attribute_name_state'
DEBUG: 'tokenizer.state=' 'attribute_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_value_state'
DEBUG: 'tokenizer.state=' 'attribute_value_double_quoted_state'
DEBUG: 'tokenizer.state=' 'after_attribute_value_state'
DEBUG: 'tokenizer.token' { type: 'StartTag',
name: 'p',
data:
[ { nodeName: 'class', nodeValue: 'po' },
{ nodeName: 'id', nodeValue: 'p' } ] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'Characters', data: 'element ' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'close_tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'p', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'EOF', data: 'End of File' }
The code runs different way in Script element with the characters token.
Using jsdom, I have two snippets that load the same huge html. One snippet, using html5 as jsdom's parser, takes 70 seconds to process. The other, using just jsdom's parser, takes 10 seconds.
https://gist.github.com/886348#file_html5parser.js : takes 70 seconds, uses html5
https://gist.github.com/886348#file_defaultparser.js : takes 10 seconds, does not use html5
https://gist.github.com/886348#file_get_html.sh : run this quick command to download the test html
Of course, the lag could be within jsdom itself, but I haven't written a case to test that yet. The gist was initially written for a different bug I filed with jsdom, but it works here too.
This is definitely an edge case, since the html is such a large file (it's probably one of the largest wikipedia documents), but I was testing worst-case scenarios, and 70 seconds is a bit much! :)
Issue was mentioned here, but told to post here as well.
When trying to run any tests, it errors out with this:
/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/tokenizer.js:62
throw(e);
^
Error
at Object.appendChild (/usr/local/lib/node/.npm/jsdom/0.1.22/package/lib/jsdom/level1/core.js:1312:13)
at Object.insert_html_element (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/parser/before_html_phase.js:41:21)
at Object.processStartTag (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/parser/before_html_phase.js:29:7)
at EventEmitter.do_token (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/parser.js:94:20)
at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/parser.js:112:30)
at EventEmitter.emit (events:31:17)
at EventEmitter.emitToken (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/tokenizer.js:84:7)
at EventEmitter.emit_current_token (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/tokenizer.js:813:7)
at EventEmitter.tag_name_state (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/tokenizer.js:358:8)
at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.5/package
WINDOWS1252 seems to have been renamed as ENTITIES_WINDOWS1252 in constants.js but tokenizer.js still uses the old variable!
Please rename HTML5.WINDOWS1252 to HTML5.ENTITIES_WINDOWS1252 in tokenizer.js. :)
Please refer to related to issue: #21
If I try to serialize out a document, html5 is outputting the doctype at the end, rather than the beginning.
My simple-ish test-case code is:
var jsdom = require('jsdom');
var HTML5 = require('html5');
var window = jsdom.jsdom(null, null, {parser: HTML5, features: { QuerySelector: true }}).createWindow();
var parser = new HTML5.Parser({document: window.document});
parser.parse("<!DOCTYPE html><html><head></head><body></body></html>");
console.log(HTML5.serialize(window.document))
What I see outputted to console is:
<html><head></head><body></body></html><!DOCTYPE html>
I'm not quite certain of the events that lead html5 to the following state, but running capybara's expectations on zombie.js I get this stack trace:
TypeError: Cannot read property 'tagName' of undefined
at EventEmitter.reset_insertion_mode (/usr/local/lib/node/.npm/html5/9999.0.0-LINK-3d23ff1b/package/lib/html5/parser.js:190:24)
at Object.endTagSelect (/usr/local/lib/node/.npm/html5/9999.0.0-LINK-3d23ff1b/package/lib/html5/parser/in_select_phase.js:81:15)
Poking around, I see that Parser.prototype.reset_insertion_mode
expects a context
parameter but p.endTagSelect
does not supply one. I looked for similar patterns in your code and saw that p.prototype.endTagTable
in end_table_phase.js
is very similar to p.endTagSelect
and it supplies this.tree.open_elements.last()
to reset_insertion_mode
.
I made that change and zombie plowed forward.
Here's the commit: boblail/html5@18c09b9.
I'll send a pull request.
trying to run it gives now an error (with node 0.6.x):
$ node jquery-example.js
node.js:201
throw e; // process.nextTick error, or 'error' event on first tick
^
Error: require.paths is removed. Use node_modules folders, or the NODE_PATH environment variable instead.
at Function. (module.js:376:11)
at Object. (html5/doc/jquery-example.js:1:70)
at Module._compile (module.js:432:26)
at Object..js (module.js:450:10)
at Module.load (module.js:351:31)
at Function._load (module.js:310:12)
at Array.0 (module.js:470:10)
at EventEmitter._tickCallback (node.js:192:40)
$ git clone http://dinhe.net/~aredridel/projects/js/html5.git/
Cloning into 'html5'...
fatal: http://dinhe.net/~aredridel/projects/js/html5.git/info/refs not found: did you run git update-server-info on the server?
I ran into this issue when playing with Zombie.js on some Wicket application. Following gist shows a stripped down test case (using Zombie.js):
https://gist.github.com/775898
The HTML generated by the server in this gist is of same form that Wicket generates when adding Javascript by back end code.
it breaks the npm install of this and jsdom.
Installing with npm fails, though installing @0.2.13 succeeds
$ npm install html5
npm info it worked if it ends with ok
npm info using [email protected]
npm info using [email protected]
npm ERR! gzip "--decompress" "--stdout" "/cygdrive/c/Users/marty.nelson/Document
s/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0.2.14/package.tgz"
npm ERR! gzip "--decompress" "--stdout" "/cygdrive/c/Users/marty.nelson/Document
s/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0.2.14/package.tgz" gzip: /cygdri
ve/c/Users/marty.nelson/Documents/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0
.2.14/package.tgz: unexpected end of file
npm ERR! gzip "--decompress" "--stdout" "/cygdrive/c/Users/marty.nelson/Document
s/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0.2.14/package.tgz"
npm ERR! Failed unpacking /cygdrive/c/Users/marty.nelson/Documents/Marty/.nvm/v0
.4.2/lib/node/.npm/.cache/html5/0.2.14/package.tgz to /cygdrive/c/Users/marty.ne
lson/Documents/Marty/.nvm/v0.4.2/lib/node/.npm/html5/0.2.14
npm ERR! install failed Error: Failed gzip "--decompress" "--stdout" "/cygdrive/
c/Users/marty.nelson/Documents/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0.2.
14/package.tgz"
npm ERR! install failed exited with 1
npm ERR! install failed at ChildProcess.<anonymous> (/cygdrive/c/Users/marty
.nelson/Documents/Marty/.nvm/v0.4.2/lib/node/.npm/npm/0.3.15/package/lib/utils/e
xec.js:77:19)
npm ERR! install failed at ChildProcess.emit (events.js:45:17)
npm ERR! install failed at Socket.<anonymous> (child_process.js:151:12)
npm ERR! install failed at Socket.emit (events.js:42:17)
npm ERR! install failed at Array.0 (net.js:800:12)
npm ERR! install failed at EventEmitter._tickCallback (node.js:108:26)
npm info install failed rollback
npm info uninstall [ '[email protected]' ]
npm info preuninstall [email protected]
npm info uninstall [email protected]
npm info postuninstall [email protected]
npm info uninstall [email protected] complete
npm info install failed rolled back
npm ERR! Error: Failed gzip "--decompress" "--stdout" "/cygdrive/c/Users/marty.n
elson/Documents/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0.2.14/package.tgz"
npm ERR! exited with 1
npm ERR! at ChildProcess.<anonymous> (/cygdrive/c/Users/marty.nelson/Documen
ts/Marty/.nvm/v0.4.2/lib/node/.npm/npm/0.3.15/package/lib/utils/exec.js:77:19)
npm ERR! at ChildProcess.emit (events.js:45:17)
npm ERR! at Socket.<anonymous> (child_process.js:151:12)
npm ERR! at Socket.emit (events.js:42:17)
npm ERR! at Array.0 (net.js:800:12)
npm ERR! at EventEmitter._tickCallback (node.js:108:26)
npm ERR! Report this *entire* log at <http://github.com/isaacs/npm/issues>
npm ERR! or email it to <[email protected]>
npm ERR! Just tweeting a tiny part of the error will not be helpful.
npm ERR! System CYGWIN_NT-6.1-WOW64 1.7.7(0.230/5/3)
npm ERR! argv { remain:
npm ERR! argv [ 'html5',
npm ERR! argv 'jsdom@>= 0.1.3',
npm ERR! argv 'nodeunit@>= 0.5.0' ],
npm ERR! argv cooked: [ 'install', 'html5' ],
npm ERR! argv original: [ 'install', 'html5' ] }
npm not ok
but this is ok:
$ npm install [email protected]
npm info it worked if it ends with ok
npm info using [email protected]
npm info using [email protected]
npm info preinstall [email protected]
npm info install [email protected]
npm info postinstall [email protected]
npm info predeactivate [email protected]
npm info deactivate [email protected]
npm info postdeactivate [email protected]
npm info preactivate [email protected]
npm info activate [email protected]
npm info postactivate [email protected]
npm info build Success: [email protected]
npm ok
any ideas?
The same error as #24, traceback as below:
/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:62
throw(e);
^
TypeError: Object [ undefined ] has no method 'getAttribute'
at Object.startTagHtml (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/parser/phase.js:66:35)
at Object.processStartTag (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/parser/phase.js:41:38)
at Object.processStartTag (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/parser/before_html_phase.js:31:20)
at EventEmitter.do_token (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/parser.js:99:20)
at EventEmitter.<anonymous> (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/parser.js:120:30)
at EventEmitter.emit (events.js:67:17)
at EventEmitter.emitToken (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:88:8)
at EventEmitter.emit_current_token (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:833:7)
at EventEmitter.after_attribute_value_state (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:573:8)
at EventEmitter.<anonymous> (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:59:25)
The first 3 lines of html content is:
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
Hope this could help to address the issue.
I use DataTables (datatables.net) in an application and wanted to test it using zombie.js.
When parsing with HTML5 I got the error message "TypeError: Cannot read property 'tagName' of undefined" and finally tracked it down to the DataTables initialization code. I tried some examples on the datatables site and all had similar errors.
To reproduce:
$ node
> var zombie = require("zombie");
> zombie.visit("http://datatables.net/examples/basic_init/zero_config.html");
>
/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:62
throw(e);
^
TypeError: Cannot read property '_ids' of null
at Object. (/usr/lib/node/.npm/jsdom/0.1.23/package/lib/jsdom/level1/core.js:570:33)
at Object.appendChild (/usr/lib/node/.npm/jsdom/0.1.23/package/lib/jsdom/level2/events.js:317:20)
at TreeBuilder.insert_element_normal (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/treebuilder.js:62:28)
at TreeBuilder.insert_element (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/treebuilder.js:52:15)
at Object.startTagBody (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/parser/after_head_phase.js:46:12)
at Object.processStartTag (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/parser/phase.js:41:38)
at EventEmitter.do_token (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/parser.js:94:20)
at EventEmitter. (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/parser.js:112:30)
at EventEmitter.emit (events:31:17)
at EventEmitter.emitToken (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:84:7)
Could you tell me how to debug further to find the real cause of the error? Thanks and keep up the good work!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.