Giter Site home page Giter Site logo

html5's Issues

Make the parser/tokenizer a writable stream?

It's a nice touch that you can hand a readable stream to parser.parse, but it would be even more flexible if the parser was able to act as a writable stream so the stream piping logic could reside in the calling code.

Seems like this is an "emerging pattern" :)

Error: Cannot find module 'jsdom/level2/core'

When I attempt to require zombie in a simple script, I get the error below. I've included a listing of the current installed npm packages. Any idea what is causing this?

Error: Cannot find module 'jsdom/level2/core'
at resolveModuleFilename (node.js:280:15)
at loadModule (node.js:242:22)
at require (node.js:306:16)
at EventEmitter.HTML5Parser (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/parser.js:31:12)
at [object Object].appendHtmlToElement (/usr/local/lib/node/.npm/jsdom/0.1.20/package/lib/jsdom/browser/htmltodom.js:86:15)
at Object.innerHTML (/usr/local/lib/node/.npm/jsdom/0.1.20/package/lib/jsdom/browser/index.js:341:27)
at Object.jsdom (/usr/local/lib/node/.npm/jsdom/0.1.20/package/lib/jsdom.js:25:17)
at History. (/usr/local/lib/node/.npm/zombie/0.8.8/package/lib/zombie/history.js:63:24)
at /usr/local/lib/node/.npm/zombie/0.8.8/package/lib/zombie/history.js:2:61
at History. (/usr/local/lib/node/.npm/zombie/0.8.8/package/lib/zombie/history.js:30:16)
1 awt@DEV 2 ~/projects/md2> npm ls installed
npm info it worked if it ends with ok
npm info using [email protected]
npm info using [email protected]
[email protected] A C++ module for node-js that does base64 encoding and decoding. =pkrumins active installed latest remote base conversion base64 base64 encode base64 d
[email protected] Unfancy JavaScript =jashkenas active installed latest remote stable javascript language coffeescript compiler
[email protected] Markup as CoffeeScript. =mauricemach active installed latest remote template view coffeescript
[email protected] High performance middleware framework =creationix =tjholowaychuk active installed remote
[email protected] CSS Object Model implementation and CSS parser =nv active installed latest remote CSS CSSOM parser styleSheet
[email protected] Sinatra inspired web development framework =tjholowaychuk active installed latest remote framework sinatra web rest restful
[email protected] HTML5 HTML parser, including support for SVG and MathML foreign content =aredridel installed remote
[email protected] HTML5 HTML parser, including support for SVG and MathML foreign content =aredridel active installed latest remote
[email protected] Forgiving HTML/XML/RSS Parser in JS for both Node and Browsers =tautologistics active installed latest remote
[email protected] Jade template engine =tjholowaychuk active installed latest remote
[email protected] jQuery: The Write Less, Do More, JavaScript Library =coolaj86 active installed latest remote util dom jquery
[email protected] CommonJS implementation of the DOM intended to be platform independent and as minimal/light as possible while completely adhering to the w3c DOM specifica
[email protected] CommonJS implementation of the DOM intended to be platform independent and as minimal/light as possible while completely adhering to the w3c DOM specifica
[email protected] A super simple utility library for dealing with mime-types =broofa active installed latest remote util mime
[email protected] Command line mjsunit runner which provides an easy way to hook into mjsunit and start running tests immediately.. =tmpvar active installed latest remot
[email protected] Easy unit testing for node.js and the browser. =caolan active installed latest remote
[email protected] A package manager for node =isaacs active installed remote package manager modules install package.json
[email protected] Simplified HTTP request method. =mikeal active installed latest remote
[email protected] Syntactically Awesome Stylesheets (compiles to css) =tjholowaychuk active installed latest remote sass template css view
[email protected] The cross-browser WebSocket =rauchg =Tim-Smart active installed remote
[email protected] Web development, cut-the-crap style. =mauricemach active installed latest remote framework websockets coffeescript
[email protected] Insanely fast, full-stack, headless browser testing using Node.js =assaf active installed latest remote test spec headless full-stack
npm ok

Update to latest html5lib test suite

In #whatwg, jgraham mentioned it looks like you're using a very out-of-date version of the html5lib test suite. Updating would bring in many more testcases.

Patch: Failed comparison of activeFormattingElements entry with Marker

On node 0.6.3 and html5 v0.3.5, parsing multiple page fragments with the same parser sometimes causes failures in the treebuilder, which are caused by a pointer-unequal marker element in reconstructActiveFormattingElements and elementInActiveFormattingElements. The problem is mainly triggered if the parsed fragments contain tables.

The following patch converts the pointer equality based check into a node type check, which fixes the problem for me:

--- treebuilder.js      2011-11-28 21:30:17.675749830 +0100
+++ /usr/lib/node_modules/html5/lib/html5/treebuilder.js        2011-11-28 20:21:38.067593170 +0100
@@ -224,9 +224,9 @@
        // Step 2 and 3: start with the last element
        var i = this.activeFormattingElements.length - 1;
        var entry = this.activeFormattingElements[i];
-       if(entry == HTML5.Marker || this.open_elements.indexOf(entry) != -1) return;
+       if(entry.type == HTML5.Marker.type || this.open_elements.indexOf(entry) != -1) return;

-       while(entry != HTML5.Marker && this.open_elements.indexOf(entry) == -1) {
+       while(entry.type != HTML5.Marker.type && this.open_elements.indexOf(entry) == -1) {
                i -= 1;
                entry = this.activeFormattingElements[i];
                if(!entry) break;
@@ -248,7 +248,7 @@
 b.prototype.elementInActiveFormattingElements = function(name) {
        var els = this.activeFormattingElements;
        for(var i = els.length - 1; i >= 0; i--) {
-               if(els[i] == HTML5.Marker) break;
+               if(els[i].type == HTML5.Marker.type) break;
                if(els[i].tagName.toLowerCase() == name) return els[i];
        }
        return false;

Parser error

Using the Zombie module, i get this trying to visit a facebook group page. This happens only with a few of them.

Zombie: GET http://www.facebook.com/group.php?gid=104369172929110&_fb_noscript=1 => 200

/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:62
                throw(e);
    ^
TypeError: Cannot call method 'toLowerCase' of undefined
    at Object.endTagFormatting (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/parser/in_body_phase.js:646:85)
    at Object.processEndTag (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/parser/phase.js:50:36)
    at EventEmitter.do_token (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/parser.js:97:20)
    at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/parser.js:112:30)
    at EventEmitter.emit (events:27:15)
    at EventEmitter.emitToken (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:84:7)
    at EventEmitter.emit_current_token (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:813:7)
    at EventEmitter.tag_name_state (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:358:8)
    at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:59:25)
    at EventEmitter.emit (events:27:15)

&eacute; getting extra semicolon

Testcase:

var html5 = require('html5');
var parser = new html5.Parser();
var html = "Lasse Kron&eacute;r";
parser.parse(html);
console.log(parser.document.innerHTML)
// Actual: <html><head></head><body>Lasse Kroné;r</body></html>
// Expected: <html><head></head><body>Lasse Kronér</body></html>

HTML5 not defined

Attempting to parse a quirky HTML document, I get the following:

/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/tokenizer.js:62
                                throw(e);
    ^
ReferenceError: HTML5 is not defined
    at Object.insert_html_element (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/parser/before_html_phase.js:42:4)
    at Object.processStartTag (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/parser/before_html_phase.js:29:7)
    at EventEmitter.do_token (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/parser.js:94:20)
    at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/parser.js:112:30)
    at EventEmitter.emit (events:31:17)
    at EventEmitter.emitToken (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/tokenizer.js:84:7)
    at EventEmitter.emit_current_token (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/tokenizer.js:813:7)
    at EventEmitter.tag_name_state (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/tokenizer.js:358:8)
    at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.10/package/lib/html5/tokenizer.js:59:25)
    at EventEmitter.emit (events:31:17)

Looks like there's a debug statement there using a library not require()'d in that module.

stack overflow running tests in node 0.10

I tried running tree-construction-test.js in node 0.10, and test case data/tree-construction/tests25.dat-19 failed first with a few hundred repeats of

(node) warning: Recursive process.nextTick detected. This will break in the next version of node. Please use setImmediate for recursive deferral.

and then finally

RangeError: Maximum call stack size exceeded

can't run tests

$ git clone https://github.com/aredridel/html5.git && cd html5 && git submodule update --init && npm test
Cloning into 'html5'...
remote: Counting objects: 3493, done.
remote: Compressing objects: 100% (1139/1139), done.
remote: Total 3493 (delta 2353), reused 3440 (delta 2305)
Receiving objects: 100% (3493/3493), 1.21 MiB | 705 KiB/s, done.
Resolving deltas: 100% (2353/2353), done.
Submodule 'tools/ronnjs' (http://github.com/kapouer/ronnjs.git) registered for path 'tools/ronnjs'
Cloning into 'tools/ronnjs'...
remote: Counting objects: 357, done.
remote: Compressing objects: 100% (200/200), done.
remote: Total 357 (delta 159), reused 326 (delta 134)
Receiving objects: 100% (357/357), 195.56 KiB, done.
Resolving deltas: 100% (159/159), done.
Submodule path 'tools/ronnjs': checked out 'f4f3ce7ef546dbc1651f9970eb28ab209afd58e7'

> [email protected] test /home/james/src/html5
> tap test/functional

sh: tap: command not found
npm ERR! [email protected] test: `tap test/functional`
npm ERR! `sh "-c" "tap test/functional"` failed with 127
npm ERR! 
npm ERR! Failed at the [email protected] test script.
npm ERR! This is most likely a problem with the html5 package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     tap test/functional
npm ERR! You can get their info via:
npm ERR!     npm owner ls html5
npm ERR! There is likely additional logging output above.
npm ERR! 
npm ERR! System Linux 3.1.4-1-ARCH
npm ERR! command "node" "/usr/bin/npm" "test"
npm ERR! cwd /home/james/src/html5
npm ERR! node -v v0.6.5
npm ERR! npm -v 1.1.0-alpha-6
npm ERR! code ELIFECYCLE
npm ERR! message [email protected] test: `tap test/functional`
npm ERR! message `sh "-c" "tap test/functional"` failed with 127
npm ERR! 
npm ERR! Additional logging details can be found in:
npm ERR!     /home/james/src/html5/npm-debug.log
npm not ok

there is no deps/jquery folder; there is no rakefile

Parser Error - wrong token

I find some error with following code:

var HTML5 = require('html5')
var p = new HTML5.Parser();
HTML5.enableDebug('tokenizer.token')
var str = '<!DOCTYPE html>\n' +
'<html> \n'+
'<body>\n'+
'<script>\n' +
'var a = 1;\n' +
'</script>\n' +
'<a href="#">top</a>\n'
p.parse(str);

The output is:

DEBUG: 'tokenizer.token' { type: 'Doctype',
name: 'html',
publicId: null,
systemId: null,
correct: true }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'html', data: [] }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'body', data: [] }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'Characters', data: undefined }
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'script', data: [] }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'Characters', data: 'var a = 1;\n' }
DEBUG: 'tokenizer.token' { type: 'Characters', data: 'var a = 1;\n' }
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'script', data: [] }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'StartTag',
name: 'a',
data: [ { nodeName: 'href', nodeValue: '#' } ] }
DEBUG: 'tokenizer.token' { type: 'Characters', data: 'top' }
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'a', data: [] }
DEBUG: 'tokenizer.token' { type: 'EOF', data: 'End of File' }

The token give me { type: 'Characters', data: undefined } after { type: 'StartTag', name: 'body', data: [] } which is wrong.

Parsing of &thinsp; entities (and others)?

var html5 = require('html5');
var parser = new html5.Parser();
parser.parse("<p>&ndash;&thinsp;Om inget görs åt utsläppen...</p>");
console.log(parser.document.innerHTML)
// Expected: <html><head></head><body><p>– Om inget görs åt utsläppen...</p></body></html>
// Actual:   <html><head></head><body><p>– Om inget görs åt utsläppen...</p></body></html>

Error: Cannot find module 'jsdom/level2/core'

node version: v0.2.5
jsdom version: 0.1.20

This happens after attempting to run the example:

node.js:63
    throw e;
    ^
Error: Cannot find module 'jsdom/level2/core'
    at loadModule (node.js:275:15)
    at require (node.js:411:14)
    at EventEmitter.HTML5Parser (/home/ubuntu/nvm/v0.2.5/lib/node/.npm/html5/0.2.5/package/lib/html5/parser.js:31:12)
    at Object.<anonymous> (/home/ubuntu/html5.js:11:14)
    at Module._compile (node.js:462:23)
    at Module._loadScriptSync (node.js:469:10)
    at Module.loadSync (node.js:338:12)
    at Object.runMain (node.js:522:24)
    at Array.<anonymous> (node.js:756:12)
    at EventEmitter._tickCallback (node.js:55:22)

another parse error...

    var util = require('util'),
    args = require('argsparser').parse(),
    zombie = require('zombie');

var options = {
    debug : true,
    runScripts : true,
    userAgent : 'Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.98 Safari/534.13'
};
var url = 'http://www.despegar.com.ar/search/flights/RoundTrip/bue/mad/2011-04-27/2011-5-14/1/0/0';

zombie.visit(url, options, function(err, browser, status) {
    if (err) throw err;
});

throws...

Zombie: GET http://www.despegar.com.ar/search/flights/RoundTrip/bue/mad/2011-04-27/2011-5-14/1/0/0
Zombie: GET http://www.despegar.com.ar/search/flights/RoundTrip/bue/mad/2011-04-27/2011-5-14/1/0/0 => 200
Zombie: GET http://www.despegar.com.ar/Search/js/jquery-1.2.6.min.js
Zombie: GET http://ar.staticontent.com/js-versioned/2.13.4/FrameworkJS/base.js
Zombie: GET http://ar.staticontent.com/js-versioned/2.13.4/FrameworkJS/pkg/FlightsCommonResults.js
Zombie: GET http://ar.staticontent.com/js-versioned/2.13.4/FrameworkJS/pkg/omniture.js
Zombie: GET http://www.despegar.com.ar/search/flights/RoundTrip/bue/mad/2011-04-27/2011-5-14/1/0/0

/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:62
                throw(e);
    ^
Error: undefined: attribute name: "
    at Object.createAttribute (/usr/local/lib/node/.npm/jsdom/0.1.23/package/lib/jsdom/level1/core.js:1239:13)
    at Object.setAttribute (/usr/local/lib/node/.npm/jsdom/0.1.23/package/lib/jsdom/level1/core.js:937:37)
    at TreeBuilder.copyAttributeToElement (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/treebuilder.js:20:11)
    at TreeBuilder.createElement (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/treebuilder.js:40:10)
    at TreeBuilder.insert_element_normal (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/treebuilder.js:61:21)
    at TreeBuilder.insert_element (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/treebuilder.js:52:15)
    at Object.addFormattingElement (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/in_body_phase.js:719:12)
    at Object.startTagFormatting (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/in_body_phase.js:330:7)
    at Object.processStartTag (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/phase.js:41:38)
    at Object.startTagOther (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/in_cell_phase.js:59:37)

Tokenizer splits a script block, if it contains tags even if it is inside litterals.

I'm using the Tokenizer to chop up html snippets, and it is working like a dream. Unfortunately it has this one odd idea of splitting up a script block, if it contains tags - even though they are inside litterals. I don't know if this is by design or by accident, but I would expect it to handle the whole script block as one Character token.

Here is an example of the problem:

<body>
  <script>
    var x = "<a href='myPage.htm'>myLink<\/a>"
  </script>
</body>

When tokenized, this snippet will result in something in the (simplified) neighbourhood of:

  1. StartTag: body
  2. StartTag: script
  3. Characters: 'var x = "'
  4. StartTag: a
  5. Characters: 'myLink</a>'
  6. EndTag: script
  7. EndTag: body

I would have expected 3-5 to be all in one Character node. The end a-tag also results in a ParseError, since it is escaped and thus does not get recognized. So my code ends up deeming this snippet for malformed, lacking the end a-tag.

Is this a bug, or is it by design, because it should be handled elsewhere (the Parser?). I'm not using the parser because of speed and a desire to keep everything streaming nicely not having to wait for the whole tree to be build and managed.

My guess is that this thing would count for poorly named selectors in a style tag too, but I haven't tried that out yet.

Integrate jsdom feature flags

This not an issue for your project. I just wanted to pass along a comment I just posted on the jsdom github site:
...
I reported on this issue, see #290. Inspired by this post (#328), I also tried the html5 parser. Code snippet:

                request(url, function (error, response, body) {
                    if (error) {
                        onError(message, error, socket);
                    } else {
                        var window = jsdom.jsdom(null, null, { parser : html5 }).createWindow();
                        var parser = new html5.Parser({ document: window.document });
                        parser.parse(body);
                        jsdom.jQueryify(window, jQueryLib, function(window, jquery) {
                            var $ = window.$;
                            // Do work...
                        });
                    }
                });

This pattern can be found on the htlm5 github site. While I was able to pass in 'features' options (as detailed above), the jsdom.jQueryify function appears to overwrite the feature's settings. See jsdom.js (lines 122-4):

122 window.document.implementation.addFeature('FetchExternalResources', ['script']);
123 window.document.implementation.addFeature('ProcessExternalResources', ['script']);
124 window.document.implementation.addFeature('MutationEvents', ["1.0"]);

Naturally, I wanted to set these to false to avoid script processing. My only option was to edited the code, setting all 3 features to false. It would be nice if the global defaultDocumentFeatures function worked as expected or the jQueryify function signature provided for options/features.
...

So using the html5 parser allows me to circumvent the jsdom 'hierarchy request' error. But the integration between html5 and jsdom does not allow me to disable target features. Unless, I'm missing something.

Script tags need to be added all at once

Hello,
I tried to launch the unit tests on master and v0.2.9 and I get a lots of errors.

on master:
FAILURES: 464/2502 assertions failed (4127ms)

do you have the same result or is it my setup that is wrong ?

thank you
Jerome

script tag case sensitive (only lowercase recognized)

while working with zombie I found that only lowercase <script> tags appear to be parsed. Uppercase <SCRIPT> tags seem to be ignored or not handled as script tags, so javascript redirects are not followed as a regular browser does. Problem exists in version 0.3.8

"undefined" has no method "getAttribute"

var zombie = require("zombie")
var assert = require("assert")

zombie.visit("http://www.tripadvisor.com", function(err, browser, status) {                                                
    assert.equal(status, 200)
})

produces

/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:62
                                throw(e);
    ^
TypeError: Object [ undefined ] has no method 'getAttribute'
    at Object.startTagHtml (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/phase.js:67:35)
    at Object.processStartTag (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/phase.js:41:38)
    at Object.processStartTag (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser/before_html_phase.js:31:20)
    at EventEmitter.do_token (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser.js:94:20)
    at EventEmitter.<anonymous> (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser.js:112:30)
    at EventEmitter.emit (events.js:64:17)
    at EventEmitter.emitToken (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:84:7)
    at EventEmitter.emit_current_token (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:814:7)
    at EventEmitter.after_attribute_value_state (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:558:8)
    at EventEmitter.<anonymous> (/home/dbaumgold/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:59:25)

Entities come out double-encoded

var html5 = require('html5');
var parser = new html5.Parser();
parser.parse("

Just nu ligger årets medeltemperatur nästan 0,6 grader.

");
console.log(parser.document.innerHTML)
//Expected output: Just nu ligger årets medeltemperatur nästan 0,6 grader.

Need to delay adding certain elements to the tree

I want to use HTML5 for parsing pages, as part of Zombie.js (http://zombie.labnotes.org), but ran into an issue with the way elements are added to the tree.

When an element gets added to the tree, JSDOM fires a DOMNodeInserted event. It also listens to this event, and when it's fired on a SCRIPT element, loads the script (external) or evaluates it (internal). That's consistent with the way browsers evaluate scripts immediately after they're added to the document.

However, when the DOMNodeInserted event is fired, the SCRIPT element has no contents. I'm guessing the element is added to the tree first, then any child nodes are added to it. I couldn't figure out where this takes place, or how to change it (If I could, I would and send a pull request).

Can you help with that?

devDependencies vs. dependencies?

Are jsdom, bench, tap, and opts really dependencies? Or are they devDependencies?

This came up because we're considering using this as the default jsdom parser, but then there would be a circular dependency between jsdom -> html5 -> jsdom -> html5 -> ...

document.write cause document to clear

Hi Guys,

Document.write in document load up should write the content inline, instead of replacing the body

Example
MYCONTENT
< script >
document.write('WOOT');
< /script >

Expected Output
MyCONTENTWOOT

Bug Output
WOOT

Not compatible with node 0.3.0 yet.

I get the following error while trying to run html5:

node.js:50
    throw e;
    ^
TypeError: Cannot read property '_bytesRead' of undefined
    at Object.readFileSync (fs:118:44)
    at Object..js (node.js:332:39)
    at Module.load (node.js:255:25)
    at loadModule (node.js:242:12)
    at require (node.js:272:14)
    at Object.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/constants.js:1076:22)
    at Module._compile (node.js:325:23)
    at Object..js (node.js:333:12)
    at Module.load (node.js:255:25)
    at loadModule (node.js:242:12)

Even after reducing my test to a single line it still errors:

var  HTML5 = require('html5');

Any help would be appreciated

Documents with doctypes and attributes on the root node break parser on a previously used document

i'm trying to parse www.tripadvisor.com with "zombie" (for nodejs) which uses html5.
the following exception is being thrown.

/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:62
                throw(e);
    ^
TypeError: Object [ undefined ] has no method 'getAttribute'
    at Object.startTagHtml (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/parser/phase.js:66:35)
    at Object.processStartTag (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/parser/phase.js:41:38)
    at Object.processStartTag (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/parser/before_html_phase.js:31:20)
    at EventEmitter.do_token (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/parser.js:100:20)
    at EventEmitter.<anonymous> (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/parser.js:118:30)
    at EventEmitter.emit (events.js:64:17)
    at EventEmitter.emitToken (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:84:7)
    at EventEmitter.emit_current_token (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:814:7)
    at EventEmitter.after_attribute_value_state (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:558:8)
    at EventEmitter.<anonymous> (/media/old/home/benigo/Downloads/nodejs/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:59:25)

parse error

var util = require('util'),
    zombie = require('zombie');

var browser = zombie.visit('http://www.google.com/search?q=twitter', {
    runScripts : false,
    debug : true,
    userAgent : "Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.98 Safari/534.13"
}, function(err, browser, st) {
    if (err) throw err;
    browser.dump();
});

throws...

Zombie: GET http://www.google.com/search?q=twitter
Zombie: GET http://www.google.com/search?q=twitter => 200
Zombie: GET http://www.google.com/blank.html

/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:62
                throw(e);
    ^
TypeError: Cannot read property '27' of undefined
    at EventEmitter.consume_numeric_entity (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:174:32)
    at EventEmitter.consume_entity (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:103:16)
    at EventEmitter.entity_data_state (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:251:20)
    at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:59:25)
    at EventEmitter.emit (events.js:42:17)
    at EventEmitter.pump (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:45:11)
    at EventEmitter.tokenize (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/tokenizer.js:78:21)
    at EventEmitter.parse (/usr/local/lib/node/.npm/html5/0.2.14/package/lib/html5/parser.js:47:17)
    at HtmlToDom.appendHtmlToElement (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/browser/htmltodom.js:90:50)
    at Object.innerHTML (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/browser/index.js:334:27)

Conditional comments break parsing

Several sites I am looking at have code in them something like this:

<html itemscope itemtype="http://schema.org/NewsArticle" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="msvalidate.01" content="9D28F7743C790DD88F2D9C7375EF7ED5" />

  ...tons of useless stuff that parses correctly removed for clarity...

<script type="text/javascript">
//<![CDATA[ 
var shortURL = "";
BitlyCB.alertResponse = function(data) {
                var s = '';
                var first_result;
                // Results are keyed by longUrl, so we need to grab the first one.
                for     (var r in data.results) {
                        first_result = data.results[r]; break;
                }
                for (var key in first_result) {
                        //s += key + ":" + first_result[key].toString();
                        if(key == "shortUrl")
                        {
                            shortURL = first_result[key].toString();
                            break;
                        }
                }
                PostTwitter();
        }  
    var newstogramURL = "";
window.onload=function(){
if(readCookie("DMUserTrack") != "")
        eraseCookie("DMUserTrack");
        OnNewsreleaseLoad();
        //var userid = readCookie('DMUserTrack');
    var csid = 193265181;
    newstogramURL = 'http://www.prnewswire.com/templates/PRN_Custom_MutltiVu_Recommendation?t=1362148791373&id=193265181';
    //var newstogramURL ='http://www.prnewswire.com/templates/prnwConfig.xml';
    var delay = function() { sendToFlash('StartPlay',newstogramURL); };
    setTimeout(delay,1000,"JavaScript");
    //setTimeout("sendToFlash('StartPlay', 'http://www.prnewswire.com/templates/prnwConfig.xml')",100); 
        };
function ShortenURL()
{
    //Bit.ly function call to shorten url
    BitlyClient.call('shorten', {'longUrl': formatURL()}, 'BitlyCB.alertResponse');
}
//]]> 
</script>   
<!--[if lte IE 6]>
<script type="text/javascript" src="http://www.prnewswire.com/includes/PRN_jquery.pngFix.js"></script>
<script type="text/javascript">
$(document).ready(function(){
    $(document).pngFix( );
    });
</script>
<![endif]-->
<style type="text/css">
/* Style Definitions */
span.prnews_span
{
font-size:8pt;
font-family:"Arial";
color:black;
}
a.prnews_a
{
color:blue;
}
li.prnews_li
{
font-size:8pt;
font-family:"Arial";
color:black;
}
p.prnews_p
{
font-size:0.62em;
font-family:"Arial";
color:black;
margin:"0in";
}
</style>

<!-- Below script is for Google Analytics -->
<script type="text/javascript">
//<![CDATA[ 
  var _gaq = _gaq || [];
  var loc = location.href;
  var env = '';  
  if(env == "dev."){ // for Dev Environment
    _gaq.push(['_setAccount', 'UA-21992272-4']);
  } else if(env=="stage."){ // For Stage Environment
    _gaq.push(['_setAccount', 'UA-21992272-5']);
  } else { // For Prodcution Environment
    _gaq.push(['_setAccount', 'UA-21992272-1']);
  }
  //_gaq.push(['_setDomainName', '.www.prnewswire.com']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();
//]]> 
</script>
 <link rel="stylesheet" type="text/css" href="http://ajax.googleapis.com/ajax/libs/jqueryui/1.7.2/themes/dot-luv/jquery-ui.css" /> 
  </head>

jsdom throws an error 3, even though the HTML is valid. I think I have i traced it down to the HTML5 parser, it appears that the data for the comment node is not being set correctly, causing subsequent nodes to attempt to be assigned to the incorrect parent. Here is a dump of the nodes being generated for the above html section. Notice at about the 22nd line down, the '<![endif]' - it appears that conditional comments are not being parsed into data correctly, and this then throws off the rest of the document.

Parent [ HTML ]
Node { data: ' \r\n', type: 'text' }
Parent [ null ]
Node { type: 'tag',
  name: 'html',
  attribs: 
   { itemscope: '',
     itemtype: 'http://schema.org/NewsArticle',
     xmlns: 'http://www.w3.org/1999/xhtml' },
  children: 
   [ { data: '\r\n', type: 'text' },
     { type: 'tag',
       name: 'head',
       attribs: {},
       children: [Object],
       prev: null,
       next: [Object],
       parent: [Circular] },
     { data: '\t\r\n', type: 'text' },
     { data: '[if lte IE 6]>\r\n<script type="text/javascript" src="http://www.prnewswire.com
/includes/PRN_jquery.pngFix.js"></script>\r\n<script type="text/javascript">\r\n$(document).r
eady(function(){\r\n\t$(document).pngFix( );\r\n\t});\r\n</script>\r\n<![endif]',
       type: 'comment' },
     { data: '\r\n', type: 'text' },
     { type: 'style',
       name: 'style',
       attribs: [Object],
       children: [Object],
       prev: [Object],
       next: [Object],
       parent: [Circular] },
     { data: '\r\n\r\n', type: 'text' },
     { data: ' Below script is for Google Analytics ',
       type: 'comment' },
     { data: '\r\n', type: 'text' },
     { type: 'script',
       name: 'script',
       attribs: [Object],
       children: [Object],
       prev: [Object],
       next: null,
       parent: [Circular] },
     { data: ' \r\n', type: 'text' } ],
  prev: null,
  next: null,
  parent: null }
Parent [ null ]
Node { data: '\r\n ', type: 'text' }
Parent [ null ]
Node { type: 'tag',
  name: 'link',
  attribs: 
   { rel: 'stylesheet',
     type: 'text/css',
     href: 'http://ajax.googleapis.com/ajax/libs/jqueryui/1.7.2/themes/dot-luv/jquery-ui.css'
 },
  children: [],
  prev: null,
  next: null,
  parent: null }

activeFormattingElements not reset in treebuilder between parses

var html5 = require('html5').HTML5;
var p = new html5.Parser();

p.parse('<p><u>foo</p>');
p.parse('<p>bar</p>');
console.log( p.document.innerHTML );

-> <html><head></head><body><p><u>bar</u></p></body></html>

Not 100% sure if this is correct in all cases, but moving the initialization of open_elements and activeFormattingElements to TreeBuilder.prototype.reset and calling it from the constructor fixes this issue for me. Will send a pull request with this change.

tokenier still have a bug.

I write code invoke the html5 parser twice with the same of html.The the following output show the tokenizer.js still have a bug:

DEBUG: 'tokenizer.content_model=' 0
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'doctype_state'
DEBUG: 'tokenizer.state=' 'before_doctype_name_state'
DEBUG: 'tokenizer.state=' 'doctype_name_state'
DEBUG: 'tokenizer.token' { type: 'Doctype',
name: 'html',
publicId: null,
systemId: null,
correct: true }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'comment_start_state'
DEBUG: 'tokenizer.state=' 'comment_state'
DEBUG: 'tokenizer.state=' 'comment_end_dash_state'
DEBUG: 'tokenizer.state=' 'comment_end_state'
DEBUG: 'tokenizer.token' { type: 'Comment', data: 'NewPage' }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'html', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'comment_start_state'
DEBUG: 'tokenizer.state=' 'comment_state'
DEBUG: 'tokenizer.state=' 'comment_end_dash_state'
DEBUG: 'tokenizer.state=' 'comment_end_state'
DEBUG: 'tokenizer.token' { type: 'Comment',
data: '[if ie 6]>\n\n<![endif]' }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'body', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'script', data: [] }
DEBUG: 'tokenizer.content_model=' 3
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'close_tag_open_state'
DEBUG: 'tokenizer.content_model=' 0
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'Characters',
data: 'var a = 1;\nvar b = 2;\n' }
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'script', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_name_state'
DEBUG: 'tokenizer.state=' 'attribute_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_value_state'
DEBUG: 'tokenizer.state=' 'attribute_value_double_quoted_state'
DEBUG: 'tokenizer.state=' 'after_attribute_value_state'
DEBUG: 'tokenizer.state=' 'before_attribute_name_state'
DEBUG: 'tokenizer.state=' 'attribute_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_value_state'
DEBUG: 'tokenizer.state=' 'attribute_value_double_quoted_state'
DEBUG: 'tokenizer.state=' 'after_attribute_value_state'
DEBUG: 'tokenizer.token' { type: 'StartTag',
name: 'p',
data:
[ { nodeName: 'class', nodeValue: 'po' },
{ nodeName: 'id', nodeValue: 'p' } ] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'Characters', data: 'element ' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'close_tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'p', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'EOF', data: 'End of File' }
DEBUG: 'tokenizer.content_model=' 0
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'doctype_state'
DEBUG: 'tokenizer.state=' 'before_doctype_name_state'
DEBUG: 'tokenizer.state=' 'doctype_name_state'

DEBUG: 'tokenizer.token' { type: 'Doctype',
name: 'html',
publicId: null,
systemId: null,
correct: true }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'comment_start_state'
DEBUG: 'tokenizer.state=' 'comment_state'
DEBUG: 'tokenizer.state=' 'comment_end_dash_state'
DEBUG: 'tokenizer.state=' 'comment_end_state'
DEBUG: 'tokenizer.token' { type: 'Comment', data: 'NewPage' }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'html', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'markup_declaration_open_state'
DEBUG: 'tokenizer.state=' 'comment_start_state'
DEBUG: 'tokenizer.state=' 'comment_state'
DEBUG: 'tokenizer.state=' 'comment_end_dash_state'
DEBUG: 'tokenizer.state=' 'comment_end_state'
DEBUG: 'tokenizer.token' { type: 'Comment',
data: '[if ie 6]>\n\n<![endif]' }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'body', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'StartTag', name: 'script', data: [] }
DEBUG: 'tokenizer.content_model=' 3
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'Characters',
data: 'var a = 1;\nvar b = 2;\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'close_tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'Characters', data: undefined }
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'script', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_name_state'
DEBUG: 'tokenizer.state=' 'attribute_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_value_state'
DEBUG: 'tokenizer.state=' 'attribute_value_double_quoted_state'
DEBUG: 'tokenizer.state=' 'after_attribute_value_state'
DEBUG: 'tokenizer.state=' 'before_attribute_name_state'
DEBUG: 'tokenizer.state=' 'attribute_name_state'
DEBUG: 'tokenizer.state=' 'before_attribute_value_state'
DEBUG: 'tokenizer.state=' 'attribute_value_double_quoted_state'
DEBUG: 'tokenizer.state=' 'after_attribute_value_state'
DEBUG: 'tokenizer.token' { type: 'StartTag',
name: 'p',
data:
[ { nodeName: 'class', nodeValue: 'po' },
{ nodeName: 'id', nodeValue: 'p' } ] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: ' ' }
DEBUG: 'tokenizer.token' { type: 'Characters', data: 'element ' }
DEBUG: 'tokenizer.state=' 'tag_open_state'
DEBUG: 'tokenizer.state=' 'close_tag_open_state'
DEBUG: 'tokenizer.state=' 'tag_name_state'
DEBUG: 'tokenizer.token' { type: 'EndTag', name: 'p', data: [] }
DEBUG: 'tokenizer.state=' 'data_state'
DEBUG: 'tokenizer.token' { type: 'SpaceCharacters', data: '\n' }
DEBUG: 'tokenizer.token' { type: 'EOF', data: 'End of File' }

The code runs different way in Script element with the characters token.

html5 parser is significantly slower than default jsdom parser

Using jsdom, I have two snippets that load the same huge html. One snippet, using html5 as jsdom's parser, takes 70 seconds to process. The other, using just jsdom's parser, takes 10 seconds.

https://gist.github.com/886348#file_html5parser.js : takes 70 seconds, uses html5
https://gist.github.com/886348#file_defaultparser.js : takes 10 seconds, does not use html5
https://gist.github.com/886348#file_get_html.sh : run this quick command to download the test html

Of course, the lag could be within jsdom itself, but I haven't written a case to test that yet. The gist was initially written for a different bug I filed with jsdom, but it works here too.

This is definitely an edge case, since the html is such a large file (it's probably one of the largest wikipedia documents), but I was testing worst-case scenarios, and 70 seconds is a bit much! :)

Parse issue when using Zombie

Issue was mentioned here, but told to post here as well.

When trying to run any tests, it errors out with this:

/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/tokenizer.js:62
            throw(e);
^
Error
at Object.appendChild (/usr/local/lib/node/.npm/jsdom/0.1.22/package/lib/jsdom/level1/core.js:1312:13)
at Object.insert_html_element (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/parser/before_html_phase.js:41:21)
at Object.processStartTag (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/parser/before_html_phase.js:29:7)
at EventEmitter.do_token (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/parser.js:94:20)
at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/parser.js:112:30)
at EventEmitter.emit (events:31:17)
at EventEmitter.emitToken (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/tokenizer.js:84:7)
at EventEmitter.emit_current_token (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/tokenizer.js:813:7)
at EventEmitter.tag_name_state (/usr/local/lib/node/.npm/html5/0.2.5/package/lib/html5/tokenizer.js:358:8)
at EventEmitter.<anonymous> (/usr/local/lib/node/.npm/html5/0.2.5/package

Serialize outputs the doctype at the end / DOM Level2 Unavailable

If I try to serialize out a document, html5 is outputting the doctype at the end, rather than the beginning.

My simple-ish test-case code is:

var jsdom = require('jsdom');
var HTML5 = require('html5');
var window = jsdom.jsdom(null, null, {parser: HTML5, features: { QuerySelector: true }}).createWindow();
var parser = new HTML5.Parser({document: window.document});
parser.parse("<!DOCTYPE html><html><head></head><body></body></html>");
console.log(HTML5.serialize(window.document))

What I see outputted to console is:

<html><head></head><body></body></html><!DOCTYPE html>

p.endSelectTag does not pass a context to this.parser.reset_insertion_mode() [w/ fix]

I'm not quite certain of the events that lead html5 to the following state, but running capybara's expectations on zombie.js I get this stack trace:

 TypeError: Cannot read property 'tagName' of undefined
   at EventEmitter.reset_insertion_mode (/usr/local/lib/node/.npm/html5/9999.0.0-LINK-3d23ff1b/package/lib/html5/parser.js:190:24)
   at Object.endTagSelect (/usr/local/lib/node/.npm/html5/9999.0.0-LINK-3d23ff1b/package/lib/html5/parser/in_select_phase.js:81:15)

Poking around, I see that Parser.prototype.reset_insertion_mode expects a context parameter but p.endTagSelect does not supply one. I looked for similar patterns in your code and saw that p.prototype.endTagTable in end_table_phase.js is very similar to p.endTagSelect and it supplies this.tree.open_elements.last() to reset_insertion_mode.

I made that change and zombie plowed forward.

Here's the commit: boblail/html5@18c09b9.

I'll send a pull request.

doc/jquery-example.js uses deprecated require.paths

trying to run it gives now an error (with node 0.6.x):

$ node jquery-example.js

node.js:201
throw e; // process.nextTick error, or 'error' event on first tick
^
Error: require.paths is removed. Use node_modules folders, or the NODE_PATH environment variable instead.
at Function. (module.js:376:11)
at Object. (html5/doc/jquery-example.js:1:70)
at Module._compile (module.js:432:26)
at Object..js (module.js:450:10)
at Module.load (module.js:351:31)
at Function._load (module.js:310:12)
at Array.0 (module.js:470:10)
at EventEmitter._tickCallback (node.js:192:40)

can't clone repo

$ git clone http://dinhe.net/~aredridel/projects/js/html5.git/
Cloning into 'html5'...
fatal: http://dinhe.net/~aredridel/projects/js/html5.git/info/refs not found: did you run git update-server-info on the server?

Parsing issues with HTML comments

I ran into this issue when playing with Zombie.js on some Wicket application. Following gist shows a stripped down test case (using Zombie.js):

https://gist.github.com/775898

The HTML generated by the server in this gist is of same form that Wicket generates when adding Javascript by back end code.

Can't npm install html5

Installing with npm fails, though installing @0.2.13 succeeds

$ npm install html5
npm info it worked if it ends with ok
npm info using [email protected]
npm info using [email protected]
npm ERR! gzip "--decompress" "--stdout" "/cygdrive/c/Users/marty.nelson/Document
s/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0.2.14/package.tgz"
npm ERR! gzip "--decompress" "--stdout" "/cygdrive/c/Users/marty.nelson/Document
s/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0.2.14/package.tgz" gzip: /cygdri
ve/c/Users/marty.nelson/Documents/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0
.2.14/package.tgz: unexpected end of file
npm ERR! gzip "--decompress" "--stdout" "/cygdrive/c/Users/marty.nelson/Document
s/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0.2.14/package.tgz"
npm ERR! Failed unpacking /cygdrive/c/Users/marty.nelson/Documents/Marty/.nvm/v0
.4.2/lib/node/.npm/.cache/html5/0.2.14/package.tgz to /cygdrive/c/Users/marty.ne
lson/Documents/Marty/.nvm/v0.4.2/lib/node/.npm/html5/0.2.14
npm ERR! install failed Error: Failed gzip "--decompress" "--stdout" "/cygdrive/
c/Users/marty.nelson/Documents/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0.2.
14/package.tgz"
npm ERR! install failed exited with 1
npm ERR! install failed     at ChildProcess.<anonymous> (/cygdrive/c/Users/marty
.nelson/Documents/Marty/.nvm/v0.4.2/lib/node/.npm/npm/0.3.15/package/lib/utils/e
xec.js:77:19)
npm ERR! install failed     at ChildProcess.emit (events.js:45:17)
npm ERR! install failed     at Socket.<anonymous> (child_process.js:151:12)
npm ERR! install failed     at Socket.emit (events.js:42:17)
npm ERR! install failed     at Array.0 (net.js:800:12)
npm ERR! install failed     at EventEmitter._tickCallback (node.js:108:26)
npm info install failed rollback
npm info uninstall [ '[email protected]' ]
npm info preuninstall [email protected]
npm info uninstall [email protected]
npm info postuninstall [email protected]
npm info uninstall [email protected] complete
npm info install failed rolled back
npm ERR! Error: Failed gzip "--decompress" "--stdout" "/cygdrive/c/Users/marty.n
elson/Documents/Marty/.nvm/v0.4.2/lib/node/.npm/.cache/html5/0.2.14/package.tgz"

npm ERR! exited with 1
npm ERR!     at ChildProcess.<anonymous> (/cygdrive/c/Users/marty.nelson/Documen
ts/Marty/.nvm/v0.4.2/lib/node/.npm/npm/0.3.15/package/lib/utils/exec.js:77:19)
npm ERR!     at ChildProcess.emit (events.js:45:17)
npm ERR!     at Socket.<anonymous> (child_process.js:151:12)
npm ERR!     at Socket.emit (events.js:42:17)
npm ERR!     at Array.0 (net.js:800:12)
npm ERR!     at EventEmitter._tickCallback (node.js:108:26)
npm ERR! Report this *entire* log at <http://github.com/isaacs/npm/issues>
npm ERR! or email it to <[email protected]>
npm ERR! Just tweeting a tiny part of the error will not be helpful.
npm ERR! System CYGWIN_NT-6.1-WOW64 1.7.7(0.230/5/3)
npm ERR! argv { remain:
npm ERR! argv    [ 'html5',
npm ERR! argv      'jsdom@>= 0.1.3',
npm ERR! argv      'nodeunit@>= 0.5.0' ],
npm ERR! argv   cooked: [ 'install', 'html5' ],
npm ERR! argv   original: [ 'install', 'html5' ] }
npm not ok

but this is ok:

$ npm install [email protected]
npm info it worked if it ends with ok
npm info using [email protected]
npm info using [email protected]
npm info preinstall [email protected]
npm info install [email protected]
npm info postinstall [email protected]
npm info predeactivate [email protected]
npm info deactivate [email protected]
npm info postdeactivate [email protected]
npm info preactivate [email protected]
npm info activate [email protected]
npm info postactivate [email protected]
npm info build Success: [email protected]
npm ok

any ideas?

"undefined" has no method "getAttribute" -- Run 2

The same error as #24, traceback as below:

/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:62
                                throw(e);
          ^
TypeError: Object [ undefined ] has no method 'getAttribute'
    at Object.startTagHtml (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/parser/phase.js:66:35)
    at Object.processStartTag (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/parser/phase.js:41:38)
    at Object.processStartTag (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/parser/before_html_phase.js:31:20)
    at EventEmitter.do_token (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/parser.js:99:20)
    at EventEmitter.<anonymous> (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/parser.js:120:30)
    at EventEmitter.emit (events.js:67:17)
    at EventEmitter.emitToken (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:88:8)
    at EventEmitter.emit_current_token (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:833:7)
    at EventEmitter.after_attribute_value_state (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:573:8)
    at EventEmitter.<anonymous> (/usr/lib/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:59:25)

The first 3 lines of html content is:

<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>

Hope this could help to address the issue.

DataTables cause error when tokenizing

I use DataTables (datatables.net) in an application and wanted to test it using zombie.js.

When parsing with HTML5 I got the error message "TypeError: Cannot read property 'tagName' of undefined" and finally tracked it down to the DataTables initialization code. I tried some examples on the datatables site and all had similar errors.

To reproduce:
$ node
> var zombie = require("zombie");
> zombie.visit("http://datatables.net/examples/basic_init/zero_config.html");
>
/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:62
throw(e);
^
TypeError: Cannot read property '_ids' of null
at Object. (/usr/lib/node/.npm/jsdom/0.1.23/package/lib/jsdom/level1/core.js:570:33)
at Object.appendChild (/usr/lib/node/.npm/jsdom/0.1.23/package/lib/jsdom/level2/events.js:317:20)
at TreeBuilder.insert_element_normal (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/treebuilder.js:62:28)
at TreeBuilder.insert_element (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/treebuilder.js:52:15)
at Object.startTagBody (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/parser/after_head_phase.js:46:12)
at Object.processStartTag (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/parser/phase.js:41:38)
at EventEmitter.do_token (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/parser.js:94:20)
at EventEmitter. (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/parser.js:112:30)
at EventEmitter.emit (events:31:17)
at EventEmitter.emitToken (/usr/lib/node/.npm/html5/0.2.12/package/lib/html5/tokenizer.js:84:7)

Could you tell me how to debug further to find the real cause of the error? Thanks and keep up the good work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.