Giter Site home page Giter Site logo

node-gumbo-parser's Introduction

Gumbo Parser

Build Status

Using google's gumbo parser to parse HTML in node.

var gumbo = require("gumbo-parser");
var tree = gumbo(htmlstring);

Usage

There's only one method: gumbo(htmlstring).

You can also pass in the options

gumbo(htmlstring, {
  // The tab-stop size, for computing positions in source code that uses tabs.
  // default: 8
  tabStop: 8,
  // Whether or not to stop parsing when the first error is encountered.
  // default: false
  stopOnFirstError: true,

  // fragment parsing
  // Option 1: just plain HTML in a 'body' context
  fragment: true

  // Option 2:
  // gumbo-style fragment parsing:
  // can be a valid tag for the ns
  fragmentContext: "div",
  // optional can be 'html', 'svg', 'mathml', defaults to html
  fragmentNamespace: "html"
});

returns:

// if you use normal document mode:
{
  document: {
    // the document element (see below)
  },

  root: {
    // the html element (se 'Element' below)
  }
}

// if you use fragment parsing:
{
  childNodes: [
    list
  ]
}

Element:
  nodeName (string) (same as tagname)
  nodeType (number) 1
  tagName (string)  (normalized to lowercase)
  originalTag (string) original text from tag
  originalTagEnd (string) original closing tag from original text, if there was one
  children (array) -> replicating childNodes rather than children,
                      ie all text / comment children are included
  tagNamespace (string) "HTML", "SVG" or "MATHML"
  attributes (array)
  startPos (position) -> if element is inserted by parser, this value is undefined
  endPos (position)

TextNode:
  nodeName (string) #text or #cdata-section
  nodeType (number) 3
  textContent (string)
  originalText (string)
  startPos (position)

  note: In DOM3, CDATA is marked as nodeType 4. However, after checking that neither
  firefox, chrome nor safari marks CDATA as 4 (they use 3 or 8), and that CDATA is
  gone in DOM4, i decided to stick with the futuristic alternative.

Document:
  nodeName (string) #document
  nodeType (number) 9
  children (array)
  hasDoctype true/false
  name: (string)            -> see below
  publicIdentifier (string)       "
  systemIdentifier (string)       "

CommentNode
  nodeName (string) #comment
  nodeType (number) 8
  textContent (string) content comment
  nodeValue (string) same as textcontent

Attribute
  name: attribute name
  value: attribute value (currently always string, doh)
  nodeType: (number) 2
  nameStart: (position)
  nameEnd: (position)
  valueStart: (position)
  valueEnd: (position)

Position
  line:   number
  column: number
  offset: number

About html doctypes

An html document will always have the document.name "html". If the document has anything else in the type, for example this html4 doctype:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

the first part within quotation marks will end up in the document.publicIdentifier, and the second part will be in document.systemIdentifier. You can read more about this here: http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#syntax-doctype.

Untrusted content / XSS cleaning

If you plan on using gumbo-parser to clean user input, the gumbo parser is one of the most well-tested and audited parsers available. Please read this comment from the gumbo-parsers authors.. There's a node module for XSS cleaning with the gumbo parser. Check Gumbo-Sanitize out!

Node 0.8

Contrary to what i previously said, node-gumbo-parser does build under node 0.8. You might have to npm update -g npm though.

Build and test:

node-gyp configure
node-gyp build
npm test

Changes

0.2.2 Update to use the latest NaN api, so it works for node 4.0

0.2.1 Celebrating some new stuff with a MINOR version change * Fragment parsing supports fragmentContext and fragmentNamespace Uses version 0.10.1, Big changes from the gumbo-parser-team: * Fragment parsing (instead my homebrew fragment parsing, the gumbo c-lib now supports fragments) * Parses all html5lib tests including template * 30-40% speed improvement See all changes here

0.1.13 Upgrade C lib Uses version 0.9.3, CDATA handling (see note in docs) See all changes here

0.1.12 io.js support! Thanks a lot to MicroMike

0.1.11 Upgrade C lib Uses version 0.9.2, performance improvements, duplicate attributes, semicolon fix, See all changes here

0.1.10 Visual Studio bugfix Thanks takenspc

0.1.9 Experimental fragment parsing Expose node positions from the parser, which also enables the user to see if an element is inserted by the parser or was in the text Update gumbo parser to a more secure version Update statement about security

0.1.8 Fix for BSD build problem

0.1.7 Fixes for build on snow leopard

0.1.6 Adding originalTag, originalTagName and tagNamespace if the tag is unknown, parse originalTag and set in as tag

0.1.5 Updating the gumbo-parser to the latest version. This includes some security fixes, and if you use this for user content, please update.

0.1.4 Temporary workaround for the latest changes in node 0.11, thanks Daniel

0.1.3 Fixes utf-8 bug, thanks Yonatan

0.1.2 Taking the (optional) options argument providing publicIdentifier and systemIdentifer for the doctype

0.1.1 Fix build on node 0.8

0.1.0 Passing { document: document, root: root } instead of only root

node-gumbo-parser's People

Contributors

akzhan avatar karlwestin avatar mike820324 avatar takenspc avatar thehydroimpulse avatar yonatan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-gumbo-parser's Issues

Install error

Tried installing through npm, but got a gyp error:

> [email protected] install /Users/sindresorhus/Downloads/node_modules/gumbo-parser
> node-gyp rebuild

Package gumbo was not found in the pkg-config search path.
Perhaps you should add the directory containing `gumbo.pc'
to the PKG_CONFIG_PATH environment variable
No package 'gumbo' found
gyp: Call to 'pkg-config --libs gumbo' returned exit status 1. while trying to load binding.gyp
gyp ERR! configure error 
gyp ERR! stack Error: `gyp` failed with exit code: 1
gyp ERR! stack     at ChildProcess.onCpExit (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/configure.js:415:16)
gyp ERR! stack     at ChildProcess.EventEmitter.emit (events.js:98:17)
gyp ERR! stack     at Process.ChildProcess._handle.onexit (child_process.js:789:12)
gyp ERR! System Darwin 12.4.0
gyp ERR! command "node" "/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js" "rebuild"
gyp ERR! cwd /Users/sindresorhus/Downloads/node_modules/gumbo-parser
gyp ERR! node -v v0.10.12
gyp ERR! node-gyp -v v0.10.0
gyp ERR! not ok 
npm ERR! weird error 1
npm ERR! not ok code 0

Same when I tried running node-gyp configure:

Package gumbo was not found in the pkg-config search path.
Perhaps you should add the directory containing `gumbo.pc'
to the PKG_CONFIG_PATH environment variable
No package 'gumbo' found
gyp: Call to 'pkg-config --libs gumbo' returned exit status 1. while trying to load binding.gyp
gyp ERR! configure error 
gyp ERR! stack Error: `gyp` failed with exit code: 1
gyp ERR! stack     at ChildProcess.onCpExit (/usr/local/lib/node_modules/node-gyp/lib/configure.js:424:16)
gyp ERR! stack     at ChildProcess.EventEmitter.emit (events.js:98:17)
gyp ERR! stack     at Process.ChildProcess._handle.onexit (child_process.js:789:12)
gyp ERR! System Darwin 12.4.0
gyp ERR! command "node" "/usr/local/bin/node-gyp" "configure"
gyp ERR! cwd /Users/sindresorhus/Downloads/gumbo-parser
gyp ERR! node -v v0.10.12
gyp ERR! node-gyp -v v0.10.9
gyp ERR! not ok 

Allow options

the gumbo-parser takes a couple of options:
{
tab_stop: (tab size, integer)
stop_on_first_error (bool
}
would be nice to be able to access those from JS too

Gumbo parser

can anyone tell , how to install gumbo parser on visual studio 2012?

parse html start with comment

when i run this:
var html = '<!--comment --><html><body><head></head></body></html>' var tree = gumbo(html);

tree.root.nodeType is 8,is it right?

compilation fail on snow leopard

Hey,

When I try to npm install gumbo-parser it seems to get to the point of compiling the parser itself in C, but blows up with the following:

CC(target) Release/obj.target/gumbo/deps/gumbo-parser/src/attribute.o
../deps/gumbo-parser/src/attribute.c: In function ‘gumbo_get_attribute’:
../deps/gumbo-parser/src/attribute.c:30: error: ‘for’ loop initial declaration used outside C99 mode
../deps/gumbo-parser/src/attribute.c:30: warning: comparison between signed and unsigned
make: *** [Release/obj.target/gumbo/deps/gumbo-parser/src/attribute.o] Error 1

I'm sure this is something to do with the GYP build script for the parser, but after trying for quite a while and to no avail to patch it I think it's about time to ask. My guess is the gnu99 requirement being enforced isn't correct, but even if I'm on the right track I can't figure out how to relax that requirement.

Any suggestion would be most welcome.

Thanks, Alex J Burke.

iojs support

Hello,
I have used nan to rewrite the node-gumbo.cc a little bit, the following is the diff link:
mike820324@ff13216

It should install properly in iojs v1.3.0.
Should I send a pull request?

Help wanted: makeshift fragment parsing

I've been thinking a little about a simplified fragment parsing, since the gumbo parser hasn't implemented yet. When they do, we should use their version, but how about doing smth in the mean time?

If you're interested in this, check out the feature-fragment branch

git clone https://github.com/karlwestin/node-gumbo-parser.git
cd node-gumbo-parser
git checkout feature-fragment
node-gyp rebuild

and then require("/path/to/repo/node-gumbo")

see example in test/test.js in that branch

let me know if you need any help, any feedback appreciated!

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.