Giter Site home page Giter Site logo

nokogumbo's Introduction

Nokogumbo - a Nokogiri interface to the Gumbo HTML5 parser.

NOTICE: End of life

Nokogumbo has been merged into Nokogiri v1.12.0. Please update to Nokogiri v1.12.0 or later, and remove Nokogumbo as an explicit dependency.

Please note that the final release of Nokogumbo, v2.0.5 (2021-03-19), will not support Ruby 3.2.0 and later. For Ruby 3.2 support, please upgrade to Nokogiri v1.12.0 or later and remove Nokogumbo as an explicit dependency.


Summary

Nokogumbo provides the ability for a Ruby program to invoke our version of the Gumbo HTML5 parser and to access the result as a Nokogiri::HTML::Document.

Github Actions Build Status Appveyor Build Status

Usage

require 'nokogumbo'
doc = Nokogiri.HTML5(string)

To parse an HTML fragment, a fragment method is provided.

require 'nokogumbo'
doc = Nokogiri::HTML5.fragment(string)

Because HTML is often fetched via the web, a convenience interface to HTTP get is also provided:

require 'nokogumbo'
doc = Nokogiri::HTML5.get(uri)

Parsing options

The document and fragment parsing methods,

  • Nokogiri.HTML5(html, url = nil, encoding = nil, options = {})
  • Nokogiri::HTML5.parse(html, url = nil, encoding = nil, options = {})
  • Nokogiri::HTML5::Document.parse(html, url = nil, encoding = nil, options = {})
  • Nokogiri::HTML5.fragment(html, encoding = nil, options = {})
  • Nokogiri::HTML5::DocumentFragment.parse(html, encoding = nil, options = {}) support options that are different from Nokogiri's.

The three currently supported options are :max_errors, :max_tree_depth and :max_attributes, described below.

Error reporting

Nokogumbo contains an experimental parse error reporting facility. By default, no parse errors are reported but this can be configured by passing the :max_errors option to ::parse or ::fragment.

require 'nokogumbo'
doc = Nokogiri::HTML5.parse('<span/>Hi there!</span foo=bar />', max_errors: 10)
doc.errors.each do |err|
  puts(err)
end

This prints the following.

1:1: ERROR: Expected a doctype token
<span/>Hi there!</span foo=bar />
^
1:1: ERROR: Start tag of nonvoid HTML element ends with '/>', use '>'.
<span/>Hi there!</span foo=bar />
^
1:17: ERROR: End tag ends with '/>', use '>'.
<span/>Hi there!</span foo=bar />
                ^
1:17: ERROR: End tag contains attributes.
<span/>Hi there!</span foo=bar />
                ^

Using max_errors: -1 results in an unlimited number of errors being returned.

The errors returned by #errors are instances of Nokogiri::XML::SyntaxError.

The HTML standard defines a number of standard parse error codes. These error codes only cover the "tokenization" stage of parsing HTML. The parse errors in the "tree construction" stage do not have standardized error codes (yet).

As a convenience to Nokogumbo users, the defined error codes are available via the Nokogiri::XML::SyntaxError#str1 method.

require 'nokogumbo'
doc = Nokogiri::HTML5.parse('<span/>Hi there!</span foo=bar />', max_errors: 10)
doc.errors.each do |err|
  puts("#{err.line}:#{err.column}: #{err.str1}")
end

This prints the following.

1:1: generic-parser
1:1: non-void-html-element-start-tag-with-trailing-solidus
1:17: end-tag-with-trailing-solidus
1:17: end-tag-with-attributes

Note that the first error is generic-parser because it's an error from the tree construction stage and doesn't have a standardized error code.

For the purposes of semantic versioning, the error messages, error locations, and error codes are not part of Nokogumbo's public API. That is, these are subject to change without Nokogumbo's major version number changing. These may be stabilized in the future.

Maximum tree depth

The maximum depth of the DOM tree parsed by the various parsing methods is configurable by the :max_tree_depth option. If the depth of the tree would exceed this limit, then an ArgumentError is thrown.

This limit (which defaults to Nokogumbo::DEFAULT_MAX_TREE_DEPTH = 400) can be removed by giving the option max_tree_depth: -1.

html = '<!DOCTYPE html>' + '<div>' * 1000
doc = Nokogiri.HTML5(html)
# raises ArgumentError: Document tree depth limit exceeded
doc = Nokogiri.HTML5(html, max_tree_depth: -1)

Attribute limit per element

The maximum number of attributes per DOM element is configurable by the :max_attributes option. If a given element would exceed this limit, then an ArgumentError is thrown.

This limit (which defaults to Nokogumbo::DEFAULT_MAX_ATTRIBUTES = 400) can be removed by giving the option max_attributes: -1.

html = '<!DOCTYPE html><div ' + (1..1000).map { |x| "attr-#{x}" }.join(' ') + '>'
# "<!DOCTYPE html><div attr-1 attr-2 attr-3 ... attr-1000>"
doc = Nokogiri.HTML5(html)
# raises ArgumentError: Attributes per element limit exceeded
doc = Nokogiri.HTML5(html, max_attributes: -1)

HTML Serialization

After parsing HTML, it may be serialized using any of the Nokogiri serialization methods. In particular, #serialize, #to_html, and #to_s will serialize a given node and its children. (This is the equivalent of JavaScript's Element.outerHTML.) Similarly, #inner_html will serialize the children of a given node. (This is the equivalent of JavaScript's Element.innerHTML.)

doc = Nokogiri::HTML5("<!DOCTYPE html><span>Hello world!</span>")
puts doc.serialize
# Prints: <!DOCTYPE html><html><head></head><body><span>Hello world!</span></body></html>

Due to quirks in how HTML is parsed and serialized, it's possible for a DOM tree to be serialized and then re-parsed, resulting in a different DOM. Mostly, this happens with DOMs produced from invalid HTML. Unfortunately, even valid HTML may not survive serialization and re-parsing.

In particular, a newline at the start of pre, listing, and textarea elements is ignored by the parser.

doc = Nokogiri::HTML5(<<-EOF)
<!DOCTYPE html>
<pre>
Content</pre>
EOF
puts doc.at('/html/body/pre').serialize
# Prints: <pre>Content</pre>

In this case, the original HTML is semantically equivalent to the serialized version. If the pre, listing, or textarea content starts with two newlines, the first newline will be stripped on the first parse and the second newline will be stripped on the second, leading to semantically different DOMs. Passing the parameter preserve_newline: true will cause two or more newlines to be preserved. (A single leading newline will still be removed.)

doc = Nokogiri::HTML5(<<-EOF)
<!DOCTYPE html>
<listing>

Content</listing>
EOF
puts doc.at('/html/body/listing').serialize(preserve_newline: true)
# Prints: <listing>
#
# Content</listing>

Encodings

Nokogumbo always parses HTML using UTF-8; however, the encoding of the input can be explicitly selected via the optional encoding parameter. This is most useful when the input comes not from a string but from an IO object.

When serializing a document or node, the encoding of the output string can be specified via the :encoding options. Characters that cannot be encoded in the selected encoding will be encoded as HTML numeric entities.

frag = Nokogiri::HTML5.fragment('<span>아는 길도 물어가라</span>')
html = frag.serialize(encoding: 'US-ASCII')
puts html
# Prints: <span>&#xc544;&#xb294; &#xae38;&#xb3c4; &#xbb3c;&#xc5b4;&#xac00;&#xb77c;</span>
frag = Nokogiri::HTML5.fragment(html)
puts frag.serialize
# Prints: <span>아는 길도 물어가라</span>

(There's a bug in all current versions of Ruby that can cause the entity encoding to fail. Of the mandated supported encodings for HTML, the only encoding I'm aware of that has this bug is 'ISO-2022-JP'. I recommend avoiding this encoding.)

Examples

require 'nokogumbo'
puts Nokogiri::HTML5.get('http://nokogiri.org').search('ol li')[2].text

Notes

  • The Nokogiri::HTML5.fragment function takes a string and parses it as a HTML5 document. The <html>, <head>, and <body> elements are removed from this document, and any children of these elements that remain are returned as a Nokogiri::HTML::DocumentFragment.

  • The Nokogiri::HTML5.parse function takes a string and passes it to the gumbo_parse_with_options method, using the default options. The resulting Gumbo parse tree is then walked.

    • If the necessary Nokogiri and libxml2 headers can be found at installation time then an xmlDoc tree is produced and a single Nokogiri Ruby object is constructed to wrap the xmlDoc structure. Nokogiri only produces Ruby objects as necessary, so all searching is done using the underlying libxml2 libraries.
    • If the necessary headers are not present at installation time, then Nokogiri Ruby objects are created for each Gumbo node. Other than memory usage and CPU time, the results should be equivalent.
  • The Nokogiri::HTML5.get function takes care of following redirects, https, and determining the character encoding of the result, based on the rules defined in the HTML5 specification for doing so.

  • Instead of uppercase element names, lowercase element names are produced.

  • Instead of returning unknown as the element name for unknown tags, the original tag name is returned verbatim.

Flavors of Nokogumbo

Nokogumbo uses libxml2, the XML library underlying Nokogiri, to speed up parsing. If the libxml2 headers are not available, then Nokogumbo resorts to using Nokogiri's Ruby API to construct the DOM tree.

Nokogiri can be configured to either use the system library version of libxml2 or use a bundled version. By default (as of Nokogiri version 1.8.4), Nokogiri will use a bundled version.

To prevent differences between versions of libxml2, Nokogumbo will only use libxml2 if the build process can find the exact same version used by Nokogiri. This leads to three possibilities

  1. Nokogiri is compiled with the bundled libxml2. In this case, Nokogumbo will (by default) use the same version of libxml2.
  2. Nokogiri is compiled with the system libxml2. In this case, if the libxml2 headers are available, then Nokogumbo will (by default) use the system version and headers.
  3. Nokogiri is compiled with the system libxml2 but its headers aren't available at build time for Nokogumbo. In this case, Nokogumbo will use the slower Ruby API.

Using libxml2 can be required by passing -- --with-libxml2 to bundle exec rake or to gem install. Using libxml2 can be prohibited by instead passing -- --without-libxml2.

Functionally, the only difference between using libxml2 or not is in the behavior of Nokogiri::XML::Node#line. If it is used, then #line will return the line number of the corresponding node. Otherwise, it will return 0.

Installation

git clone https://github.com/rubys/nokogumbo.git
cd nokogumbo
bundle install
rake gem
gem install pkg/nokogumbo*.gem

Related efforts

  • ruby-gumbo -- a ruby binding for the Gumbo HTML5 parser.
  • lua-gumbo -- a lua binding for the Gumbo HTML5 parser.

nokogumbo's People

Contributors

alexandrebernard avatar craigbarnes avatar duvduvfb avatar flavorjones avatar fulldecent avatar grosser avatar jbotelho2-bb avatar jeremy avatar jhawthorn avatar jirutka avatar johan-smits avatar krutten avatar lis2 avatar lowjoel avatar mattwildig avatar mrasu avatar rafbm avatar rgrove avatar robbat2 avatar rubys avatar squeek502 avatar staugaard avatar stevecheckoway avatar vryash avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nokogumbo's Issues

Compilation issue

All versions after version 1.4.3 fail to compile on my machine OSX.

It looks like you are using an undeclared constant (GUMBO_NODE_TEMPLATE ) maybe?
Here is the output:

➜  ~ gem install nokogumbo -v 1.4.7
Fetching: nokogumbo-1.4.7.gem (100%)
Building native extensions.  This could take a while...
ERROR:  Error installing nokogumbo:
    ERROR: Failed to build gem native extension.

    /Users/staugaard/.rubies/ruby-2.2.2/bin/ruby -r ./siteconf20160104-53286-8m6nee.rb extconf.rb
checking for xmlNewDoc() in -lxml2... yes
checking for nokogiri.h in /Users/staugaard/.gem/ruby/2.2.2/gems/nokogiri-1.6.6.2.1/ext/nokogiri... yes
checking for nokogiri.h in /Users/staugaard/.gem/ruby/2.2.2/gems/nokogiri-1.6.6.2.1/ext/nokogiri... yes
checking for gumbo_parse() in -lgumbo... yes
creating Makefile

make "DESTDIR=" clean

make "DESTDIR="
compiling nokogumbo.c
In file included from nokogumbo.c:28:
/Users/staugaard/.gem/ruby/2.2.2/gems/nokogiri-1.6.6.2.1/ext/nokogiri/nokogiri.h:13:9: warning: '_GNU_SOURCE' macro redefined [-Wmacro-redefined]
#define _GNU_SOURCE
        ^
/Users/staugaard/.rubies/ruby-2.2.2/include/ruby-2.2.0/x86_64-darwin14/ruby/config.h:17:9: note: previous definition is here
#define _GNU_SOURCE 1
        ^
nokogumbo.c:115:12: warning: assigning to 'char *' from 'const char [7]' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
        ns = "xlink:";
           ^ ~~~~~~~~
nokogumbo.c:119:12: warning: assigning to 'char *' from 'const char [5]' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
        ns = "xml:";
           ^ ~~~~~~
nokogumbo.c:123:12: warning: assigning to 'char *' from 'const char [7]' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
        ns = "xmlns:";
           ^ ~~~~~~~~
nokogumbo.c:160:12: error: use of undeclared identifier 'GUMBO_NODE_TEMPLATE'; did you mean 'GUMBO_TAG_TEMPLATE'?
      case GUMBO_NODE_TEMPLATE:
           ^~~~~~~~~~~~~~~~~~~
           GUMBO_TAG_TEMPLATE
/usr/local/include/gumbo.h:172:3: note: 'GUMBO_TAG_TEMPLATE' declared here
  GUMBO_TAG_TEMPLATE,
  ^
nokogumbo.c:160:12: warning: case value not in enumerated type 'GumboNodeType' [-Wswitch]
      case GUMBO_NODE_TEMPLATE:
           ^
nokogumbo.c:110:19: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (int i=0; i < attrs->length; i++) {
                ~ ^ ~~~~~~~~~~~~~
nokogumbo.c:132:47: warning: comparison of integers of different signs: 'unsigned long' and 'int' [-Wsign-compare]
      if (strlen(ns) + strlen(attr->name) + 1 > namelen) {
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~
nokogumbo.c:138:51: warning: implicit conversion loses integer precision: 'unsigned long' to 'int' [-Wshorten-64-to-32]
        namelen = strlen(ns) + strlen(attr->name) + 1;
                ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
nokogumbo.c:153:19: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (int i=0; i < children->length; i++) {
                ~ ^ ~~~~~~~~~~~~~~~~
9 warnings and 1 error generated.
make: *** [nokogumbo.o] Error 1

make failed, exit code 2

Gem files will remain installed in /Users/staugaard/.gem/ruby/2.2.2/gems/nokogumbo-1.4.7 for inspection.
Results logged to /Users/staugaard/.gem/ruby/2.2.2/extensions/x86_64-darwin-14/2.2.0-static/nokogumbo-1.4.7/gem_make.out

Gem fails to compile on Windows

gem install nokogumbo yields:

Fetching: nokogumbo-1.1.9.gem (100%)
Temporarily enhancing PATH to include DevKit...
Building native extensions.  This could take a while...
ERROR:  Error installing nokogumbo:
        ERROR: Failed to build gem native extension.

    C:/Ruby/bin/ruby.exe extconf.rb
checking for xmlNewDoc() in -lxml2... *** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of necessary
libraries and/or headers.  Check the mkmf.log file for more details.  You may
need configuration options.

Provided configuration options:
        --with-opt-dir
        --without-opt-dir
        --with-opt-include
        --without-opt-include=${opt-dir}/include
        --with-opt-lib
        --without-opt-lib=${opt-dir}/lib
        --with-make-prog
        --without-make-prog
        --srcdir=.
        --curdir
        --ruby=C:/Ruby/bin/ruby
        --with-xml2lib
        --without-xml2lib
C:/Ruby/lib/ruby/2.0.0/mkmf.rb:434:in `try_do': The compiler failed to generate an executable file. (RuntimeError)
You have to install development tools first.
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:519:in `try_link0'
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:534:in `try_link'
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:720:in `try_func'
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:950:in `block in have_library'
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:895:in `block in checking_for'
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:340:in `block (2 levels) in postpone'
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:310:in `open'
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:340:in `block in postpone'
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:310:in `open'
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:336:in `postpone'
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:894:in `checking_for'
        from C:/Ruby/lib/ruby/2.0.0/mkmf.rb:945:in `have_library'
        from extconf.rb:4:in `<main>'


Gem files will remain installed in C:/Ruby/lib/ruby/gems/2.0.0/gems/nokogumbo-1.1.9 for inspection.
Results logged to C:/Ruby/lib/ruby/gems/2.0.0/gems/nokogumbo-1.1.9/ext/nokogumboc/gem_make.out

mkmf.log:

"gcc -o conftest.exe -IC:/Ruby/include/ruby-2.0.0/i386-mingw32 -IC:/Ruby/include/ruby-2.0.0/ruby/backward -IC:/Ruby/include/ruby-2.0.0 -I. -DFD_SETSIZE=2048 -D_WIN32_WINNT=0x0501 -D_FILE_OFFSET_BITS=64   -O3 -fno-omit-frame-pointer -fno-fast-math -g -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wunused-variable -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -std=c99 conftest.c  -L. -LC:/Ruby/lib -L.      -lmsvcrt-ruby200  -lshell32 -lws2_32 -limagehlp -lshlwapi  "
In file included from C:/Ruby/include/ruby-2.0.0/ruby/defines.h:153:0,
                 from C:/Ruby/include/ruby-2.0.0/ruby/ruby.h:70,
                 from C:/Ruby/include/ruby-2.0.0/ruby.h:33,
                 from conftest.c:1:
C:/Ruby/include/ruby-2.0.0/ruby/win32.h: In function 'rb_w32_pow':
C:/Ruby/include/ruby-2.0.0/ruby/win32.h:801:5: warning: implicit declaration of function '_controlfp' [-Wimplicit-function-declaration]
C:/Ruby/include/ruby-2.0.0/ruby/win32.h:802:16: error: '_PC_64' undeclared (first use in this function)
C:/Ruby/include/ruby-2.0.0/ruby/win32.h:802:16: note: each undeclared identifier is reported only once for each function it appears in
C:/Ruby/include/ruby-2.0.0/ruby/win32.h:802:24: error: '_MCW_PC' undeclared (first use in this function)
checked program was:
/* begin */
1: #include "ruby.h"
2: 
3: #include <winsock2.h>
4: #include <windows.h>
5: int main(int argc, char **argv)
6: {
7:   return 0;
8: }
/* end */

Matching libxml2 headers with nokogiri

Build configurations

Nokogiri has (for our purposes) two supported configurations

  1. Compile with an embedded libxml2 (the default and most supported option)
  2. Compile with system libraries

Nokogumbo has two supported configurations

  1. Compile with libxml2 (and nokogiri) headers (this should be the most efficient option)
  2. Compile without the headers present and only use the nokogiri API to construct the DOM tree

Issue

There's no guarantee that the version of libxml2 that nokogumbo builds against in configuration 1 matches the version of libxml2 that nokogiri builds against.

The reason is that ext/nokogumboc/extconf.rb checks for the presence of a system libxml2 (via have_library('xml2', 'xmlNewDoc')) and then searches for nokogiri.h. If nokogiri is built in its default configuration, then there's no guarantee that the system library matches.

Desired goal

I think the best approach is for extconf.rb to perform the following actions.

  1. If nokogiri was compiled with an embedded libxml2 and if the headers it needs (in this case libxml/tree.h and nokogiri.h) are available from the nokogiri gem, then build nokogumbo configuration 1.
  2. If nokogiri was compiled with the system libxml2 and the system library headers (and nokogiri.h) are available, then build nokogumbo in configuration 1.
  3. Otherwise, build nokogumbo in configuration 2.

Blocking issue

I believe that for nokogiri 1.7, we can achieve this using this extconf.rb. Unfortunately, for 1.8, this approach doesn't work because the built libxml2, along with its headers, are deleted after the extension is built. See sparklemotion/nokogiri#1325 for more discussion on this.

One possible way forward (and I haven't tested this because I don't like this solution) would be to test if the libxml2 symbols are available in the nokogiri .so/.bundle and if so, then download the corresponding version of libxml2 and extract the headers.

It would be better if nokogiri either didn't delete the built libraries or first installed all of its headers into the extension directory.

compilation failing on engine yard cloud (gentoo)

currently no idea what could be missing.

note that i tried also with libxml2 installed on the server

 ~ $ gem install nokogumbo
Fetching: mini_portile-0.6.0.gem (100%)
Successfully installed mini_portile-0.6.0
Fetching: nokogiri-1.6.2.1.gem (100%)
Building native extensions.  This could take a while...
Building nokogiri using packaged libraries.
Building libxml2-2.8.0 for nokogiri with the following patches applied:
    - 0001-Fix-parser-local-buffers-size-problems.patch
    - 0002-Fix-entities-local-buffers-size-problems.patch
    - 0003-Fix-an-error-in-previous-commit.patch
    - 0004-Fix-potential-out-of-bound-access.patch
    - 0005-Detect-excessive-entities-expansion-upon-replacement.patch
    - 0006-Do-not-fetch-external-parsed-entities.patch
    - 0007-Enforce-XML_PARSER_EOF-state-handling-through-the-pa.patch
    - 0008-Improve-handling-of-xmlStopParser.patch
    - 0009-Fix-a-couple-of-return-without-value.patch
    - 0010-Keep-non-significant-blanks-node-in-HTML-parser.patch
    - 0011-Do-not-fetch-external-parameter-entities.patch
************************************************************************
IMPORTANT!  Nokogiri builds and uses a packaged version of libxml2.

If this is a concern for you and you want to use the system library
instead, abort this installation process and reinstall nokogiri as
follows:

    gem install nokogiri -- --use-system-libraries

If you are using Bundler, tell it to use the option:

    bundle config build.nokogiri --use-system-libraries
    bundle install

However, note that nokogiri does not necessarily support all versions
of libxml2.

For example, libxml2-2.9.0 and higher are currently known to be broken
and thus unsupported by nokogiri, due to compatibility problems and
XPath optimization bugs.
************************************************************************
Building libxslt-1.1.28 for nokogiri with the following patches applied:
    - 0001-Adding-doc-update-related-to-1.1.28.patch
    - 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
    - 0003-Initialize-pseudo-random-number-generator-with-curre.patch
    - 0004-EXSLT-function-str-replace-is-broken-as-is.patch
    - 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
    - 0007-Separate-function-for-predicate-matching-in-patterns.patch
    - 0008-Fix-direct-pattern-matching.patch
    - 0009-Fix-certain-patterns-with-predicates.patch
    - 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
    - 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
    - 0014-Fix-for-bug-436589.patch
    - 0015-Fix-mkdir-for-mingw.patch
************************************************************************
IMPORTANT!  Nokogiri builds and uses a packaged version of libxslt.

If this is a concern for you and you want to use the system library
instead, abort this installation process and reinstall nokogiri as
follows:

    gem install nokogiri -- --use-system-libraries

If you are using Bundler, tell it to use the option:

    bundle config build.nokogiri --use-system-libraries
    bundle install
************************************************************************
Successfully installed nokogiri-1.6.2.1
Fetching: nokogumbo-1.1.5.gem (100%)
Building native extensions.  This could take a while...
ERROR:  Error installing nokogumbo:
    ERROR: Failed to build gem native extension.

    /usr/bin/ruby21 extconf.rb
checking for xmlNewDoc() in -lxml2... yes
checking for nokogiri.h in /home/deploy/.gem/ruby/2.1.0/gems/nokogiri-1.6.2.1/ext/nokogiri... yes
checking for nokogiri.h in /home/deploy/.gem/ruby/2.1.0/gems/nokogiri-1.6.2.1/ext/nokogiri... yes
checking for gumbo_parse() in -lgumbo... no
creating Makefile

make "DESTDIR="
compiling vector.c
compiling tokenizer.c
compiling utf8.c
compiling nokogumbo.c
In file included from nokogumbo.c:28:0:
/home/deploy/.gem/ruby/2.1.0/gems/nokogiri-1.6.2.1/ext/nokogiri/nokogiri.h:13:0: warning: "_GNU_SOURCE" redefined
/usr/include/ruby-2.1.0/x86_64-linux/ruby/config.h:17:0: note: this is the location of the previous definition
compiling error.c
compiling parser.c
compiling tag.c
compiling attribute.c
compiling util.c
compiling string_piece.c
compiling string_buffer.c
compiling char_ref.c
linking shared-object nokogumboc.so
nokogumbo.o: In function `parse':
nokogumbo.c:(.text+0x26a): undefined reference to `Nokogiri_wrap_xml_document'
collect2: ld returned 1 exit status
make: *** [nokogumboc.so] Error 1


Gem files will remain installed in /home/deploy/.gem/ruby/2.1.0/gems/nokogumbo-1.1.5 for inspection.
Results logged to /home/deploy/.gem/ruby/2.1.0/gems/nokogumbo-1.1.5/ext/nokogumboc/gem_make.out

Bundle install fails on kali linux with nokogumbo gem.

I am trying to install dradis community edition for Kali linux through this website: (Running Kali Linux on Virtual Machine on macOS)

https://dradisframework.com/ce/documentation/install_kali.html

and

https://dradisframework.com/ce/documentation/install_git.html

when I run the command:

./bin/setup

I am getting the error:

Fetching nokogumbo 2.0.1
Installing nokogumbo 2.0.1 with native extensions
Gem::Ext::BuildError: ERROR: Failed to build gem native extension.

    current directory: /root/dradis-ce/gems/nokogumbo-2.0.1/ext/nokogumbo
/usr/bin/ruby2.5 -I /usr/local/lib/site_ruby/2.5.0 -r ./siteconf20190224-6660-1caz3w7.rb extconf.rb
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of necessary
libraries and/or headers.  Check the mkmf.log file for more details.  You may
need configuration options.

Provided configuration options:
    --with-opt-dir
    --without-opt-dir
    --with-opt-include
    --without-opt-include=${opt-dir}/include
    --with-opt-lib
    --without-opt-lib=${opt-dir}/lib
    --with-make-prog
    --without-make-prog
    --srcdir=.
    --curdir
    --ruby=/usr/bin/$(RUBY_BASE_NAME)2.5
/usr/local/lib/site_ruby/2.5.0/rubygems/dependency.rb:313:in `to_specs': Could not find 'nokogiri' (= 1.8.4) - did find: [nokogiri-1.8.5]
(Gem::MissingSpecVersionError)
Checked in 'GEM_PATH=/root/dradis-ce', execute `gem env` for more information
    from /usr/local/lib/site_ruby/2.5.0/rubygems/dependency.rb:323:in `to_spec'
    from /usr/local/lib/site_ruby/2.5.0/rubygems/specification.rb:1033:in `find_by_name'
    from extconf.rb:9:in `<main>'

extconf failed, exit code 1

Gem files will remain installed in /root/dradis-ce/gems/nokogumbo-2.0.1 for inspection.
Results logged to /root/dradis-ce/extensions/x86_64-linux/2.5.0/nokogumbo-2.0.1/gem_make.out

An error occurred while installing nokogumbo (2.0.1), and Bundler cannot continue.
Make sure that `gem install nokogumbo -v '2.0.1' --source 'https://rubygems.org/'` succeeds before bundling.

In Gemfile:
  sanitize was resolved to 5.0.0, which depends on
    nokogumbo

== Command ["bundle install"] failed ==

I tried several solution like bundle install or bundle update But getting same error as above while trying anything with bundle. I tried installing nokogumbo with:

gem install nokogumbo -v '2.0.1' --source 'https://rubygems.org/'

and I tried again. Still getting the same error. Why is this happening.

Serialization options for html

A few days ago, I wrote an implementation for serializing HTML5 according to Serializing HTML fragments. I need to write some more tests, but there are a few design choices that I'd like to lay out.

  1. How should we integrate with Nokogiri's serialization API?
  2. What should we do about pre, listing, and textarea?

Integration with Nokogiri's serialization API

Nokogiri has a bewildering number of different interfaces for serializing a Nokogiri::XML::Node, each of which has several forms for arguments:

  • #inner_html(options = {}) calls to_html(options) on each child and joins the results
  • #serialize(options, &block) (alternatively #serialize(encoding = nil, save_with = nil, &block) which is equivalent to passing an options hash/keyword arguments with keys :encoding and :save_with) which calls write_to io, options, &block where io is a new StringIO
  • #to_html(options = {}) which calls to_format SaveOptions::DEFAULT_HTML, options
  • #to_s which (for non-XML documents) calls to_html
  • #write_html_to(io, options = {}) which calls write_format_to SaveOptions::DEFAULT_HTML, io, options
  • #write_to(io, options, &block) (alternatively #write_to(io, encoding, save_with, &block) which yields options[:save_with] (defaulting to XML::Node::SaveOptions::FORMAT) and then calls ``(io, encoding, indent_string, config.optionswhereindent_string` doesn't matter for html and `config` is the `SaveOptions`

There are also private methods

  • #to_format(save_option, options) which calls serialize(options) with options[:save_with] = save_option unless options[:save_with] already exists; old versions of libxml2 cause this to call dump_html instead
  • #write_format_to(save_options, io, options) which calls write_to io, options with options[:save_with] = save_option unless options[:save_with] already exists; old versions of libxml2 cause this to call dump_html instead
  • #dump_html calls htmlNodeDump from libxml2
  • #native_write_to(io, encoding, indent_string, config_options) calls xmlSaveToIO from libxml2

Except for some broken versions of libxml2, everything eventually calls #native_write_to. The key save_with option to control formatting is XML::Node::SaveOptions::FORMAT (corresponding to XML_SAVE_FORMAT).

In table form:

Method Calls Sets default SaveOptions Ultimate default SaveOptions
#inner_html #to_html AS_HTML|FORMAT
#serialize #write_to FORMAT
#to_html #to_format AS_HTML|FORMAT AS_HTML|FORMAT
#to_s #to_html AS_HTML|FORMAT
#write_html_to #write_format_to AS_HTML|FORMAT AS_HTML|FORMAT
#write_to #native_write_to FORMAT FORMAT
#to_format #serialize FORMAT
#write_format_to #write_to FORMAT

As long as neither AS_XML and AS_XHTML is set and a node's document is a HTML_DOCUMENT_NODE, the output will be written as HTML.

When output as HTML, the only thing I can see FORMAT controlling is whether newlines are added after (some, but not all) elements. The questions are where do we want to modify XML::Node to perform HTML5 serialization and do we want to preserve this FORMAT default?

In particular #inner_html should probably follow the standard for serialization by default which means no additional newlines.

#write_to is the natural place to patch but as you can see from the table, #to_html and #write_html_to both add FORMAT.

pre, listing, and textarea ignore leading newlines!

The parsing rules for these elements says that if the token following their start tag is a line feed, then the line feed is ignored "as an authoring convenience."

As an informative example in the standard, after parsing

<pre>

Hello.</pre>

and then serializing and reparsing, the pre element's text content is Hello..

I'm inclined to follow this behavior for #inner_html but it's likely surprising to clients that well-formed HTML doesn't round-trip through serialization and reparsing.

What should we do here? Should some of these follow the standard? Should we introduce a new API?

"Symbol not found: _kGumboDefaultOptions (LoadError)" when building gem

I tried to install nokogumbo following the instructions in the readme, but rake gem fails with the following output:

$ rake gem
mkdir -p tmp/x86_64-darwin12.4.0/nokogumboc/1.9.3
cd tmp/x86_64-darwin12.4.0/nokogumboc/1.9.3
/Users/rgrove/.rbenv/versions/1.9.3-p448/bin/ruby -I. ../../../../ext/nokogumboc/extconf.rb
checking for xmlNewDoc() in -lxml2... yes
checking for nokogiri.h in /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/gems/1.9.1/gems/nokogiri-1.6.0/ext/nokogiri... no
Activating libxml2 2.8.0 (from /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/gems/1.9.1/gems/nokogiri-1.6.0/ports/i686-apple-darwin11/libxml2/2.8.0)...
Activating libxslt 1.1.26 (from /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/gems/1.9.1/gems/nokogiri-1.6.0/ports/i686-apple-darwin11/libxslt/1.1.26)...
checking for libxml/parser.h... yes
checking for libxslt/xslt.h... yes
checking for libexslt/exslt.h... yes
checking for iconv_open() in iconv.h... no
checking for iconv_open() in -liconv... yes
checking for xmlParseDoc() in -lxml2... yes
checking for xsltParseStylesheetDoc() in -lxslt... yes
checking for exsltFuncRegister() in -lexslt... yes
checking for xmlHasFeature()... yes
checking for xmlFirstElementChild()... yes
checking for xmlRelaxNGSetParserStructuredErrors()... yes
checking for xmlRelaxNGSetParserStructuredErrors()... yes
checking for xmlRelaxNGSetValidStructuredErrors()... yes
checking for xmlSchemaSetValidStructuredErrors()... yes
checking for xmlSchemaSetParserStructuredErrors()... yes
creating Makefile
checking for nokogiri.h in /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/gems/1.9.1/gems/nokogiri-1.6.0/ext/nokogiri... yes
checking for gumbo_parse() in -lgumbo... no
creating Makefile
cd -
cd tmp/x86_64-darwin12.4.0/nokogumboc/1.9.3
make
compiling ../../../../ext/nokogumboc/nokogumbo.c
In file included from ../../../../ext/nokogumboc/nokogumbo.c:28:
/Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/gems/1.9.1/gems/nokogiri-1.6.0/ext/nokogiri/nokogiri.h:13:1: warning: "_GNU_SOURCE" redefined
In file included from /Users/rgrove/.rbenv/versions/1.9.3-p448/include/ruby-1.9.1/ruby/ruby.h:24,
                 from /Users/rgrove/.rbenv/versions/1.9.3-p448/include/ruby-1.9.1/ruby.h:32,
                 from ../../../../ext/nokogumboc/nokogumbo.c:21:
/Users/rgrove/.rbenv/versions/1.9.3-p448/include/ruby-1.9.1/x86_64-darwin12.4.0/ruby/config.h:17:1: warning: this is the location of the previous definition
linking shared-object nokogumboc.bundle
cd -
mkdir -p tmp/x86_64-darwin12.4.0/stage/lib
mkdir -p tmp/x86_64-darwin12.4.0/stage/ext/nokogumboc
cp ext/nokogumboc/extconf.rb tmp/x86_64-darwin12.4.0/stage/ext/nokogumboc/extconf.rb
cp ext/nokogumboc/nokogumbo.c tmp/x86_64-darwin12.4.0/stage/ext/nokogumboc/nokogumbo.c
cp lib/nokogumbo.rb tmp/x86_64-darwin12.4.0/stage/lib/nokogumbo.rb
cp LICENSE.txt tmp/x86_64-darwin12.4.0/stage/LICENSE.txt
cp README.md tmp/x86_64-darwin12.4.0/stage/README.md
mkdir -p tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src
cp gumbo-parser/src/attribute.c tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/attribute.c
cp gumbo-parser/src/attribute.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/attribute.h
cp gumbo-parser/src/char_ref.c tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/char_ref.c
cp gumbo-parser/src/char_ref.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/char_ref.h
cp gumbo-parser/src/error.c tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/error.c
cp gumbo-parser/src/error.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/error.h
cp gumbo-parser/src/gumbo.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/gumbo.h
cp gumbo-parser/src/insertion_mode.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/insertion_mode.h
cp gumbo-parser/src/parser.c tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/parser.c
cp gumbo-parser/src/parser.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/parser.h
cp gumbo-parser/src/string_buffer.c tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/string_buffer.c
cp gumbo-parser/src/string_buffer.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/string_buffer.h
cp gumbo-parser/src/string_piece.c tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/string_piece.c
cp gumbo-parser/src/string_piece.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/string_piece.h
cp gumbo-parser/src/tag.c tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/tag.c
cp gumbo-parser/src/token_type.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/token_type.h
cp gumbo-parser/src/tokenizer.c tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/tokenizer.c
cp gumbo-parser/src/tokenizer.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/tokenizer.h
cp gumbo-parser/src/tokenizer_states.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/tokenizer_states.h
cp gumbo-parser/src/utf8.c tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/utf8.c
cp gumbo-parser/src/utf8.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/utf8.h
cp gumbo-parser/src/util.c tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/util.c
cp gumbo-parser/src/util.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/util.h
cp gumbo-parser/src/vector.c tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/vector.c
cp gumbo-parser/src/vector.h tmp/x86_64-darwin12.4.0/stage/gumbo-parser/src/vector.h
install -c tmp/x86_64-darwin12.4.0/nokogumboc/1.9.3/nokogumboc.bundle lib/nokogumboc.bundle
cp tmp/x86_64-darwin12.4.0/nokogumboc/1.9.3/nokogumboc.bundle tmp/x86_64-darwin12.4.0/stage/lib/nokogumboc.bundle
/Users/rgrove/.rbenv/versions/1.9.3-p448/bin/ruby test-nokogumbo.rb
/Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/site_ruby/1.9.1/rubygems/core_ext/kernel_require.rb:53:in `require': dlopen(/Users/rgrove/src/nokogumbo/lib/nokogumboc.bundle, 9): Symbol not found: _kGumboDefaultOptions (LoadError)
  Referenced from: /Users/rgrove/src/nokogumbo/lib/nokogumboc.bundle
  Expected in: flat namespace
 in /Users/rgrove/src/nokogumbo/lib/nokogumboc.bundle - /Users/rgrove/src/nokogumbo/lib/nokogumboc.bundle
    from /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/site_ruby/1.9.1/rubygems/core_ext/kernel_require.rb:53:in `require'
    from /Users/rgrove/src/nokogumbo/lib/nokogumbo.rb:2:in `<top (required)>'
    from /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/site_ruby/1.9.1/rubygems/core_ext/kernel_require.rb:53:in `require'
    from /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/site_ruby/1.9.1/rubygems/core_ext/kernel_require.rb:53:in `require'
    from test-nokogumbo.rb:3:in `<main>'
rake aborted!
Command failed with status (1): [/Users/rgrove/.rbenv/versions/1.9.3-p448/b...]
/Users/rgrove/src/nokogumbo/Rakefile:12:in `block in <top (required)>'
Tasks: TOP => gem => test
(See full trace by running task with --trace)

Ruby and Rake versions, in case they're useful (Ruby is running under rbenv):

$ ruby -v && rake --version
ruby 1.9.3p448 (2013-06-27 revision 41675) [x86_64-darwin12.4.0]
rake, version 10.1.0

I get the same failure attempting to build the gem under Ruby 2.0.0-p247.

gem install nokogumbo succeeds, but require 'nokogumbo' results in the same "Symbol not found" error:

irb(main):001:0> require 'nokogumbo'
LoadError: dlopen(/Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/gems/1.9.1/gems/nokogumbo-1.1/lib/nokogumboc.bundle, 9): Symbol not found: _kGumboDefaultOptions
  Referenced from: /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/gems/1.9.1/gems/nokogumbo-1.1/lib/nokogumboc.bundle
  Expected in: flat namespace
 in /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/gems/1.9.1/gems/nokogumbo-1.1/lib/nokogumboc.bundle - /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/gems/1.9.1/gems/nokogumbo-1.1/lib/nokogumboc.bundle
    from /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/site_ruby/1.9.1/rubygems/core_ext/kernel_require.rb:53:in `require'
    from /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/site_ruby/1.9.1/rubygems/core_ext/kernel_require.rb:53:in `require'
    from /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/gems/1.9.1/gems/nokogumbo-1.1/lib/nokogumbo.rb:2:in `<top (required)>'
    from /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/site_ruby/1.9.1/rubygems/core_ext/kernel_require.rb:133:in `require'
    from /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/site_ruby/1.9.1/rubygems/core_ext/kernel_require.rb:133:in `rescue in require'
    from /Users/rgrove/.rbenv/versions/1.9.3-p448/lib/ruby/site_ruby/1.9.1/rubygems/core_ext/kernel_require.rb:142:in `require'
    from (irb):1
    from /Users/rgrove/.rbenv/versions/1.9.3-p448/bin/irb:12:in `<main>'

Let me know if I can help by providing any more information!

Template Tag Support

I noticed that you added and then commented out template support in 5bcd5a6. This is problem when processing documents containing this tag, because the entire tag and all its children are thrown out of the document. Uncommenting the code seems to work fine.

Was there a specific reason you removed this?

Nokogumbo 1.4.12 crashes on Heroku

Deploying Nokogumbo 1.4.12 to Heroku, my apps crash and will not load:

LoadError: /tmp/build_cb6d9f00a9684138f8fafb6ab936331c/vendor/bundle/ruby/2.4.0/gems/nokogiri-1.7.2/ext/nokogiri/nokogiri.so: cannot open shared object file: No such file or directory - /tmp/build_a52ff056ecc70c6737ee489b9e26c490/vendor/bundle/ruby/2.4.0/gems/nokogumbo-1.4.12/lib/nokogumboc.so

Ruby 2.7.0 warnings

nokogumbo-2.0.2/lib/nokogumbo/html5.rb:74: warning: Using the last argument as keyword parameters is deprecated; maybe ** should be added to the call

nokogumbo-2.0.2/lib/nokogumbo/html5.rb:21: warning: The called method `parse' is defined here

nokogumbo/html5.rb:22: warning: Using the last argument as keyword parameters is deprecated; maybe ** should be added to the call

nokogumbo/html5/document.rb:4: warning: The called method `parse' is defined here

<source> tag should not be closed

The < source >-Tag is an empty element (https://developer.mozilla.org/de/docs/Web/HTML/Element/source).

However, when I do the following:

html = "<source>"
=> "<source>"
irb(main):062:0> Nokogiri::HTML5.parse("<html><body>#{html}").to_html
=> "<html>\n<head></head>\n<body><source></source></body>\n</html>\n"

The source tag is closed, which causes all sorts of problems.

I assume this is actually an issue with the underlying c lib, but I wasn't able to check.

`Nokogiri::HTML5` does not support line numbers

When I run:

@html =  Nokogiri::HTML(content)
@html.css('a').each { |node| ... }

each node element has the line method set to the right line number. However, Nokogiri::HTML5 does not seem to support this.

1.4.8 fails to compile on OSX (regression)

If I try to install 1.4.8 (the current version) I get a compilation error:

$ gem install nokogumbo
Building native extensions. This could take a while...
ERROR: Error installing nokogumbo:
ERROR: Failed to build gem native extension.

current directory: /Users/stefan/.rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/nokogumbo-1.4.8/ext/nokogumboc
/Users/stefan/.rbenv/versions/2.3.1/bin/ruby -r ./siteconf20160801-29681-fg62gj.rb extconf.rb
checking for xmlNewDoc() in -lxml2... yes
checking for nokogiri.h in /Users/stefan/.rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.8/ext/nokogiri... yes
checking for nokogiri.h in /Users/stefan/.rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.8/ext/nokogiri... yes
checking for gumbo_parse() in -lgumbo... no
...

and a moment later:

nokogumbo.c:22:10: fatal error: 'gumbo.h' file not found

Installing 1.4.7 works:

$ gem install nokogumbo:1.4.7
Fetching: nokogumbo-1.4.7.gem (100%)
Building native extensions. This could take a while...
Successfully installed nokogumbo-1.4.7

Installed via rbenv, and libxml2 via homebrew.

Fail to build gem

When trying to install nokogumbo on a Ruby 2.0.0 environment (Windows 8 64bit), I have the following error:

$ gem install nokogumbo -v '1.2.0'

Temporarily enhancing PATH to include DevKit...
Building native extensions. This could take a while...
ERROR: Error installing nokogumbo:
ERROR: Failed to build gem native extension.

c:/RailsInstaller/Ruby2.0.0/bin/ruby.exe extconf.rb

checking for xmlNewDoc() in -lxml2... *** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of necessary libraries and/or headers. Check the mkmf.log file for more details. You may need configuration options.

Provided configuration options:
--with-opt-dir
--without-opt-dir
--with-opt-include
--without-opt-include=${opt-dir}/include
--with-opt-lib
--without-opt-lib=${opt-dir}/lib
--with-make-prog
--without-make-prog
--srcdir=.
--curdir
--ruby=c:/RailsInstaller/Ruby2.0.0/bin/ruby
--with-xml2lib
--without-xml2lib
c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:434:in try_do': The compiler failed to generate an executable file. (RuntimeError) You have to install development tools first. from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:519:intry_link0'
from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:534:in try_link' from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:720:intry_func'
from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:950:in block in have_library' from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:895:inblock in checking_for'
from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:340:in block (2 levels) in postpone' from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:310:inopen'
from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:340:in block in postpone' from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:310:inopen'
from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:336:in postpone' from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:894:inchecking_for'
from c:/RailsInstaller/Ruby2.0.0/lib/ruby/2.0.0/mkmf.rb:945:in have_library' from extconf.rb:4:in

'

Gem files will remain installed in c:/RailsInstaller/Ruby2.0.0/lib/ruby/gems/2.0.0/gems/nokogumbo-1.2.0 for inspection.
Results logged to c:/RailsInstaller/Ruby2.0.0/lib/ruby/gems/2.0.0/gems/nokogumbo-1.2.0/ext/nokogumboc/gem_make.out

wraps closed div tags around everything

this seems pretty broken :(

Nokogiri::HTML5.parse('<div />aaa').to_s
"<?xml version=\"1.0\"?>\n<html>\n  <head/>\n  <body>\n    <div>aaa</div>\n  </body>\n</html>\n"

... nokogiri does it just fine:

Nokogiri::HTML.parse('<div />aaa').to_s
"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<div></div>aaa</body></html>\n"

@rubys any ideas how I can fix this ?

Markup errors not reported

Prior to 1.5 (e.g., 1.4.13), Nokogumbo would report markup errors such as the following:

doc = Nokogiri::HTML5.parse('<!DOCTYPE html> <html')
doc.errors #==> [#<Nokogiri::XML::SyntaxError: 1:22: ERROR: @1:22: Tokenizer error with an unimplemented error message.
<!DOCTYPE html> <html
                     ^>] 

With 1.5, this error is no longer reported. Most likely due to a change in Gumbo, but unfortunate.

Compile failure with system gumbo, usage of internal Gumbo data types & structures.

If I have gumbo-0.10.1 installed system-wide (emerge dev-libs/gumbo), then the gumbo sources are not copied locally, and nokogumboc fails to compile due to lack of header files and using Gumbo internals.

The quickest workaround is to always copy over the headers (minimal set error.h insertion_mode.h parser.h string_buffer.h token_type.h), but that has a potential to break in future if they are out of sync with the system copy.

Support Nokogiri::HTML.parse method signature

Nokogiri::HTML5.parse accepts a single argument, a String. However Nokogiri::HTML.parse accepts three, the string or readable, a base uri, and encoding. I'd be happy if the encoding argument were ignored, but to be a plug-replacement for Nokgori::HTML, it should accept a base URI for turning relative URLs into absolute URLs.

Segfault on parse using 1.4.10

I noticed that right after we upgraded to the latest (1.4.10) we are now getting segfaults when parsing certain HTML files as part of our rails asset precompile. For some reason, I can't recreate the segfault on my Mac, but it happens consistently on our CI box running RedHat 6. I can dig deeper to figure out what the actual HTML input is, but I wanted to check if you had some idea of what would cause the segfault first.

In case it helps, the library thats calling nokogumbo is something we maintain, so we can modify it if necessary: https://github.com/uniite/web-components-rails/blob/master/lib/web_components_rails/html_import_processor.rb#L49

Here's the output from the segfault:

/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/nokogumbo-1.4.10/lib/nokogumbo.rb:24: [BUG] Segmentation fault at 0x007fb35c7ab000
ruby 2.1.7p400 (2015-08-18 revision 51632) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0086 p:---- s:0437 e:000436 CFUNC  :parse
c:0085 p:0072 s:0433 e:000432 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/nokogumbo-1.4.10/lib/noko
c:0084 p:0009 s:0429 e:000428 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/nokogumbo-1.4.10/lib/noko
c:0083 p:0017 s:0421 e:000420 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/web_components_rails-1.2.
c:0082 p:0077 s:0412 e:000411 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/web_components_rails-1.2.
c:0081 p:0011 s:0407 e:000406 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/web_components_rails-1.2.
c:0080 p:0057 s:0403 e:000402 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0079 p:0023 s:0396 e:000395 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc [FINISH]
c:0078 p:---- s:0392 e:000391 CFUNC  :reverse_each
c:0077 p:0044 s:0389 e:000388 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0076 p:0351 s:0382 e:000381 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0075 p:0077 s:0363 e:000362 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0074 p:0042 s:0358 e:000357 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0073 p:0138 s:0349 e:000348 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0072 p:0014 s:0341 e:000337 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc [FINISH]
c:0071 p:---- s:0334 e:000333 CFUNC  :yield
c:0070 p:0011 s:0332 e:000331 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0069 p:0010 s:0328 e:000327 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0068 p:0093 s:0325 e:000322 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0067 p:0109 s:0315 E:001ad0 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0066 p:0057 s:0301 e:000300 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0065 p:0023 s:0294 e:000293 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc [FINISH]
c:0064 p:---- s:0290 e:000289 CFUNC  :reverse_each
c:0063 p:0044 s:0287 e:000286 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0062 p:0351 s:0280 e:000279 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0061 p:0077 s:0261 e:000260 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0060 p:0042 s:0256 E:0000e8 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0059 p:0138 s:0247 E:002150 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0058 p:0014 s:0239 e:000235 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc [FINISH]
c:0057 p:---- s:0232 e:000231 CFUNC  :yield
c:0056 p:0011 s:0230 e:000229 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0055 p:0043 s:0226 e:000225 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0054 p:0037 s:0219 e:000218 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0053 p:0018 s:0211 e:000210 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0052 p:0099 s:0207 e:000206 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0051 p:0010 s:0200 e:000199 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0050 p:0039 s:0196 e:000195 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc [FINISH]
c:0049 p:---- s:0191 e:000190 CFUNC  :each
c:0048 p:0031 s:0188 e:000187 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0047 p:0029 s:0184 e:000183 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0046 p:0028 s:0179 e:000178 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0045 p:0039 s:0175 e:000174 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc [FINISH]
c:0044 p:---- s:0170 e:000169 CFUNC  :each
c:0043 p:0031 s:0167 e:000166 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0042 p:0029 s:0163 e:000162 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0041 p:0028 s:0158 e:000157 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0040 p:0039 s:0154 e:000153 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc [FINISH]
c:0039 p:---- s:0149 e:000148 CFUNC  :each
c:0038 p:0031 s:0146 e:000145 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0037 p:0029 s:0142 e:000141 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0036 p:0028 s:0137 e:000136 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0035 p:0039 s:0133 e:000132 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc [FINISH]
c:0034 p:---- s:0128 e:000127 CFUNC  :each
c:0033 p:0031 s:0125 e:000124 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0032 p:0029 s:0121 e:000120 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc [FINISH]
c:0031 p:---- s:0116 e:000115 CFUNC  :each
c:0030 p:0011 s:0113 E:000460 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc [FINISH]
c:0029 p:---- s:0110 e:000109 CFUNC  :each
c:0028 p:0040 s:0107 E:0014e0 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0027 p:0097 s:0103 E:001ca8 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0026 p:0049 s:0096 E:0021d8 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sproc
c:0025 p:0010 s:0089 e:000088 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/non-stupid-digest-assets-
c:0024 p:0012 s:0084 e:000083 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-rails-3.2.0/lib
c:0023 p:0036 s:0082 e:000081 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/rake/
c:0022 p:0007 s:0077 e:000076 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-rails-3.2.0/lib [FINISH]
c:0021 p:---- s:0075 e:000074 CFUNC  :call
c:0020 p:0028 s:0070 e:000069 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task [FINISH]
c:0019 p:---- s:0067 e:000066 CFUNC  :each
c:0018 p:0113 s:0064 e:000063 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task
c:0017 p:0075 s:0060 e:000059 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task
c:0016 p:0014 s:0058 e:000057 METHOD /tools/ruby-2.1.7/lib/ruby/2.1.0/monitor.rb:211
c:0015 p:0025 s:0055 e:000054 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task
c:0014 p:0036 s:0048 e:000047 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task
c:0013 p:0033 s:0043 e:000042 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/appl
c:0012 p:0009 s:0036 e:000035 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/appl [FINISH]
c:0011 p:---- s:0033 e:000032 CFUNC  :each
c:0010 p:0039 s:0030 e:000029 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/appl
c:0009 p:0025 s:0028 e:000027 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/appl
c:0008 p:0007 s:0024 e:000023 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/appl
c:0007 p:0019 s:0021 e:000020 BLOCK  /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/appl
c:0006 p:0006 s:0019 e:000018 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/appl
c:0005 p:0007 s:0015 e:000014 METHOD /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/appl
c:0004 p:0021 s:0012 e:000011 TOP    /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/exe/rake:27 [FINISH]
c:0003 p:---- s:0010 e:000009 CFUNC  :load
c:0002 p:0135 s:0006 E:001be8 EVAL   /ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/bin/rake:23 [FINISH]
c:0001 p:0000 s:0002 E:001528 TOP    [FINISH]

/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/bin/rake:23:in `<main>'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/bin/rake:23:in `load'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/exe/rake:27:in `<top (required)>'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/application.rb:77:in `run'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/application.rb:178:in `standard_exception_handling'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/application.rb:80:in `block in run'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/application.rb:102:in `top_level'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/application.rb:117:in `run_with_threads'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/application.rb:108:in `block in top_level'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/application.rb:108:in `each'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/application.rb:108:in `block (2 levels) in top_level'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/application.rb:152:in `invoke_task'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task.rb:173:in `invoke'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task.rb:180:in `invoke_with_call_chain'
/tools/ruby-2.1.7/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task.rb:187:in `block in invoke_with_call_chain'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task.rb:243:in `execute'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task.rb:243:in `each'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task.rb:248:in `block in execute'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/rake-11.3.0/lib/rake/task.rb:248:in `call'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-rails-3.2.0/lib/sprockets/rails/task.rb:67:in `block (2 levels) in define'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/rake/sprocketstask.rb:147:in `with_logger'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-rails-3.2.0/lib/sprockets/rails/task.rb:68:in `block (3 levels) in define'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/non-stupid-digest-assets-1.0.8/lib/non-stupid-digest-assets.rb:26:in `compile'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/manifest.rb:185:in `compile'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/manifest.rb:140:in `find'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/legacy.rb:104:in `logical_paths'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/legacy.rb:104:in `each'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/legacy.rb:105:in `block in logical_paths'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/legacy.rb:105:in `each'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:227:in `stat_tree'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:209:in `stat_directory'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:209:in `each'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:212:in `block in stat_directory'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:231:in `block in stat_tree'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:227:in `stat_tree'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:209:in `stat_directory'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:209:in `each'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:212:in `block in stat_directory'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:231:in `block in stat_tree'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:227:in `stat_tree'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:209:in `stat_directory'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:209:in `each'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:212:in `block in stat_directory'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:231:in `block in stat_tree'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:227:in `stat_tree'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:209:in `stat_directory'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:209:in `each'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:212:in `block in stat_directory'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/path_utils.rb:228:in `block in stat_tree'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/legacy.rb:114:in `block (2 levels) in logical_paths'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/manifest.rb:142:in `block in find'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/base.rb:73:in `find_all_linked_assets'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/base.rb:66:in `find_asset'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/cached_environment.rb:47:in `load'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/cached_environment.rb:47:in `yield'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/cached_environment.rb:20:in `block in initialize'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/loader.rb:44:in `load'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/loader.rb:317:in `fetch_asset_from_dependency_cache'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/loader.rb:60:in `block in load'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/loader.rb:134:in `load_from_unloaded'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/processor_utils.rb:56:in `call_processors'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/processor_utils.rb:56:in `reverse_each'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/processor_utils.rb:57:in `block in call_processors'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/processor_utils.rb:75:in `call_processor'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/bundle.rb:24:in `call'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/utils.rb:196:in `dfs'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/bundle.rb:23:in `block in call'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/cached_environment.rb:47:in `load'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/cached_environment.rb:47:in `yield'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/cached_environment.rb:20:in `block in initialize'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/loader.rb:44:in `load'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/loader.rb:317:in `fetch_asset_from_dependency_cache'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/loader.rb:60:in `block in load'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/loader.rb:134:in `load_from_unloaded'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/processor_utils.rb:56:in `call_processors'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/processor_utils.rb:56:in `reverse_each'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/processor_utils.rb:57:in `block in call_processors'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/sprockets-3.7.0/lib/sprockets/processor_utils.rb:75:in `call_processor'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/web_components_rails-1.2.2/lib/web_components_rails/html_import_processor.rb:16:in `call'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/web_components_rails-1.2.2/lib/web_components_rails/html_import_processor.rb:37:in `call'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/web_components_rails-1.2.2/lib/web_components_rails/html_import_processor.rb:49:in `process_imports'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/nokogumbo-1.4.10/lib/nokogumbo.rb:87:in `fragment'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/nokogumbo-1.4.10/lib/nokogumbo.rb:24:in `parse'
/ci-workspace/rails-project/vendor/bundle/ruby/2.1.0/gems/nokogumbo-1.4.10/lib/nokogumbo.rb:24:in `parse'

-- C level backtrace information -------------------------------------------

nokogumbo not installing

Setup

Using cygwin on windows 10 with rvm.

Had trouble with nokogiri installing so had to Install libxml2 and libxslt libraries and headers and I think pkg-config then do gem install nokogiri -- --use-system-libraries --with-xml2-include=C:/cygwin64/usr/include/libxml2 --with-xslt-include=C:/cygwin64/usr/include/libxslt

This installed great.

Problem

Run gem install nokogumbo -v '1.4.13' and get this error in C:\cygwin64\home\Lee.rvm\gems\ruby-2.5.5\extensions\x86_64-cygwin\2.5.0\nokogumbo-1.4.13\gem_make.out

current directory: /home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogumbo-1.4.13/ext/nokogumboc
/home/Lee/.rvm/rubies/ruby-2.5.5/bin/ruby.exe -I /home/Lee/.rvm/rubies/ruby-2.5.5/lib/ruby/site_ruby/2.5.0 -r ./siteconf20190828-35793-1yqbuou.rb extconf.rb
checking for xmlNewDoc() in -lxml2... yes
checking for nokogiri.h in /home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri... yes
checking for nokogiri.h in /home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri... yes
checking for gumbo_parse() in -lgumbo... no
checking for GumboErrorType with error.h... not found
checking for GumboInsertionMode with insertion_mode.h... not found
checking for GumboParser with parser.h... not found
checking for GumboStringBuffer with string_buffer.h... not found
checking for GumboTokenType with token_type.h... not found
creating Makefile

current directory: /home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogumbo-1.4.13/ext/nokogumboc
make "DESTDIR=" clean

current directory: /home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogumbo-1.4.13/ext/nokogumboc
make "DESTDIR="
compiling attribute.c
compiling char_ref.c
char_ref.rl: In function ‘consume_numeric_ref’:
char_ref.rl:128:3: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
   if (!error) {
   ^~~
char_ref.rl:137:3: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
   for (int i = 0; kCharReplacements[i].from_char != -1; ++i) {
   ^~~
.....
.....
..... errors continue.......

mkmf.txt output is

have_library: checking for xmlNewDoc() in -lxml2... -------------------- yes

"gcc -o conftest.exe -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/backward -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0 -I. -D_XOPEN_SOURCE -D_GNU_SOURCE   -O3 -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wno-tautological-compare -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-self-assign -Wunused-variable -Wimplicit-int -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wmisleading-indentation -Wno-packed-bitfield-compat -Wsuggest-attribute=noreturn -Wsuggest-attribute=format -Wimplicit-fallthrough=0 -Wduplicated-cond -Wrestrict -std=c99 conftest.c  -L. -L/home/Lee/.rvm/rubies/ruby-2.5.5/lib -L. -fstack-protector     -lruby250  -lpthread -ldl -lcrypt  "
checked program was:
/* begin */
1: #include "ruby.h"
2: 
3: int main(int argc, char **argv)
4: {
5:   return 0;
6: }
/* end */

"gcc -o conftest.exe -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/backward -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0 -I. -D_XOPEN_SOURCE -D_GNU_SOURCE   -O3 -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wno-tautological-compare -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-self-assign -Wunused-variable -Wimplicit-int -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wmisleading-indentation -Wno-packed-bitfield-compat -Wsuggest-attribute=noreturn -Wsuggest-attribute=format -Wimplicit-fallthrough=0 -Wduplicated-cond -Wrestrict -std=c99 conftest.c  -L. -L/home/Lee/.rvm/rubies/ruby-2.5.5/lib -L. -fstack-protector     -lruby250 -lxml2  -lpthread -ldl -lcrypt  "
conftest.c: In function ‘t’:
conftest.c:13:57: error: ‘xmlNewDoc’ undeclared (first use in this function)
 int t(void) { void ((*volatile p)()); p = (void ((*)()))xmlNewDoc; return !p; }
                                                     ^~~~~~~~~
conftest.c:13:57: note: each undeclared identifier is reported only once for each function it appears in
conftest.c: At top level:
cc1: warning: unrecognized command line option ‘-Wno-self-assign’
cc1: warning: unrecognized command line option ‘-Wno-constant-logical-operand’
cc1: warning: unrecognized command line option ‘-Wno-parentheses-equality’
checked program was:
/* begin */
 1: #include "ruby.h"
 2: 
 3: /*top*/
 4: extern int t(void);
 5: int main(int argc, char **argv)
 6: {
 7:   if (argc > 1000000) {
 8:     printf("%p", &t);
 9:   }
10: 
11:   return 0;
12: }
13: int t(void) { void ((*volatile p)()); p = (void ((*)()))xmlNewDoc; return !p; }
/* end */

"gcc -o conftest.exe -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/backward -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0 -I. -D_XOPEN_SOURCE -D_GNU_SOURCE   -O3 -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wno-tautological-compare -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-self-assign -Wunused-variable -Wimplicit-int -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wmisleading-indentation -Wno-packed-bitfield-compat -Wsuggest-attribute=noreturn -Wsuggest-attribute=format -Wimplicit-fallthrough=0 -Wduplicated-cond -Wrestrict -std=c99 conftest.c  -L. -L/home/Lee/.rvm/rubies/ruby-2.5.5/lib -L. -fstack-protector     -lruby250 -lxml2  -lpthread -ldl -lcrypt  "
checked program was:
 /* begin */
 1: #include "ruby.h"
 2: 
 3: /*top*/
 4: extern int t(void);
 5: int main(int argc, char **argv)
 6: {
 7:   if (argc > 1000000) {
 8:     printf("%p", &t);
 9:   }
10: 
11:   return 0;
12: }
13: extern void xmlNewDoc();
14: int t(void) { xmlNewDoc(); return 0; }
/* end */

--------------------

"pkg-config --exists libxml-2.0"
| pkg-config --libs libxml-2.0
=> "-lxml2 \n"
"gcc -o conftest.exe -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/backward -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0 -I. -D_XOPEN_SOURCE -D_GNU_SOURCE   -O3 -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wno-tautological-compare -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-self-assign -Wunused-variable -Wimplicit-int -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wmisleading-indentation -Wno-packed-bitfield-compat -Wsuggest-attribute=noreturn -Wsuggest-attribute=format -Wimplicit-fallthrough=0 -Wduplicated-cond -Wrestrict -std=c99 conftest.c  -L. -L/home/Lee/.rvm/rubies/ruby-2.5.5/lib -L. -fstack-protector    -lxml2  -lruby250 -lxml2 -lpthread -ldl -lcrypt  "
checked program was:
/* begin */
1: #include "ruby.h"
2: 
3: int main(int argc, char **argv)
4: {
5:   return 0;
6: }
/* end */

| pkg-config --cflags-only-I libxml-2.0
=> "-I/usr/include/libxml2 \n"
| pkg-config --cflags-only-other libxml-2.0
=> "\n"
| pkg-config --libs-only-l libxml-2.0
=> "-lxml2 \n"
package configuration for libxml-2.0
cflags: 
ldflags: 
libs: -lxml2

find_header: checking for nokogiri.h in /home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri... -------------------- yes

"gcc -E -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/backward -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0 -I. -I/usr/include/libxml2 -D_XOPEN_SOURCE -D_GNU_SOURCE   -O3 -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wno-tautological-compare -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-self-assign -Wunused-variable -Wimplicit-int -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wmisleading-indentation -Wno-packed-bitfield-compat -Wsuggest-attribute=noreturn -Wsuggest-attribute=format -Wimplicit-fallthrough=0 -Wduplicated-cond -Wrestrict -std=c99   conftest.c -o conftest.i"
conftest.c:3:10: fatal error: nokogiri.h: No such file or directory
 #include <nokogiri.h>
          ^~~~~~~~~~~~
compilation terminated.
checked program was:
/* begin */
1: #include "ruby.h"
2: 
3: #include <nokogiri.h>
/* end */

"gcc -E -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/backward -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0 -I. -I/usr/include/libxml2 -D_XOPEN_SOURCE -D_GNU_SOURCE   -O3 -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wno-tautological-compare -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-self-assign -Wunused-variable -Wimplicit-int -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wmisleading-indentation -Wno-packed-bitfield-compat -Wsuggest-attribute=noreturn -Wsuggest-attribute=format -Wimplicit-fallthrough=0 -Wduplicated-cond -Wrestrict -std=c99  -I/home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri conftest.c -o conftest.i"
In file included from conftest.c:3:0:
/home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri/nokogiri.h:13:0: warning: "_GNU_SOURCE" redefined
 #define _GNU_SOURCE

In file included from /home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/ruby.h:24:0,
                 from /home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby.h:33,
                 from conftest.c:1:
/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin/ruby/config.h:17:0: note: this is the location of the previous definition
 #define _GNU_SOURCE 1

 In file included from conftest.c:3:0:
/home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri/nokogiri.h:39:0: warning: "MAYBE_UNUSED" redefined
  #  define MAYBE_UNUSED(name) name __attribute__((unused))

In file included from /home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/ruby.h:24:0,
                 from /home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby.h:33,
                 from conftest.c:1:
/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin/ruby/config.h:134:0: note: this is the location of the previous definition
 #define MAYBE_UNUSED(x) __attribute__ ((__unused__)) x

cc1: warning: unrecognized command line option ‘-Wno-self-assign’
cc1: warning: unrecognized command line option ‘-Wno-constant-logical-operand’
cc1: warning: unrecognized command line option ‘-Wno-parentheses-equality’
checked program was:
/* begin */
1: #include "ruby.h"
2: 
3: #include <nokogiri.h>
/* end */

--------------------

find_header: checking for nokogiri.h in /home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri... -------------------- yes

"gcc -E -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/backward -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0 -I. -I/usr/include/libxml2 -I/home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri -D_XOPEN_SOURCE -D_GNU_SOURCE   -O3 -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wno-tautological-compare -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-self-assign -Wunused-variable -Wimplicit-int -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wmisleading-indentation -Wno-packed-bitfield-compat -Wsuggest-attribute=noreturn -Wsuggest-attribute=format -Wimplicit-fallthrough=0 -Wduplicated-cond -Wrestrict -std=c99   conftest.c -o conftest.i"
In file included from conftest.c:3:0:
/home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri/nokogiri.h:13:0: warning: "_GNU_SOURCE" redefined
 #define _GNU_SOURCE

In file included from /home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/ruby.h:24:0,
                 from /home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby.h:33,
                 from conftest.c:1:
/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin/ruby/config.h:17:0: note: this is the location of the previous definition
 #define _GNU_SOURCE 1

In file included from conftest.c:3:0:
/home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri/nokogiri.h:39:0: warning: "MAYBE_UNUSED" redefined
 #  define MAYBE_UNUSED(name) name __attribute__((unused))

In file included from /home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/ruby.h:24:0,
                 from /home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby.h:33,
                 from conftest.c:1:
/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin/ruby/config.h:134:0: note: this is the location of the previous definition
 #define MAYBE_UNUSED(x) __attribute__ ((__unused__)) x

cc1: warning: unrecognized command line option ‘-Wno-self-assign’
cc1: warning: unrecognized command line option ‘-Wno-constant-logical-operand’
cc1: warning: unrecognized command line option ‘-Wno-parentheses-equality’
checked program was:
/* begin */
1: #include "ruby.h"
2: 
3: #include <nokogiri.h>
/* end */

--------------------

have_library: checking for gumbo_parse() in -lgumbo... -------------------- no

"gcc -o conftest.exe -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/backward -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0 -I. -I/usr/include/libxml2 -I/home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri -D_XOPEN_SOURCE -D_GNU_SOURCE   -O3 -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wno-tautological-compare -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-self-assign -Wunused-variable -Wimplicit-int -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wmisleading-indentation -Wno-packed-bitfield-compat -Wsuggest-attribute=noreturn -Wsuggest-attribute=format -Wimplicit-fallthrough=0 -Wduplicated-cond -Wrestrict -std=c99  -DNGLIB conftest.c  -L. -L/home/Lee/.rvm/rubies/ruby-2.5.5/lib -L. -fstack-protector     -lxml2  -lxml2 -lruby250 -lgumbo -lxml2  -lxml2 -lpthread -ldl -lcrypt  "
conftest.c: In function ‘t’:
conftest.c:13:57: error: ‘gumbo_parse’ undeclared (first use in this function)
 int t(void) { void ((*volatile p)()); p = (void ((*)()))gumbo_parse; return !p; }
                                                     ^~~~~~~~~~~
conftest.c:13:57: note: each undeclared identifier is reported only once for each function it appears in
conftest.c: At top level:
cc1: warning: unrecognized command line option ‘-Wno-self-assign’
cc1: warning: unrecognized command line option ‘-Wno-constant-logical-operand’
cc1: warning: unrecognized command line option ‘-Wno-parentheses-equality’
checked program was: 
/* begin */
1: #include "ruby.h"
 2: 
 3: /*top*/
 4: extern int t(void);
 5: int main(int argc, char **argv)
 6: {
 7:   if (argc > 1000000) {
 8:     printf("%p", &t);
 9:   }
10: 
11:   return 0;
12: }
13: int t(void) { void ((*volatile p)()); p = (void ((*)()))gumbo_parse; return !p; }
/* end */

"gcc -o conftest.exe -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/x86_64-cygwin -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0/ruby/backward -I/home/Lee/.rvm/rubies/ruby-2.5.5/include/ruby-2.5.0 -I. -I/usr/include/libxml2 -I/home/Lee/.rvm/gems/ruby-2.5.5/gems/nokogiri-1.6.8.1/ext/nokogiri -D_XOPEN_SOURCE -D_GNU_SOURCE   -O3 -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wno-tautological-compare -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-self-assign -Wunused-variable -Wimplicit-int -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wmisleading-indentation -Wno-packed-bitfield-compat -Wsuggest-attribute=noreturn -Wsuggest-attribute=format -Wimplicit-fallthrough=0 -Wduplicated-cond -Wrestrict -std=c99  -DNGLIB conftest.c  -L. -L/home/Lee/.rvm/rubies/ruby-2.5.5/lib -L. -fstack-protector     -lxml2  -lxml2 -lruby250 -lgumbo -lxml2  -lxml2 -lpthread -ldl -lcrypt  "
/usr/lib/gcc/x86_64-pc-cygwin/7.4.0/../../../../x86_64-pc-cygwin/bin/ld: cannot find -lgumbo
collect2: error: ld returned 1 exit status
checked program was:
/* begin */
 1: #include "ruby.h"
 2: 
 3: /*top*/
 4: extern int t(void);
 5: int main(int argc, char **argv)
 6: {
 7:   if (argc > 1000000) {
 8:     printf("%p", &t);
 9:   }
10: 
11:   return 0;
12: }
13: extern void gumbo_parse();
14: int t(void) { gumbo_parse(); return 0; }
/* end */
--------------------
find_type: checking for GumboErrorType with error.h... -------------------- not found
--------------------
find_type: checking for GumboInsertionMode with insertion_mode.h... -------------------- not found
 --------------------
find_type: checking for GumboParser with parser.h... -------------------- not found
--------------------
find_type: checking for GumboStringBuffer with string_buffer.h... -------------------- not found
--------------------
find_type: checking for GumboTokenType with token_type.h... -------------------- not found
--------------------

I think this may be similar to issue #71 and this pull request here #86 as this is a earlier version of nokogumbo but this is a little out of my comfort zone seeing as I just wanted to use rvm with different ruby versions..

Trying gem install nokogumbo -v '1.4.13' --use-system-libraries --with-xml2-include=C:/cygwin64/usr/include/libxml2 --with-xslt-include=C:/cygwin64/usr/include/libxslt did not fix my issue, although teh error was slightly different.

EDIT Upgraded to nokogiri 1.10.4 with same result and it doesn't matter which version of nokogumbo I try it is always the same checking for gumbo_parse() in -lgumbo... no error.

Fork gumbo-parser?

Google's gumbo-parser has a few bugs and it's not clear that it's actually maintained any longer. I worked around google/gumbo-parser#371 in #51 but google/gumbo-parser#375 seems potentially bad.

Two options spring to mind:

  1. Incorporate an updated version of gumbo-parser directly in nokogumbo.
  2. Fork gumbo-parser and point the git submodule at the fork.

In either case, fixes could be applied. I suspect 2 would be easier to do and easier to revert if gumbo starts being updated again.

v2.0.0

With #112, I think I've finished everything that I wanted to get finished for 2.0.0.

The changelog is here but some highlights:

  • Integrate gumbo to fix some long-standing issues
  • Implement the latest version of the HTML standard
  • Add hundreds of tests (with tens of thousands of assertions) to both gumbo itself and to Nokogumbo.
  • Implement all of the error conditions specified in the standard and use the standard-specified code names for those that have them (about half do not have a name yet)
  • Implement proper HTML fragment parsing
  • Implement proper HTML serialization
  • Fix Windows builds and add CI tests for them
  • Fix Gentoo builds and add CI tests for them

Unless there's anything else that needs to be added, I think we should

  • Change the version number in lib/nokogumbo/version.rb
  • Add a corresponding section in CHANGELOG.md for 2.0.0
  • Tag v2.0.0
  • Push a new gem

Tag the releases

RubyGems report that this gem is now in the version 1.3.0. Can you please tag the releases?

gem build fails on OSX behind proxy

$ gem install nokogumbo -v '1.4.1' results in:

Fetching: nokogumbo-1.4.1.gem (100%)
Building native extensions.  This could take a while...
ERROR:  Error installing nokogumbo:
  ERROR: Failed to build gem native extension.

    /Users/newton10471/.rvm/rubies/ruby-2.1.5/bin/ruby -r ./siteconf20150716-54106-14uit7d.rb extconf.rb
checking for xmlNewDoc() in -lxml2... yes
checking for nokogiri.h in /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/nokogiri-1.6.6.2/ext/nokogiri... no
checking if the C compiler accepts ... yes
checking if the C compiler accepts -Wno-error=unused-command-line-argument-hard-error-in-future... no
Building nokogiri using packaged libraries.
checking for gzdopen() in -lz... yes
checking for iconv... yes
************************************************************************
IMPORTANT NOTICE:

Building Nokogiri with a packaged version of libxml2-2.9.2.

Team Nokogiri will keep on doing their best to provide security
updates in a timely manner, but if this is a concern for you and want
to use the system library instead; abort this installation process and
reinstall nokogiri as follows:

    gem install nokogiri -- --use-system-libraries
        [--with-xml2-config=/path/to/xml2-config]
        [--with-xslt-config=/path/to/xslt-config]

If you are using Bundler, tell it to use the option:

    bundle config build.nokogiri --use-system-libraries
    bundle install

Note, however, that nokogiri is not fully compatible with arbitrary
versions of libxml2 provided by OS/package vendors.
************************************************************************
ERROR: Operation timed out - connect(2) for "ftp.xmlsoft.org" port 21
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of necessary
libraries and/or headers.  Check the mkmf.log file for more details.  You may
need configuration options.

Provided configuration options:
  --with-opt-dir
  --without-opt-dir
  --with-opt-include
  --without-opt-include=${opt-dir}/include
  --with-opt-lib
  --without-opt-lib=${opt-dir}/lib
  --with-make-prog
  --without-make-prog
  --srcdir=.
  --curdir
  --ruby=/Users/newton10471/.rvm/rubies/ruby-2.1.5/bin/ruby
  --with-xml2lib
  --without-xml2lib
  --with-libxml-2.0-config
 --without-libxml-2.0-config
  --with-pkg-config
  --without-pkg-config
  --help
  --clean
  --use-system-libraries
  --enable-static
  --disable-static
  --with-zlib-dir
  --without-zlib-dir
  --with-zlib-include
  --without-zlib-include=${zlib-dir}/include
  --with-zlib-lib
  --without-zlib-lib=${zlib-dir}/lib
  --enable-cross-build
  --disable-cross-build
/Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/mini_portile-0.6.2/lib/mini_portile.rb:319:in `rescue in download_file': Failed to complete download task (RuntimeError)
  from /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/mini_portile-0.6.2/lib/mini_portile.rb:309:in `download_file'
  from /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/mini_portile-0.6.2/lib/mini_portile.rb:28:in `block in download'
  from /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/mini_portile-0.6.2/lib/mini_portile.rb:26:in `each'
  from /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/mini_portile-0.6.2/lib/mini_portile.rb:26:in `download'
  from /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/mini_portile-0.6.2/lib/mini_portile.rb:106:in `cook'
  from /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/nokogiri-1.6.6.2/ext/nokogiri/extconf.rb:278:in `block in process_recipe'
  from /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/nokogiri-1.6.6.2/ext/nokogiri/extconf.rb:177:in `tap'
  from /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/nokogiri-1.6.6.2/ext/nokogiri/extconf.rb:177:in `process_recipe'
  from /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/nokogiri-1.6.6.2/ext/nokogiri/extconf.rb:475:in `<top (required)>'
  from /Users/newton10471/.rvm/rubies/ruby-2.1.5/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require'
  from /Users/newton10471/.rvm/rubies/ruby-2.1.5/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require'
  from extconf.rb:17:in `<main>'

extconf failed, exit code 1

Gem files will remain installed in /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/gems/nokogumbo-1.4.1 for inspection.
Results logged to /Users/newton10471/.rvm/gems/ruby-2.1.5@mygemset/extensions/x86_64-darwin-14/2.1.0-static/nokogumbo-1.4.1/gem_make.out

My https_proxy and http_proxy variables are set to the proper proxies, such that other gems build just fine. 'gem install nokogiri' works. 'curl ftp.xmlsoft.org' returns html as well. It looks to me like whatever is trying to contact ftp.xmlsoft.org is doing so without the proxy settings.

Revisiting dropping Ruby 2.0

I had removed the Nokogiri version specification from the Gemfile for Ruby 2.0 a while back and it passed all the tests so I assumed that it was working. What I didn't realize was that Travis had cached 1.6.8 and since the Gemfile didn't specify any particular version of Nokogiri, that one was used.

I just tried setting the minimum version to 1.8 and it fails to build because everything after 1.6.8 requires at least 2.1. Unfortunately, I made this change because tests on 1.6.8 started failing. I looked into this and it's a bug in Nokogiri.

Here's some HTML that triggers this.

<html></html><!-- foo -->

The buggy line is right here and here's @flavorjones's fix in 1.8.0.

We could monkey patch this in version 1.6.8, but I'd much prefer dropping support for Ruby 2.0 since the latest version of Nokogiri supports Ruby 2.1 and higher and we're all green on Ruby 2.1+. (This is testing the code I wrote to properly support fragment parsing which passes all of the html5lib fragment tests. I still need to write some more tests of the API, but the hard part is all done.)

Crash with unicode NULL

Nokogumbo can have a nasty crash when parsing html tags that contain unicode null characters.

Steps to reproduce:

require 'nokogiri'
require 'nokogumbo'

str = "<\u0000/\u0000>" # </>
Nokogiri::HTML5.parse str
Assertion failed: (*c || c == error_location), function find_last_newline, file error.c, line 141.
[1]    64965 abort      irb

See also these strings:

<\u0000/\u0000h\u0000t\u0000m\u0000l\u0000>
<html><body><\u0000/\u0000b\u0000o\u0000d\u0000y\u0000><\u0000/\u0000h\u0000t\u0000m\u0000l\u0000>

Version tested: 1.8.1
Ruby version: 2.3.4

Change default number of errors for 2.0.0

In #65 we decided that 0 was a good default for the maximum number of errors. I'm starting to think that's not a good choice and I think we should change it for 2.0.0. I propose 10 (although any small, nonzero number seems reasonable).

My rationale is that it's useful to be able to check if there were errors at all, even if you don't care about any specific number of them. That is, something like

html = ...
doc = Nokogiri::HTML5.parse(html)
if doc.errors.any?
  puts(doc.errors[0])
end

is nicer than

html = ...
doc = Nokogiri::HTML5.parse(html, max_parse_errors: 1)
if doc.errors.any?
  puts(doc.errors[0])
end

Plus, I think it's unintuitive that if the maximum number of errors is not set, then doc.errors will be empty even if there were parse errors.

10 or 5 errors seem reasonable (and we should document what we choose). It shouldn't lead to a massive amount of memory usage as can happen with an unbounded number of errors.

@rafbm, you raised the original issue. Would 5 or 10 work for you?

Release a new version

There is a couple of crashes that have been fixed (#129, #128, #126) but a new version has not been released on RubyGems since a year ago. Can you push a new version?

binutils-2.25: `require': nokogiri.so: cannot open shared object file: No such file or directory

Projects using Bundler and Nokogumbo fail to load if built using binutils-2.25. The following error message appears:

/home/infoman/work/dev/binutils-bisect/bundler-test/vendor/bundle/ruby/2.0.0/gems/nokogumbo-1.4.2/lib/nokogumbo.rb:2:in `require': nokogiri.so: cannot open shared object file: No such file or directory - /home/infoman/work/dev/binutils-bisect/bundler-test/vendor/bundle/ruby/2.0.0/extensions/x86-linux/2.0.0/nokogumbo-1.4.2/nokogumboc.so (LoadError)

binutils bug report was closed as invalid: https://sourceware.org/bugzilla/show_bug.cgi?id=18448

Creating a DocumentFragment results in a slow memory leak

Creating a DocumentFragment from a Nokogumbo document appears to result in a slow memory leak. The larger the HTML input, the larger the leak is. Here's a simple repro case (I tested against Nokogumbo 1.3.0):

#!/usr/bin/env ruby
# encoding: utf-8
require 'nokogumbo'

html = %[
  <p><b>Rome</b> (,  , ) is a city and special <i><a href="comune">comune</a></i> (named "Roma Capitale") in <a href="Italy">Italy</a>. Rome is the capital of Italy and <a href="Regions of Italy">region</a> of <a href="Lazio">Lazio</a>. With 2.9&nbsp;million residents in , it is also the country's largest and most populated <i>comune</i> and <a href="Largest cities of the European Union by population within city limits">fourth-most populous city</a> in the European Union by population within city limits. The <a href="Metropolitan City of Rome">Metropolitan City of Rome</a> has a population of 4.3&nbsp;million residents.<sup class="reference" id="cite_ref-PR_2-1">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-PR_2">2</a>]</sup> The city is located in the central-western portion of the <a href="Italian Peninsula">Italian Peninsula</a>, within Lazio (Latium), along the shores of <a href="Tiber">Tiber</a> river. <a href="Vatican City">Vatican City</a> is an independent country within the city boundaries of Rome, the only existing example of a country within a city: for this reason Rome has been often defined as capital of two states.<sup class="reference" id="cite_ref-3-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-3">3</a>]</sup><sup class="reference" id="cite_ref-4-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-4">4</a>]</sup></p>
  <p><a href="History of Rome">Rome's history</a> spans <a href="List of cities by time of continuous habitation">more than two and a half thousand years</a>. Although Roman tradition states the founding of Rome around 753 BC, the site has been inhabited much earlier, being one of the oldest continuously occupied cities in Europe.<sup class="reference" id="cite_ref-5-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-5">5</a>]</sup> The city's early population originated from a mix of <a href="Latins">Latins</a>, <a href="Etruscan civilization">Etruscans</a> and <a href="Sabines">Sabines</a>. Eventually, the city  successively became the capital of the <a href="Roman Kingdom">Roman Kingdom</a>, the <a href="Roman Republic">Roman Republic</a> and the <a href="Roman Empire">Roman Empire</a>, and is regarded as one of the birthplaces of <a href="Western culture">Western civilization</a>. It is referred to as "Roma Aeterna" (The Eternal City) <sup class="reference" id="cite_ref-6-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-6">6</a>]</sup> and "<a href="Caput Mundi">Caput Mundi</a>" (Capital of the World), two central notions in ancient Roman culture.</p>
  <p>After the <a href="Fall of the Roman Empire">Fall of the Empire</a>, which marked the begin of the <a href="Middle Ages">Middle Ages</a>, Rome slowly fell under the political control of the <a href="Pope">Pope</a>, which had settled in the city since the 1st century AD, until in the 8th century it became the capital of the <a href="Papal States">Papal States</a>, which lasted until 1870.</p>
  <p>Beginning with the <a href="Renaissance">Renaissance</a>, almost all the popes since <a href="Pope Nicholas V">Nicholas V</a> (1422–55) pursued coherently along four hundred years an architectonic and urbanistic program aimed to make of the city the world's artistic and cultural center.<sup class="reference" id="cite_ref-7-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-7">7</a>]</sup> Due to that, Rome became first one of the major centers of the <a href="Italian Renaissance">Italian Renaissance</a>,<sup class="reference" id="cite_ref-8-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-8">8</a>]</sup> and then the birthplace of the <a href="Baroque">Baroque</a> style. Famous artists and architects of the Renaissance and Baroque period made Rome the center of their activity, creating masterpieces throughout the city. In 1871 Rome became the capital of the <a href="Kingdom of Italy (1861–1946)">Kingdom of Italy</a>, and in 1946 that of the <a href="Italian Republic">Italian Republic</a>.</p>
  <p>Rome has the status of a <a href="global city">global city</a>.<sup class="reference" id="cite_ref-lboro.ac.uk_9-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-lboro.ac.uk_9">9</a>]</sup><sup class="reference" id="cite_ref-10-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-10">10</a>]</sup><sup class="reference" id="cite_ref-atkearney.at_11-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-atkearney.at_11">11</a>]</sup> In 2011, Rome was the 18th-most-visited city in the world, 3rd most visited in the <a href="European Union">European Union</a>, and the most popular tourist attraction in Italy.<sup class="reference" id="cite_ref-Caroline Bremner_12-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-Caroline Bremner_12">12</a>]</sup> Its historic centre is listed by <a href="UNESCO">UNESCO</a> as a <a href="World Heritage Site">World Heritage Site</a>.<sup class="reference" id="cite_ref-whc.unesco.org_13-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-whc.unesco.org_13">13</a>]</sup> Monuments and museums such as the <a href="Vatican Museums">Vatican Museums</a> and the <a href="Colosseum">Colosseum</a> are among the world's most visited tourist destinations with both locations receiving millions of tourists a year. Rome hosted the <a href="1960 Summer Olympics">1960 Summer Olympics</a> and is the seat of United Nations' <a href="Food and Agriculture Organization">Food and Agriculture Organization</a> (FAO).</p>
  <p><table id="toc" class="toc" summary="Contents"><tr><td><div id="toctitle"><h2>Table of Contents</h2></div><ul><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Etymology" rel="nofollow" title="#Etymology">#Etymology</a>">Etymology</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23History" rel="nofollow" title="#History">#History</a>">History</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Earliest_history" rel="nofollow" title="#Earliest_history">#Earliest_history</a>">Earliest history</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Legend_of_the_founding_of_Rome" rel="nofollow" title="#Legend_of_the_founding_of_Rome">#Legend_of_the_founding_of_Rome</a>">Legend of the founding of Rome</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Monarchy_republic_empire" rel="nofollow" title="#Monarchy_republic_empire">#Monarchy_republic_empire</a>">Monarchy, republic, empire</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Middle_Ages" rel="nofollow" title="#Middle_Ages">#Middle_Ages</a>">Middle Ages</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Early_modern" rel="nofollow" title="#Early_modern">#Early_modern</a>">Early modern</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Late_modern_and_contemporary" rel="nofollow" title="#Late_modern_and_contemporary">#Late_modern_and_contemporary</a>">Late modern and contemporary</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Government" rel="nofollow" title="#Government">#Government</a>">Government</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Local_government" rel="nofollow" title="#Local_government">#Local_government</a>">Local government</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Administrative_and_historical_subdivisions" rel="nofollow" title="#Administrative_and_historical_subdivisions">#Administrative_and_historical_subdivisions</a>">Administrative and historical subdivisions</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Metropolitan_and_regional_government" rel="nofollow" title="#Metropolitan_and_regional_government">#Metropolitan_and_regional_government</a>">Metropolitan and regional government</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23National_government" rel="nofollow" title="#National_government">#National_government</a>">National government</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Geography" rel="nofollow" title="#Geography">#Geography</a>">Geography</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Location" rel="nofollow" title="#Location">#Location</a>">Location</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Topography" rel="nofollow" title="#Topography">#Topography</a>">Topography</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Climate" rel="nofollow" title="#Climate">#Climate</a>">Climate</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Demographics" rel="nofollow" title="#Demographics">#Demographics</a>">Demographics</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Ethnic_groups" rel="nofollow" title="#Ethnic_groups">#Ethnic_groups</a>">Ethnic groups</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Religion" rel="nofollow" title="#Religion">#Religion</a>">Religion</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Vatican_City" rel="nofollow" title="#Vatican_City">#Vatican_City</a>">Vatican City</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Pilgrimage" rel="nofollow" title="#Pilgrimage">#Pilgrimage</a>">Pilgrimage</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Cityscape" rel="nofollow" title="#Cityscape">#Cityscape</a>">Cityscape</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Architecture" rel="nofollow" title="#Architecture">#Architecture</a>">Architecture</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Ancient_Rome" rel="nofollow" title="#Ancient_Rome">#Ancient_Rome</a>">Ancient Rome</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Medieval" rel="nofollow" title="#Medieval">#Medieval</a>">Medieval</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Renaissance_and_Baroque" rel="nofollow" title="#Renaissance_and_Baroque">#Renaissance_and_Baroque</a>">Renaissance and Baroque</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Neoclassicism" rel="nofollow" title="#Neoclassicism">#Neoclassicism</a>">Neoclassicism</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Fascist_architecture" rel="nofollow" title="#Fascist_architecture">#Fascist_architecture</a>">Fascist architecture</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Parks_and_gardens" rel="nofollow" title="#Parks_and_gardens">#Parks_and_gardens</a>">Parks and gardens</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Fountains_and_aqueducts" rel="nofollow" title="#Fountains_and_aqueducts">#Fountains_and_aqueducts</a>">Fountains and aqueducts</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Statues" rel="nofollow" title="#Statues">#Statues</a>">Statues</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Obelisks_and_columns" rel="nofollow" title="#Obelisks_and_columns">#Obelisks_and_columns</a>">Obelisks and columns</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Bridges" rel="nofollow" title="#Bridges">#Bridges</a>">Bridges</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Catacombs" rel="nofollow" title="#Catacombs">#Catacombs</a>">Catacombs</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Economy" rel="nofollow" title="#Economy">#Economy</a>">Economy</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Education" rel="nofollow" title="#Education">#Education</a>">Education</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Culture" rel="nofollow" title="#Culture">#Culture</a>">Culture</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Entertainment_and_performing_arts" rel="nofollow" title="#Entertainment_and_performing_arts">#Entertainment_and_performing_arts</a>">Entertainment and performing arts</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Tourism" rel="nofollow" title="#Tourism">#Tourism</a>">Tourism</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Cuisine" rel="nofollow" title="#Cuisine">#Cuisine</a>">Cuisine</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Cinema" rel="nofollow" title="#Cinema">#Cinema</a>">Cinema</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Language" rel="nofollow" title="#Language">#Language</a>">Language</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Sports" rel="nofollow" title="#Sports">#Sports</a>">Sports</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Transport" rel="nofollow" title="#Transport">#Transport</a>">Transport</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23International_entities_organisations_and_involvement" rel="nofollow" title="#International_entities_organisations_and_involvement">#International_entities_organisations_and_involvement</a>">International entities, organisations and involvement</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Twin_towns_sister_cities_and_partner_cities" rel="nofollow" title="#Twin_towns_sister_cities_and_partner_cities">#Twin_towns_sister_cities_and_partner_cities</a>">Twin towns, sister cities and partner cities</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23See_also" rel="nofollow" title="#See_also">#See_also</a>">See also</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23References" rel="nofollow" title="#References">#References</a>">References</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Bibliography" rel="nofollow" title="#Bibliography">#Bibliography</a>">Bibliography</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Documentaries" rel="nofollow" title="#Documentaries">#Documentaries</a>">Documentaries</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23External_links" rel="nofollow" title="#External_links">#External_links</a>">External links</a></li></ul></ul></td></tr></table></p>
  <h2><span class="editsection">&#91;<a href="?section=Etymology" title="Edit section: Etymology">edit</a>&#93;</span> <a name="Etymology"></a><span class="mw-headline" id="Etymology">Etymology</span></h2>
  <p>About the origin of the name <i>Roma</i> several hypotheses have been advanced.<sup class="reference" id="cite_ref-14-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-14">14</a>]</sup> The most important are the following:
  <ul><li>From <i>Rumon</i> or <i>Rumen</i>, archaic name of the <a href="Tiber">Tiber</a>, which in turn has the same root as the Greek verb ῥέω (rhèo) and the Latin verb <i>ruo</i>, which both mean "flow";<sup class="reference" id="cite_ref-15-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-15">15</a>]</sup></li><li>From the <a href="Etruscan language">Etruscan</a> word <i>ruma</i>, whose root is *rum- "teat", with possible reference either to the <a href="Founding of Rome#The legend">totem wolf that adopted and suckled</a> the cognately named twins <a href="Romulus and Remus">Romulus and Remus</a>, or to the shape of the <a href="Palatine Hill">Palatine</a> and <a href="Aventine Hill">Aventine Hills</a>;</li><li>From the Greek word ῤώμη (rh�mē), which means <i>strength</i>.<sup class="reference" id="cite_ref-16-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-16">16</a>]</sup></li></ul></p>
  <h2><span class="editsection">&#91;<a href="?section=History" title="Edit section: Etymology">edit</a>&#93;</span> <a name="History"></a><span class="mw-headline" id="History">History</span></h2>
  <h3><span class="editsection">&#91;<a href="?section=Earliest_history" title="Edit section: Etymology">edit</a>&#93;</span> <a name="Earliest_history"></a><span class="mw-headline" id="Earliest_history">Earliest history</span></h3>
  <p>There is archaeological evidence of human occupation of the Rome area from approximately 14,000 years ago, but the dense layer of much younger debris obscures Palaeolithic and Neolithic sites.<sup class="reference" id="cite_ref-17-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-17">17</a>]</sup> Evidence of stone tools, pottery and stone weapons attest to about 10,000 years of human presence. Several excavations support the view that Rome grew from <a href="pastoralism">pastoral</a> settlements on the <a href="Palatine Hill">Palatine Hill</a> built above the area of the future <a href="Roman Forum">Roman Forum</a>. While some archaeologists argue that Rome was indeed founded in the middle of the 8th century BC (the date of the tradition), the date is subject to controversy.<sup class="reference" id="cite_ref-foundation_18-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-foundation_18">18</a>]</sup> However, the power of the well known tale of Rome's legendary foundation tends to deflect attention from its actual, and much more ancient, origins.</p>
  <h4><span class="editsection">&#91;<a href="?section=Legend_of_the_founding_of_Rome" title="Edit section: Etymology">edit</a>&#93;</span> <a name="Legend_of_the_founding_of_Rome"></a><span class="mw-headline" id="Legend_of_the_founding_of_Rome">Legend of the founding of Rome</span></h4>
  <p><div class="thumb tright"><div class="thumbinner" style="width: 180px;"><a href="File:She-wolf suckles Romulus and Remus.jpg" class="image" title="Capitoline Wolf suckles the infant twins Romulus and Remus."><img src="She-wolf suckles Romulus and Remus.jpg" alt="Capitoline Wolf suckles the infant twins Romulus and Remus." title="Capitoline Wolf suckles the infant twins Romulus and Remus." style="float:right" /></a><div class="thumbcaption"><a href="Capitoline Wolf">Capitoline Wolf</a> suckles the infant twins <a href="Romulus and Remus">Romulus and Remus</a>.</div></div></div> Traditional stories handed down by the <a href="ancient Romans">ancient Romans</a> themselves explain the earliest <a href="History of Rome">history of their city</a> in terms of <a href="legend">legend</a> and <a href="myth">myth</a>. The most familiar of these myths, and perhaps the most famous of all <a href="Roman mythology">Roman myths</a>, is the story of <a href="Romulus and Remus">Romulus and Remus</a>, the twins who were suckled by a <a href="wolf">she-wolf</a>.<sup class="reference" id="cite_ref-livy1797_19-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-livy1797_19">19</a>]</sup> They decided to build a city, but after an argument, <a href="Romulus">Romulus</a> killed his brother. According to the Roman <a href="annalist">annalists</a>, this happened on 21 April 753 BC.<sup class="reference" id="cite_ref-awg73_20-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg73_20">20</a>]</sup> This legend had to be reconciled with a dual tradition, set earlier in time, that had the <a href="Trojan War">Trojan refugee</a> <a href="Aeneas">Aeneas</a> escape to Italy and found the line of Romans through his son <a href="Ascanius">Iulus</a>, the namesake of the <a href="Julio-Claudian dynasty">Julio-Claudian dynasty</a>.<sup class="reference" id="cite_ref-livy2005_21-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-livy2005_21">21</a>]</sup> This was accomplished by the Roman poet <a href="Virgil">Virgil</a> in the first century BC.</p>
  <h3><span class="editsection">&#91;<a href="?section=Monarchy_republic_empire" title="Edit section: Etymology">edit</a>&#93;</span> <a name="Monarchy_republic_empire"></a><span class="mw-headline" id="Monarchy_republic_empire">Monarchy, republic, empire</span></h3>
  <p>After the legendary foundation by <a href="Romulus">Romulus</a>,<sup class="reference" id="cite_ref-awg73_20-1">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg73_20">20</a>]</sup> Rome was ruled for a period of 244 years by a monarchical system, initially with sovereigns of <a href="Latins (Italic tribe)">Latin</a> and <a href="Sabines">Sabine</a> origin, later by <a href="Etruscans">Etruscan</a> kings. The tradition handed down seven kings: Romulus, <a href="Numa Pompilius">Numa Pompilius</a>, <a href="Tullus Hostilius">Tullus Hostilius</a>, <a href="Ancus Marcius">Ancus Marcius</a>, <a href="Tarquinius Priscus">Tarquinius Priscus</a>, <a href="Servius Tullius">Servius Tullius</a> and <a href="Tarquin the Proud">Tarquin the Proud</a>.<sup class="reference" id="cite_ref-awg73_20-2">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg73_20">20</a>]</sup></p>
  <p>In 509 BC the Romans expelled from the city the last king and established an oligarchic republic: since then, for Rome began a period characterized by internal struggles between <a href="Patrician (ancient Rome)">patricians</a> (aristocrats) and <a href="Plebs">plebeians</a> (small landowners), and by constant warfare against the populations of central Italy: Etruscans, Latins, <a href="Volsci">Volsci</a>, <a href="Aequi">Aequi</a>.<sup class="reference" id="cite_ref-awg77_22-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg77_22">22</a>]</sup> After becoming master of <a href="Latium">Latium</a>, Rome led several wars (against the <a href="Gauls">Gauls</a>, <a href="Osci">Osci</a>-<a href="Samnites">Samnites</a> and the Greek colony of <a href="Taranto">Taranto</a>, allied with <a href="Pyrrhus of Epirus">Pyrrhus</a>, king of <a href="Epirus">Epirus</a>) whose result was the conquest of the <a href="Italian peninsula">Italian peninsula</a>, from the central area up to <a href="Magna Graecia">Magna Graecia</a>.<sup class="reference" id="cite_ref-awg79_23-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg79_23">23</a>]</sup></p>
  <p>The third and second century BC saw the establishment of the Roman hegemony over the Mediterranean and the East, through the three <a href="Punic Wars">Punic Wars</a> (264-146 BC) fought against the city of <a href="Carthage">Carthage</a> and the three <a href="Macedonian Wars">Macedonian Wars</a> (212-168 BC) against <a href="Macedonia (ancient kingdom)">Macedonia</a>.<sup class="reference" id="cite_ref-awg8183_24-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg8183_24">24</a>]</sup> Then were established the first <a href="Roman province">Roman provinces</a>: <a href="Sicilia (Roman province)">Sicily</a>, <a href="Corsica et Sardinia">Sardinia and Corsica</a>, <a href="Hispania">Spain</a>, <a href="Macedonia (Roman province)">Macedonia</a>, <a href="Achaea (Roman province)">Greece (Achaia)</a>, <a href="Africa (Roman province)">Africa</a>.<sup class="reference" id="cite_ref-awg8185_25-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg8185_25">25</a>]</sup></p>
]

200_000.times do |i|
  frag = Nokogiri::HTML5.fragment(html)

  if i % 1000 == 0
    GC.start
    puts "#{i} Memory: #{`ps -o rss= -p #{Process.pid}`.to_i / 1024}MB"
    puts
  end
end

After 100,000 iterations, this process consumes 87MB on my system.

The leak isn't specific to the Nokogiri::HTML5.fragment method. Creating a fragment manually also leaks:

200_000.times do |i|
  doc = Nokogiri::HTML5.parse("<html><body>#{html}")
  frag = doc.fragment
  frag << doc.xpath('/html/body/node()')

  if i % 1000 == 0
    GC.start
    puts "#{i} Memory: #{`ps -o rss= -p #{Process.pid}`.to_i / 1024}MB"
    puts
  end
end

Vanilla Nokogiri doesn't leak:

200_000.times do |i|
  frag = Nokogiri::HTML.fragment(html)

  if i % 1000 == 0
    GC.start
    puts "#{i} Memory: #{`ps -o rss= -p #{Process.pid}`.to_i / 1024}MB"
    puts
  end
end

I tried poking around a bit to see if I could spot an obvious cause, but C isn't my forte.

1.4.1 doesn't compile on mac os

mac os latest version with latest Xcode. 1.3.0 compiles just fine

alexandre@Rails [master] $ gem install nokogumbo
Building native extensions.  This could take a while...
ERROR:  Error installing nokogumbo:
    ERROR: Failed to build gem native extension.

    /Users/alexandre/.rvm/rubies/ruby-2.2.0/bin/ruby -r ./siteconf20150512-58205-nctaoy.rb extconf.rb
checking for xmlNewDoc() in -lxml2... yes
checking for nokogiri.h in /Users/alexandre/.rvm/gems/ruby-2.2.0@lylo/gems/nokogiri-1.6.6.2/ext/nokogiri... yes
checking for nokogiri.h in /Users/alexandre/.rvm/gems/ruby-2.2.0@lylo/gems/nokogiri-1.6.6.2/ext/nokogiri... yes
checking for gumbo_parse() in -lgumbo... yes
creating Makefile

make "DESTDIR=" clean

make "DESTDIR="
compiling nokogumbo.c
In file included from nokogumbo.c:28:
/Users/alexandre/.rvm/gems/ruby-2.2.0@lylo/gems/nokogiri-1.6.6.2/ext/nokogiri/nokogiri.h:13:9: warning: '_GNU_SOURCE' macro redefined [-Wmacro-redefined]
#define _GNU_SOURCE
        ^
/Users/alexandre/.rvm/rubies/ruby-2.2.0/include/ruby-2.2.0/x86_64-darwin14/ruby/config.h:17:9: note: previous definition is here
#define _GNU_SOURCE 1
        ^
nokogumbo.c:115:12: warning: assigning to 'char *' from 'const char [7]' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
        ns = "xlink:";
           ^ ~~~~~~~~
nokogumbo.c:119:12: warning: assigning to 'char *' from 'const char [5]' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
        ns = "xml:";
           ^ ~~~~~~
nokogumbo.c:123:12: warning: assigning to 'char *' from 'const char [7]' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
        ns = "xmlns:";
           ^ ~~~~~~~~
nokogumbo.c:160:12: error: use of undeclared identifier 'GUMBO_NODE_TEMPLATE'; did you mean 'GUMBO_TAG_TEMPLATE'?
      case GUMBO_NODE_TEMPLATE:
           ^~~~~~~~~~~~~~~~~~~
           GUMBO_TAG_TEMPLATE
/usr/local/include/gumbo.h:172:3: note: 'GUMBO_TAG_TEMPLATE' declared here
  GUMBO_TAG_TEMPLATE,
  ^
nokogumbo.c:160:12: warning: case value not in enumerated type 'GumboNodeType' [-Wswitch]
      case GUMBO_NODE_TEMPLATE:
           ^
nokogumbo.c:110:19: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (int i=0; i < attrs->length; i++) {
                ~ ^ ~~~~~~~~~~~~~
nokogumbo.c:132:47: warning: comparison of integers of different signs: 'unsigned long' and 'int' [-Wsign-compare]
      if (strlen(ns) + strlen(attr->name) + 1 > namelen) {
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~
nokogumbo.c:138:51: warning: implicit conversion loses integer precision: 'unsigned long' to 'int' [-Wshorten-64-to-32]
        namelen = strlen(ns) + strlen(attr->name) + 1;
                ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
nokogumbo.c:153:19: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (int i=0; i < children->length; i++) {
                ~ ^ ~~~~~~~~~~~~~~~~
9 warnings and 1 error generated.
make: *** [nokogumbo.o] Error 1

make failed, exit code 2

Gem files will remain installed in /Users/alexandre/.rvm/gems/ruby-2.2.0@lylo/gems/nokogumbo-1.4.1 for inspection.
Results logged to /Users/alexandre/.rvm/gems/ruby-2.2.0@lylo/extensions/x86_64-darwin-14/2.2.0/nokogumbo-1.4.1/gem_make.out

Incorporate gumbo fixes from lua-gumbo

lua-gumbo has made a bunch of improvements to gumbo. We should incorporate their fixes!

HTML5 Validation?

@rubys

Do you know if this gem allows HTML5 validation of any sort? I seem to get this to work properly:

Nokogiri::HTML5('<html text </html').errors
# => []

special characters are escaped in CDATA segments

Examples speak louder than words:

>> puts Nokogiri::HTML5.fragment('<script> if (a < b) alert(1) </script>')
<script> if (a &lt; b) alert(1) </script>

>> puts Nokogiri::HTML5.fragment('<style> .a > .b { color:red; } </style>')
<style> .a &gt; .b { color:red; } </style>

So if we try to parse html, modify a few elements, and output the result, script and style tags will not survive the trip. (Neither will xmp and plaintext tags, but obviously that's less of a problem.) This escaping doesn't occur with Nokogiri::HTML.fragment.

Gumbo crashes found by AFL

I fuzzed our copy of Gumbo and found two crashes, both caught by assertions.

Crash 1

<b><nobr><r><d><r><ol></b><nobr> results in

Assertion failed: (handled), function handle_in_body, file src/parser.c, line 3149.

This one is related to the adoption algorithm.

Crash 2

<d0tx0i0t><option></d0tX0i0t> results in

Assertion failed: (tag != GUMBO_TAG_UNKNOWN), function node_html_tag_is, file src/parser.c, line 636.

I haven't investigated this one.

Flooding of warning saying: "element span: validity error : ID login_btn already defined"

Hi

I'm using nokogumbo as a parser and I'm getting flooded with warnings/messages saying:
element span: validity error : ID login_btn already defined

I boiled down my code and was able to reproduce it with the following code:
require 'open-uri'
require 'nokogumbo'
file = open('http://www.motogp.com/en/Results+Statistics')
doc = Nokogiri::HTML5(file)

The error comes out 2x after performing the 'doc = Nokogiri::HTML5(file)' command.

Would it be possible to supress this message/warning?
Thanks for the help!
Cheers

Gentoo (and Centos?) build re-broken

In #38, there's a reversion of a fix for Gentoo linking - I see that the argument is that making the linkage work makes sense "upstream", but the problem is that e.g. nokogumbo is a dep of sanitize, which we use in Rails apps - through Bundler. Making the Gentoo ebuild of nokogumbo work doesn't help for deploying a Rails app to Gentoo.

@svoop suggested in discussion in the pull request that users could set flags themselves - if there's a flag to pass during gem install or via bundle config, that'd be a fine, but I haven't been able to determine what those flags would be.

Error and blank output when calling `.to_html` on a parsed document containing a certain meta tag

Nokogiri version: 1.8.2
Nokogumbo version: 1.5.0

Problem

Calling to_html on a parsed document containing a certain meta tag results in:

  • a printed (but not raised) error
  • an empty string erroneously being returned

Quick reproduction

input = <<~HTML
  <html>
    <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8; width=device-width">
    </head>
    <body>
      Hello!
    </body>
  </html>
HTML

Nokogiri::HTML5(input).to_html
# output error : unknown encoding utf-8; width=device-width
# => ""

Cannot install nokogumbo with ruby 2.3.0

I cannot install nokogumbo 1.4.1-1.4.7 with Ruby 2.3.0. But it can be installed with previous ruby versions. Here's the log output:

have_library: checking for xmlNewDoc() in -lxml2... -------------------- yes

"gcc -o conftest -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/x86_64-linux -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/ruby/backward -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0 -I. -I/home/musaffa/.rbenv/versions/2.3.0/include     -O3 -fno-fast-math -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wunused-variable -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wno-packed-bitfield-compat -std=c99 conftest.c  -L. -L/home/musaffa/.rbenv/versions/2.3.0/lib -Wl,-R/home/musaffa/.rbenv/versions/2.3.0/lib -L. -L/home/musaffa/.rbenv/versions/2.3.0/lib  -fstack-protector -rdynamic -Wl,-export-dynamic     -Wl,-R/home/musaffa/.rbenv/versions/2.3.0/lib -L/home/musaffa/.rbenv/versions/2.3.0/lib -lruby-static  -lpthread -lgmp -ldl -lcrypt -lm   -lc"
checked program was:
/* begin */
1: #include "ruby.h"
2: 
3: int main(int argc, char **argv)
4: {
5:   return 0;
6: }
/* end */

"gcc -o conftest -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/x86_64-linux -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/ruby/backward -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0 -I. -I/home/musaffa/.rbenv/versions/2.3.0/include     -O3 -fno-fast-math -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wunused-variable -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wno-packed-bitfield-compat -std=c99 conftest.c  -L. -L/home/musaffa/.rbenv/versions/2.3.0/lib -Wl,-R/home/musaffa/.rbenv/versions/2.3.0/lib -L. -L/home/musaffa/.rbenv/versions/2.3.0/lib  -fstack-protector -rdynamic -Wl,-export-dynamic     -Wl,-R/home/musaffa/.rbenv/versions/2.3.0/lib -L/home/musaffa/.rbenv/versions/2.3.0/lib -lruby-static -lxml2  -lpthread -lgmp -ldl -lcrypt -lm   -lc"
conftest.c: In function ‘t’:
conftest.c:13:57: error: ‘xmlNewDoc’ undeclared (first use in this function)
 int t(void) { void ((*volatile p)()); p = (void ((*)()))xmlNewDoc; return !p; }
                                                         ^
conftest.c:13:57: note: each undeclared identifier is reported only once for each function it appears in
checked program was:
/* begin */
 1: #include "ruby.h"
 2: 
 3: /*top*/
 4: extern int t(void);
 5: int main(int argc, char **argv)
 6: {
 7:   if (argc > 1000000) {
 8:     printf("%p", &t);
 9:   }
10: 
11:   return 0;
12: }
13: int t(void) { void ((*volatile p)()); p = (void ((*)()))xmlNewDoc; return !p; }
/* end */

"gcc -o conftest -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/x86_64-linux -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/ruby/backward -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0 -I. -I/home/musaffa/.rbenv/versions/2.3.0/include     -O3 -fno-fast-math -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wunused-variable -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wno-packed-bitfield-compat -std=c99 conftest.c  -L. -L/home/musaffa/.rbenv/versions/2.3.0/lib -Wl,-R/home/musaffa/.rbenv/versions/2.3.0/lib -L. -L/home/musaffa/.rbenv/versions/2.3.0/lib  -fstack-protector -rdynamic -Wl,-export-dynamic     -Wl,-R/home/musaffa/.rbenv/versions/2.3.0/lib -L/home/musaffa/.rbenv/versions/2.3.0/lib -lruby-static -lxml2  -lpthread -lgmp -ldl -lcrypt -lm   -lc"
checked program was:
/* begin */
 1: #include "ruby.h"
 2: 
 3: /*top*/
 4: extern int t(void);
 5: int main(int argc, char **argv)
 6: {
 7:   if (argc > 1000000) {
 8:     printf("%p", &t);
 9:   }
10: 
11:   return 0;
12: }
13: extern void xmlNewDoc();
14: int t(void) { xmlNewDoc(); return 0; }
/* end */

--------------------

package configuration for libxml-2.0 is not found
find_header: checking for nokogiri.h in /home/musaffa/.rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.1/ext/nokogiri... -------------------- no

"gcc -E -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/x86_64-linux -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/ruby/backward -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0 -I. -I/home/musaffa/.rbenv/versions/2.3.0/include     -O3 -fno-fast-math -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wunused-variable -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wno-packed-bitfield-compat -std=c99  conftest.c -o conftest.i"
conftest.c:3:22: fatal error: nokogiri.h: No such file or directory
 #include <nokogiri.h>
                      ^
compilation terminated.
checked program was:
/* begin */
1: #include "ruby.h"
2: 
3: #include <nokogiri.h>
/* end */

"gcc -E -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/x86_64-linux -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/ruby/backward -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0 -I. -I/home/musaffa/.rbenv/versions/2.3.0/include     -O3 -fno-fast-math -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wunused-variable -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wno-packed-bitfield-compat -std=c99 -I/home/musaffa/.rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.1/ext/nokogiri conftest.c -o conftest.i"
In file included from conftest.c:3:0:
/home/musaffa/.rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.1/ext/nokogiri/nokogiri.h:13:0: warning: "_GNU_SOURCE" redefined [enabled by default]
 #define _GNU_SOURCE
 ^
In file included from /home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/ruby/ruby.h:24:0,
                 from /home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/ruby.h:33,
                 from conftest.c:1:
/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/x86_64-linux/ruby/config.h:17:0: note: this is the location of the previous definition
 #define _GNU_SOURCE 1
 ^
In file included from conftest.c:3:0:
/home/musaffa/.rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.1/ext/nokogiri/nokogiri.h:19:27: fatal error: libxml/parser.h: No such file or directory
 #include <libxml/parser.h>
                           ^
compilation terminated.
checked program was:
/* begin */
1: #include "ruby.h"
2: 
3: #include <nokogiri.h>
/* end */

--------------------

"gcc -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/x86_64-linux -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/ruby/backward -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0 -I. -I/home/musaffa/.rbenv/versions/2.3.0/include     -O3 -fno-fast-math -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wunused-variable -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wno-packed-bitfield-compat -std=c99    -Werror -c conftest.c"
checked program was:
/* begin */
1: #include "ruby.h"
2: 
3: int main() {return 0;}
/* end */

have_library: checking for gzdopen() in -lz... -------------------- yes

"gcc -o conftest -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/x86_64-linux -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/ruby/backward -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0 -I. -I/home/musaffa/.rbenv/versions/2.3.0/include     -O3 -fno-fast-math -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wunused-variable -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wno-packed-bitfield-compat -std=c99  -g -DXP_UNIX -O3 -Wall -Wcast-qual -Wwrite-strings -Wconversion -Wmissing-noreturn -Winline conftest.c  -L. -L/home/musaffa/.rbenv/versions/2.3.0/lib -Wl,-R/home/musaffa/.rbenv/versions/2.3.0/lib -L. -L/home/musaffa/.rbenv/versions/2.3.0/lib  -fstack-protector -rdynamic -Wl,-export-dynamic    -lxml2  -Wl,-R/home/musaffa/.rbenv/versions/2.3.0/lib -L/home/musaffa/.rbenv/versions/2.3.0/lib -lruby-static -lz -lxml2  -lpthread -lgmp -ldl -lcrypt -lm   -lc "
checked program was:
/* begin */
 1: #include "ruby.h"
 2: 
 3: #include <zlib.h>
 4: 
 5: /*top*/
 6: extern int t(void);
 7: int main(int argc, char **argv)
 8: {
 9:   if (argc > 1000000) {
10:     printf("%p", &t);
11:   }
12: 
13:   return 0;
14: }
15: int t(void) { void ((*volatile p)()); p = (void ((*)()))gzdopen; return !p; }
/* end */

--------------------

have_iconv?: checking for iconv... -------------------- yes

"gcc -o conftest -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/x86_64-linux -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0/ruby/backward -I/home/musaffa/.rbenv/versions/2.3.0/include/ruby-2.3.0 -I. -I/home/musaffa/.rbenv/versions/2.3.0/include     -O3 -fno-fast-math -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-long-long -Wno-missing-field-initializers -Wunused-variable -Wpointer-arith -Wwrite-strings -Wdeclaration-after-statement -Wimplicit-function-declaration -Wdeprecated-declarations -Wno-packed-bitfield-compat -std=c99  -g -DXP_UNIX -O3 -Wall -Wcast-qual -Wwrite-strings -Wconversion -Wmissing-noreturn -Winline conftest.c  -L. -L/home/musaffa/.rbenv/versions/2.3.0/lib -Wl,-R/home/musaffa/.rbenv/versions/2.3.0/lib -L. -L/home/musaffa/.rbenv/versions/2.3.0/lib  -fstack-protector -rdynamic -Wl,-export-dynamic    -lxml2  -Wl,-R/home/musaffa/.rbenv/versions/2.3.0/lib -L/home/musaffa/.rbenv/versions/2.3.0/lib -lruby-static  -lpthread -lgmp -ldl -lcrypt -lm   -lc "
checked program was:
/* begin */
 1: #include "ruby.h"
 2: 
 3: #include <stdlib.h>
 4: #include <iconv.h>
 5: 
 6: int main(void)
 7: {
 8:     iconv_t cd = iconv_open("", "");
 9:     iconv(cd, NULL, NULL, NULL, NULL);
10:     return EXIT_SUCCESS;
11: }
/* end */

--------------------

Use `Nokogiri::XML::Node#line=` to set the line number on nodes

If sparklemotion/nokogiri#1918 (or something like it) is accepted, then https://github.com/rubys/nokogumbo/blob/master/ext/nokogumbo/nokogumbo.c#L320 should be modified to use #line= if the node supports it.

This can be tested using rb_respond_to but apparently there's a bug on Ruby 2.3 https://bugs.ruby-lang.org/issues/12907 that causes that to always return true.

  • sparklemotion/nokogiri#1918 merged
  • Nokogiri v1.11.0 (which should have #line=)
  • Investigate the Ruby 2.3 issue (Ruby 2.3 EOL'd earlier this year so maybe it's fine to ignore this)
  • Use #line= if the node supports it

Encoding Issue

It looks like the Nokogumbo parser is messing with the encoding more than it should be. The space escape character \b is getting converted to an unknown char (�) somewhere along the line that it isn't in nokogiri. I'm not sure if this is specific to this library, or maybe the gumbo parser in general, but I'm starting here for now.

# Nokogiri::HTML
> Nokogiri::HTML.parse("<p>※送信締切 2016年5月10日(\b火曜日)AM10:00\n</p>")
=> #<Nokogiri::HTML::Document:0x3ff74eae8758 name="document" children=[#<Nokogiri::XML::DTD:0x3ff74eae8438 name="html">, #<Nokogiri::XML::Element:0x3ff74eae81b8 name="html" children=[#<Nokogiri::XML::Element:0x3ff74eae7fb0 name="body" children=[#<Nokogiri::XML::Element:0x3ff74eae7dd0 name="p" children=[#<Nokogiri::XML::Text:0x3ff74eae7bf0 "※送信締切 2016年5月10日(火曜日)AM10:00\n">]>]>]>]>

Notice in the last text element, the content is fine
#<Nokogiri::XML::Text:0x3ff74eae7bf0 "※送信締切 2016年5月10日(火曜日)AM10:00\n">

# Nokogiri::HTML5
> Nokogiri::HTML5.parse("<p>※送信締切 2016年5月10日(\b火曜日)AM10:00\n</p>")
=> #<Nokogiri::HTML::Document:0x3ff74eae11c4 name="document" children=[#<Nokogiri::XML::Element:0x3ff74eae0ea4 name="html" children=[#<Nokogiri::XML::Element:0x3ff74eae0cc4 name="head">, #<Nokogiri::XML::Element:0x3ff74eae0af8 name="body" children=[#<Nokogiri::XML::Element:0x3ff74eae0918 name="p" children=[#<Nokogiri::XML::Text:0x3ff74eae0698 "※送信締切 2016年5月10日(�火曜日)AM10:00\n">]>]>]>]>

Notice in the last text element, the content is corrupt
#<Nokogiri::XML::Text:0x3ff74eae0698 "※送信締切 2016年5月10日(�火曜日)AM10:00\n">

Nokogumbo crashes on nil input

irb(main):004:0> Nokogiri::HTML5(nil)
/var/lib/gems/1.9.1/gems/nokogumbo-1.1.2/lib/nokogumbo.rb:24: [BUG] Segmentation fault
ruby 1.9.3p448 (2013-06-27 revision 41675) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0026 p:---- s:0094 b:0094 l:000093 d:000093 CFUNC  :parse
c:0025 p:0094 s:0090 b:0090 l:000089 d:000089 METHOD /var/lib/gems/1.9.1/gems/nokogumbo-1.1.2/lib/nokogumbo.rb:24
c:0024 p:0021 s:0086 b:0086 l:000085 d:000085 METHOD /var/lib/gems/1.9.1/gems/nokogumbo-1.1.2/lib/nokogumbo.rb:8
c:0023 p:0016 s:0082 b:0082 l:001fb8 d:000081 EVAL   (irb):4
c:0022 p:---- s:0080 b:0080 l:000079 d:000079 FINISH
c:0021 p:---- s:0078 b:0078 l:000077 d:000077 CFUNC  :eval
c:0020 p:0028 s:0071 b:0071 l:000070 d:000070 METHOD /usr/lib/ruby/1.9.1/irb/workspace.rb:80
c:0019 p:0033 s:0064 b:0063 l:000062 d:000062 METHOD /usr/lib/ruby/1.9.1/irb/context.rb:254
c:0018 p:0031 s:0058 b:0058 l:001b48 d:000057 BLOCK  /usr/lib/ruby/1.9.1/irb.rb:159
c:0017 p:0042 s:0050 b:0050 l:000049 d:000049 METHOD /usr/lib/ruby/1.9.1/irb.rb:273
c:0016 p:0011 s:0045 b:0045 l:001b48 d:000044 BLOCK  /usr/lib/ruby/1.9.1/irb.rb:156
c:0015 p:0144 s:0041 b:0041 l:000024 d:000040 BLOCK  /usr/lib/ruby/1.9.1/irb/ruby-lex.rb:243
c:0014 p:---- s:0038 b:0038 l:000037 d:000037 FINISH
c:0013 p:---- s:0036 b:0036 l:000035 d:000035 CFUNC  :loop
c:0012 p:0009 s:0033 b:0033 l:000024 d:000032 BLOCK  /usr/lib/ruby/1.9.1/irb/ruby-lex.rb:229
c:0011 p:---- s:0031 b:0031 l:000030 d:000030 FINISH
c:0010 p:---- s:0029 b:0029 l:000028 d:000028 CFUNC  :catch
c:0009 p:0023 s:0025 b:0025 l:000024 d:000024 METHOD /usr/lib/ruby/1.9.1/irb/ruby-lex.rb:228
c:0008 p:0046 s:0022 b:0022 l:001b48 d:001b48 METHOD /usr/lib/ruby/1.9.1/irb.rb:155
c:0007 p:0011 s:0019 b:0019 l:000fd8 d:000018 BLOCK  /usr/lib/ruby/1.9.1/irb.rb:70
c:0006 p:---- s:0017 b:0017 l:000016 d:000016 FINISH
c:0005 p:---- s:0015 b:0015 l:000014 d:000014 CFUNC  :catch
c:0004 p:0183 s:0011 b:0011 l:000fd8 d:000fd8 METHOD /usr/lib/ruby/1.9.1/irb.rb:69
c:0003 p:0039 s:0006 b:0006 l:0008b8 d:0011c8 EVAL   /usr/bin/irb:12
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
c:0001 p:0000 s:0002 b:0002 l:0008b8 d:0008b8 TOP   

Doesn’t handle Unicode signature/byte-order-mark

Returns the following unexpected errors when encountering UTF BOM/signatures.

1:1: ERROR: Expected a doctype token
<!DOCTYPE html>
^
1:2: ERROR: This is not a legal doctype
<!DOCTYPE html>
 ^

Expected behaviour: Check the first bytes of the document and detect BOM byte sequence. Set the document encoding to the encoding indicated by the BOM sequence (e.g. UTF-8 or UTF-16 LE). Strip the BOM sequence and proceed with parsing the document as normal.

https://encoding.spec.whatwg.org/#decode
https://html.spec.whatwg.org/#writing

Some test cases:

UTF-8 signature mark:

Nokogiri::HTML5.parse(
  "\xEF\xBB\xBF<!DOCTYPE html>\n<html></html>".
  force_encoding('UTF-8'),
  max_errors: 10).
errors.each { |err| puts(err) }

UTF-16 (BE) byte-order-mark:

Nokogiri::HTML5.parse(
    "\xFE\xFF".force_encoding('UTF-16BE') +
    "<!DOCTYPE html>\n<html></html>".
    encode('UTF-16BE', 'UTF-8'),
    max_errors: 10).
errors.each { |err| puts(err) }

UTF-16 (LE) byte-order-mark:

Nokogiri::HTML5.parse(
    "\xFF\xEF".force_encoding('UTF-16LE') +
    "<!DOCTYPE html>\n<html></html>".
    encode('UTF-16LE', 'UTF-8'),
    max_errors: 10).
errors.each { |err| puts(err) }

install failed on gentoo

ruby version: ruby 2.4.4p296 (2018-03-28 revision 63013) [x86_64-linux]

gumbo version(portage): 0.10.1

I can install nokogumbo with portage, but cannot install it with gem or bundle.

I try to read the ebuild file to config bundle, but I have no idea how to configure.

I also try to add --with-ldflags=-Wl,--no-undefined, it fails as well.

$ gem install nokogumbo
Building native extensions. This could take a while...
ERROR:  Error installing nokogumbo:
        ERROR: Failed to build gem native extension.

    current directory: /home/git/.gem/ruby/2.4.0/gems/nokogumbo-1.5.0/ext/nokogumboc
/usr/bin/ruby24 -r ./siteconf20180628-12916-1nvqzxo.rb extconf.rb
checking for xmlNewDoc() in -lxml2... yes
checking for nokogiri.h in /usr/lib64/ruby/gems/2.4.0/gems/nokogiri-1.8.1/ext/nokogiri... yes
checking for nokogiri.h in /usr/lib64/ruby/gems/2.4.0/gems/nokogiri-1.8.1/ext/nokogiri... yes
checking for gumbo_parse() in -lgumbo... yes
checking for GumboErrorType with error.h... not found
checking for GumboInsertionMode with insertion_mode.h... not found
checking for GumboParser with parser.h... not found
checking for GumboStringBuffer with string_buffer.h... not found
checking for GumboTokenType with token_type.h... not found
creating Makefile

current directory: /home/git/.gem/ruby/2.4.0/gems/nokogumbo-1.5.0/ext/nokogumboc
make "DESTDIR=" clean

current directory: /home/git/.gem/ruby/2.4.0/gems/nokogumbo-1.5.0/ext/nokogumboc
make "DESTDIR="
compiling nokogumbo.c
nokogumbo.c:24:10: fatal error: parser.h: No such file or directory
 #include "parser.h"
          ^~~~~~~~~~
compilation terminated.
make: *** [Makefile:242: nokogumbo.o] Error 1

make failed, exit code 2

Gem files will remain installed in /home/git/.gem/ruby/2.4.0/gems/nokogumbo-1.5.0 for inspection.
Results logged to /home/git/.gem/ruby/2.4.0/extensions/x86_64-linux/2.4.0/nokogumbo-1.5.0/gem_make.out

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.