Giter Site home page Giter Site logo

gokogiri's Introduction

Gokogiri

LibXML bindings for the Go programming language.

By Zhigang Chen and Hampton Catlin

This is a major rewrite from v0 in the following places:

  • Separation of XML and HTML
  • Put more burden of memory allocation/deallocation on Go
  • Fragment parsing -- no more deep-copy
  • Serialization
  • Some API adjustment

Installation

# Linux
sudo apt-get install libxml2-dev
# Mac
brew install libxml2

go get github.com/moovweb/gokogiri

Running tests

go test github.com/moovweb/gokogiri/...

Basic example

package main

import (
  "net/http"
  "io/ioutil"
  "github.com/moovweb/gokogiri"
)

func main() {
  // fetch and read a web page
  resp, _ := http.Get("http://www.google.com")
  page, _ := ioutil.ReadAll(resp.Body)

  // parse the web page
  doc, _ := gokogiri.ParseHtml(page)

  // perform operations on the parsed page -- consult the tests for examples

  // important -- don't forget to free the resources when you're done!
  doc.Free()
}

gokogiri's People

Contributors

afeld avatar elimisteve avatar hamptonmakes avatar jbowtie avatar jehiah avatar joelreymont avatar mattkanwisher avatar mdayaram avatar mier85 avatar mshafrir avatar sjezewski avatar voxxit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gokogiri's Issues

Node Line Numbers

It would be really useful if programs detecting errors in their own parsing (as opposed to the underlying XML parsing) could report line numbers for the error by fetching a Node object's position information. E.g., if a required attribute is missing on a node, I want to report the line that node began on so the user can find it. Looking at the pkg docs and sources I haven't come across any functionality to get that information, but please correct me if wrong.

Gokogiri fills what I think is otherwise a huge hole in Go's pkg library (DOM style XML parsing), thanks!

gokogiri on windows full step

1,install tdm64-gcc (http://downloads.sourceforge.net/project/tdm-gcc/TDM-GCC%20Installer/tdm64-gcc-4.8.1.exe?r=http%3A%2F%2Ftdm-gcc.tdragon.net%2F&ts=1380861880&use_mirror=superb-dca2)

2,modify $PATH,add gcc path (......tdm64-gcc\bin)

3,install GIMP 2 (http://jaist.dl.sourceforge.net/project/gimp-win/GIMP%20%2B%20GTK%2B%20%28stable%20release%29/GIMP%202.8.6/gimp-2.8.6-setup.exe)

4,unzip pkg-config to C:\Program Files\GIMP 2\bin (http://ftp.acc.umu.se/pub/gnome/binaries/win64/dependencies/pkg-config_0.23-2_win64.zip)

4.5 add C:\Program Files\GIMP 2\bin to $PATH

5,ADD $PKG_CONFIG_PATH (eg. e:\mygo\clib)

6,download
ftp://ftp.zlatkovic.com/libxml/64bit/libxml-2.9.1-win32-x86_64.7z
ftp://ftp.zlatkovic.com/libxml/64bit/iconv-1.14-win32-x86_64.7z
and unzip to e:\mygo\clib

directory tree like this:

E\mygo\clib
├─bin
├─include
│ └─libxml2
│ └─libxml
├─lib
│ └─pkgconfig
└─share
├─aclocal
├─doc
│ ├─libiconv
│ └─libxml2-2.9.1
│ ├─examples
│ └─html
│ ├─html
│ └─tutorial
│ └─images
│ └─callouts
├─gtk-doc
│ └─html
│ └─libxml2
└─man
├─man1
└─man3

modify libxml-2.0.pc (file path) ,and copy to your $PKG_CONFIG_PATH

prefix=e:/mygo/clib
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include
modules=1

Name: libXML
Version: 2.9.1
Description: libXML library version2.
Requires:
Libs: -L${libdir} -lxml2
Libs.private: -L/usr/local/lib -lz -L/usr/local/lib -liconv -lws2_32
Cflags: -I${includedir}/libxml2 -I/usr/local/include

done


test:

cmd : gcc --version
display:
gcc (rev5, Built by TDM64-GCC project) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

cmd:pkg-config --libs libxml-2.0
output: -Le:/mygo/clib/lib -lxml2


now,you can:
go get github.com/moovweb/gokogiri


the last !!!!!!!!!!
please copy the dll's to your go exe app directory from your clib\bin directory.

WTF!!!! This waste of my a week. see #47

thank you for your help, mdayaram.

sorry for my poor english. hope can help some guys.

El capitan issue

Any ideas how to fix this after el capitan upgrade?

[11:53:11] Error (go install): # github.com/blank/feeds/vendor/github.com/moovweb/gokogiri/help
[11:53:11] Error (go install): vendor/github.com/moovweb/gokogiri/help/help.go:6:10: fatal error: 'libxml/tree.h' file not found
[11:53:11] Error (go install): #include <libxml/tree.h>
[11:53:11] Error (go install):          ^
[11:53:11] Error (go install): 1 error generated.

Issues with go get github.com/moovweb/gokogiri/css

Trying to install this library and getting some error messages. Any help would be great!

$ go get github.com/moovweb/gokogiri/css
# github.com/moovweb/rubex
../github.com/moovweb/rubex/chelper.c:161:43: warning: passing 'const OnigUChar *' (aka 'const unsigned char *') to parameter of type 'const char *' converts between pointers to integer types with different sign [-Wpointer-sign]
/usr/include/secure/_string.h:119:34: note: expanded from macro 'strncpy'
# github.com/moovweb/rubex
ld: warning: ignoring file /usr/local/lib/libonig.dylib, file was built for x86_64 which is not the architecture being linked (i386): /usr/local/lib/libonig.dylib
Undefined symbols for architecture i386:
  "_OnigDefaultSyntax", referenced from:
      _NewOnigRegex in chelper.o
  "_OnigEncodingUTF8", referenced from:
      _NewOnigRegex in chelper.o
  "_onig_error_code_to_str", referenced from:
      _NewOnigRegex in chelper.o
      _SearchOnigRegex in chelper.o
  "_onig_foreach_name", referenced from:
      _GetCaptureNames in chelper.o
  "_onig_free", referenced from:
      __cgo_5dffe3ce5698_Cfunc_onig_free in regex.cgo2.o
     (maybe you meant: __cgo_5dffe3ce5698_Cfunc_onig_free)
  "_onig_match", referenced from:
      _MatchOnigRegex in chelper.o
  "_onig_name_to_backref_number", referenced from:
      _LookupOnigCaptureByName in chelper.o
  "_onig_new", referenced from:
      _NewOnigRegex in chelper.o
  "_onig_number_of_captures", referenced from:
      __cgo_5dffe3ce5698_Cfunc_onig_number_of_captures in regex.cgo2.o
     (maybe you meant: __cgo_5dffe3ce5698_Cfunc_onig_number_of_captures)
  "_onig_number_of_names", referenced from:
      __cgo_5dffe3ce5698_Cfunc_onig_number_of_names in regex.cgo2.o
     (maybe you meant: __cgo_5dffe3ce5698_Cfunc_onig_number_of_names)
  "_onig_region_free", referenced from:
      __cgo_5dffe3ce5698_Cfunc_onig_region_free in regex.cgo2.o
     (maybe you meant: __cgo_5dffe3ce5698_Cfunc_onig_region_free)
  "_onig_region_new", referenced from:
      _NewOnigRegex in chelper.o
  "_onig_search", referenced from:
      _SearchOnigRegex in chelper.o
ld: symbol(s) not found for architecture i386
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Crash with custom XPath resolver

There seems to be some memory corruption happening when using custom XPath resolver using XPath.SetResolver(VariableScope):

unexpected fault address 0x676e69727493
fatal error: fault
[signal 0xb code=0x1 addr=0x676e69727493 pc=0x4a3c79]

goroutine 9 [running, locked to thread]:
runtime.throw(0x58c660, 0x5)
        /usr/local/go/src/runtime/panic.go:527 +0x90 fp=0xc820036ab8 sp=0xc820036aa0
runtime.sigpanic()
        /usr/local/go/src/runtime/sigpanic_unix.go:27 +0x2ab fp=0xc820036b08 sp=0xc820036ab8
github.com/moovweb/gokogiri/xpath.go_can_resolve_function(0xc820091470, 0x1305830, 0x0, 0x0)
        /home/ubuntu/gopath/src/github.com/moovweb/gokogiri/xpath/util.go:114 +0x99 fp=0xc820036b70 sp=0xc820036b08

I've reduced the repro to the minimal source I could, see crashtest.go and the full crash log for Linux with 4 goroutines at https://gist.github.com/magnushiie/4587ba63b3a37bc9e76e

I could reproduce the crash on both Linux and Windows, using various libxml versions. On Windows I got crashes even when I changed the parallelism to 1 (the crash location was then in Go runtime semaphore code); on Linux I couldn't reproduce this with less than 4 goroutines.

The crash does not happen when the SetResolver line is commented out in the source.

The crash usually happens at the above-mentioned location (go_can_resolve_function), when it's trying to dereference the address of the IsFunctionRegistered VariableScope interface method, but the interface pointer is corrupt. In one debugging session, it contained an 8-byte character string that was the result of the XPath expression.

windows(mingw) duplicate symbol reference err

I got this err when using this lib:
C:\Local\Go\local\pkg\windows_386/github.com/moovweb/gokogiri/xml.a(_all.o): duplicate symbol reference: __mingw_snprintf in both github.com/moovweb/gokogiri/html(.text) and github.com/moovweb/gokogi

it's ok running go get ...

Sax

Do you guys have any plans for sax? We are starting to be needing it, might implement it. So was going to check if you had started it or had any ideas/concerns

cannot build, test, or install gokogiri

I've tried several avenues, including what's detailed in the README. Here's the steps I took:

hobbsc@ea:~/incoming/gokogiri 1014:0% make test
make: *** No rule to make target `test'.  Stop.

hobbsc@ea:~/incoming/gokogiri 1015:2% go build
gokogiri.go:4:2: import "gokogiri/html": cannot find package
gokogiri.go:5:2: import "gokogiri/xml": cannot find package

hobbsc@ea:~/incoming/gokogiri 1016:1% go get github.com/moovweb/gokogiri
# pkg-config --cflags libxml-2.0 libxml-2.0
exec: "pkg-config": executable file not found in $PATH

hobbsc@ea:~/incoming/gokogiri 1017:2% make install
make: *** No rule to make target `install'.  Stop.

hobbsc@ea:~/incoming/gokogiri 1018:2% go test
gokogiri.go:4:2: import "gokogiri/html": cannot find package
gokogiri.go:5:2: import "gokogiri/xml": cannot find package

TestDisableOutputEscaping fails in Darwin

Not sure why, seems to work fine on other platforms (windows and linux included).

Below is the output:

gokogiri/xml $ go test .

Testing: Basic Parsing [....]

All (4) tests passed!

Testing: Buffered Parsing [....]

All (4) tests passed!
--- FAIL: TestDisableOutputEscaping (0.00 seconds)
    node_test.go:364: TestDisableOutputEscaping (escaping disabled) Expected: <br/>
        Actual: &lt;br/&gt;
FAIL
FAIL    github.com/moovweb/gokogiri/xml 0.134s

go1 branch

Is the go1 branch still needed?

It was last updated 7 months ago.

go get pulls it instead of master and we don't get the latest version of the code.

would like to develop but cannot build

This has nothing to do with the go1 branch.

biggie:golang joelr$ cd /tmp
biggie:tmp joelr$ git clone git://github.com/moovweb/gokogiri.git
Cloning into 'gokogiri'...
remote: Counting objects: 3502, done.
remote: Compressing objects: 100% (1721/1721), done.
remote: Total 3502 (delta 1845), reused 3351 (delta 1712)
Receiving objects: 100% (3502/3502), 1.18 MiB | 140 KiB/s, done.
Resolving deltas: 100% (1845/1845), done.
biggie:tmp joelr$ cd gokogiri/
biggie:gokogiri joelr$ go build
# gokogiri/xpath
expression.go:4:26: error: libxml/xpath.h: No such file or directory
expression.go:5:35: error: libxml/xpathInternals.h: No such file or directory
biggie:gokogiri joelr$ git branch
* master

performance issue of XmlNode.SetContent()

Hi,

currently, XmlNode.SetContent() can both receive string/[]byte param for code conveniece,

func (xmlNode *XmlNode) SetContent(content interface{}) (err error) {
    switch data := content.(type) {
    default:
        err = ERR_UNDEFINED_SET_CONTENT_PARAM
    case string:
        err = xmlNode.SetContent([]byte(data))
    case []byte:
        contentBytes := GetCString(data)
        contentPtr := unsafe.Pointer(&contentBytes[0])
        C.xmlSetContent(unsafe.Pointer(xmlNode), unsafe.Pointer(xmlNode.Ptr), contentPtr)
    }
    return
}

But content.(type) and once more function call may cost more resource, below is my bechmark codes:

package fakeOverload

import "testing"

func SetContent(content interface{}) {
    switch data := content.(type) {
    case string:
        SetContent([]byte(data))
    case []byte:
        _ = content
    }
}

func SetContentBytes(bytes []byte) {
    _ = bytes
}

func BenchmarkSetContent(t *testing.B) {
    str := "test"
    for i := 0; i < t.N; i++ {
        SetContent(str)
    }
}

func BenchmarkSetContentBytes(t *testing.B) {
    str := "test"
    for i := 0; i < t.N; i++ {
        SetContentBytes([]byte(str))
    }
}

below is result:

PASS
BenchmarkSetContent testing: warning: no tests to run
50000000           260 ns/op          58 B/op          2 allocs/op
BenchmarkSetContentBytes    200000000           37.9 ns/op         8 B/op          0 allocs/op
ok      _/home/xxxx/benchmark/testFakeOverloadFunc  24.776s

we can see that if use SetContent() cost 6 times time and 7 times memory than SetContentBytes(),

how about use two functions:

SetContent(content string)
SetContentBytes(content []byte)

Thank you! :)

Namespace support

Dear team,

thanks for the great library. It nicely solves my xml problem that involves processing xml in an unknown format.

After having a closer look at the sources, i find that namespace support is not available for attributes. I'm thinking about implementing this missing piece - unless you are already in the middle of doing so.

Thanks for your feedback

Jochen

Consider making xmlNode.Serialize public

While the defaults for ToXml and ToHtml are sensible, I need to be able to specify the format flags to match various standards. Rather than proliferate To* functions, it seems reasonable to make Serialize part of the xml.Node interface.

memory leak under heavy load

i am parsing around 200-300 3kb html snippets per second. which in itself proves how cool your lib is ;)
sadly it's leaking memory at around 1-2 mb/min. not constantly though, so i am guessing it could be some kind of error while parsing.

if i can help you to fix this let me know

thx,

paul

Adding an existing node to a new document

Consider this document:

<body xmlns='http://jabber.org/protocol/httpbind' 
        xmlns:stream='http://etherx.jabber.org/streams' to='127.0.0.1' rid='3' sid='158465d2549542c9'>
    <iq id='2' type='get'>
        <query xmlns='jabber:iq:auth'>
            <username>...</username>
        </query>
    </iq>
</body>

I would like to extract the inner element and put it in a new document.

I want to do this as quickly as possible and I don't know in advance the name of the inner element.

How do I do this?

go get support

I should be able to install the library with go get github.com/moovweb/gokogiri but this does't work. Install individual parts like go get github.com/moovweb/gokogiri/html fails because import paths don't like each other.

I like gb but only for applications. Libraries should support go get

xmlFreeChars error (on Windows)

Hi,

when I got a call like that:

imgnode.Attr("src")

or

imgnode.Attribute("src").Value()

I get the following error:

unexpected fault address 0x137425ff
throw: fault
[signal 0xc0000005 code=0x0 addr=0x137425ff pc=0x137425ff]

goroutine 1 [syscall]:
github.com/moovweb/gokogiri/xml._Cfunc_xmlFreeChars(0x37a570, 0x201319a0)
        C:/Users/sOph/AppData/Local/Temp/go-build750256355/github.com/moovweb/gokogiri/xml/_obj/_cgo_defun.c:106 +0x31
github.com/moovweb/gokogiri/xml.(*XmlNode).Content(0x20132d30, 0x201319a0, 0x66)
        document.go:842 +0x75
github.com/moovweb/gokogiri/xml.(*AttributeNode).Value(0x200e0948, 0x20155dc0, 0x5cf2e4)
        C:/Go/src/pkg/github.com/moovweb/gokogiri/xml/attribute.go:12 +0x2a

similar errors also happen when running the tests in : go test github.com/moovweb/gokogiri/xml

the only way to access the attributes right now, seems to be doing something like:

var attribute string
attribute = string((*imgnode.Attribute("src")).ToBuffer(nil))
attribute = strings.Trim(attribute[strings.Index(attribute, "=")+1:], "\"' ")

XSD Validation?

Any chance of adding XSD validation?

Or, if I added it myself (I'm so new to C and CGO it's a bit crazy), what's the odds it'll be merged back in?

compilation errors/warnings

When I cross compile my app using gokogiri, I get the following messages:

compiling for linux-386
# github.com/moovweb/gokogiri/xml
../../../github.com/moovweb/gokogiri/xml/attribute.go:4: undefined: XmlNode
../../../github.com/moovweb/gokogiri/xml/cdata.go:4: undefined: XmlNode
../../../github.com/moovweb/gokogiri/xml/element.go:4: undefined: XmlNode
../../../github.com/moovweb/gokogiri/xml/nodeset.go:6: undefined: Document
../../../github.com/moovweb/gokogiri/xml/nodeset.go:7: undefined: Node
../../../github.com/moovweb/gokogiri/xml/nodeset.go:11: undefined: Document
../../../github.com/moovweb/gokogiri/xml/nodeset.go:16: undefined: Node
../../../github.com/moovweb/gokogiri/xml/nodeset.go:20: undefined: Node
../../../github.com/moovweb/gokogiri/xml/nodeset.go:22: undefined: NewNode
../../../github.com/moovweb/gokogiri/xml/text.go:4: undefined: XmlNode
../../../github.com/moovweb/gokogiri/xml/nodeset.go:22: too many errors
compiling for linux-amd64
# github.com/moovweb/gokogiri/xml
../../../github.com/moovweb/gokogiri/xml/attribute.go:4: undefined: XmlNode
../../../github.com/moovweb/gokogiri/xml/cdata.go:4: undefined: XmlNode
../../../github.com/moovweb/gokogiri/xml/element.go:4: undefined: XmlNode
../../../github.com/moovweb/gokogiri/xml/nodeset.go:6: undefined: Document
../../../github.com/moovweb/gokogiri/xml/nodeset.go:7: undefined: Node
../../../github.com/moovweb/gokogiri/xml/nodeset.go:11: undefined: Document
../../../github.com/moovweb/gokogiri/xml/nodeset.go:16: undefined: Node
../../../github.com/moovweb/gokogiri/xml/nodeset.go:20: undefined: Node
../../../github.com/moovweb/gokogiri/xml/nodeset.go:22: undefined: NewNode
../../../github.com/moovweb/gokogiri/xml/text.go:4: undefined: XmlNode
../../../github.com/moovweb/gokogiri/xml/nodeset.go:22: too many errors

Everything works well for darwin tho.

development build

There's got to be something simple that escapes me but how do you build during development?

go build
# gokogiri/xpath
expression.go:4:26: error: libxml/xpath.h: No such file or directory
expression.go:5:35: error: libxml/xpathInternals.h: No such file or directory

The above doesn't work for me, although 'go get github.com/.../ works fine.

Encoding support

Gokogiri doesn't seem to support the encoding of some pages, although http://www.xmlsoft.org/encoding.html claims libxml will use iconv on unix systems.
Here's a small test:

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
    "github.com/moovweb/gokogiri"
)

func get(url string) []byte {
    r, err := http.Get(url)
    if err != nil { panic(err) }
    body, err := ioutil.ReadAll(r.Body)
    if err != nil { panic(err) }
    return body
}

func main() {
    buf := get("http://bbs.chinaunix.net/thread-4080291-1-1.html")
    doc, err := gokogiri.ParseHtml(buf)
    if err != nil { panic(err) }
    fmt.Println("MetaEncoding:", doc.MetaEncoding())
    title, _ := doc.Search("//title")
    fmt.Println(title[0].Content())
}

Output:

~/gtest > go run gokogiritest.go
MetaEncoding: gbk
AIXÉÏlibxml2²»֧³Ögb2312±àÂë-AIX-ChinaUnix.net
~/gtest > go run gokogiritest.go | iconv -f gbk
MetaEncoding: gbk
AIX上libxml2不支持gb2312编码-AIX-ChinaUnix.net

Any idea why it's not working? Did I misunderstand the libxml page?

Parsing XML dies (stays blocked) when doing in parallel

Hello all,
I've discovered this while load-testing my app on Windows 7. When there are multiple goroutines doing XML parsing, sooner or later all of them stay stucked in xml.Parse() and the CPU load drops to zero.

It's much more probable for go run than for building and running the executable. So far I wasn't able to repro it on linux.

See the attached sample code (no request, just 10 go routines running in loop).
It should timeout after 20 seconds (or 10000 iterations) but with go run it usually ends up like

Id 8, iter: 571, elapsed: 5.018003
Id 8, iter: 572, elapsed: 5.021004
Id 8, iter: 573, elapsed: 5.023504
Id 8, iter: 574, elapsed: 5.026005
Id 8, iter: 575, elapsed: 5.029006
Done -- this is outputted after 20s - timeout expired

Any idea what can be wrong?
Thanks in advance
gokogiri-load.zip

Namespace decl nodes can be returned by xpath expressions

In my testing I've noticed that an XPath expression can return an namespace declaration node ("foo/namespace::bar", for example).

The generic XmlNode.Content() will return the URL - unfortunately there doesn't seem to be a way to access the prefix. The obvious hack of getting the parent node and searching for the declaration fails because namespace declarations are on their own axis and don't actually have parents.

I don't know what the right solution here is, but we'll possibly want a new type that derives from Xml.Node but is aware it wraps an C.xmlNsPtr

Thoughts?

Need support for XPath callbacks

I'm currently working on this in a branch, but it's an invasive design change so I'm creating an issue to solicit feedback before creating a pull request. It will result in a new xml.Node call and new xpath interface.

If variable names or non-built in functions are included in an XPath, the evaluator invokes callback functions in order to resolve them. Currently the context object we create leaves them null and exposes no way to set them.

I had to go away and do some research on callbacks, so here's what I think needs to happen (and is mostly implemented in my not-yet-published branch):

  • Create xpath.VariableScope interface. The ResolveVariable and ResolveFunction functions defined by this interface do the actual resolution.
  • Export functions that match the callback definitions. The void* pointer can be unpacked to a VariableScope variable and the appropriate resolve functions invoked.
  • Create corresponding C functions in the xpath.go header that take an XPathContext and void pointer and set the appropriate bits in the XPathContext.
  • Add XPath.SetVariableResolution(v VariableScope) which calls the C function, passing unsafe.Pointer(&v) as the data.
  • Finally, define Node.SearchWithVariables. This takes an extra VariableScope parameter and calls XPath.SetVariableResolution.
  • Create a simple struct that wraps a map and implements VariableScope. This is needed for testing.

Just typing this out has clarified a couple of details I was unsure about, so it's helped in that regard. Please provide any feedback on the design; I'm sure I'll get plenty of feedback on the actual code once I've created a pull request.

Attempting to cross compile for linux/amd64 on OSX fails using gox

As stated in title. I am unable to cross compile my app after importing gokogiri. Using Gox, and attempting to build for linux/amd64, I get the following error.

--> linux/amd64 error: exit status 1
Stderr: ../github.com/moovweb/gokogiri/gokogiri.go:11:2: C source files not allowed when not using     cgo: helper.c
../github.com/moovweb/gokogiri/html/document.go:16:2: C source files not allowed when not using cgo: helper.c

Parsing Fragments That Have Namespaces

I am having trouble parsing a document fragment that contains elements with namespaces. When added to a node as children, the namespaces are removed even though the namespaces are defined in the root element of the document to which I wish to add the fragment.

var fragment []byte = foo.Fragment()
fmt.Println(fragment) // prints name spaced elements
node.Child(fragment)
fmt.Println(node) // the elements are added, but the namespaces are removed.

I have also tried to parse the fragment manually with no success in preserving the namespaces.

I have noticed that you are wrapping the fragment with a root element. Nevertheless, the fragment seems to be parsed in the context of a node or document root, which, in my case, has the namespaces defined.

Thank you.

Implicit declarations of C functions -- error when running go get

When running go get in OSX Mavericks, it appears some people are getting the following errors:

go get -u github.com/moovweb/gokogiri
github.com/moovweb/gokogiri/xml
go/src/github.com/moovweb/gokogiri/xml/helper.c:7:3: warning: implicit declaration of function 'xmlNodeWriteCallback' is invalid in C99 [-Wimplicit-function-declaration]
go/src/github.com/moovweb/gokogiri/xml/helper.c:106:8: warning: initializing 'char *' with an expression of type 'xmlChar *' (aka 'unsigned char *') converts between pointers to integer types with different sign [-Wpointer-sign]
go/src/github.com/moovweb/gokogiri/xml/helper.c:112:4: warning: implicit declaration of function 'xmlUnlinkNodeCallback' is invalid in C99 [-Wimplicit-function-declaration]

Some more info on issue #53

In some version of Darwin, compilation Fails with "unexpected GOT reloc" error.

I've only been able to reproduce this in certain mac environments, but it looks like for certain cases, trying to compile yields the following error:

gokogiri/xpath(__TEXT/__text): unexpected GOT reloc for non-dynamic symbol exec_xpath_function

I've tried using all versions of Go, and the error is consistent. However, it's not consistent across all Darwin machines.

Specs for one of the machines where the error appears consistently:

$ clang --version
Apple clang version 3.0 (tags/Apple/clang-211.10.1) (based on LLVM 3.0svn)
Target: x86_64-apple-darwin10.8.0
Thread model: posix

$ system_profiler SPSoftwareDataType
Software:

    System Software Overview:

      System Version: Mac OS X Server 10.6.8 (10K549)
      Server Configuration: Advanced
      Kernel Version: Darwin 10.8.0
      Boot Volume: Server HD
      Boot Mode: Normal
      Computer Name: computer
      User Name: user (user)
      Secure Virtual Memory: Not Enabled
      64-bit Kernel and Extensions: Yes
      Time since boot: 34 days 3:36

Specs for a Darwin machine where the problem consistently does NOT appear:

$ clang --version
Apple clang version 2.1 (tags/Apple/clang-163.7.1) (based on LLVM 3.0svn)
Target: x86_64-apple-darwin11.4.2
Thread model: posix

$ system_profiler SPSoftwareDataType
Software:

    System Software Overview:

      System Version: Mac OS X 10.7.5 (11G63)
      Kernel Version: Darwin 11.4.2
      Boot Volume: Macintosh HD
      Boot Mode: Normal
      Computer Name: computer
      User Name: user (user)
      Secure Virtual Memory: Enabled
      64-bit Kernel and Extensions: Yes
      Time since boot: 2 days 16:08

Encoding is passed around as byte array instead of string

There doesn't seem to be any particular reason that encoding names are passed around as byte arrays instead of strings. This results in a lot of unnecessary conversion back and forth (particularly in light of Go and libxml2 both using UTF-8 internally).

I propose we modify the API to rectify this; it will simplify things for the user.

Error when compiling gokogiri with go 1.2

Hi,

I've tried to compile gokogiri with go 1.2 but it fails:

$ go get -u github.com/moovweb/gokogiri/xml
# github.com/moovweb/gokogiri/xml
../../src/github.com/moovweb/gokogiri/xml/document.go:113: const initializer []byte literal is not a constant

$ go version
go version go1.2 linux/amd64

I fixed in local this issue by editing this file like this:

-const DefaultEncodingBytes = []byte(DefaultEncoding)
+var DefaultEncodingBytes = []byte(DefaultEncoding)

Install fails, go1.0.3, missing newlines at EOF

% go get github.com/moovweb/gokogiri      
# github.com/moovweb/gokogiri/xml
In file included from document.go:4:
helper.h:34:23: error: no newline at end of file

Similarly when I fix that helper.h file, a subsequent go install failes on html/helper.h and html/helper.c too.

Once I've added the missing final newline, go install succeeds. Very simple fix.

go1.0.3

can't seem to easily build on OS X

The README is obviously outdated since the makefile is gone, but I still didn't manage to build/install on Mountain Lion:

https://gist.github.com/4383203

I installed libxml2 from homebrew, updated the xpath import statement to reflect the path of the brew files.
Tried to build and go some error.

An updated readme would be very appreciated since this lib seems very useful.

Thanks,

  • Matt

fetchNode undeclared or inconsistent definition.

$ go test gokogiri/xpath
# gokogiri/xpath
1: error: 'fetchNode' undeclared (first use in this function)
1: note: each undeclared identifier is reported only once for each function it appears in
FAIL    gokogiri/xpath [build failed]

Looks like the changes introduced by PR #44 cause gokogiri to fail building.

I noticed that the fetchNode function is defined in cgo in the xpath.go file. The error above is referring to it's usage in util.go. I tried adding the function prototype in cgo inside util.go, but then I got this error message:

$ go build .
# gokogiri/xpath
inconsistent definitions for C.fetchNode

Any ideas, @jbowtie ?

I can't find how to remove an attribute

Hi,

I didn't find a way to remove an attribute from a node. It seems that libxml2 has xmlRemoveProp, but I didn't find it in gokogiri by grepping its code. Is there another way to do that?

How do I parse xml with a namespace?

Read through everything but still not understanding how to parse xml with a namespace. var b parses correctly but I can't figure out how to parse var a below:

package main

import (
"github.com/moovweb/gokogiri"
"github.com/moovweb/gokogiri/xpath"
"log"
)

func main() {
log.SetFlags(log.Lshortfile)
doc, _ := gokogiri.ParseXml([]byte(a))
defer doc.Free()
doc.SetNamespace("", "http://example.com/this")
x := xpath.Compile(".//NodeA/NodeB")
groups, err := doc.Search(x)
log.Println(groups)
if err != nil {
log.Println(err)
}
for i, group := range groups {
log.Println(i, group)
}
}

var a = <?xml version="1.0" ?><NodeA xmlns="http://example.com/this"><NodeB>thisthat</NodeB></NodeA>
var b = <?xml version="1.0" ?><NodeA><NodeB>thisthat</NodeB></NodeA>

Make xml.Nodeset relevant

The current implementation of NodeSet ( see xml/nodeset.go ) appears to be unused, and frankly it doesn't appear to be provide any useful functionality.

I propose this be changed to:

type Nodeset []Node

This can then be extended with some useful functionality. Specifically, I'd add functions to convert to and from a collection of unsafe.Pointer, a C.xmlXPathNodeSet structure, and a C.xmlXPathResultValueTree (which is a nodeset with an extra bit set for libxml).

We can then eliminate some duplicate code internally and simplify some type switches by using a Nodeset. I'd prefer to update the method signatures where appropriate but have no issue with us leaving those as is.

strange behavior in href attribute, uri with parameters

Hi , I try to parse some links from a page, those links comes with params , something like:

<a href="/SOMEPATH/QueryAccion.do?ONE=1&TWO=1&TREE=4109914&FOUR=28300" onclick="func()">TEXT</a>

but I get results like this:
<a href="/SOMEPATH/QueryAccion.do?ONE=1=1=4109914=28300" onclick="func()">TEXT</a>]

This could be due to an encoding problem ? have you encountered bug like this before ?
best regards

Trim surrounding whitespace for Node.Content()

Hi,

Does gokogiri currently support trimming leading and trailing whitespace from a Node's Content (text content) before it is returned?

For example, I've used a Perl library in the past that allows me to do something like $node->as_text and $node->as_trimmed_text. The latter function automatically removes surrounding whitespace (s/^\s*|\s*$/) before returning the node's text content.

If this desired behavior is not currently built in, I would happily provide a patch. For example, a new field for type Node could be introduced, say ContentTrimmed.

Thanks!

Enums should be typed

Currently enums are all ints. They should be typed for clarity. In some cases a String function may make sense.

For example,

type SerializationFlag int

Note that this will alter the existing API, so I'd like feedback before implementing this.

Memory Leak

Hey, I'm new to Go so please excuse me if the problem is with my code.

When I run the following code, the memory just keeps growing and growing, as if it is being leaked somewhere. It leaks slow though, if I run it with a list of 42k urls, it slowly just keeps on climbing and climbing. You should be able to spot it with this url list: https://gist.github.com/JakeAustwick/82c9d4ce300639a4d275/raw/368c41ce6ba95f03cbc25a188dd3c07646a068b0/gistfile1.txt

Can you spot what I'm doing wrong, or have I found a bug?

package main

// import "github.com/hoisie/redis"
// import "code.google.com/p/go.net/html"
// import "github.com/davecheney/profile"
import "github.com/moovweb/gokogiri"
import "github.com/moovweb/gokogiri/xml"
import "github.com/moovweb/gokogiri/html"
import "github.com/mreiferson/go-httpclient"

// import "github.com/PuerkitoBio/goquery"
import "log"
import "time"

// import "strconv"
import "errors"

// import "net/url"

import "strings"

// import "io"
import "io/ioutil"

// import "bytes"

import "net/http"
import "runtime"

func main() {
    runtime.GOMAXPROCS(runtime.NumCPU())
    WORKER_COUNT := 25

    transport := &httpclient.Transport{
        ConnectTimeout:        3 * time.Second,
        RequestTimeout:        8 * time.Second,
        ResponseHeaderTimeout: 3 * time.Second,
    }
    http.DefaultClient = &http.Client{Transport: transport}
    // QUEUES := []string{"feed"}

    jobChannel := make(chan string)
    for id := 0; id < WORKER_COUNT; id++ {
        go fetchPage(id, jobChannel)
    }
    // Not most efficient way to read file, but isn't gonna use up all
    // available RAM. This isn't the problem.
    content, err := ioutil.ReadFile("urls.txt")
    if err != nil {
        log.Panicln("FILE READ ERROR")
    }
    urls := strings.Split(string(content), "\n")

    for _, url := range urls {
        jobChannel <- url
    }

    // Stop program from ending
    c := make(chan string)
    <-c

}

func fetchPage(id int, jobChannel chan string) {
    for url := range jobChannel {
        resp, err := http.Get(url)
        if err != nil {
            log.Println("http error")
            continue
        }
        body, err := ioutil.ReadAll(resp.Body)
        if err != nil {
            log.Println("read error")
            continue
        }
        resp.Body.Close()
        doc, err := gokogiri.ParseHtml(body)
        if err != nil {
            log.Println("parse error")
        }

        form, err := BiggestForm(doc)
        if err != nil {
            log.Println(err)
        } else {
            log.Println(form.Name())
        }
        doc.Free()
    }
}

func PostForms(doc *html.HtmlDocument) []xml.Node {
    var forms []xml.Node
    foundForms, err := doc.Search("//form")
    if err != nil {
        return forms
    }
    return foundForms
}

func BiggestForm(doc *html.HtmlDocument) (xml.Node, error) {
    forms := PostForms(doc)
    if len(forms) > 0 {
        high_input_count := 0
        high_index := 0
        for i, f := range forms {
            inputs, _ := f.Search(".//descendant::input")
            textareas, _ := f.Search(".//descendant::textarea")
            fieldCount := len(inputs) + len(textareas)
            if fieldCount > high_input_count {
                high_input_count = fieldCount
                high_index = i
            }
        }
        return forms[high_index], nil
    } else {
        var f xml.Node
        return f, errors.New("No form on the Page")
    }
}

Once again I apologise if it is me doing something wrong.

Inject HTML into a node

There should be a way to inject HTML into a node. For instance,

node.String() // ""
node.Inject("<div />")
node.String() // "<div />"

And, furthermore, this new div has to be properly doc'd.

node.FirstElement().Doc() == node.Doc()
// and ensure this happens in C-world too!

xmlNode.Search returns no results when XPath evals to string or number

The current implementation of Search assumes that a NodeSet was returned, and returns nothing if the XPath actually evaluates to a string or number.

Search should either return an interface{} and require type switching, or an error should be returned if not a NodeSet. Alternatively a new API that returns an interface{} can be introduced.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.