christrenkamp / goxpath Goto Github PK

View Code? Open in Web Editor NEW

148.0 148.0 17.0 248 KB

An XPath 1.0 implementation written in the Go programming language.

License: MIT License

Go 99.56% Shell 0.44%

go golang xpath

goxpath's People

Contributors

Stargazers

Watchers

Forkers

juri agriic aristanetworks prettymuchbryce forging2012 bserdar dolanor-galaxy mosopeogundipe nkitchen rahuldevopsengg optanix isgasho nagesh4193 kaleksandrov239

goxpath's Issues

Exposing the lexer/parser

Hi,

There is a bit of a backstory to this issue, but basically I'm looking to write my own XPath evaluator, mostly because the one in Goxpath isn't geared towards what I want and that translates in big performance cost. Goxpath does have a great lexer/parser, so there is no point in me reinventing the wheel there.

Would you be open to a PR that moves the parser/lexer out of internal so that I can use them in another package?

ExecNode bug?

I have a simple example here of a program which takes a NodeSet containing two "container" Nodes, and runs ExecNode on them with a query searching for the text of the "title" elements.

However, in both cases it seems to always return a NodeSet as if I had run ExecNode from the root node. This leads me to believe ExecNode is always executing the query on the root instead of the node that was provided.

Is my understanding of the expected behavior correct? and if so – is this a bug?

package main

import (
	"bytes"
	"fmt"

	"github.com/ChrisTrenkamp/goxpath"
	"github.com/ChrisTrenkamp/goxpath/tree"
	"github.com/ChrisTrenkamp/goxpath/tree/xmltree"
)

func main() {
	var page = `
	<html>
		<body>
			<div class="container">
				<span class="title">title1</span>
			</div>
			<div class="container">
				<span class="title">title2</span>
			</div>
		</body>
	</html>
	`

	var root tree.Node
	var buffer = bytes.NewBuffer([]byte(page))
	root, _ = xmltree.ParseXML(buffer, parseSettings)
	xpExec, _ := goxpath.Parse("//div[@class=\"container\"]")
	containers, _ := xpExec.ExecNode(root)
	for _, c := range containers {
		xpExec, _ = goxpath.Parse("//span[@class=\"title\"]/text()")
		title, _ := xpExec.ExecNode(c)
		fmt.Println(title[0].ResValue()) // "title1" in both cases
	}
}

func parseSettings(s *xmltree.ParseOptions) {
	s.Strict = false
}

Retrieve node name and namespace

I'd like to be able to retrieve a list of nodes and access their name and namespace prefix, not just their value.

For example, I'd like to be able to do something like this:

/rss/channel/*

<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd">
    <channel>
        <title>This Week in Tech (MP3)</title>
        <itunes:author>TWiT</itunes:author>
        <itunes:subtitle>Some subtitle</itunes:subtitle>
   </channel>
</rss>

Then be able to pull out namespace prefix and node name from the our result.

result.value = "This Week in Tech (MP3)"
result.ns = ""
result.name = "title"

result.value = "TWiT"
result.ns = "itunes"
result.name = "author"

result.value = "Some subtitle"
result.ns = "itunes"
result.name = "subtitle"

I'm not sure if this is possible with the current XPRes that we get back now or not.

If we had that in place, perhaps it would also be possible to support this XPath 2.0 method to retrieve element name as well:

//rss/*/name()

Nice work on this library! Really like that it is capable of querying based on namespace (that is something xmlpath can't do). I'm currently building an RSS/Atom feed parser on top of it.

Delete tree node

First of all, following up on #6 -- Is this project being actively maintained?

Now, how to delete tree node(s), say by given xpath?

thanks

Parser returns wrong tree structure for expressions that begin with functions

The parser isn't behaving correctly when an XPath expression starts with a function. An example expression is current()/oc-platform:state/oc-platform:type = 'TRANSCEIVER', which is available in the public openconfig models https://github.com/openconfig/public/blob/5897507ecdb54453d4457e7dbb0a3d4b7ead4314/release/models/platform/openconfig-platform-transceiver.yang#L557

I noticed two issues, for which I'm currently testing my fixes but I still want to report them anyway.

Whenever an XPATH expression starts with a function, parser would always return an empty root node in the AST tree. This issue occurred even for small expressions like current()='TRANSCEIVER'. After fixing this I encountered the second issue.
Whenever an XPATH expression starts with a function and contains more than one node (e.g. the sample openconfig expression mentioned above), parser creates a wrong AST tree structure.

When the hell did encoding/xml handle HTML?

This is embarrassing. I never knew that the encoding/xml package could parse HTML files as well. Upon this discovery, I decided to see how goxpath could handle it:

curl -s https://en.wikipedia.org/wiki/XPath | goxpath -u -v '//div[@class="mw-highlight mw-content-ltr"]/pre[contains(.,"xml")]'

Response:


<?xml version="1.0" encoding="utf-8"?>
<wikimedia>
  <projects>
    <project name="Wikipedia" launch="2001-01-05">
      <editions>
        <edition language="English">en.wikipedia.org</edition>
        <edition language="German">de.wikipedia.org</edition>
        <edition language="French">fr.wikipedia.org</edition>
        <edition language="Polish">pl.wikipedia.org</edition>
        <edition language="Spanish">es.wikipedia.org</edition>
      </editions>
    </project>
    <project name="Wiktionary" launch="2002-12-12">
      <editions>
        <edition language="English">en.wiktionary.org</edition>
        <edition language="French">fr.wiktionary.org</edition>
        <edition language="Vietnamese">vi.wiktionary.org</edition>
        <edition language="Turkish">tr.wiktionary.org</edition>
        <edition language="Spanish">es.wiktionary.org</edition>
      </editions>
    </project>
  </projects>
</wikimedia>

Neat! We can even run goxpath on that response.

curl -s https://en.wikipedia.org/wiki/XPath | goxpath -u -v '//div[@class="mw-highlight mw-content-ltr"]/pre[contains(.,"xml")]' | goxpath -u '//edition[@language="English"]'

Response:

<edition language="English">en.wikipedia.org</edition>
<edition language="English">en.wiktionary.org</edition>

This is making me reconsider the way goxpath uses strict validation. goxpath expects the XML declaration to be the first thing in the input. Since it can handle both XML and HTML, maybe this should be changed to just strip off the XML declaration if it's at the beginning of the input and not throw an error if it's not there.

Could it support non-strict html parsing?

If will be more useful if it could support non-strict html parsing.
Or, is there any tool could convert bad html into normalized file which could be read by goxpath?

Preserve self-closing tags as is

Following up with #7

goxpath ... reads and writes XML as it wasn't meant for changing the tree.

I found that goxpath.Marshal will change the self-closing tags like <test /> into a pair <test></test>.

Is there any way I can preserve self-closing tags as is?

In a typical jmx file, there are lots of <hashTree /> nodes (or <collectionProp name="HeaderManager.headers"/> etc), and I want to preserve them as-is, after processed/output by goxpath.

Thanks

XML-tagged structs operation

Logging from here.

I'm also looking for some ideas on extending the parser to operate on XML-tagged structs.

Quite agree.

Maybe implementing Python ElementTree module XML API as the starting point?

Thanks again for the great work!

goxpath needs a rewrite

goxpath was mainly written for myself because the tools I used were either too complicated and cumbersome or insufficient. I made a lot of decisions early on that I now regret. Some of them were fixable, but others are inherit to its design and cannot be fixed without gutting huge portions of the library. I don't use goxpath as much as I was about half a year ago (and the commit history reflects that), but whenever I do, it has a bunch of flaws that bug the hell out of me. Here's a short list of ones that desperately need to be addressed:

tree's are not streamed

Anyone who has used goxpath on big XML files will notice this. My job is to work with many, many small XML files, so it was never really a concern, but as I start working with bigger XML files, this is starting to turn its ugly head.

The problem, however, is the core Go XML library does not have the ability to read previous elements, so if the tree automatically releases memory in the background, there's no way to get it back without re-opening the file.

tree's are format-agnostic, but the core library depends specifically on XML elements

I should have thought the core library out more thoroughly before putting in explicit XML dependencies. I should have made the data points on tree node's an interface so other format's can just define their own internal data points without making some ugly translation into XML.

tree's cannot be altered

It's becoming more apparent to me that people want this library for editing tree's, not just querying them. It can be done, but the default xmltree package was not made with editing in mind.

The public API is ugly

I focused a lot of the internal library and didn't pay much attention to the public API. It's not as fluent as I would like it to be, and it needs to be more elegant.

Parser doesn't like expressions like `/foo/[current()/bar=ip]`

It seems like the parser doesn't like the expression in the issue title:

func main() {
	x, _ := parser.Parse(`/foo[current()/bar=buz]`)
	fmt.Println(x.Left.Left.Left)
	fmt.Println(x.Left.Left.Left.Left.Left) // the parent is 'op =' which is fine
}

=>
go run *.go &{{ } 0xc42004e1c0 <nil> 0xc42004e140 0xc42004e1c0} &{{function current} <nil> <nil> 0xc42004e1c0 <nil>}
The first line is suspicious and the second is weird because current should have a left (or right?), but it doesn't. Writing it the other way around produces a correct tree.

I haven't been bitten by this so atm this is nothing more than a nice to have. I suspect if this becomes an actual problem, I'll find some time to fix it myself.

last() / index after attribute restriction query doesn't parse

trying to run

var xmlStr = xml.Header + `<root>
  <text attr="1">This is some text 1.1.</text>
  <text attr="1">This is some text 1.2.</text>
  <text attr="2">This is some text. 2.1</text>
  <text attr="2">This is some text. 2.2</text>
  <text attr="2">This is some text. 2.3</text>
</root>`

var xpExec = goxpath.MustParse(`/root/text[@attr="2"][2]`)

lib throws panic: Malformed XPath expression

this is one of examples at 1.0 spec. And result should be "This is some text. 2.2"

panic: Malformed XML file

Hi @ChrisTrenkamp,

Please take a look at this simple file that beautify XML. I.e.,

func main() {

	parseTree := xmltree.MustParseXML(os.Stdin)
	goxpath.Marshal(parseTree, os.Stdout)
}

This is how I tested it, and their results:

$ echo '<message><org><cn>Some org-or-other</cn><ph>Wouldnt you like to know</ph></org><contact><fn>Pat</fn><ln>Califia</ln></contact></message>' | go run gxp_XMLBeautify.go
panic: Malformed XML file

goroutine 1 [running]:
panic(0x51df40, 0xc420064530)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/ChrisTrenkamp/goxpath/tree/xmltree.MustParseXML(0x5ee540, 0xc420088000, 0x0, 0x0, 0x0, 0x4806ad, 0xc4200001a0)
        /export/repo/go-arch/src/github.com/ChrisTrenkamp/goxpath/tree/xmltree/xmltree.go:39 +0xb0
main.main()
...

$ echo "<root><this><is>a</is><test /></this></root>" | go run gxp_XMLBeautify.go
panic: Malformed XML file

goroutine 1 [running]:
panic(0x51df40, 0xc42000aa10)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/ChrisTrenkamp/goxpath/tree/xmltree.MustParseXML(0x5ee540, 0xc420034008, 0x0, 0x0, 0x0, 0x4806ad, 0xc4200001a0)
        /export/repo/go-arch/src/github.com/ChrisTrenkamp/goxpath/tree/xmltree/xmltree.go:39 +0xb0
main.main()
...

how can I make it works? Thanks!

does this lib support html?

Select root node

I would expect the path /* to select the root node. However, it always returns no results for me.

This is useful for if I wish to pull attributes from the root node of an RSS feed which may sometimes be "rss" and in other version be "rdf".

For now I have two paths to cover both cases but it would be cool if I could select the root element via a wildcard.

XSLT current()

I've been doing some work on goxpath, using it to evaluate xpath expressions used in the YANG data modelling language. I've been able to implement the tree.Node interface on top of my own store (I'm not using the XML parse part). I've been able to get a number of XPATH queries to work in this way.

One issue I just hit is the XSLT current() function is used in the YANG files, this fails to evaluate as current() isn't included in the implementation. Did you consider implementing current() at all ?

Locate and modify

Hi,

Would it be possible for you to provide an example showing how to locate some records them modify them please?

For example, please take a look at this .xml file. I need to,

locate all //Request nodes; and for each node found,
change its attribute according to other attributes.

For example, these changes were to change

//Request's ./[ReportingName] use the basename of ./[Url](the change also contains prefixing it with ../TransactionTimer[Name] as well, but for simplicity of the demo, please ignore this part).

The final result is posted here.

TIA for your help.