Giter Site home page Giter Site logo

qcam / saxy Goto Github PK

View Code? Open in Web Editor NEW
258.0 5.0 38.0 1.63 MB

Fast SAX parser and encoder for XML in Elixir

Home Page: https://hexdocs.pm/saxy

License: MIT License

Elixir 100.00%
elixir elixir-lang xml xml-parser xml-builder xml-builder-library xml-library

saxy's Introduction

Saxy

Test suite Module Version

Saxy (Sá xị) is an XML SAX parser and encoder in Elixir that focuses on speed, usability and standard compliance.

Comply with Extensible Markup Language (XML) 1.0 (Fifth Edition).

Features highlight

  • An incredibly fast XML 1.0 SAX parser.
  • An extremely fast XML encoder.
  • Native support for streaming parsing large XML files.
  • Parse XML documents into simple DOM format.
  • Support quick returning in event handlers.

Installation

Add :saxy to your mix.exs.

def deps() do
  [
    {:saxy, "~> 1.5"}
  ]
end

Overview

Full documentation is available on HexDocs.

If you never work with a SAX parser before, please check out this guide.

SAX parser

A SAX event handler implementation is required before starting parsing.

defmodule MyEventHandler do
  @behaviour Saxy.Handler

  def handle_event(:start_document, prolog, state) do
    IO.inspect("Start parsing document")
    {:ok, [{:start_document, prolog} | state]}
  end

  def handle_event(:end_document, _data, state) do
    IO.inspect("Finish parsing document")
    {:ok, [{:end_document} | state]}
  end

  def handle_event(:start_element, {name, attributes}, state) do
    IO.inspect("Start parsing element #{name} with attributes #{inspect(attributes)}")
    {:ok, [{:start_element, name, attributes} | state]}
  end

  def handle_event(:end_element, name, state) do
    IO.inspect("Finish parsing element #{name}")
    {:ok, [{:end_element, name} | state]}
  end

  def handle_event(:characters, chars, state) do
    IO.inspect("Receive characters #{chars}")
    {:ok, [{:characters, chars} | state]}
  end

  def handle_event(:cdata, cdata, state) do
    IO.inspect("Receive CData #{cdata}")
    {:ok, [{:cdata, cdata} | state]}
  end
end

Then start parsing XML documents with:

iex> xml = "<?xml version='1.0' ?><foo bar='value'></foo>"
iex> Saxy.parse_string(xml, MyEventHandler, [])
{:ok,
 [{:end_document},
  {:end_element, "foo"},
  {:start_element, "foo", [{"bar", "value"}]},
  {:start_document, [version: "1.0"]}]}

Streaming parsing

Saxy also accepts file stream as the input:

stream = File.stream!("/path/to/file")

Saxy.parse_stream(stream, MyEventHandler, initial_state)

It even supports parsing a normal stream.

stream = File.stream!("/path/to/file") |> Stream.filter(&(&1 != "\n"))

Saxy.parse_stream(stream, MyEventHandler, initial_state)

Partial parsing

Saxy can parse an XML document partially. This feature is useful when the document cannot be turned into a stream e.g receiving over socket.

{:ok, partial} = Partial.new(MyEventHandler, initial_state)
{:cont, partial} = Partial.parse(partial, "<foo>")
{:cont, partial} = Partial.parse(partial, "<bar></bar>")
{:cont, partial} = Partial.parse(partial, "</foo>")
{:ok, state} = Partial.terminate(partial)

Simple DOM format exporting

Sometimes it will be convenient to just export the XML document into simple DOM format, which is a 3-element tuple including the tag name, attributes, and a list of its children.

Saxy.SimpleForm module has this nicely supported:

Saxy.SimpleForm.parse_string(data)

{"menu", [],
 [
   {"movie",
    [{"id", "tt0120338"}, {"url", "https://www.imdb.com/title/tt0120338/"}],
    [{"name", [], ["Titanic"]}, {"characters", [], ["Jack &amp; Rose"]}]},
   {"movie",
    [{"id", "tt0109830"}, {"url", "https://www.imdb.com/title/tt0109830/"}],
    [
      {"name", [], ["Forest Gump"]},
      {"characters", [], ["Forest &amp; Jenny"]}
    ]}
 ]}

XML builder

Saxy offers two APIs to build simple form and encode XML document.

Use Saxy.XML to build and compose XML simple form, then Saxy.encode!/2 to encode the built element into XML binary.

iex> import Saxy.XML
iex> element = element("person", [gender: "female"], "Alice")
{"person", [{"gender", "female"}], [{:characters, "Alice"}]}
iex> Saxy.encode!(element, [])
"<?xml version=\"1.0\"?><person gender=\"female\">Alice</person>"

See Saxy.XML for more XML building APIs.

Saxy also provides Saxy.Builder protocol to help composing structs into simple form.

defmodule Person do
  @derive {Saxy.Builder, name: "person", attributes: [:gender], children: [:name]}

  defstruct [:gender, :name]
end

iex> jack = %Person{gender: :male, name: "Jack"}
iex> john = %Person{gender: :male, name: "John"}
iex> import Saxy.XML
iex> root = element("people", [], [jack, john])
iex> Saxy.encode!(root, [])
"<?xml version=\"1.0\"?><people><person gender=\"male\">Jack</person><person gender=\"male\">John</person></people>"

FAQs with Saxy/XMLs

Saxy sounds cool! But I just wanted to quickly convert some XMLs into maps/JSON...

Saxy does not have offer XML to maps conversion, because many awesome people already made it happen 💪:

Alternatively, this pull request could serve as a good reference if you want to implement your own map-based handler.

Does Saxy work with XPath?

Saxy in its core is a SAX parser, therefore Saxy does not, and likely will not, offer any XPath functionality.

SweetXml is a wonderful library to work with XPath. However, :xmerl, the library used by SweetXml, is not always memory efficient and speedy. You can combine the best of both sides with Saxmerl, which is a Saxy extension converting XML documents into SweetXml compatible format. Please check that library out for more information.

Saxy! Where did the name come from?

Sa xi Chuong Duong

Sa Xi, pronounced like sa-see, is an awesome soft drink made by Chuong Duong.

Benchmarking

Note that benchmarking XML parsers is difficult and highly depends on the complexity of the documents being parsed. Event I try hard to make the benchmarking suite fair but it's hard to avoid biases when choosing the documents to benchmark against.

Therefore the conclusion in this section is only for reference purpose. Please feel free to benchmark against your target documents. The benchmark suite can be found in bench/.

A rule of thumb is that we should compare apple to apple. Some XML parsers target only specific types of XML. Therefore some indicators are provided in the test suite to let know of the fairness of the benchmark results.

Some quick and biased conclusions from the benchmark suite:

  • For SAX parser, Saxy is usually 1.4 times faster than Erlsom. With deeply nested documents, Saxy is noticeably faster (4 times faster).
  • For XML builder and encoding, Saxy is usually 10 to 30 times faster than XML Builder. With deeply nested documents, it could be 180 times faster.
  • Saxy significantly uses less memory than XML Builder (4 times to 25 times).
  • Saxy significantly uses less memory than Xmerl, Erlsom and Exomler (1.4 times 10 times).

Limitations

  • No XSD supported.
  • No DTD supported, when Saxy encounters a <!DOCTYPE, it skips that.
  • Only support UTF-8 encoding.

Contributing

If you have any issues or ideas, feel free to write to https://github.com/qcam/saxy/issues.

To start developing:

  1. Fork the repository.
  2. Write your code and related tests.
  3. Create a pull request at https://github.com/qcam/saxy/pulls.

Copyright and License

Copyright (c) 2018 Cẩm Huỳnh

This software is licensed under the MIT license.

saxy's People

Contributors

hissssst avatar joez99 avatar josevalim avatar kanmaniselvan avatar kianmeng avatar lucacorti avatar manuel-rubio avatar marcelotto avatar marpo60 avatar qcam avatar stevedomin avatar thbar avatar tiagonbotelho avatar varnerac avatar vikas15bhardwaj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

saxy's Issues

Can we safely raise an error from a handler?

I am "overriding" the Saxy.SimpleForm.Handler (using it as a basis, and overriding one function), and I was wondering if there is a risk of memory leak or similar trouble if an exception is raised from handle_event.

It is safe to do?

(for some context, I've read #80).

Thanks!

Allow partial parsing of an XML doc

Firstly, this is a very nice library, thanks for making it!

It would be really useful to be able to partially parse a string, return the state, and then when the user gets more input they can pass more of the string in with the previous state, until eventually getting to the end of the xml doc. I tried to see if this was already do-able but it doesn't appear to be. It seems kind of close to streaming... but not quite.

The use case I'm looking at is in a handle_info/2 function in a GenServer that gets a little bit more of an XML doc each time from a long running process, so I want to update the parsing state each time handle_info/2 is called.

Special characters are not escaped during xml encoding

Hello @qcam, thanks for the effort you're putting on this project, it's been really helpful for us.
I'm currently having an issue creating xml documents.
My expectation was that when I create an xml the special characters in the content will be escaped apparently this is not happening here is an example:

> Saxy.XML.element("event", [], "Event1 & Event2") |> Saxy.encode!()
> "<event>Event1 & Event2</event>"

Thank you!

Dialyzer

Will you accept a PR that allows Dialyzer and adds type specs?

Concerns on whitespace emitting and parsing

Context

We use Saxy to parse XML files from several APIs.

There have been changes introduced to emit whitespaces correctly, connected to [this issue and its pull request](#51).

Concerns

For our first concern related to the above, we noticed a reduction in performance when using the latest version of Saxy (v1.4.0), as opposed to a fork based on v0.9.1.

We did some digging to check whether it was changes on our end, or if it was a performance regression in this library.

We observed that the generated SimpleForm data from a pretty-printed XML, using the latest version of saxy is double the file size from the original one. Very likely caused or correlated by the emitted / parsed whitespaces.

Running benchmarks between SimpleForm results coming from both versions, we found that parsed / emitted whitespaces cause some performance regressions.

Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.13.4
Erlang 25.0.4

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 5 s
parallel: 5
inputs: none specified
Estimated total run time: 24 s

Benchmarking new_saxy_version ...
Benchmarking old_saxy_version ...

Name                       ips        average  deviation         median         99th %
old_saxy_version          4.33      231.03 ms     ±3.94%      229.68 ms      257.67 ms
new_saxy_version          4.22      237.09 ms     ±4.67%      234.80 ms      265.52 ms

Comparison:
old_saxy_version          4.33
new_saxy_version          4.22 - 1.03x slower +6.05 ms

Reduction count statistics:

Name                     average  deviation         median         99th %
old_saxy_version         20.86 M     ±0.04%        20.86 M        20.87 M
new_saxy_version         24.84 M     ±0.03%        24.84 M        24.85 M

Comparison:
old_saxy_version         20.86 M
new_saxy_version         24.84 M - 1.19x reduction count +3.98 M

The second concern is about whitespace values within tags. Previously, an XML in the shape of:

<Body> <Value> </Value> </Body>

Would result into the contents of Value being parsed as nil. In the current build, it is parsed as a string containing whitespace.

In a previous issue mentioned in the beginning, this should only happen when certain attributes are provided. Like so:

<Body> <Value xml:space="preserve"> </Value> </Body>

Proposed solutions

  • We should only emit whitespaces when xml:space="preserve" is provided. By default whitespaces should not be emitted.
  • Alternatively, providing an option to ignore whitespaces might be a way to solve it, such that we can do calls similar to these:
Saxy.parse_stream(stream, Saxy.SimpleForm.Handler, [emit_whitespaces?: false])

ParseError for Adobe Illustrator SVG exports

Fails parsing Adobe Illustrator exports

  • reason: {:token, :name_start_char}

Example SVG from Illustrator:

<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 19.2.1, SVG Export Plug-In . SVG Version: 6.00 Build 0)  -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="icon_x5F_checkmark" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px"
	 y="0px" viewBox="0 0 24 24" enable-background="new 0 0 24 24" xml:space="preserve">
<rect fill="none" width="24" height="24"/>
<g>
	<path d="M12,23.5752C5.6177,23.5752,0.4248,18.3828,0.4248,12S5.6177,0.4248,12,0.4248S23.5752,5.6172,23.5752,12
		S18.3823,23.5752,12,23.5752z M12,2.292c-5.353,0-9.708,4.3545-9.708,9.708S6.647,21.708,12,21.708s9.708-4.3545,9.708-9.708
		S17.353,2.292,12,2.292z"/>
	<path d="M10.5063,17.1338c-0.001,0-0.0024,0-0.0034,0c-0.2852-0.001-0.5542-0.1328-0.7305-0.3564l-3.0801-3.9199
		c-0.3188-0.4062-0.248-0.9922,0.1572-1.3115c0.4048-0.3164,0.9917-0.249,1.3105,0.1572l2.3516,2.9932l5.8926-7.3857
		c0.3208-0.4033,0.9087-0.4707,1.3115-0.1475c0.4033,0.3213,0.4692,0.9082,0.1475,1.3115l-6.6274,8.3076
		C11.0591,17.0039,10.7905,17.1338,10.5063,17.1338z"/>
</g>
</svg>

Removing the <!DOCTYPE line fixes the error.

Support :error tuples as a valid return value in event handlers

Maybe I'm missing something, but is there a way to stop parsing with a :error tuple which is directly passed through as the overall result?

I'm using Saxy for writing an RDF/XML parser which imposes some further restrictions on a valid document. So, I'd like to return a :error tuple in the event handlers when encountering an error, which I'd expect to be passed through directly as the result of Saxy.parse_string/3, but currently this results in this:

{:error,
    %Saxy.ParseError{
      __exception__: true,
      binary: nil,
      position: nil,
      reason: {:bad_return, {:start_element, {:error, "my custom error"}}}
    }
  }
}

I'm aware of the :halt and :stop return patterns, which could be used for this scenario of course, but neither of them feels quite right:

  • with :stop the :error tuple is wrapped in a :ok tuple
  • with :halt the :error tuple is wrapped in a three element :halt tuple

What do you think of supporting :error tuples as an additional return pattern in the handlers, which will become the result of Saxy.parse_string/3 directly?

XML element name parsing is not spec compliant

To add some more context on #128, we are using saxy to parse XML files generated by invoicing systems and we are seeing examples of this happening in the wild. e.g.:

<a>somevalue</a

>

This causes saxy to return an error because the end of element a is not detected. According to the XML spec though, whitespace is allowed between the element name and the closing > character, so I think this is a bug in the implementation.

CDATA: Keep when parsing & rebuilding document

Hi!

I see you're handling CDATA when parsing and it's possible to generate CDATA using a {:cdata, content} tuple. What I want to do is parse some XML, edit it and then build XML again. Here's a minimal example using CDATA:

simplexml
"<?xml version=\"1.0\"?><rss><content:encoded><![CDATA[<strong>foobar</strong>]]></content:encoded></rss>"{:ok, parsed} = Saxy.SimpleForm.parse_string(simplexml)
{:ok, {"rss", [], [{"content:encoded", [], ["<strong>foobar</strong>"]}]}}
▶ Saxy.encode!(parsed, [])
"<?xml version=\"1.0\"?><rss><content:encoded><strong>foobar</strong></content:encoded></rss>"

The final XML is wrong because it does not wrap the content:encoded content in CDATA. I looked at the code and I don't think I can solve this by implementing my own event handler as there is no event or indicator for CDATA being returned.

I believe it's a builder bug as well. The generated string without CDATA should look like this:

<?xml version="1.0"?><rss><content:encoded>&lt;strong&gt;foobar&lt;/strong&gt;</content:encoded></rss>

Support for xml:space="preserve"

It appears that Saxy doesn't support the xml:space="preserve" attribute and fails to emit character data for an element containing only whitespace.

Here's a test which illustrates the issue and the expected behavior:

test "parses elements with xml:space=\"preserve\"" do
  buffer = "<foo xml:space=\"preserve\">   </foo>"

  assert {:ok, state} = parse(buffer)
  events = Enum.reverse(state.user_state)

  assert [{:start_element, {"foo", [{"xml:space", "preserve"}]}} | events] = events
  assert [{:characters, "   "} | events] = events
  assert [{:end_element, "foo"} | events] = events
  assert [{:end_document, {}} | events] = events

  assert events == []
end

Note that this is a trivial case as the behavior dictated by xml:space is supposed to be inherited by child nodes.

I had a look at parser/element.ex and it seems like supporting such stateful behavior might be challenging.
Perhaps could we work around the issue by allowing some level of customization of whitespace handling as an option to the parse functions?

Output Stream

Sá xị already takes a Stream as input. It would be nice to also be able to output events as Stream. This will let us fit Sá xị parsing in a stream/flow pipeline to parallel big XML document.

Example:

File.stream!("path/to/file.xml")
|> Saxi.parse_stream(MyEventHandler)
|> Stream.filter(fn
  ({:start_element, {"file", _attributes}}) -> true
  (_) -> false
end)
...

"&" in characters throws error

First off. Fantastic library.

I'm getting a {:token, :reference} error with contents containing the ampersand(&) character.

<?xml version='1.0' ?>
<event>
  <sensor>
    <name>Door 1&2</name>
    <location>Upstairs</location>
  </sensor>
</event>

I'm parsing a stream from an application that is emitting events for sensors. I can't change what is being emitted.

:bad_return error returned when parsing a valid XML file

I get an error trying to parse a small sample of discogs data dump. I could parse the same file without any issues in Ruby and Python.

The error is raised right after the prolog, the first element is not processed.

iex(4)> DiscogsParserEx.run('data/masters_sample.xml')
"Start parsing document"
{:error,
 %Saxy.ParseError{
   binary: nil,
   position: nil,
   reason: {:bad_return, {:start_document, [start_document: []]}}
 }}

I'm trying to figure out what's going on, and the meaning of bad_return, which I believe is returned here:

Utils.bad_return_error(other)

Here a link to the files I'm trying to parse
https://github.com/alexquintino/discogs-parser/tree/master/data

These files don't have a prolog. But I've tried to add the prolog and still get the same error

iex(5)> DiscogsParserEx.run('data/masters_sample.xml')
"Start parsing document"
{:error,
 %Saxy.ParseError{
   binary: nil,
   position: nil,
   reason: {:bad_return, {:start_document, [start_document: [version: "1.0"]]}}
 }}

Do you have any clue of why the parser raises this error? Any help would be appreciated, thanks!

v1.0 roadmap

  • Binary parsing.
  • Stream parsing.
  • Process instruction (#6).
  • Simple form (#10).
  • Entity Reference (no external entity reference) (#8).
  • Stop/halt control (#7).
  • XPath support (partial?).
  • XML Builder.

Is streaming encoding possible with Saxy?

Maybe it is already supported, but I'm not 100% clear with this, and I will be very happy to document the findings in a way or another, so opening the discussion :-)

I am currently manipulating potentially large XML responses provided by third-party servers inside a proxy.

Basically, someone queries our proxy with a small XML query, then we modify it, and send the payload to a third-party server.

What I would like to do is stream the response of the third-party server as a client, modify it on the fly (e.g. redacting sensitive elements) and send it back to the client, also in streaming fashion.

This means I would need to stream the large response, but also generate a large streaming encoded document out of it, all with minimal memory.

I think @JoeZ99 paved the way in #100 and https://joez99.medium.com/stream-output-when-parsing-big-xml-with-elixir-92baff37e607, and maybe everything is more or less already here.

Is "streaming encoding" possible currently with Saxy? If I plug this with the above work, I'll have achieved what I need.

Thanks for your input!

Example of pretty-printing (indenting) XML with Saxy?

For some time (see https://elixirforum.com/t/what-is-your-best-trick-to-pretty-print-a-xml-string-with-elixir-or-erlang/42010), I've wondered what is the best way to indent XML (like xmllint --format file.xml does), in order to make it more readable to humans.

I have came across javascript packages (https://www.npmjs.com/package/xml-formatter) which does that.

I wonder if someone has already implemented that with Saxy, ideally in a way that does not require a full in-RAM storage (simple_form), but instead streamed if possible.

Opening this issue to create a bit of discussion, and I will definitely comment back & create extra documentation if I find how to achieve that (for now I'll rely on javascript-land).

Related:

"UTF-8" vs "utf-8" in the encoded prolog?

Thanks for Saxy! A great library 😄

While implementing a round-trip operation (parsing, modifying, re-encoding) of small XML documents, and comparing the before/after as a way to test the operation, I noticed that it is possible to generate a document with the encoding specified via this code:

saxy/lib/saxy/encoder.ex

Lines 29 to 31 in 9f9b372

defp encoding(:utf8) do
[?\s, 'encoding', ?=, ?", 'utf-8', ?"]
end

The documents I'm usually dealing with use an uppercase encoding here (UTF-8), and it is often quoted on the net that the uppercase variant is preferred (https://blog.codingoutloud.com/2009/04/08/is-utf-8-case-sensitive-in-xml-declaration/).

I am a bit concerned that some legacy / low-quality parsers could choke on the lower-case version.

So I wonder: would you be OK to also support the uppercase version?

Also, it could be nice to have the option to add a newline after the prolog, as commonly seen in indented document.

If you are interested, I'll be happy to contribute this via a PR. Thanks!

[HOWTO] Target a nested element

Hi,
Is it possible to target a nested element ?

I have this data structure:

<?xml version="1.0" encoding="utf-8"?>
<identity>
  <name>
    <nested_name>foo</nested_name>
  </name>
  <no_name>
    <nested_name>bar</nested_name>
  </mo_name>
</identity>

I would like to have the value of nested_name of the first name block but not the nested_name of the no_name block.

Thks

Saxy does not map \r to \n like SimpleXml does - can we count on this continuing?

We're trying to parse XML output from Amazon S3 APIs when the file/object name might end (inappropriately but it is possible) with \n or \r characters. Right now we are using SweetXml but the case with "xxx\r" is getting parsed to "xxx\r" which, if we returned that as the name to S3, is not what we got in the first place.

Parsing \r | \n with SweetXml.parse:

iex(6)> SweetXml.parse("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx\r</testattr>") 
{:xmlElement, :testattr, :testattr, [], {:xmlNamespace, [], []}, [], 1, [],
 [{:xmlText, [testattr: 1], 1, [], 'xxx\n', :text}], [],
 '/Users/cmarkle/devel/elixir/xml_example', :undeclared}
iex(7)> SweetXml.parse("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx\n</testattr>")
{:xmlElement, :testattr, :testattr, [], {:xmlNamespace, [], []}, [], 1, [],
 [{:xmlText, [testattr: 1], 1, [], 'xxx\n', :text}], [],
 '/Users/cmarkle/devel/elixir/xml_example', :undeclared}
iex(8)> SweetXml.parse("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx&#13;</testattr>")
{:xmlElement, :testattr, :testattr, [], {:xmlNamespace, [], []}, [], 1, [],
 [{:xmlText, [testattr: 1], 1, [], 'xxx\n', :text}], [],
 '/Users/cmarkle/devel/elixir/xml_example', :undeclared}
iex(9)> SweetXml.parse("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx&#10;</testattr>")
{:xmlElement, :testattr, :testattr, [], {:xmlNamespace, [], []}, [], 1, [],
 [{:xmlText, [testattr: 1], 1, [], 'xxx\n', :text}], [],
 '/Users/cmarkle/devel/elixir/xml_example', :undeclared}

...basically \n and \r are each mapped to \n, probably as intended by XML spec.

I see that Saxy distinguishes between these two characters and \r is NOT mapped to \n.

Parsing \r | \n with Saxy.SimpleForm:

iex(2)> Saxy.SimpleForm.parse_string("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx\n</testattr>")
{:ok, {"testattr", [], ["xxx\n"]}}
iex(3)> Saxy.SimpleForm.parse_string("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx\r</testattr>")
{:ok, {"testattr", [], ["xxx\r"]}}
iex(4)> Saxy.SimpleForm.parse_string("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx&#13;</testattr>")
{:ok, {"testattr", [], ["xxx\r"]}}
iex(5)> Saxy.SimpleForm.parse_string("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx&#10;</testattr>")
{:ok, {"testattr", [], ["xxx\n"]}}

...basically \n and \r are treated distinctly, each mapping to same in result

This would be helpful to me in this case, but I guess my question is this something we can count on staying this way with Saxy?

Multiple chacarters events being received for the same element when the cdata_as_characters option is enabled

Hi,

When I receive "character" events with CDATA content, more than one event is emitted for the same element.

Using this example:

     <Neighborhood>
       <![CDATA[CIBRATEL I]]>
     </Neighborhood>

I did an inspection of the content and found it passed my hendle_event 3 times.
That was my test inspect.

      "\n"
      "CIBRATEL I"
      "\n"

Since cdata_as_characters is true, shouldn't only one event come?

I believe that only one event should arrive with the content "\nCIBRATEL I\n".

Boolean attribute issue

Hi,

How can I fix this issue with boolean attr?

Saxy.SimpleForm.parse_string(~s|<span itemscope itemtype="https://schema.org/Review"></span>|)

{:error,
 %Saxy.ParseError{
   binary: "<span itemscope itemtype=\"https://schema.org/Review\"></span>",
   position: 16,
   reason: {:token, :eq}
 }}

List support in Saxy.Builder

It looks like Saxy.Builder doesn't handle Lists by default. I was able to get it to work with this (hacky) implementation:

defimpl Saxy.Builder, for: List do
  def build(nil), do: ""
  def build([]), do: ""

  def build(list) do
    list
    |> Enum.map(fn elem -> elem |> Saxy.Builder.build() |> Saxy.encode!() end)
    |> Enum.join()
  end
end

I was wondering if there's a better way to do this, and if it should be included in Saxy by default.

Saxy.encode! function is not escaping characters in some cases

Run following code:

{:ok, x} = Saxy.SimpleForm.parse_string("<events><event>test&amp;test</event></events>") 
Saxy.encode!(x)

It should return exactly same original XML output but it changes &amp; to &. This also produces invalid XML which we can't parse again.

Documenting how to filter out newlines from the parsed model

Just sharing a bit of code I'm using. For some context, I'm starting to use Saxy to build XML queries during testing. I must compare them to a baseline template, to ensure I used the Saxy.XML API correctly. These templates include newlines for human readability (indented/pretty-printed).

To make the comparison automated, I'm removing the non-significant newlines with the following code so far:

defmodule Stripper do
  def filter_newlines_from_model({element, tags, children}) do
    {
      element,
      tags,
      children
      |> Enum.map(&filter_newlines_from_model/1)
      |> Enum.reject(&is_nil/1)
    }
  end

  def filter_newlines_from_model(content) when is_binary(content) do
    if content |> String.trim() == "", do: nil, else: content
  end
end

I can use this like this afterwards:

    {:ok, parsed_request} = Saxy.SimpleForm.parse_string(xml, cdata_as_characters: false)
    IO.inspect(Stripper. filter_newlines_from_model(parsed_request))

This makes it easier to compare this using ExUnit assert x == y, for instance.

I wonder if:

  • this is worth adding to the documentation as an example
  • or if it could be an interesting option for Saxy.SimpleForm.parse_string(less sure at this point, we can mull over this!)

Let me know what you think!

Have you considered implementing a declarative syntax for parsing?

Hi again! The last time we spoke I've been using Saxy to create an RSS feed parser and after successfully implementing the RSS 2.0 spec, I'm now trying to make the handler code a little bit more generic to accept other types of configuration and was wondering if you have ever considered implementing something similar to https://github.com/pauldix/sax-machine with Saxy (and if not, how hard would it be to get it done in your opinion).

I've been toying with the idea, but since I'm not used to parsing documents in SAX-style I'm having a bit of trouble figuring out a nice way of abstracting the handler logic into something more reusable that does not compromise that much performance. I figure this type of configuration could be passed to a generic handler and elements being parsed accordingly:

[
  [element: :title],
  [element: :cloud],
  [element: :image, from: image_mappings],	
  [element: "textInput", from: text_input_mappings],
  [elements: "skipHours", from: skip_hours_mappings],
  [elements: :item, as: :entry, from: entry_mappings]
]

Update¹: I'm seeing this section in the docs about the Saxy.Builder https://hexdocs.pm/saxy/Saxy.html#module-encoder, that presents a declarative API for encoding the data into XML, but it couldn't find if the other way around is also supported.

Update²: I'm also gonna link the repo I'm working on, perhaps you could give me some pointers on how better to optimize/ reuse the parsing code: https://github.com/thiagomajesk/gluttony.

Saxy hangs the dialyzer

Adding saxy as a dependency will hang the dialyzer during the deps adding stage. Running the dialyzer on saxy itself hangs during the project-checking stage.

By removing lib/saxy/parser/{element,prolog}.ex, the dialyzer finishes very quickly.

I would guess that this is related to the large guard clauses defined in lib/saxy/guards.ex.

Here's a similar issue from the Elixir kernel. See also this erlang bug report. It could be that there's still an issue with large or guards.

I have tried Elixir and Erlang versions:

  • 1.6.5 OTP 20.3
  • 1.7.4 OTP 21.0
  • 1.7.4 OTP 21.1
  • 1.7.4 OTP 21.2
  • 1.8.1 OTP 21.1

My workaround is to ignore :saxy as a dependency in the dialyzer config. I'm using dialyxir to run the dialyzer.

`Saxy.XML.element/3` and a non-list third argument results in a Dialyzer error

On a project which uses Saxy.XML, I get Dialyzer errors with this code:

element("siri:StopVisitTypes", [], "all")

The code itself works fine.

I did a bit of digging to understand why Dialyzer reports an error, here are my findings!

The Saxy.XML code does support the call I wrote above indeed, here:

saxy/lib/saxy/xml.ex

Lines 71 to 81 in 98e2c9e

def element(name, attributes, children) when not is_nil(name) and is_list(children) do
{
to_string(name),
attributes(attributes),
children(children)
}
end
def element(name, attributes, child) when not is_nil(name) do
element(name, attributes, [child])
end

The typespecs, on the other hand, appear to require a list:

saxy/lib/saxy/xml.ex

Lines 24 to 30 in 98e2c9e

@type element() :: {
name :: String.t(),
attributes :: [{key :: String.t(), value :: String.t()}],
children :: [content]
}
@type content() :: element() | characters() | cdata() | ref() | comment() | String.t()

For now I'm turning my "all" into ["all"] to fix the error, but I wonder if we could modify the typespecs in Saxy to match the allowed behaviour here?

Thanks!

Closing the event stream - detect end_document

I'm trying to use Saxy to convert a streaming XML resource into e SAX event-stream. I'm having a hard time though with closing the document. I wonder if Saxy could help out here.

stream = ["<root>", "<elem>", "text", "</elem>", "</root>"]

{:ok, partial} = Saxy.Partial.new(MyEventHandler, [])

Stream.transform(stream, partial, fn xml_chunk, partial ->
  case Saxy.Partial.parse(partial, xml_chunk) do
    {:cont, partial} ->
      # emit events
      {Enum.reverse(partial.state.user_data), reset_user_data(partial)}
  end
end)
|> Enum.to_list()

# returns something like
[:start_document, :start_element, :start_element, :characters, :end_element,
 :end_element]

As you can see the last emitted event is :end_element (and not :end_document). The stream stops after "</root>", so parsing stops.

My questions:

  • Should Saxy detect the end of the document here?
  • In which situations would Saxy.Partial.parse return a {:halt, _} response?

For now, my alternative solution would be to detect if the stack is empty, and terminate the stream based on that.

Stream.transform(stream, partial, fn
      _, nil ->
        # parsing has been stopped, halt the stream
        {:halt, []}

      xml_chunk, partial ->
        case Saxy.Partial.parse(partial, xml_chunk) do
          {:cont, %{state: %{stack: []}} = partial} ->
            # the stack is empty, so we're done parsing the document: terminate the partial
            {:ok, end_document} = Saxy.Partial.terminate(partial)
            {Enum.reverse(end_document), nil}

          {:cont, partial} ->
            {Enum.reverse(partial.state.user_state), reset_user_state(partial)}

          {:error, error} ->
            raise error
        end
    end)

Encoding Error ISO-8859-1 SimpleForm

XML String

data = "<?xml version='1.0' encoding='ISO-8859-1'?><OTA_HotelAvailRS xmlns=\"http://parsec.es/hotelapi/OTA2014Compact\" TimeStamp=\"2021-03-17T08:56:29Z\" PrimaryLangID=\"en-GB\" Id=\"11,33667649,72545\"><Hotels HotelCount=\"0\"><DateRange Start=\"2021-04-20\" End=\"2021-04-21\" /><RoomCandidates><RoomCandidate RPH=\"0\"><Guests><Guest AgeCode=\"A\" Count=\"2\" /></Guests></RoomCandidate></RoomCandidates></Hotels></OTA_HotelAvailRS>"

Here, the encoding value is ISO-8859-1.

Now, if I try run simple form it is giving me the following error

Error

iex(12)> Saxy.SimpleForm.parse_string(data)
{:error,
 %Saxy.ParseError{
   binary: "<?xml version='1.0' encoding='ISO-8859-1'?><OTA_HotelAvailRS xmlns=\"http://parsec.es/hotelapi/OTA2014Compact\" TimeStamp=\"2021-03-17T08:56:29Z\" PrimaryLangID=\"en-GB\" Id=\"11,33667649,72545\"><Hotels HotelCount=\"0\"><DateRange Start=\"2021-04-20\" End=\"2021-04-21\" /><RoomCandidates><RoomCandidate RPH=\"0\"><Guests><Guest AgeCode=\"A\" Count=\"2\" /></Guests></RoomCandidate></RoomCandidates></Hotels></OTA_HotelAvailRS>",
   position: 30,
   reason: {:invalid_encoding, "ISO-8859-1"}
 }}

I tried to replace ISO-8859-1 with 'UTF-8' and it is working fine.

Is there a way to parse ISO-8859-1 encoded xml?

Stream fails to parse if XML file contains a leading BOM character

If an XML file contains a leading BOM Saxy fails to parse the file.

iex(1)> xml = "\uFEFF<?xml version=\"1.0\" encoding=\"utf-8\"><foo bar='value'></foo>"
iex(2)> Saxy.parse_string(data, MyEventHandler, [])
"Start parsing document"
{:error,
%Saxy.ParseError{
  reason: {:token, :lt},
  binary: "\uFEFF<?xml version=\"1.0\" encoding=\"utf-8\"><foo bar='value'></foo>",
  position: 0
}}

I'm seeing this when using ExCmd to stream a gziped file into Saxy and can't see any obvious way of stripping it out before passing the stream to Saxy.

Streaming received chunks of characters event

Hi,

for start i really like your parser, read also blog post about making it and benchmarking, amazing work mate!

I struggle with one thing at the moment I’m using partial parsing cause i’m receiving big XML over socket so I’m chunking it and it works but…

My use case is this:

The XML a client is sending to my app contains big attachments as Base64 encoded data (images, PDFs.. etc.) which I need (while parsing / processing) send as stream of chunks to S3-like storage e.g.:

<Object Encoding=“Base64”>
_big_base64encoded_data_binary_
</Object>

Here is the catch:
As long as Saxy won’t receive last part of that encoded base64 binary it’s not parsed / valid and Saxy defaults to buffering this element contents until end of that element and therefore loading whole encoded data in memory (this is not acceptable for my use-case and I'm pretty sure I'll not be alone).

What I want to achieve is somehow start stream upload of received chunks of this base64 / element data as soon as Saxy receives it, but don’t know how to do it without changing internal logic of this library while handling CharData (this concept of having control over what's happening with actual data while parsing them is for example implemented in Stax (Java, https://en.wikipedia.org/wiki/StAX).

Is something like that possible without changing internal logic of Saxy parser as it’s handling characters data? Having this kind of control over received chunks of data will be real benefit and ultimate use-case for me using this library as a stream XML parser with control over emitted events.

Thanks in advance

Erik

Saxy does parse but does not encode properly

When loading in a document using Saxy.SimpleForm.parse_string such as

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<text>&gt;</text>

The result is correctly parsed as:

{"text", [], [">"]}

But when encoding back using Saxy.encode!(simple), result is encoded as:

<text>></text>

So the > character was not properly encoded to &gt;.


Reproduce:

    {:ok, simple} = Saxy.SimpleForm.parse_string("""
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <text>&gt;</text>
    """)

    IO.inspect(simple)

    encoded = Saxy.encode!(simple)

    IO.inspect(encoded)

Output:

{"text", [], [">"]}
"<text>></text>"

Halt and Partial

Hi @qcam, I think I'm not getting the change done in #66 , I was trying to implement it in my project: https://github.com/altenwald/exampple ; but without success.

What I'm searching for is to get the module Exampple.Xml.Parser.Sender processing the XML chunk and sending the elements one by one to the other process. (I'm sending the PID in the start document event).

Could you address me in the right direction? What should I change to start using your modification?

Get tag content at :end_element

Hey,

I'm having memory issues with SweetXML

Your library looks great and performant, but I'm struggling to get what I'd like. I have a huge list of <FICHE /> elements, and I need to process those independently hence having a map of it's content would definitely do the job:

Eg.

  def handle_event(:end_element, {"FICHE", _no_attributes, text_content}, state) do
    do_transform(Saxy.SimpleForm.parse_string(text_content))
  end

But it doesn't seem possible, is it?

Thanks

Patch release

Hi! This is a great library, and I converted my library to use it instead of xmerl, and it's much better now. Just wondering if you could do a patch release to hex, as I depend on e8f9347 (removal of the stream match in parse_stream/3). I would use github master, but hex deps can't depend on github deps (my lib is exkml).

Thanks!

Cutting a new release?

Based on:
v1.4.0...c15c2c6

In particular, I'm interested to get a new release because of the Dialyzer troubles fixed in be744ff, currently unreleased (not in a hurry because we use an "override" and straight link to GitHub, so no pressure).

I can help with the changelog if you will @qcam!

CDATA element fails to parse when file is streamed as fixed bytes, not lines

We have an XML file which is failing to parse since we switched from calling File.stream!(path, [:compressed, :trim_bom]) to File.stream!(path, [:compressed, :trim_bom], 32_768). It throws the following error:

{:error, %Saxy.ParseError{reason: {:token, :"]]"}, binary: <<10, 60, 100, 101, 115, 99, 114, 105, 112, 116, 105, 111, 110, 62, 60, 33, 91, 67, 68, 65, 84, 65, 91, 60, 112, 62, 60, 115, 116, 114, 111, 110, 103, 62, 83, 65, 80, 32, 124, 32, 46, 78, 69, 84, 32, 124, ...>>, position: 92}}

The file is littered with empty CDATA elements. I wonder if one of those is aligning with the start/end of a buffer? I can provide the full XML file if useful - it's 114MB and I'd prefer not to provide it publicly.

Namespace support

We are using Saxy to parse XML files provided by our users. We are only interested in some elements, so Saxy perfectly suits our needs. However, sometimes our users introduce namespaces in the XML and we receive something like

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<fruitset xmlns:ns2="https://www.w3schools.com/fruit">
  <fruit>
    <ns2:name>Banana</ns2:name>
    <colour>Yellow</colour>
  </fruit>
</fruitset>

Where we are only interested in the name value. We can modify the handle_event of :start_element to ignore the namespace, but http://www.saxproject.org/namespaces.html mentions a variable localName. Are there any plans to add such a parameter to the :start_element event?

Tolerate ASCII encoding instead of UTF-8

Today, if a file given with encoding="ASCII" is provided, an error will be raised.

But just replacing the attribute value with "UTF-8" is enough to parse it with Saxy, because as one can read on the internet:

Unicode(UTF-8) is a superset of ASCII as its backward compatible with ASCII.

Would you welcome a PR to ensure this case is covered?

CDATA event not being handled when there's a line break?

Hi! I'm creating an RSS2 feed parser and it seems that CDATA is not being correctly handled when there's a line break between the tag and the CDATA content. For instance, take the example handler:

defmodule MyEventHandler do
  @behaviour Saxy.Handler

  def handle_event(:start_document, prolog, state) do
    IO.inspect("Start parsing document")
    {:ok, [{:start_document, prolog} | state]}
  end

  def handle_event(:end_document, _data, state) do
    IO.inspect("Finish parsing document")
    {:ok, [{:end_document} | state]}
  end

  def handle_event(:start_element, {name, attributes}, state) do
    IO.inspect("Start parsing element #{name} with attributes #{inspect(attributes)}")
    {:ok, [{:start_element, name, attributes} | state]}
  end

  def handle_event(:end_element, name, state) do
    IO.inspect("Finish parsing element #{name}")
    {:ok, [{:end_element, name} | state]}
  end

  def handle_event(:characters, chars, state) do
    IO.inspect("Receive characters #{chars}")
    {:ok, [{:characters, chars} | state]}
  end

  def handle_event(:cdata, cdata, state) do
    IO.inspect("Receive CData #{cdata}")
    {:ok, [{:cdata, cdata} | state]}
  end
end

Copy this feed content and paste it into an XML file. After formatting the file (I'm currently using this XML formatter) you'll see that a character event is emitted with the line break \n . I'm not sure if this should be semantically different from each other:

<description>![CDATA[<p>Description</p>]]></description>
<description>
  ![CDATA[<p>Description</p>]]>
</description>

Update: Actually, it seems that cdata event is not being emitted at all, I always get the data values as a characters event.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.