Giter Site home page Giter Site logo

Comments (8)

pabigot avatar pabigot commented on May 25, 2024

Based on some quick research, I don't think this is a bug in PyXB. PyXB is intended to operate on XML documents that are validated against XML schemas. XAML is an XML-based language which uses a different validation semantics, in particular allowing individual documents to change what namespaces are validated. PyXB is not an XAML processor and won't ignore those namespaces, so you will get validation errors if they are referenced but not validatable.

You might first convert the document to DOM format, then run a preprocessing step that removes elements and attributes that have a prefix that appears in an {http://schemas.openxmlformats.org/markup-compatibility/2006}Ignorable attribute. PyXB should be able to process what's left.

That the exception this produces doesn't have a nice text representation is a reasonable complaint, though. I've added that as issue #31.

from pyxb.

kylegibson avatar kylegibson commented on May 25, 2024

Hi Peter,

I appreciate the thorough response. It looks like I can just extend the default SAX handler used by PyXB to filter out these elements and attributes. I don't quite have a working implementation yet, but I'll post it when I'm done.

Thanks,
-Kyle

from pyxb.

kylegibson avatar kylegibson commented on May 25, 2024

Hi Peter,

I have a SAX handler that overrides the default PyXB SAX handler to strip out the ignorable attributes. This appears to cause PyXB to raise ContentNondeterminismExceededError: Nondeterminism exceeded validating. The code in Configuration.candidateTransitions and AutomatonConfiguration.step is fairly complex, so I am struggling to resolve the issue. If there's any pointers, advice or references you could share I would appreciate it.

I prefer to avoid having to pre-process the XML before passing it to PyXB.

Thanks,
-Kyle

from pyxb.

pabigot avatar pabigot commented on May 25, 2024

"Override" or "extend"? You probably shouldn't discard PyXB's SAX handler in favor of your own, but you could subclass it and overload some of the methods to strip out the attributes (and elements) that are in ignorable namespaces.

It may also simply be that the documents you're using are nondeterministic and exceed the configured threshold. You could try increasing PermittedNondeterminism slowly to see if there's a reasonable threshold that makes it pass. Be aware that the larger the value you use, the more memory PyXB may require to validate the document, and the longer it will take.

from pyxb.

kylegibson avatar kylegibson commented on May 25, 2024

Sorry, I mean extend. I'm subclassing pyxb.binding.saxer.PyXBSAXHandler and overloading the startElementNS method. That part seems to be doing exactly what I want.

My hypothesis was that the ContentNondeterminismExceededError exception was being caused by my SAX handler. To test that, I manually removed all of the ignorable attributes from the XML document, and attempted to load it using PyXB. I got the same ContentNondeterminismExceededError exception. I increased the PermittedNondeterminism to 1024. Exception still occurs. I've encountered this issue on almost all of my sample documents thus far except one. That particular sample is a very simple document, only containing a single word. It's not yet clear to me what's special about my other samples that is causing this problem.

I also lack understanding of the purpose of the determinism check. It's not clear to me what non-determinism means in this context, or why/whether it's a problem. If there's any references or advice you could share I would appreciate it.

Thanks,
-Kyle

from pyxb.

pabigot avatar pabigot commented on May 25, 2024

Try this stackoverflow question, this PyXB test case, and possibly the technical references in the PyXB FAC documentation. More generally, a google for "nondeterminism in xml" might be fruitful, or the more common nondeterministic finite automata.

PyXB "resolves" nondeterminism by executing multiple candidate parses in parallel until only one succeeds or the number of potential candidates exceeds the limit. In grossly nondeterministic languages this can happen with pretty small documents.

from pyxb.

kylegibson avatar kylegibson commented on May 25, 2024

Thanks so much for your help Peter, it is greatly appreciated.

After some reading and testing, it appears that I will not be able to utilize PyXB generated bindings to open and interact with ECMA-376 (v2008 transitional) documents due to this issue with nondeterminism.

For example, the following document requires a PermittedNondeterminism of 12288, and takes about 5 seconds on my system (quad core, 16GB ram) to process:

<?xml version="1.0"?>
<w:document xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing">
  <w:body>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Normal"/>
        <w:rPr/>
      </w:pPr>
      <w:ins w:id="1" w:author="Foo" w:date="2013-01-29T14:31:00Z">
        <w:r>
          <w:rPr>
            <w:b/>
          </w:rPr>
          <w:t>This is an insertion</w:t>
        </w:r>
      </w:ins>
      <w:ins w:id="2" w:author="Foo" w:date="2013-01-29T14:31:00Z">
        <w:r>
          <w:rPr/>
          <w:t xml:space="preserve">. </w:t>
        </w:r>
      </w:ins>
      <w:r>
        <w:rPr>
          <w:b/>
        </w:rPr>
        <w:t xml:space="preserve">This is </w:t>
      </w:r>
      <w:del w:id="3" w:author="Foo" w:date="2013-02-05T18:50:00Z">
        <w:r>
          <w:rPr>
            <w:b/>
          </w:rPr>
          <w:delText>the</w:delText>
        </w:r>
      </w:del>
      <w:r>
        <w:rPr>
          <w:b/>
        </w:rPr>
        <w:t xml:space="preserve"> end</w:t>
      </w:r>
      <w:r>
        <w:rPr/>
        <w:t xml:space="preserve"> of the</w:t>
      </w:r>
      <w:ins w:id="4" w:author="Foo" w:date="2013-01-29T14:31:00Z">
        <w:r>
          <w:rPr/>
          <w:t xml:space="preserve"> inserted</w:t>
        </w:r>
      </w:ins>
      <w:r>
        <w:rPr/>
        <w:t xml:space="preserve"> </w:t>
      </w:r>
      <w:commentRangeStart w:id="0"/>
      <w:r>
        <w:rPr/>
        <w:t>paragraph</w:t>
      </w:r>
      <w:commentRangeEnd w:id="0"/>
      <w:r>
        <w:rPr/>
      </w:r>
      <w:r>
        <w:rPr/>
        <w:commentReference w:id="0"/>
      </w:r>
      <w:r>
        <w:rPr/>
        <w:t>.</w:t>
      </w:r>
    </w:p>
  </w:body>
</w:document>

PyXB includes the EMCA-376 generating script, and while it can generate the bindings without issue, actually using them in practice doesn't appear reliable. Is that your experience with ECMA-376?

from pyxb.

pabigot avatar pabigot commented on May 25, 2024

I have no personal experience using the ECMA-376 bindings; they were added primarily as an example after another user had problems generating them. To my knowledge that user was able to accomplish hir task with them, but may have been using a namespace that wasn't as generic.

from pyxb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.