Giter Site home page Giter Site logo

aalto-xml's People

Contributors

adamretter avatar cjmamo avatar cowtowncoder avatar jakeri avatar khituras avatar prb avatar simonetripodi avatar stebulus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aalto-xml's Issues

Would be nice to have a dedicated UTF-16 parser for byte streams

When creating a parser for UTF-16 byte streams, the XMLInputFactory returns a character stream based parser using java.io.InputStreamReader to bridge the streams.

The Javadoc for com.fasterxml.aalto.in.ReaderScanner states : "In general using this scanner is quite a bit less optimal than that of java.io.InputStream based scanner".

Additionally a side-effect of not having a dedicated byte-based parser is that the value returned by javax.xml.stream.Location#getCharacterOffset() method returns the character offset and not the expected byte offset.

3-byte Unicode character causes an extra character to be appended by `XMLStreamWriter2.writeRaw()`

Consider the following XML fragment <problem>Left ≥ Right</problem>.

When setting the encoding to UTF-8 and using the XmlStreamWrite2.writeRaw method, the result is that the XML output now contains an e following the "greater than or equal to" sign: <problem>Left ≥e Right</problem>.

Here is a unit test to demonstrate the problem. Note that I am using the jdk1.8.0_74 in Windows 10. Also, I am using version 1.0 of aalto, 'com.fasterxml:aalto-xml:1.0.0.'

@Test
public void testSerialization_failsWithUtf8() throws Exception {

    final String input = "<problem>Left ≥ Right</problem>";

    final XMLOutputFactory2 outputFactory = new OutputFactoryImpl();
    final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
    final XMLStreamWriter2 xmlStreamWriter =
            (XMLStreamWriter2) outputFactory.createXMLStreamWriter(byteArrayOutputStream, "utf-8");

    xmlStreamWriter.writeStartElement("example");
    xmlStreamWriter.writeRaw(input);
    xmlStreamWriter.writeEndElement();
    xmlStreamWriter.flush();

    final String result = byteArrayOutputStream.toString("utf-8");
    System.out.print(result);

    assertThat(result).isEqualToIgnoringCase("<example><problem>Left ≥e Right</problem></example>");
    // Why is there an e added after the ≥ character, a valid UTF-8 character (see
    // http://www.fileformat.info/info/unicode/char/2265/index.htm)?

    byteArrayOutputStream.close();
    xmlStreamWriter.closeCompletely();
}

I wrote 2 additional unit tests to show workarounds I found. The first workaround is to change the encoding to UTF-16.

@Test
public void testSerialization_worksWithUtf16() throws Exception {

    final String input = "<problem>Left ≥ Right</problem>";

    final XMLOutputFactory2 outputFactory = new OutputFactoryImpl();
    final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
    final XMLStreamWriter2 xmlStreamWriter =
            (XMLStreamWriter2) outputFactory.createXMLStreamWriter(byteArrayOutputStream, "utf-16");

    xmlStreamWriter.writeStartElement("example");
    xmlStreamWriter.writeRaw(input);
    xmlStreamWriter.writeEndElement();
    xmlStreamWriter.flush();

    final String result = byteArrayOutputStream.toString("utf-16");
    System.out.print(result);

    assertThat(result).isEqualToIgnoringCase("<example><problem>Left ≥ Right</problem></example>");

    byteArrayOutputStream.close();
    xmlStreamWriter.closeCompletely();
}

My 3rd and final unit test shows that to keep the encoding as UTF-8, I can use a reader and copy the events instead of using the writeRaw string. It's safer, but it's slower too.

@Test
public void testSerialization_worksWithUtf8WithReader() throws Exception {

    final String input = "<example><problem>Left ≥ Right</problem></example>";
    final XMLInputFactory2 inputFactory = new InputFactoryImpl();
    final XMLStreamReader2 xmlStreamReader = (XMLStreamReader2) inputFactory.createXMLStreamReader(
            IOUtils.toInputStream(input, "utf-8"));

    final XMLOutputFactory2 outputFactory = new OutputFactoryImpl();
    final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
    final XMLStreamWriter2 xmlStreamWriter =
            (XMLStreamWriter2) outputFactory.createXMLStreamWriter(byteArrayOutputStream, "utf-8");

    while (xmlStreamReader.hasNext()) {
        xmlStreamReader.next();
        xmlStreamWriter.copyEventFromReader(xmlStreamReader, true);
    }
    xmlStreamWriter.flush();
    xmlStreamReader.closeCompletely();

    final String result = byteArrayOutputStream.toString("utf-8");
    System.out.print(result);

    assertThat(result).isEqualToIgnoringCase("<example><problem>Left ≥ Right</problem></example>");

    byteArrayOutputStream.close();
    xmlStreamWriter.closeCompletely();
}

Unexpected characters - even in the simplest scenario

Hello,
I am new to the "world" of aalto. Unfortunately, I am not able to make it run as expected - It keeps returning com.fasterxml.aalto.WFCException , even in the simplest scenario (e.g. http://www.cowtowncoder.com/blog/archives/2011/03/entry_451.html this old blog entry).
Here is my code:

byte[] XML = "<tag>Tove</tag>".getBytes();
AsyncXMLInputFactory xmlInputFactory = new InputFactoryImpl();
AsyncXMLStreamReader<AsyncByteArrayFeeder> asyncReader = xmlInputFactory.createAsyncFor(XML);
int inputPtr = 0;
int type;
do {
while ((type = asyncReader.next()) == AsyncXMLStreamReader.EVENT_INCOMPLETE) {
asyncReader.getInputFeeder().feedInput(XML, inputPtr++, 1);
if (inputPtr >= XML.length) {
asyncReader.getInputFeeder().endOfInput();
}
}
System.out.println("Got event of type: "+type);
} while (type != XMLEvent.END_DOCUMENT);
asyncReader.close();

It keeps mi returning:
com.fasterxml.aalto.WFCException: Unexpected character 't' (code 116) in epilog (unbalanced start/end tags?)

Do you know, how to feedInput to the feeder so that no Exception is thrown, please? What do I do wrong?

I utilize com.fasterxml.aalto-xml in the version 1.0.0.

Thank you for any help

com.fasterxml.aalto.stax.StreamReaderImpl.getLocation() doesn't work

I use it to parse a ByteArrayStream. Whenever I call that API, it always return 0 as offset.

I take a brief look at the code. This may be the problem.
//------
protected int com.fasterxml.aalto.in.ByteBasedScanner._inputPtr


com.fasterxml.aalto.in.ByteBasedScanner.getCurrentLocation() {
return LocationImpl.fromZeroBased(_config.getPublicId(), _config.getSystemId(),
_pastBytes + _inputPtr, _currRow, _inputPtr - _rowStartOffset);
}
// ------
protected int com.fasterxml.aalto.in.StreamScanner._inputPtr // This field is not necessary???

XML Escape is not working

Hi all,

We are not able to escape some characters when using aalto-xml. It looks the characters < and ' are not being escaped.

Method called: writer.writeCharacters("<>'&");
Input: <>'&
File result: &lt;>'&amp;

Our current implementation uses XMLStreamWriter and ByteXmlWriter. When debugging the code, it looks the issue happens in writeCharacters method.

Can you please assist on this issue?

Regards,
Alexandre

Unicode: surrogate pairs encoded invalidly after x number of chars

The problem here is that the character Kappa starts being encoded as xml entities, unfortunately this is an non valid character encoding. I don't understand why this is happening and why it happens after X characters instead of at any point.

@Test
public void tooManyKappas()
	 throws XMLStreamException
{
	XMLOutputFactory factory = OutputFactoryImpl.newInstance();
	if (factory instanceof OutputFactoryImpl) {
	     ((OutputFactoryImpl) factory).configureForSpeed();
        }
        //loop to find exactly at which point entity encoding kicks in.
	for (int j = 0; j < 1000; j++) {
		final ByteArrayOutputStream baos = new ByteArrayOutputStream();
		XMLStreamWriter writer = factory.createXMLStreamWriter(baos, StandardCharsets.UTF_8.name());

		final String namespace = "http://example.org";

		StringBuilder kappas = new StringBuilder();

		for (int i = 0; i < (2000 + j); i++) {
			kappas.append("𝜅");
		}
		writer.writeStartElement("", "ex", namespace);
		writer.writeCharacters(kappas.toString());
		writer.writeEndElement();
		writer.close();

		assertEquals("fails at " + (2000 + j),
			    "<ex>" + kappas + "</ex>",
			    new String(baos.toByteArray(), StandardCharsets.UTF_8));
	}
}

I hope this minimized test case is off help. It's definitely due to something internal to aalto. WSTX does not have this issue (or at a much higher loop number...).

The problem really is that aalto-xml reader does not deal with its own output in this case. Which is correct as its the writer that is wrong.

Source bundles have no Bundle-SymbolicName in Manifest

See e.g.:

http://jcenter.bintray.com/com/fasterxml/aalto-xml/1.0.0/
http://jcenter.bintray.com/com/fasterxml/aalto-xml/1.1.0/

E.g. from aalto-xml-1.0.0-sources.jar:

Manifest-Version: 1.0 Archiver-Version: Plexus Archiver Created-By: Apache Maven Built-By: tatu Build-Jdk: 1.7.0_79 Specification-Title: aalto-xml Specification-Version: 1.0.0 Specification-Vendor: FasterXML Implementation-Title: aalto-xml Implementation-Version: 1.0.0 Implementation-Vendor-Id: com.fasterxml Implementation-Vendor: FasterXML Implementation-Build-Date: 2015-11-23 19:35:01-0800 X-Compile-Source-JDK: 1.6 X-Compile-Target-JDK: 1.6

Same seems to apply for the stax2-api bundles, e.g. at:

http://jcenter.bintray.com/org/codehaus/woodstox/stax2-api/

Including the source bundles into an Eclipse product becomes difficult due to the missing symbolic names. Would be great if the names can be added, as done for the library bundles.

Long base64 content causes readElementAsBinary always returns maxLength

Hello,

First of all, I'd like to thank you for this great library. I'm using it to implement a FrameDecoder (JBoss Netty) in order to communicate with a hardware component that uses an XML-based protocol over TCP/IP. The library works fine but I'm experiencing an odd behaviour when XML being parsed contains long base64 content. My code looks similar to code included in TestBase64Reader.

AsyncXMLStreamReader parser = new InputFactoryImpl( ).createAsyncXMLStreamReader( );

// the following code is started after an STAR_ELEMENT event

private byte[] collectBinaryData ( TypedXMLStreamReader reader ) throws XMLStreamException, IOException
{
ByteArrayOutputStream baos = new ByteArrayOutputStream( );
byte[] buffer = new byte[ 4096 ];
int size;
while ( ( size = reader.readElementAsBinary( buffer, 0, buffer.length, Base64Variants.MIME ) ) != -1 )
{
baos.write( buffer, 0, size );
}
baos.close( );
return baos.size( ) == 0 ? null : baos.toByteArray( );
}

Previous code works fine if base64 content is short, for example:

U3RyZWFtaW5nIEFQSSBmb3IgWE1MIChTdEFYKSBpcyBhbiBhcHBsaWNhdGlvbiBwcm9ncmFtbWluZyBpbnRlcmZhY2UgKEFQSSkgdG8gcmVhZCBhbmQgd3JpdGUgWE1MIGRvY3VtZW50cywgb3JpZ2luYXRpbmcgZnJvbSB0aGUgSmF2YSBwcm9ncmFtbWluZyBsYW5ndWFnZSBjb21tdW5pdHku

However, if base64 content is long the code remains cycling since readElementAsBinary always returns buffer.length. I'm not sure if this behaviour is related to a bug or it could be caused by a wrong use of the library. The following XML chunk causes this behavior:

U3RyZWFtaW5nIEFQSSBmb3IgWE1MIChTdEFYKSBpcyBhbiBhcHBsaWNhdGlvbiBwcm9ncmFtbWluZyBpbnRlcmZhY2UgKEFQSSkgdG8gcmVhZCBhbmQgd3JpdGUgWE1MIGRvY3VtZW50cywgb3JpZ2luYXRpbmcgZnJvbSB0aGUgSmF2YSBwcm9ncmFtbWluZyBsYW5ndWFnZSBjb21tdW5pdHkuDQoNClRyYWRpdGlvbmFsbHksIFhNTCBBUElzIGFyZSBlaXRoZXI6DQoNCiAgICBET00gYmFzZWQgLSB0aGUgZW50aXJlIGRvY3VtZW50IGlzIHJlYWQgaW50byBtZW1vcnkgYXMgYSB0cmVlIHN0cnVjdHVyZSBmb3IgcmFuZG9tIGFjY2VzcyBieSB0aGUgY2FsbGluZyBhcHBsaWNhdGlvbg0KICAgIGV2ZW50IGJhc2VkIC0gdGhlIGFwcGxpY2F0aW9uIHJlZ2lzdGVycyB0byByZWNlaXZlIGV2ZW50cyBhcyBlbnRpdGllcyBhcmUgZW5jb3VudGVyZWQgd2l0aGluIHRoZSBzb3VyY2UgZG9jdW1lbnQuDQoNCkJvdGggaGF2ZSBhZHZhbnRhZ2VzOyB0aGUgZm9ybWVyIChmb3IgZXhhbXBsZSwgRE9NKSBhbGxvd3MgZm9yIHJhbmRvbSBhY2Nlc3MgdG8gdGhlIGRvY3VtZW50LCB0aGUgbGF0dGVyIChlLmcuIFNBWCkgcmVxdWlyZXMgYSBzbWFsbCBtZW1vcnkgZm9vdHByaW50IGFuZCBpcyB0eXBpY2FsbHkgbXVjaCBmYXN0ZXIuDQoNClRoZXNlIHR3byBhY2Nlc3MgbWV0YXBob3JzIGNhbiBiZSB0aG91Z2h0IG9mIGFzIHBvbGFyIG9wcG9zaXRlcy4gQSB0cmVlIGJhc2VkIEFQSSBhbGxvd3MgdW5saW1pdGVkLCByYW5kb20gYWNjZXNzIGFuZCBtYW5pcHVsYXRpb24sIHdoaWxlIGFuIGV2ZW50IGJhc2VkIEFQSSBpcyBhICdvbmUgc2hvdCcgcGFzcyB0aHJvdWdoIHRoZSBzb3VyY2UgZG9jdW1lbnQuDQoNClN0QVggd2FzIGRlc2lnbmVkIGFzIGEgbWVkaWFuIGJldHdlZW4gdGhlc2UgdHdvIG9wcG9zaXRlcy4gSW4gdGhlIFN0QVggbWV0YXBob3IsIHRoZSBwcm9ncmFtbWF0aWMgZW50cnkgcG9pbnQgaXMgYSBjdXJzb3IgdGhhdCByZXByZXNlbnRzIGEgcG9pbnQgd2l0aGluIHRoZSBkb2N1bWVudC4gVGhlIGFwcGxpY2F0aW9uIG1vdmVzIHRoZSBjdXJzb3IgZm9yd2FyZCAtICdwdWxsaW5nJyB0aGUgaW5mb3JtYXRpb24gZnJvbSB0aGUgcGFyc2VyIGFzIGl0IG5lZWRzLiBUaGlzIGlzIGRpZmZlcmVudCBmcm9tIGFuIGV2ZW50IGJhc2VkIEFQSSAtIHN1Y2ggYXMgU0FYIC0gd2hpY2ggJ3B1c2hlcycgZGF0YSB0byB0aGUgYXBwbGljYXRpb24gLSByZXF1aXJpbmcgdGhlIGFwcGxpY2F0aW9uIHRvIG1haW50YWluIHN0YXRlIGJldHdlZW4gZXZlbnRzIGFzIG5lY2Vzc2FyeSB0byBrZWVwIHRyYWNrIG9mIGxvY2F0aW9uIHdpdGhpbiB0aGUgZG9jdW1lbnQu

Previous XML chunks use 'UTF-8' encoding.

I'm using aalto-xml v0.9.8 running on Oracle J2SE 1.6.

Best regards,
Sandy Pérez González
IT Consultant,
Indaba Consultores S.L.
http://www.indaba.es/

Attribute values with UTF-8 characters crash

The non blocking xml parser will crash when parsing below xml. I fed with one byte at the time.

<?xml version="1.0" encoding="utf-8"?>
<feed>
    <entry>
        <updated label="Gräs">2012-04-03T14:05:27-07:00
        </updated>
    </entry>
</feed>

com.fasterxml.aalto.WFCException: Illegal XML character ((CTRL-CHAR, code 128))

aalto-xml version 0.9.10 (0.9.9 too)

https://raw.githubusercontent.com/intracer/aalto-xml/57dcf8f75c3879ede3cbd9214aaf2c986165a153/src/test/resources/latin.xml

SJSXP, Woodstox parse ok

Exception in thread "main" com.fasterxml.aalto.WFCException: Illegal XML character ((CTRL-CHAR, code 128))
at [row,col {unknown-source}]: [24,7]
at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1339)
at com.fasterxml.aalto.in.XmlScanner.handleInvalidXmlChar(XmlScanner.java:1536)
at com.fasterxml.aalto.in.Utf8Scanner.skipCharacters(Utf8Scanner.java:806)
at com.fasterxml.aalto.in.XmlScanner.skipToken(XmlScanner.java:425)
at com.fasterxml.aalto.in.StreamScanner.nextFromTree(StreamScanner.java:191)
at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:757)

Multi-threading bug in XmlScanner/FixedNsContext

I believe that there is a multi-threading bug in XmlScanner/FixedNsContext.

I suspect that this is because XmlScanner uses the static FixedNsContext.EMPTY_CONTEXT initially, so threads are sharing the _tmpDecl array in FixedNsContext.

I am not familiar enough with Aalto to produce a fix, but I did manage to create a reproducible example (you may need to run it a few times to get the error):

package performance;

import com.fasterxml.aalto.AsyncByteBufferFeeder;
import com.fasterxml.aalto.AsyncXMLInputFactory;
import com.fasterxml.aalto.AsyncXMLStreamReader;
import com.fasterxml.aalto.evt.EventAllocatorImpl;
import com.fasterxml.aalto.stax.InputFactoryImpl;

import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

import javax.xml.stream.util.XMLEventAllocator;

public class AaltoBug implements Runnable {

    private static String xml = "<?xml version='1.0'?><stream:stream xmlns='jabber:client' xmlns:stream='http://etherx.jabber.org/streams' id='4095288169' from='localhost' version='1.0' xml:lang='en'>";
    private static int NUM_THREADS = 5;
    private static XMLEventAllocator allocator = EventAllocatorImpl.getDefaultInstance();
    private static AsyncXMLInputFactory inputFactory = new InputFactoryImpl();

    public static void main(String[] args) throws InterruptedException {
        ExecutorService ex = Executors.newFixedThreadPool(NUM_THREADS);

        for (int i = 0; i < 100000; i++) {
            ex.submit(new AaltoBug(i));
        }

        ex.shutdown();
        ex.awaitTermination(Integer.MAX_VALUE, TimeUnit.SECONDS);
    }

    private final int count;

    public AaltoBug(int count) {
        this.count = count;
    }

    @Override
    public void run() {
        try {
            ByteBuffer bb = StandardCharsets.UTF_8.encode(xml);
            AsyncXMLStreamReader<AsyncByteBufferFeeder> parser = inputFactory.createAsyncForByteBuffer();
            parser.getInputFeeder().feedInput(bb);
            while (parser.hasNext()) {
                int eventType = parser.next();
                if (eventType == AsyncXMLStreamReader.EVENT_INCOMPLETE) {
                    break;
                }

                allocator.allocate(parser);
            }
        } catch (Exception e) {
            System.out.println("Error in " + count);
            e.printStackTrace();
        }
    }
}

Which produces this exception:

java.lang.ArrayIndexOutOfBoundsException: 10
    at java.util.ArrayList.add(ArrayList.java:459)
    at com.fasterxml.aalto.in.FixedNsContext.reuseOrCreate(FixedNsContext.java:76)
    at com.fasterxml.aalto.in.XmlScanner.getNonTransientNamespaceContext(XmlScanner.java:916)
    at com.fasterxml.aalto.stax.StreamReaderImpl.getNonTransientNamespaceContext(StreamReaderImpl.java:1528)
    at org.codehaus.stax2.ri.evt.Stax2EventAllocatorImpl.createStartElement(Stax2EventAllocatorImpl.java:160)
    at org.codehaus.stax2.ri.evt.Stax2EventAllocatorImpl.allocate(Stax2EventAllocatorImpl.java:69)
    at com.fasterxml.aalto.evt.EventAllocatorImpl.allocate(EventAllocatorImpl.java:103)
    at performance.AaltoBug.run(AaltoBug.java:61)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Remove auto-registration of SAX parser

Aalto registers itself with the JAXP SPI as a SAX parser factory, which causes some other external libraries (liquibase, and a few others) to blow up due to Aalto not implementing validation.

Support Latin-1 (ISO-8859-1) via Async parser

(note: sort of related to #50)

Up to 1.1.0, only UTF-8 and US-ASCII (7-bit Ascii) encodings are supported by async parser.
But since Latin-1 is Unicode value compatible with UTF-8 (even though differing in encoding details), it should be possible and relatively easy to allow that too.

Combining async-http-client and aalto-xml

I have almost managed to combine aalto-xml and https://github.com/sonatype/async-http-client.

It almost works but I end up with an error when I try to push it hard. I am using latest 0.9.8-SNAPSHOT.

Any idea what this might be?
Concurrency issue in my code?

2012-04-03 22:49:54,598 ERROR [New I/O client worker #1-3] XmlParseAsyncHandler: Got throwable
java.lang.IllegalStateException: Internal error: should never execute this code path
at com.fasterxml.aalto.async.AsyncByteScanner.throwInternal(AsyncByteScanner.java:2936) ~[classes/:na]
at com.fasterxml.aalto.async.AsyncUtfScanner.handleAttrValuePending(AsyncUtfScanner.java:1350) ~[classes/:na]
at com.fasterxml.aalto.async.AsyncUtfScanner.handleAttrValue(AsyncUtfScanner.java:1136) ~[classes/:na]
at com.fasterxml.aalto.async.AsyncByteScanner.handleStartElement(AsyncByteScanner.java:2149) ~[classes/:na]
at com.fasterxml.aalto.async.AsyncByteScanner.nextFromTree(AsyncByteScanner.java:656) ~[classes/:na]
at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:720) ~[classes/:na]
at net.jakeri.appstore.toplist.XmlParseAsyncHandler.onBodyPartReceived(XmlParseAsyncHandler.java:41) ~[classes/:na]
at com.ning.http.client.providers.netty.NettyAsyncHttpProvider.updateBodyAndInterrupt(NettyAsyncHttpProvider.java:1435) [async-http-client-1.7.1.jar:na]
at com.ning.http.client.providers.netty.NettyAsyncHttpProvider.access$2300(NettyAsyncHttpProvider.java:136) [async-http-client-1.7.1.jar:na]
at com.ning.http.client.providers.netty.NettyAsyncHttpProvider$HttpProtocol.handle(NettyAsyncHttpProvider.java:2183) ~[async-http-client-1.7.1.jar:na]
at com.ning.http.client.providers.netty.NettyAsyncHttpProvider.messageReceived(NettyAsyncHttpProvider.java:1091) [async-http-client-1.7.1.jar:na]
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:777) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:143) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:777) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.unfoldAndFireMessageReceived(ReplayingDecoder.java:522) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:501) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:438) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.handler.codec.http.HttpClientCodec.handleUpstream(HttpClientCodec.java:72) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:553) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:343) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:274) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:194) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.3.1.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_29]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_29]
at java.lang.Thread.run(Thread.java:680) [na:1.6.0_29]

Issue with SAXParserFactory

(reported by Charles Foster)

When I don't set "javax.xml.parsers.SAXParserFactory"

calling:

new com.fasterxml.aalto.sax.SAXParserFactoryImpl().newSAXParser()

gets me a:

com.fasterxml.aalto.sax.SAXParserImpl

but calling:

com.fasterxml.aalto.sax.SAXParserFactoryImpl.newInstance().newSAXParser()

gets me a:

com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl

Carriage return (\r) dropped by `XMLStreamWriter` implementation

I'm using StAX to do some XML processing, reading events and then writing them, sometimes with some changes, etc., as you do, and I've noticed something odd: aalto, given XML containing &#13; in a text node will, at the end of this process, produce text with neither the CR nor a newline. This differs from other StAX implementations and appears to be a bug (the handling of this particular case is different across all implementations I know of, but aalto's seems to be the most-wrong).

This program illustrates:

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.nio.charset.StandardCharsets;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.XMLEvent;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;


public final class Minimal {
    /** Enumeration of supported StAX implementations. */
    public enum StAXImplementation {
        AALTO("com.fasterxml.aalto.stax.InputFactoryImpl",
                "com.fasterxml.aalto.stax.OutputFactoryImpl",
                "com.fasterxml.aalto.stax.EventFactoryImpl"),
        /** JDK built in implementation, based on Xerces. */
        JDK("com.sun.xml.internal.stream.XMLInputFactoryImpl",
                "com.sun.xml.internal.stream.XMLOutputFactoryImpl",
                "com.sun.xml.internal.stream.events.XMLEventFactoryImpl"),
        WOODSTOX("com.ctc.wstx.stax.WstxInputFactory",
                "com.ctc.wstx.stax.WstxOutputFactory",
                "com.ctc.wstx.stax.WstxEventFactory"),
        XERCES(JDK.inputFactory,
                JDK.outputFactory,
                "org.apache.xerces.stax.XMLEventFactoryImpl");

        final String inputFactory;
        final String outputFactory;
        final String eventFactory;

        private StAXImplementation(final String inputFactory,
                final String outputFactory, final String eventFactory) {
            this.inputFactory = inputFactory;
            this.outputFactory = outputFactory;
            this.eventFactory = eventFactory;
        }

        /**
         * Tell the JDK to use this StAXImplementation.
         */
        public void use() {
            System.setProperty("javax.xml.stream.XMLInputFactory",
                    inputFactory);
            System.setProperty("javax.xml.stream.XMLOutputFactory",
                    outputFactory);
            System.setProperty("javax.xml.stream.XMLEventFactory",
                    eventFactory);
        }
    }


    static final String CR = new String(Character.toChars(32));
    static final String TEXT="a&#13;a";
    static final String EXPANDED_TEXT = "a" + CR + "a";
    static final String NEWLINE_TEXT = "a\na";
    static final String XML = "<x>" + TEXT + "</x>";
    static final byte[] XML_BYTES = XML.getBytes(
            StandardCharsets.UTF_8);

    /** Run some tests for each StAX implementation. */
    public static void main(final String[] args) throws Exception {
        for (final StAXImplementation impl : StAXImplementation.values()) {
            runTest(impl);
            System.out.println();
        }
    }

    private static void runTest(final StAXImplementation impl)
            throws Exception {
        impl.use();
        System.out.println("************** Trying " + impl + " **************");
        final XMLInputFactory inputFactory = XMLInputFactory.newFactory();
        final XMLOutputFactory outputFactory =
                XMLOutputFactory.newFactory();
        final ByteArrayOutputStream baos = new ByteArrayOutputStream();

        System.out.println("-------------- Factory classes:");
        printName("XMLInputFactory", inputFactory);
        printName("XMLOutputFactory", outputFactory);
        printName("XMLEventFactory", XMLEventFactory.newFactory());

        System.out.println("-------------- StAX results:");
        final XMLEventReader r = inputFactory.createXMLEventReader(
                new ByteArrayInputStream(XML_BYTES));
        final XMLEventWriter w = outputFactory.createXMLEventWriter(baos);
        final StringBuilder buffer = new StringBuilder();
        while (r.hasNext()) {
            final XMLEvent e = r.nextEvent();
            if (e.isStartDocument()) {
                // Avoid the XML declaration. Not present in the input.
                continue;
            }
            if (e.isCharacters()) {
                buffer.append(e.asCharacters().getData());
            } else {
                if (buffer.length() > 0) {
                    testText(buffer.toString());
                    buffer.setLength(0);
                }
            }
            w.add(e);
        }
        r.close();
        w.flush();
        w.close();
        final byte[] resultBytes = baos.toByteArray();
        System.out.println("StAX XML: [" + new String(resultBytes,
                StandardCharsets.UTF_8) + "]");
        testDOM(resultBytes);
    }

    private static void printName(final String name, final Object obj) {
        System.out.println(name + "=" + obj.getClass().getName());
    }

    private static void testText(final String text) {
        System.out.println("Buffered text: [" + text + "]");
        System.out.println("Code point at index 1: " +
                Character.codePointAt(text, 1));
        System.out.println("Buffered text equals input text? " +
                TEXT.equals(text));
        System.out.println(
                "Buffered text equals expanded text? " +
                        EXPANDED_TEXT.equals(text));
        System.out.println("Buffered text has \\n for \\r? " +
                NEWLINE_TEXT.equals(text));
    }

    private static void testDOM(final byte[] resultBytes) throws Exception {
        final DocumentBuilderFactory dbf =
                DocumentBuilderFactory.newInstance();
        dbf.setNamespaceAware(true);
        dbf.setExpandEntityReferences(false);
        final Document doc = dbf.newDocumentBuilder().parse(
                new ByteArrayInputStream(resultBytes));
        final XPath xpath = XPathFactory.newInstance().newXPath();
        final String domText = (String)xpath.evaluate("/x/text()", doc,
                XPathConstants.STRING);
        System.out.println("============== DOM results:");
        testText(domText);
    }
}

It's pretty straightforward to run if you have all the implementations in your classpath. I use the following pom:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>foobar</groupId>
  <artifactId>stax-stuff</artifactId>
  <version>0.0.1-SNAPSHOT</version>

  <dependencies>
    <dependency>
      <groupId>com.fasterxml.woodstox</groupId>
      <artifactId>woodstox-core</artifactId>
      <version>5.0.1</version>
    </dependency>
    <dependency>
      <groupId>xerces</groupId>
      <artifactId>xercesImpl</artifactId>
      <version>2.11.0</version>
    </dependency>
    <dependency>
      <groupId>com.fasterxml</groupId>
      <artifactId>aalto-xml</artifactId>
      <version>1.0.0</version>
    </dependency>
  </dependencies>

  <build>
    <plugins>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.3</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
      </plugins>
  </build>
</project>

Output:

************** Trying AALTO **************
-------------- Factory classes:
XMLInputFactory=com.fasterxml.aalto.stax.InputFactoryImpl
XMLOutputFactory=com.fasterxml.aalto.stax.OutputFactoryImpl
XMLEventFactory=com.fasterxml.aalto.stax.EventFactoryImpl
-------------- StAX results:
Buffered text: [a
a]
Code point at index 1: 13
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false
StAX XML: [<x>aa</x>]
============== DOM results:
Buffered text: [aa]
Code point at index 1: 97
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false

************** Trying JDK **************
-------------- Factory classes:
XMLInputFactory=com.sun.xml.internal.stream.XMLInputFactoryImpl
XMLOutputFactory=com.sun.xml.internal.stream.XMLOutputFactoryImpl
XMLEventFactory=com.sun.xml.internal.stream.events.XMLEventFactoryImpl
-------------- StAX results:
Buffered text: [a
a]
Code point at index 1: 13
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false
StAX XML: [<x>a
a</x>]
============== DOM results:
Buffered text: [a
a]
Code point at index 1: 10
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? true

************** Trying WOODSTOX **************
-------------- Factory classes:
XMLInputFactory=com.ctc.wstx.stax.WstxInputFactory
XMLOutputFactory=com.ctc.wstx.stax.WstxOutputFactory
XMLEventFactory=com.ctc.wstx.stax.WstxEventFactory
-------------- StAX results:
Buffered text: [a
a]
Code point at index 1: 13
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false
StAX XML: [<x>a&#xd;a</x>]
============== DOM results:
Buffered text: [a
a]
Code point at index 1: 13
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false

************** Trying XERCES **************
-------------- Factory classes:
XMLInputFactory=com.sun.xml.internal.stream.XMLInputFactoryImpl
XMLOutputFactory=com.sun.xml.internal.stream.XMLOutputFactoryImpl
XMLEventFactory=org.apache.xerces.stax.XMLEventFactoryImpl
-------------- StAX results:
Buffered text: [a
a]
Code point at index 1: 13
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false
StAX XML: [<x>a
a</x>]
============== DOM results:
Buffered text: [a
a]
Code point at index 1: 10
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? true

Support windows-1251 encoding for async (non-blocking) parsing

Hello.

I tried to parse xml with windows-1251, but caught error:

com.fasterxml.aalto.WFCException: Unsupported encoding 'windows-1251': only UTF-8 and US-ASCII support by async parser

Any plans to support encodings differing from UTF-8 and US-ASCII.

Or did I miss something?

Parsing error on hexadecimal character reference in attribute value

When parsing the following xml:
<root att="&#xA;"></root>

The parser fails with the following error:

com.fasterxml.aalto.WFCException: Unexpected character 'A' (code 65) expected a hex digit (0-9a-fA-F) for character entity
at [row,col {unknown-source}]: [1,16]
at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1333)
at com.fasterxml.aalto.in.XmlScanner.throwUnexpectedChar(XmlScanner.java:1498)
at com.fasterxml.aalto.async.AsyncByteArrayScanner.handleHexEntityInAttribute(AsyncByteArrayScanner.java:4198)

The parser does not seem to support hexadecimal character references not starting with [0-9] although it is allowed by the XML specification.

I will issue a pull request shortly containing a fix proposal with a JUnit test case.

PS: as it is my first comment here, thanks a lot for this great library!

AsyncXMLStreamReader#getLocationInfo().getEndingCharOffset() always -1

The following is always -1:

AsyncXMLStreamReader#getLocationInfo().getEndingCharOffset() == -1
AsyncXMLStreamReader#getLocationInfo().getStartingCharOffset() == -1

although there are definitively characters, the byte offset is correct:

AsyncXMLStreamReader#getLocationInfo().getEndingByteOffset() == 686

Please note that character offset may be different from byte offset, e.g. when using Emojis in the XML (surrogate pairs)

(version 1.0.0)

SAXParserImpl fails to resolve locations

In SAXParserImpl.parse (line 328) an instance of ReaderConfig is created with arguments publicId and systemId being passed in wrong order. This causes getPublicId() and getSystemId() to produce bad results.

XMLStreamWriter#writeNamespace writes XMLConstants.XML_NS_URI

I recognized that

com.fasterxml.aalto.out.NonRepairingStreamWriter#writeNamespace(XMLConstants.XML_NS_PREFIX, XMLConstants.XML_NS_URI)

writes the full namespace xmlns declaration, e.g.:

<element xmlns:xml="http://www.w3.org/XML/1998/namespace" xml:lang="en">

although it's not necessary.

The reference implemention which ships with the JDK behaves differently, it doesn't write the full namespace.

It's not a bug, but the XML looks bloated.

Implement SAX handler for non-blocking (async) parser

With non-blocking Stax implementation, it should be relatively easy to add interface for doing non-blocking SAX parsing -- feed block of content, which triggers all SAX callbacks, returns. The main trick is that of keeping track

Unclear license for Stax2

The Stax2 java files say that they are subject to the license found in the "LICENSE" file shipped with the source code. However, there is NO file named "LICENSE" anywhere in the source. So, what license applies to Stax2? (Looking at stax2-api-3.1.1-sources.jar) and the current github source, there is no LICENSE file to be found.

Async parse of simple document fails with WFCException: Unexpected character 'd' (code 100) in epilog

This is a test case with the simplest document I could find. The code reproducing the problem is a copy-paste of an official example with a small fix.

  @Test
  public void asynchronousParseSmallestDocument() throws Exception {
    final String xml = 
//        "<?xml version=\"1.0\" encoding=\"US-ASCII\"?>" +
        "<d/>" ;

    final AsyncXMLInputFactory asyncXmlInputFactory = new InputFactoryImpl() ;
    final AsyncXMLStreamReader< AsyncByteArrayFeeder > xmlStreamReader =
        asyncXmlInputFactory.createAsyncFor( xml.getBytes( Charsets.US_ASCII ) ) ;
    final AsyncByteArrayFeeder inputFeeder = xmlStreamReader.getInputFeeder() ;

    byte[] xmlBytes = xml.getBytes() ;
    int bufferFeedLength = 1 ; 
    int currentByteOffset = 0 ;
    int type ;
    do{
      while( ( type = xmlStreamReader.next() ) == AsyncXMLStreamReader.EVENT_INCOMPLETE ) {
        byte[] buffer = new byte[]{ xmlBytes[ currentByteOffset ] } ;
        currentByteOffset ++ ;
        inputFeeder.feedInput( buffer, 0, bufferFeedLength ) ;
        if( currentByteOffset >= xmlBytes.length ) {
          inputFeeder.endOfInput() ;
        }
      }
      switch( type ) {
        case XMLEvent.START_DOCUMENT :
          LOGGER.debug( "start document" ) ;
          break ;
        case XMLEvent.START_ELEMENT :
          LOGGER.debug( "start element: " + xmlStreamReader.getName() ) ;
          break ;
        case XMLEvent.CHARACTERS :
          LOGGER.debug( "characters: " + xmlStreamReader.getText() ) ;
          break ;
        case XMLEvent.END_ELEMENT :
          LOGGER.debug( "end element: " + xmlStreamReader.getName() ) ;
          break ;
        case XMLEvent.END_DOCUMENT :
          LOGGER.debug( "end document" ) ;
          break ;
        default :
          break ;
      }
    } while( type != XMLEvent.END_DOCUMENT ) ;

    xmlStreamReader.close() ;
  }

This is what I get:

15:23:41.388 DEBUG [main] c.o.s.m.c.x.SingleContactXmlParserTest - start document
15:23:41.394 DEBUG [main] c.o.s.m.c.x.SingleContactXmlParserTest - start element: d
15:23:41.394 DEBUG [main] c.o.s.m.c.x.SingleContactXmlParserTest - end element: d

com.fasterxml.aalto.WFCException: Unexpected character 'd' (code 100) in epilog (unbalanced start/end tags?)
 at [row,col {unknown-source}]: [1,7]

	at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1333)
	at com.fasterxml.aalto.in.XmlScanner.throwUnexpectedChar(XmlScanner.java:1498)
	at com.fasterxml.aalto.in.XmlScanner.reportPrologUnexpChar(XmlScanner.java:1358)
	at com.fasterxml.aalto.async.AsyncByteArrayScanner.nextFromProlog(AsyncByteArrayScanner.java:1068)
	at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:802)
	at com.otcdlink.server.model.contact.xml.SingleContactXmlParserTest.asynchronousParseSmallestDocument(SingleContactXmlParserTest.java:102)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)

I'm using aalto-xml 1.0.

support for capturing EVENT_INCOMPLETE with XMLEventReader

I have just been trying to use Aalto's parser with Jena's RDF parsers.
I asked the Jena group what was the best way to do that and we thought that
adapting the StaX2SAX would probably do it.

http://svn.apache.org/viewvc/incubator/jena/Jena2/jena/trunk/src/main/java/com/hp/hpl/jena/rdf/arp/StAX2SAX.java?revision=1198759&view=markup

So I worked today on this and ended up with this gist

https://gist.github.com/1713430

It seems to be close to working.

but we get an exception on line 82

XMLEvent e = eventReader.nextEvent();

javax.xml.stream.XMLStreamException: Unrecognized event type 257.
at org.codehaus.stax2.ri.evt.Stax2EventAllocatorImpl.allocate(Stax2EventAllocatorImpl.java:85)
at org.codehaus.stax2.ri.Stax2EventReaderImpl.createStartDocumentEvent(Stax2EventReaderImpl.java:441)
at org.codehaus.stax2.ri.Stax2EventReaderImpl.nextEvent(Stax2EventReaderImpl.java:245)
at patch.AsyncJenaParser.parse(AsyncJenaParser.java:82)

This is I suppose because I am transforming the AsyncXMLStreamReader into an
XMLEventReader as that is the closest to the the way the Jena code was written. You can
see this in the constructor

public AsyncJenaParser(ContentHandler handler, AsyncXMLStreamReader streamReader) throws XMLStreamException {
this.handler = handler;
this.lhandler = (handler instanceof LexicalHandler) ?
(LexicalHandler) handler :
NO_LEXICAL_HANDLER ;
handler.setDocumentLocator(new LocatorConv(streamReader));
final XMLInputFactory xf = InputFactoryImpl.newInstance();
this.streamReader = streamReader;
this.eventReader = xf.createXMLEventReader(streamReader);
}

So this is not SAX parsing. It is just using an event reader instead of the stream reader. The class is used currently by the following Scala code which extends the com.ning.http.AsyncHandler

https://gist.github.com/1713663

Complete non-blocking (async) parser implementation

Now that API has been defined, and most of parsing functionality done (with character data, elements, PI, comments, xml declaration), need to complete rest of functionality. What remains is:

  • DOCTYPE declaration (only internal subset, same as blocking impl)
  • Coalescing mode (to degree feasible)
  • Parsing of (but not expansion) general entities

Continue parsing even with illegal characters

I know this is not correct but I am trying to parse some sources that have illegal characters in the xml feed.
Would it be possible to let the parser continue parse without exception and warn instead?

com.fasterxml.aalto.WFCException: Illegal XML character ((CTRL-CHAR, code 11))
 at [row,col {unknown-source}]: [29273,485]
    at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1335) ~[aalto-xml-0.9.8.jar:na]
    at com.fasterxml.aalto.in.XmlScanner.throwInvalidXmlChar(XmlScanner.java:1525) ~[aalto-xml-0.9.8.jar:na]
    at com.fasterxml.aalto.async.AsyncUtfScanner.skipCharacters(AsyncUtfScanner.java:755) ~[aalto-xml-0.9.8.jar:na]
    at com.fasterxml.aalto.async.AsyncByteScanner.nextFromTree(AsyncByteScanner.java:572) ~[aalto-xml-0.9.8.jar:na]
    at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:720) ~[aalto-xml-0.9.8.jar:na]
    ...

and

com.fasterxml.aalto.WFCException: Illegal XML character ((CTRL-CHAR, code 22))
 at [row,col {unknown-source}]: [8982,3]
    at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1335)
    at com.fasterxml.aalto.in.XmlScanner.throwInvalidXmlChar(XmlScanner.java:1525)
    at com.fasterxml.aalto.async.AsyncUtfScanner.skipCharacters(AsyncUtfScanner.java:755)
    at com.fasterxml.aalto.async.AsyncByteScanner.nextFromTree(AsyncByteScanner.java:572)
    at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:720)
    ...

getElementText will throw com.fasterxml.aalto.WFCException when it encounters an EVENT_INCOMPLETE token

Hi,

Great job on the async xml parser. I'm using it in conjunction with another Ning library, the async-http-client, and so far it's been working great.

I have a question though, how should I deal with methods that span multiple events, e.g. reader.getElementText()? For example, the fragment I want to process is "hello worl". If the cursor is currently on the start element (), when I call getElementText it will throw a WFCException because the next token is the Event Incomplete token, which it couldn't handle. Thanks for your help.

Initialize System Properties and Unexpected end in prolog

Hello,

I tried around with the example to setup the speed configuration for the XML-Parser and got the following problems:

First it did not find the InputFactoryImpl so I had to add it specifically to the system properties:
System.setProperty("com.fasterxml.aalto.stax.InputFactoryImpl","com.fasterxml.aalto.stax.InputFactoryImpl");

And secondly if I try to run it on an ordinary XML-File it gives me:

Exception in thread "main" com.fasterxml.aalto.WFCException: Unexpected End-of-input in prolog at [row,col {unknown-source}]: [1,1] at com.fasterxml.aalto.stax.StreamReaderImpl.throwWfe(StreamReaderImpl.java:1775) at com.fasterxml.aalto.stax.StreamReaderImpl.throwUnexpectedEOI(StreamReaderImpl.java:1794) at com.fasterxml.aalto.stax.StreamReaderImpl.handlePrologEoi(StreamReaderImpl.java:1754) at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:810) at com.fasterxml.aalto.speedtest.SpeedTest.execute(SpeedTest.java:27) at com.fasterxml.aalto.speedtest.SpeedTest.main(SpeedTest.java:47)

I followed the basic STAX setup e.g. as described here:
http://www.studytrails.com/java/xml/aalto/java-xml-aalto-stax-parsing/

Any suggestions to resolve this?

Kind regards

Aalto writes too many xmlns attributes if IS_REPAIRING_NAMESPACES=true

Consider the following setup.

  • Instantiate an XMLEventReader reading from an xml file, which root element defines an "xmlns" attribute, and contains unprefixed tags. (As an example, you can fetch the W3C's xmldsig schema )
  • Instantiate an XMLEventWriter writing on another file, with IS_REPAIRING_NAMESPACES=true.
  • For each event read from the Reader, write it through directly to the Writer.

If aalto-xml 1.0.0 is used as STAX backend, the resulting document's root schema tag's xmlns attribute will be written 3 times, as follows :
<schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns="http://www.w3.org/2001/XMLSchema" xmlns="http://www.w3.org/2001/XMLSchema" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" targetNamespace="http://www.w3.org/2000/09/xmldsig#" version="0.1" elementFormDefault="qualified">

Note that if I simply switch the implementation to woodstox-core 5.0.3, everything works fine, so this is definitely an issue with aalto-xml.

Also note that definitions of prefixed namespaces (like xmlns:xx) are unaffected by the issue : only the "xmlns" attribute is affected.

Issue parsing XML when com.fasterxml.aalto-xml is in the Maven dependency

I'm having trouble with the library com.fasterxml.aalto-xml.

I Have a Project A (Dspace) that do not depend on the library com.fasterxml.aalto-xml.

I developed a library B, that use a library C that depends on com.fasterxml.aalto-xml.

I wonder why when com.fasterxml.aalto-xml gets added as a dependency to A by way of B -> C, A picks up com.fasterxml.aalto-xml to do his usual XML Parsing.
The problem is that, it breaks the application. com.fasterxml.aalto-xml is having trouble to properly parse the XML that comes originally with A. Something is going wrong.

I would like to understand why does it happens. Is there a way to tell A not to use com.fasterxml.aalto-xml but still having B -> C -> com.fasterxml.aalto-xml at work within the A project.

What is particular about those XML parsers? I just don't understand why A would pick up com.fasterxml.aalto-xml to work with it, while it does not have any dependency on it.

I can see that the lib is also an osgi bundle jar, may it has any implication here ?

I mean originally A (Dspace) works without alto and use whatever parser it has, why picking alto now that it is in the dependency ? what mechanism allow that ?

ByteXmlWriter outputs extra characters on buffer boundaries

Specifically method writeSplitCharacters outputs additional byte after every output2ByteChar(ch) call. There are one line fix for this. Code fragment

case CT_MULTIBYTE_2: // 3, 4 and N can never occur
    // To off-line or not?
    output2ByteChar(ch);
    break;

in writeSplitCharacters should be changed to

case CT_MULTIBYTE_2: // 3, 4 and N can never occur
    // To off-line or not?
    output2ByteChar(ch);
    continue main_loop;

Tell me if pull request is needed for this.

ByteXmlWriter bug

Whilst running a test in my environment I encountered a bug in aalto-xml-0-9.11.jar.
I debugged the code and think that I discovered what the problem was.
It occurred in ByteXmlWriter.writeCharacters() called from longWriteCharacters().
I was attempting to add 4K of XML onto an existing XML document (in my case it was a soapenv body that was being built up). _outputPtr was 407 at the start of the 4K output. The first part of the 4K text was 325 characters before a newline character was encountered. I think the fault is that at this point in the code (_config..willEscapeCR()) the _outputPtr is not set to ptr! So in my case ptr was 692 but gets reset to 407. This left me with garbled output.

I tried a patched aalto-xml jar with a modified ByteXmlWriter.writeCharacters():

if (_config.willEscapeCR()) {
    _outputPtr = ptr;
    break;
}

this appeared to fix my issue.

Clarify licensing

The Aalto wiki states that Aalto is distributed with the Apache 2.0 license, but the only license attached to the actual source is GPL 3.0.

We'd love to be able to use Aalto, but sadly if it's GPL, we can't. Can we shed some light on what the actual license is?

Validation implementation of `XMLStreamWriter` incomplete

Validation is not working with an XMLStreamWriter. A quick comparison with woodstox makes obvious that several calls to XMLValidator are missing in StreamWriterBase (at least in writeAttribute(...), writeStartElement(...) and _closeStartElement(...)) and subclasses (several locations).

Furthermore, the invocation of XMLValidator..validateElementEnd(...) in StreamWriterBase._validator.validateElementEnd() mixes up the prefix and nsUri arguments.

AsyncByteScanner.validPublicIdChar() incorrectly rejects digits

The javadoc for AsyncByteScanner.validPublicIdChar() references PubidLiteral in the XML 1.0 specification:

PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
PubidChar    ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]

Note that this includes [0-9]. However, the implementation of the method does not:

    protected boolean validPublicIdChar(int c) {
        return
            c == 0xA ||                     //<LF>
            c == 0xD ||                     //<CR>
            c == 0x20 ||                    //<SPACE>
            (c >= '@' && c <= 'Z') ||       //@[A-Z]
            (c >= 'a' && c <= 'z') ||
            c == '!' ||
            (c >= 0x23 && c <= 0x25) ||     //#$%
            (c >= 0x27 && c <= 0x2F) ||     //'()*+,-./
            (c >= ':' && c <= ';') ||
            c == '=' ||
            c == '?' ||
            c == '_';
    }

Note also that com.fasterxml.aalto.util.XmlCharTypes.PUBID_CHARS correctly includes these digits.

Steps to reproduce:

  1. Find or create a document matching the Encoded Archival Description version 3 schema and containing the following <!DOCTYPE> declaration (e.g., add it to this file):
<!DOCTYPE ead PUBLIC "+// http://ead3.archivists.org/schema/ //DTD ead3 (Encoded Archival Description (EAD) Version 3)//EN" "ead3.dtd">
  1. Attempt to parse it with an AsyncXMLStreamReader.

Expected:

  • File parses, or at any rate gets past the <!DOCTYPE> declaration.

Actual:

  • parsing fails with a WFCException:
Error parsing XML stream
com.fasterxml.aalto.WFCException: Unexpected character '3' (code 51) in prolog (not valid in PUBLIC ID)
 at [row,col {unknown-source}]: [1,77]
	at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1333)
	at com.fasterxml.aalto.in.XmlScanner.throwUnexpectedChar(XmlScanner.java:1498)
	at com.fasterxml.aalto.in.XmlScanner.reportPrologUnexpChar(XmlScanner.java:1358)
	at com.fasterxml.aalto.async.AsyncByteBufferScanner.parseDtdId(AsyncByteBufferScanner.java:1946)
	at com.fasterxml.aalto.async.AsyncByteBufferScanner.handleDTD(AsyncByteBufferScanner.java:1833)
	at com.fasterxml.aalto.async.AsyncByteBufferScanner.handlePrologDeclStart(AsyncByteBufferScanner.java:1264)
	at com.fasterxml.aalto.async.AsyncByteBufferScanner.nextFromProlog(AsyncByteBufferScanner.java:1067)
	at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:790)
	at org.codehaus.stax2.ri.Stax2EventReaderImpl.nextEvent(Stax2EventReaderImpl.java:255)

Source files without license headers

Hi,

The following source files are without license headers:

./src/main/java/com/fasterxml/aalto/AsyncByteArrayFeeder.java
./src/main/java/com/fasterxml/aalto/AsyncByteBufferFeeder.java
./src/main/java/com/fasterxml/aalto/AsyncInputFeeder.java
./src/main/java/com/fasterxml/aalto/AsyncXMLInputFactory.java
./src/main/java/com/fasterxml/aalto/AsyncXMLStreamReader.java
./src/main/java/com/fasterxml/aalto/package-info.java
./src/main/java/com/fasterxml/aalto/ValidationException.java

./src/main/java/com/fasterxml/aalto/async/AsyncByteScanner.java
./src/main/java/com/fasterxml/aalto/async/AsyncStreamReaderImpl.java
./src/main/java/com/fasterxml/aalto/async/package-info.java

./src/main/java/com/fasterxml/aalto/dom/BijectiveNsMap.java
./src/main/java/com/fasterxml/aalto/dom/DOMOutputElement.java
./src/main/java/com/fasterxml/aalto/dom/DOMReaderImpl.java
./src/main/java/com/fasterxml/aalto/dom/DOMWriterImpl.java
./src/main/java/com/fasterxml/aalto/dom/OutputElementBase.java

./src/main/java/com/fasterxml/aalto/evt/IncompleteEvent.java

./src/main/java/com/fasterxml/aalto/impl/CommonConfig.java
./src/main/java/com/fasterxml/aalto/impl/ErrorConsts.java
./src/main/java/com/fasterxml/aalto/impl/IoStreamException.java
./src/main/java/com/fasterxml/aalto/impl/LocationImpl.java

./src/main/java/com/fasterxml/aalto/in/ElementScope.java
./src/main/java/com/fasterxml/aalto/in/FixedNsContext.java
./src/main/java/com/fasterxml/aalto/in/MergedStream.java
./src/main/java/com/fasterxml/aalto/in/PName3.java
./src/main/java/com/fasterxml/aalto/in/ReaderConfig.java
./src/main/java/com/fasterxml/aalto/in/ReaderScanner.java
./src/main/java/com/fasterxml/aalto/in/XmlScanner.java

./src/main/java/com/fasterxml/aalto/io/UTF8Writer.java

./src/main/java/com/fasterxml/aalto/sax/SAXFeature.java
./src/main/java/com/fasterxml/aalto/sax/SAXProperty.java

./src/main/java/com/fasterxml/aalto/out/WriterConfig.java

./src/main/java/com/fasterxml/aalto/util/CharsetNames.java
./src/main/java/com/fasterxml/aalto/util/DataUtil.java
./src/main/java/com/fasterxml/aalto/util/EmptyIterator.java
./src/main/java/com/fasterxml/aalto/util/IllegalCharHandler.java
./src/main/java/com/fasterxml/aalto/util/SingletonIterator.java
./src/main/java/com/fasterxml/aalto/util/TextAccumulator.java
./src/main/java/com/fasterxml/aalto/util/TextBuilder.java
./src/main/java/com/fasterxml/aalto/util/TextUtil.java
./src/main/java/com/fasterxml/aalto/util/UriCanonicalizer.java
./src/main/java/com/fasterxml/aalto/util/XmlConsts.java

./src/main/java/test/BasePerfTest.java
./src/main/java/test/RunStreamWriter.java
./src/main/java/test/TestAsyncReader.java
./src/main/java/test/TestBase64Reader.java
./src/main/java/test/TestEventReader.java
./src/main/java/test/TestLineReader.java
./src/main/java/test/TestNameHashing.java
./src/main/java/test/TestNameTable.java
./src/main/java/test/TestPNamePerf.java
./src/main/java/test/TestRawStream.java
./src/main/java/test/TestSaxReader.java
./src/main/java/test/TestScannerPerf.java
./src/main/java/test/TestStreamCopier.java
./src/main/java/test/TestStreamReader.java
./src/main/java/test/TestTypedSpeed.java
./src/main/java/test/TestUTF8.java

./src/test/java/async/AsyncReaderWrapperForByteArray.java
./src/test/java/async/AsyncReaderWrapperForByteBuffer.java
./src/test/java/async/AsyncTestBase.java
./src/test/java/async/TestAsyncViaEventReader.java
./src/test/java/async/TestCDataParsing.java
./src/test/java/async/TestCharactersParsing.java
./src/test/java/async/TestCommentParsing.java
./src/test/java/async/TestDoctypeParsing.java
./src/test/java/async/TestElementParsing.java
./src/test/java/async/TestEntityParsing.java
./src/test/java/async/TestPIParsing.java
./src/test/java/async/TestSurrogates.java
./src/test/java/async/TestXmlDeclaration.java

./src/test/java/base/BaseTestCase.java

./src/test/java/evt/TestLaziness.java

./src/test/java/sax/TestEntityResolver.java
./src/test/java/sax/TestSaxReader.java

./src/test/java/stream/TestDTDSkimming.java
./src/test/java/stream/TestNameDecoding.java
./src/test/java/stream/TestSimple.java
./src/test/java/stream/TestSurrogates.java

./src/test/java/util/TestPNameTable.java
./src/test/java/util/TestTextAccumulator.java

./src/test/java/wstream/TestIndentation.java
./src/test/java/wstream/TestLongerContent.java
./src/test/java/wstream/TestNameEncoding.java

Please, confirm the licensing of code and/or content/s, and add license headers.

https://fedoraproject.org/wiki/Packaging:LicensingGuidelines?rd=Packaging/LicensingGuidelines#License_Clarification

Thanks in advance
Regards

Allow for ignoring encoding or setting default encoding

I have a file that reports in its prolog definition that it's UTF-16, when it is indeed UTF-8. I've had to copy the parser class and do some ugly hacks to override the default behaviour (which is to fail when a weird encoding is encountered):

override def verifyAndSetXmlEncoding(): Unit = { val enc = CharsetNames.normalize(_textBuilder.contentsAsString) _config.setXmlEncoding(enc) /* 09-Feb-2011, tatu: For now, we will only accept UTF-8 and ASCII; could * expand in future (Latin-1 should be doable) */ if ((CharsetNames.CS_UTF8 != enc) && (CharsetNames.CS_US_ASCII != enc)) { _config.setXmlEncoding("UTF-8") } }

It should be possible to override this behaviour and just set the encoding manually. In this case I have no control over the file, so changing the file is not an option.

Otherwise, great library :)

~Karl

'com.fasterxml.aalto.impl.StreamExceptionBase: Can not output XML declaration, after other output has already been done.' when trying to transform XML using XSLT

Hello,

I wanted to test out the performances of aalto so I wrote this simple test program where I indent some XML using an XSLT stylesheet and use StaXSources and StaXResults as the input and output of the Transformer respectively.

Here is my test code:

public static void main(String[] args) throws Exception {

    SAXTransformerFactory xformerFactory = (SAXTransformerFactory) TransformerFactory
            .newInstance();

    InputStream is = null;

    is = XMLTest.class.getResourceAsStream("/xml/indenter.xsl");
    Templates indenter = xformerFactory.newTemplates(new StreamSource(is));
    is.close();

    System.out.println(indenter);

    File xml = new File(args[0]);
    File outXml = new File(args[0] + ".out");

    // XMLInputFactory xif = XMLInputFactory.newFactory(
    // "com.sun.xml.internal.stream.XMLInputFactoryImpl", XMLTest.class.getClassLoader());
    XMLInputFactory xif = XMLInputFactory.newFactory(
            "com.fasterxml.aalto.stax.InputFactoryImpl", XMLTest.class.getClassLoader());

    xif.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
    xif.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.TRUE);
    System.out.println(xif);
    // XMLOutputFactory xof = XMLOutputFactory.newFactory(
    // "com.sun.xml.internal.stream.XMLOutputFactoryImpl", XMLTest.class.getClassLoader());
    XMLOutputFactory xof = XMLOutputFactory.newFactory(
            "com.fasterxml.aalto.stax.OutputFactoryImpl", XMLTest.class.getClassLoader());
    System.out.println(xof);

    Source in = new StAXSource(xif.createXMLStreamReader(new FileInputStream(xml)));
    Result out = new StAXResult(
            xof.createXMLStreamWriter(new FileOutputStream(outXml), "UTF-8"));

    indenter.newTransformer().transform(in, out);
    // xformerFactory.newTransformer().transform(in, out);

}

And I get the exception visible in the title. Any help would be suggested.

Nicolas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.