fasterxml / aalto-xml Goto Github PK
View Code? Open in Web Editor NEWUltra-high performance non-blocking XML processor (Stax API + extensions)
License: Apache License 2.0
Ultra-high performance non-blocking XML processor (Stax API + extensions)
License: Apache License 2.0
When creating a parser for UTF-16 byte streams, the XMLInputFactory returns a character stream based parser using java.io.InputStreamReader to bridge the streams.
The Javadoc for com.fasterxml.aalto.in.ReaderScanner
states : "In general using this scanner is quite a bit less optimal than that of java.io.InputStream
based scanner".
Additionally a side-effect of not having a dedicated byte-based parser is that the value returned by javax.xml.stream.Location#getCharacterOffset()
method returns the character offset and not the expected byte offset.
Consider the following XML fragment <problem>Left ≥ Right</problem>.
When setting the encoding to UTF-8 and using the XmlStreamWrite2.writeRaw method, the result is that the XML output now contains an e following the "greater than or equal to" sign: <problem>Left ≥e Right</problem>
.
Here is a unit test to demonstrate the problem. Note that I am using the jdk1.8.0_74 in Windows 10. Also, I am using version 1.0 of aalto, 'com.fasterxml:aalto-xml:1.0.0.'
@Test
public void testSerialization_failsWithUtf8() throws Exception {
final String input = "<problem>Left ≥ Right</problem>";
final XMLOutputFactory2 outputFactory = new OutputFactoryImpl();
final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
final XMLStreamWriter2 xmlStreamWriter =
(XMLStreamWriter2) outputFactory.createXMLStreamWriter(byteArrayOutputStream, "utf-8");
xmlStreamWriter.writeStartElement("example");
xmlStreamWriter.writeRaw(input);
xmlStreamWriter.writeEndElement();
xmlStreamWriter.flush();
final String result = byteArrayOutputStream.toString("utf-8");
System.out.print(result);
assertThat(result).isEqualToIgnoringCase("<example><problem>Left ≥e Right</problem></example>");
// Why is there an e added after the ≥ character, a valid UTF-8 character (see
// http://www.fileformat.info/info/unicode/char/2265/index.htm)?
byteArrayOutputStream.close();
xmlStreamWriter.closeCompletely();
}
I wrote 2 additional unit tests to show workarounds I found. The first workaround is to change the encoding to UTF-16.
@Test
public void testSerialization_worksWithUtf16() throws Exception {
final String input = "<problem>Left ≥ Right</problem>";
final XMLOutputFactory2 outputFactory = new OutputFactoryImpl();
final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
final XMLStreamWriter2 xmlStreamWriter =
(XMLStreamWriter2) outputFactory.createXMLStreamWriter(byteArrayOutputStream, "utf-16");
xmlStreamWriter.writeStartElement("example");
xmlStreamWriter.writeRaw(input);
xmlStreamWriter.writeEndElement();
xmlStreamWriter.flush();
final String result = byteArrayOutputStream.toString("utf-16");
System.out.print(result);
assertThat(result).isEqualToIgnoringCase("<example><problem>Left ≥ Right</problem></example>");
byteArrayOutputStream.close();
xmlStreamWriter.closeCompletely();
}
My 3rd and final unit test shows that to keep the encoding as UTF-8, I can use a reader and copy the events instead of using the writeRaw string. It's safer, but it's slower too.
@Test
public void testSerialization_worksWithUtf8WithReader() throws Exception {
final String input = "<example><problem>Left ≥ Right</problem></example>";
final XMLInputFactory2 inputFactory = new InputFactoryImpl();
final XMLStreamReader2 xmlStreamReader = (XMLStreamReader2) inputFactory.createXMLStreamReader(
IOUtils.toInputStream(input, "utf-8"));
final XMLOutputFactory2 outputFactory = new OutputFactoryImpl();
final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
final XMLStreamWriter2 xmlStreamWriter =
(XMLStreamWriter2) outputFactory.createXMLStreamWriter(byteArrayOutputStream, "utf-8");
while (xmlStreamReader.hasNext()) {
xmlStreamReader.next();
xmlStreamWriter.copyEventFromReader(xmlStreamReader, true);
}
xmlStreamWriter.flush();
xmlStreamReader.closeCompletely();
final String result = byteArrayOutputStream.toString("utf-8");
System.out.print(result);
assertThat(result).isEqualToIgnoringCase("<example><problem>Left ≥ Right</problem></example>");
byteArrayOutputStream.close();
xmlStreamWriter.closeCompletely();
}
Hello,
I am new to the "world" of aalto. Unfortunately, I am not able to make it run as expected - It keeps returning com.fasterxml.aalto.WFCException , even in the simplest scenario (e.g. http://www.cowtowncoder.com/blog/archives/2011/03/entry_451.html this old blog entry).
Here is my code:
byte[] XML = "<tag>Tove</tag>".getBytes();
AsyncXMLInputFactory xmlInputFactory = new InputFactoryImpl();
AsyncXMLStreamReader<AsyncByteArrayFeeder> asyncReader = xmlInputFactory.createAsyncFor(XML);
int inputPtr = 0;
int type;
do {
while ((type = asyncReader.next()) == AsyncXMLStreamReader.EVENT_INCOMPLETE) {
asyncReader.getInputFeeder().feedInput(XML, inputPtr++, 1);
if (inputPtr >= XML.length) {
asyncReader.getInputFeeder().endOfInput();
}
}
System.out.println("Got event of type: "+type);
} while (type != XMLEvent.END_DOCUMENT);
asyncReader.close();
It keeps mi returning:
com.fasterxml.aalto.WFCException: Unexpected character 't' (code 116) in epilog (unbalanced start/end tags?)
Do you know, how to feedInput to the feeder so that no Exception is thrown, please? What do I do wrong?
I utilize com.fasterxml.aalto-xml in the version 1.0.0.
Thank you for any help
Feature is good to have
I use it to parse a ByteArrayStream. Whenever I call that API, it always return 0 as offset.
I take a brief look at the code. This may be the problem.
//------
protected int com.fasterxml.aalto.in.ByteBasedScanner._inputPtr
com.fasterxml.aalto.in.ByteBasedScanner.getCurrentLocation() {
return LocationImpl.fromZeroBased(_config.getPublicId(), _config.getSystemId(),
_pastBytes + _inputPtr, _currRow, _inputPtr - _rowStartOffset);
}
// ------
protected int com.fasterxml.aalto.in.StreamScanner._inputPtr // This field is not necessary???
Hi all,
We are not able to escape some characters when using aalto-xml. It looks the characters <
and '
are not being escaped.
Method called: writer.writeCharacters("<>'&");
Input: <>'&
File result: <>'&
Our current implementation uses XMLStreamWriter and ByteXmlWriter. When debugging the code, it looks the issue happens in writeCharacters
method.
Can you please assist on this issue?
Regards,
Alexandre
The problem here is that the character Kappa starts being encoded as xml entities, unfortunately this is an non valid character encoding. I don't understand why this is happening and why it happens after X characters instead of at any point.
@Test
public void tooManyKappas()
throws XMLStreamException
{
XMLOutputFactory factory = OutputFactoryImpl.newInstance();
if (factory instanceof OutputFactoryImpl) {
((OutputFactoryImpl) factory).configureForSpeed();
}
//loop to find exactly at which point entity encoding kicks in.
for (int j = 0; j < 1000; j++) {
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
XMLStreamWriter writer = factory.createXMLStreamWriter(baos, StandardCharsets.UTF_8.name());
final String namespace = "http://example.org";
StringBuilder kappas = new StringBuilder();
for (int i = 0; i < (2000 + j); i++) {
kappas.append("𝜅");
}
writer.writeStartElement("", "ex", namespace);
writer.writeCharacters(kappas.toString());
writer.writeEndElement();
writer.close();
assertEquals("fails at " + (2000 + j),
"<ex>" + kappas + "</ex>",
new String(baos.toByteArray(), StandardCharsets.UTF_8));
}
}
I hope this minimized test case is off help. It's definitely due to something internal to aalto. WSTX does not have this issue (or at a much higher loop number...).
The problem really is that aalto-xml reader does not deal with its own output in this case. Which is correct as its the writer that is wrong.
See e.g.:
http://jcenter.bintray.com/com/fasterxml/aalto-xml/1.0.0/
http://jcenter.bintray.com/com/fasterxml/aalto-xml/1.1.0/
E.g. from aalto-xml-1.0.0-sources.jar
:
Manifest-Version: 1.0 Archiver-Version: Plexus Archiver Created-By: Apache Maven Built-By: tatu Build-Jdk: 1.7.0_79 Specification-Title: aalto-xml Specification-Version: 1.0.0 Specification-Vendor: FasterXML Implementation-Title: aalto-xml Implementation-Version: 1.0.0 Implementation-Vendor-Id: com.fasterxml Implementation-Vendor: FasterXML Implementation-Build-Date: 2015-11-23 19:35:01-0800 X-Compile-Source-JDK: 1.6 X-Compile-Target-JDK: 1.6
Same seems to apply for the stax2-api bundles, e.g. at:
http://jcenter.bintray.com/org/codehaus/woodstox/stax2-api/
Including the source bundles into an Eclipse product becomes difficult due to the missing symbolic names. Would be great if the names can be added, as done for the library bundles.
Hello,
First of all, I'd like to thank you for this great library. I'm using it to implement a FrameDecoder (JBoss Netty) in order to communicate with a hardware component that uses an XML-based protocol over TCP/IP. The library works fine but I'm experiencing an odd behaviour when XML being parsed contains long base64 content. My code looks similar to code included in TestBase64Reader.
AsyncXMLStreamReader parser = new InputFactoryImpl( ).createAsyncXMLStreamReader( );
// the following code is started after an STAR_ELEMENT event
private byte[] collectBinaryData ( TypedXMLStreamReader reader ) throws XMLStreamException, IOException
{
ByteArrayOutputStream baos = new ByteArrayOutputStream( );
byte[] buffer = new byte[ 4096 ];
int size;
while ( ( size = reader.readElementAsBinary( buffer, 0, buffer.length, Base64Variants.MIME ) ) != -1 )
{
baos.write( buffer, 0, size );
}
baos.close( );
return baos.size( ) == 0 ? null : baos.toByteArray( );
}
Previous code works fine if base64 content is short, for example:
U3RyZWFtaW5nIEFQSSBmb3IgWE1MIChTdEFYKSBpcyBhbiBhcHBsaWNhdGlvbiBwcm9ncmFtbWluZyBpbnRlcmZhY2UgKEFQSSkgdG8gcmVhZCBhbmQgd3JpdGUgWE1MIGRvY3VtZW50cywgb3JpZ2luYXRpbmcgZnJvbSB0aGUgSmF2YSBwcm9ncmFtbWluZyBsYW5ndWFnZSBjb21tdW5pdHku
However, if base64 content is long the code remains cycling since readElementAsBinary always returns buffer.length. I'm not sure if this behaviour is related to a bug or it could be caused by a wrong use of the library. The following XML chunk causes this behavior:
U3RyZWFtaW5nIEFQSSBmb3IgWE1MIChTdEFYKSBpcyBhbiBhcHBsaWNhdGlvbiBwcm9ncmFtbWluZyBpbnRlcmZhY2UgKEFQSSkgdG8gcmVhZCBhbmQgd3JpdGUgWE1MIGRvY3VtZW50cywgb3JpZ2luYXRpbmcgZnJvbSB0aGUgSmF2YSBwcm9ncmFtbWluZyBsYW5ndWFnZSBjb21tdW5pdHkuDQoNClRyYWRpdGlvbmFsbHksIFhNTCBBUElzIGFyZSBlaXRoZXI6DQoNCiAgICBET00gYmFzZWQgLSB0aGUgZW50aXJlIGRvY3VtZW50IGlzIHJlYWQgaW50byBtZW1vcnkgYXMgYSB0cmVlIHN0cnVjdHVyZSBmb3IgcmFuZG9tIGFjY2VzcyBieSB0aGUgY2FsbGluZyBhcHBsaWNhdGlvbg0KICAgIGV2ZW50IGJhc2VkIC0gdGhlIGFwcGxpY2F0aW9uIHJlZ2lzdGVycyB0byByZWNlaXZlIGV2ZW50cyBhcyBlbnRpdGllcyBhcmUgZW5jb3VudGVyZWQgd2l0aGluIHRoZSBzb3VyY2UgZG9jdW1lbnQuDQoNCkJvdGggaGF2ZSBhZHZhbnRhZ2VzOyB0aGUgZm9ybWVyIChmb3IgZXhhbXBsZSwgRE9NKSBhbGxvd3MgZm9yIHJhbmRvbSBhY2Nlc3MgdG8gdGhlIGRvY3VtZW50LCB0aGUgbGF0dGVyIChlLmcuIFNBWCkgcmVxdWlyZXMgYSBzbWFsbCBtZW1vcnkgZm9vdHByaW50IGFuZCBpcyB0eXBpY2FsbHkgbXVjaCBmYXN0ZXIuDQoNClRoZXNlIHR3byBhY2Nlc3MgbWV0YXBob3JzIGNhbiBiZSB0aG91Z2h0IG9mIGFzIHBvbGFyIG9wcG9zaXRlcy4gQSB0cmVlIGJhc2VkIEFQSSBhbGxvd3MgdW5saW1pdGVkLCByYW5kb20gYWNjZXNzIGFuZCBtYW5pcHVsYXRpb24sIHdoaWxlIGFuIGV2ZW50IGJhc2VkIEFQSSBpcyBhICdvbmUgc2hvdCcgcGFzcyB0aHJvdWdoIHRoZSBzb3VyY2UgZG9jdW1lbnQuDQoNClN0QVggd2FzIGRlc2lnbmVkIGFzIGEgbWVkaWFuIGJldHdlZW4gdGhlc2UgdHdvIG9wcG9zaXRlcy4gSW4gdGhlIFN0QVggbWV0YXBob3IsIHRoZSBwcm9ncmFtbWF0aWMgZW50cnkgcG9pbnQgaXMgYSBjdXJzb3IgdGhhdCByZXByZXNlbnRzIGEgcG9pbnQgd2l0aGluIHRoZSBkb2N1bWVudC4gVGhlIGFwcGxpY2F0aW9uIG1vdmVzIHRoZSBjdXJzb3IgZm9yd2FyZCAtICdwdWxsaW5nJyB0aGUgaW5mb3JtYXRpb24gZnJvbSB0aGUgcGFyc2VyIGFzIGl0IG5lZWRzLiBUaGlzIGlzIGRpZmZlcmVudCBmcm9tIGFuIGV2ZW50IGJhc2VkIEFQSSAtIHN1Y2ggYXMgU0FYIC0gd2hpY2ggJ3B1c2hlcycgZGF0YSB0byB0aGUgYXBwbGljYXRpb24gLSByZXF1aXJpbmcgdGhlIGFwcGxpY2F0aW9uIHRvIG1haW50YWluIHN0YXRlIGJldHdlZW4gZXZlbnRzIGFzIG5lY2Vzc2FyeSB0byBrZWVwIHRyYWNrIG9mIGxvY2F0aW9uIHdpdGhpbiB0aGUgZG9jdW1lbnQu
Previous XML chunks use 'UTF-8' encoding.
I'm using aalto-xml v0.9.8 running on Oracle J2SE 1.6.
Best regards,
Sandy Pérez González
IT Consultant,
Indaba Consultores S.L.
http://www.indaba.es/
The non blocking xml parser will crash when parsing below xml. I fed with one byte at the time.
<?xml version="1.0" encoding="utf-8"?>
<feed>
<entry>
<updated label="Gräs">2012-04-03T14:05:27-07:00
</updated>
</entry>
</feed>
Hi,
is it planned to support some more encodings like ISO-8859-1 etc.?
Regards,
aalto-xml version 0.9.10 (0.9.9 too)
SJSXP, Woodstox parse ok
Exception in thread "main" com.fasterxml.aalto.WFCException: Illegal XML character ((CTRL-CHAR, code 128))
at [row,col {unknown-source}]: [24,7]
at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1339)
at com.fasterxml.aalto.in.XmlScanner.handleInvalidXmlChar(XmlScanner.java:1536)
at com.fasterxml.aalto.in.Utf8Scanner.skipCharacters(Utf8Scanner.java:806)
at com.fasterxml.aalto.in.XmlScanner.skipToken(XmlScanner.java:425)
at com.fasterxml.aalto.in.StreamScanner.nextFromTree(StreamScanner.java:191)
at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:757)
I believe that there is a multi-threading bug in XmlScanner/FixedNsContext.
I suspect that this is because XmlScanner
uses the static FixedNsContext.EMPTY_CONTEXT
initially, so threads are sharing the _tmpDecl
array in FixedNsContext
.
I am not familiar enough with Aalto to produce a fix, but I did manage to create a reproducible example (you may need to run it a few times to get the error):
package performance;
import com.fasterxml.aalto.AsyncByteBufferFeeder;
import com.fasterxml.aalto.AsyncXMLInputFactory;
import com.fasterxml.aalto.AsyncXMLStreamReader;
import com.fasterxml.aalto.evt.EventAllocatorImpl;
import com.fasterxml.aalto.stax.InputFactoryImpl;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import javax.xml.stream.util.XMLEventAllocator;
public class AaltoBug implements Runnable {
private static String xml = "<?xml version='1.0'?><stream:stream xmlns='jabber:client' xmlns:stream='http://etherx.jabber.org/streams' id='4095288169' from='localhost' version='1.0' xml:lang='en'>";
private static int NUM_THREADS = 5;
private static XMLEventAllocator allocator = EventAllocatorImpl.getDefaultInstance();
private static AsyncXMLInputFactory inputFactory = new InputFactoryImpl();
public static void main(String[] args) throws InterruptedException {
ExecutorService ex = Executors.newFixedThreadPool(NUM_THREADS);
for (int i = 0; i < 100000; i++) {
ex.submit(new AaltoBug(i));
}
ex.shutdown();
ex.awaitTermination(Integer.MAX_VALUE, TimeUnit.SECONDS);
}
private final int count;
public AaltoBug(int count) {
this.count = count;
}
@Override
public void run() {
try {
ByteBuffer bb = StandardCharsets.UTF_8.encode(xml);
AsyncXMLStreamReader<AsyncByteBufferFeeder> parser = inputFactory.createAsyncForByteBuffer();
parser.getInputFeeder().feedInput(bb);
while (parser.hasNext()) {
int eventType = parser.next();
if (eventType == AsyncXMLStreamReader.EVENT_INCOMPLETE) {
break;
}
allocator.allocate(parser);
}
} catch (Exception e) {
System.out.println("Error in " + count);
e.printStackTrace();
}
}
}
Which produces this exception:
java.lang.ArrayIndexOutOfBoundsException: 10
at java.util.ArrayList.add(ArrayList.java:459)
at com.fasterxml.aalto.in.FixedNsContext.reuseOrCreate(FixedNsContext.java:76)
at com.fasterxml.aalto.in.XmlScanner.getNonTransientNamespaceContext(XmlScanner.java:916)
at com.fasterxml.aalto.stax.StreamReaderImpl.getNonTransientNamespaceContext(StreamReaderImpl.java:1528)
at org.codehaus.stax2.ri.evt.Stax2EventAllocatorImpl.createStartElement(Stax2EventAllocatorImpl.java:160)
at org.codehaus.stax2.ri.evt.Stax2EventAllocatorImpl.allocate(Stax2EventAllocatorImpl.java:69)
at com.fasterxml.aalto.evt.EventAllocatorImpl.allocate(EventAllocatorImpl.java:103)
at performance.AaltoBug.run(AaltoBug.java:61)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Aalto registers itself with the JAXP SPI as a SAX parser factory, which causes some other external libraries (liquibase, and a few others) to blow up due to Aalto not implementing validation.
(note: sort of related to #50)
Up to 1.1.0, only UTF-8 and US-ASCII (7-bit Ascii) encodings are supported by async parser.
But since Latin-1 is Unicode value compatible with UTF-8 (even though differing in encoding details), it should be possible and relatively easy to allow that too.
About the last thing not yet implemented for non-blocking Stax parser is support for coalescing mode. It seems doable, although non-trivial.
I have almost managed to combine aalto-xml and https://github.com/sonatype/async-http-client.
It almost works but I end up with an error when I try to push it hard. I am using latest 0.9.8-SNAPSHOT.
Any idea what this might be?
Concurrency issue in my code?
2012-04-03 22:49:54,598 ERROR [New I/O client worker #1-3] XmlParseAsyncHandler: Got throwable
java.lang.IllegalStateException: Internal error: should never execute this code path
at com.fasterxml.aalto.async.AsyncByteScanner.throwInternal(AsyncByteScanner.java:2936) ~[classes/:na]
at com.fasterxml.aalto.async.AsyncUtfScanner.handleAttrValuePending(AsyncUtfScanner.java:1350) ~[classes/:na]
at com.fasterxml.aalto.async.AsyncUtfScanner.handleAttrValue(AsyncUtfScanner.java:1136) ~[classes/:na]
at com.fasterxml.aalto.async.AsyncByteScanner.handleStartElement(AsyncByteScanner.java:2149) ~[classes/:na]
at com.fasterxml.aalto.async.AsyncByteScanner.nextFromTree(AsyncByteScanner.java:656) ~[classes/:na]
at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:720) ~[classes/:na]
at net.jakeri.appstore.toplist.XmlParseAsyncHandler.onBodyPartReceived(XmlParseAsyncHandler.java:41) ~[classes/:na]
at com.ning.http.client.providers.netty.NettyAsyncHttpProvider.updateBodyAndInterrupt(NettyAsyncHttpProvider.java:1435) [async-http-client-1.7.1.jar:na]
at com.ning.http.client.providers.netty.NettyAsyncHttpProvider.access$2300(NettyAsyncHttpProvider.java:136) [async-http-client-1.7.1.jar:na]
at com.ning.http.client.providers.netty.NettyAsyncHttpProvider$HttpProtocol.handle(NettyAsyncHttpProvider.java:2183) ~[async-http-client-1.7.1.jar:na]
at com.ning.http.client.providers.netty.NettyAsyncHttpProvider.messageReceived(NettyAsyncHttpProvider.java:1091) [async-http-client-1.7.1.jar:na]
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:777) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:143) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:777) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.unfoldAndFireMessageReceived(ReplayingDecoder.java:522) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:501) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:438) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.handler.codec.http.HttpClientCodec.handleUpstream(HttpClientCodec.java:72) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:553) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:343) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:274) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:194) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102) [netty-3.3.1.Final.jar:na]
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.3.1.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_29]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_29]
at java.lang.Thread.run(Thread.java:680) [na:1.6.0_29]
When I don't set "javax.xml.parsers.SAXParserFactory"
calling:
new com.fasterxml.aalto.sax.SAXParserFactoryImpl().newSAXParser()
gets me a:
com.fasterxml.aalto.sax.SAXParserImpl
but calling:
com.fasterxml.aalto.sax.SAXParserFactoryImpl.newInstance().newSAXParser()
gets me a:
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl
I'm using StAX to do some XML processing, reading events and then writing them, sometimes with some changes, etc., as you do, and I've noticed something odd: aalto, given XML containing
in a text node will, at the end of this process, produce text with neither the CR nor a newline. This differs from other StAX implementations and appears to be a bug (the handling of this particular case is different across all implementations I know of, but aalto's seems to be the most-wrong).
This program illustrates:
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.nio.charset.StandardCharsets;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.XMLEvent;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
public final class Minimal {
/** Enumeration of supported StAX implementations. */
public enum StAXImplementation {
AALTO("com.fasterxml.aalto.stax.InputFactoryImpl",
"com.fasterxml.aalto.stax.OutputFactoryImpl",
"com.fasterxml.aalto.stax.EventFactoryImpl"),
/** JDK built in implementation, based on Xerces. */
JDK("com.sun.xml.internal.stream.XMLInputFactoryImpl",
"com.sun.xml.internal.stream.XMLOutputFactoryImpl",
"com.sun.xml.internal.stream.events.XMLEventFactoryImpl"),
WOODSTOX("com.ctc.wstx.stax.WstxInputFactory",
"com.ctc.wstx.stax.WstxOutputFactory",
"com.ctc.wstx.stax.WstxEventFactory"),
XERCES(JDK.inputFactory,
JDK.outputFactory,
"org.apache.xerces.stax.XMLEventFactoryImpl");
final String inputFactory;
final String outputFactory;
final String eventFactory;
private StAXImplementation(final String inputFactory,
final String outputFactory, final String eventFactory) {
this.inputFactory = inputFactory;
this.outputFactory = outputFactory;
this.eventFactory = eventFactory;
}
/**
* Tell the JDK to use this StAXImplementation.
*/
public void use() {
System.setProperty("javax.xml.stream.XMLInputFactory",
inputFactory);
System.setProperty("javax.xml.stream.XMLOutputFactory",
outputFactory);
System.setProperty("javax.xml.stream.XMLEventFactory",
eventFactory);
}
}
static final String CR = new String(Character.toChars(32));
static final String TEXT="a a";
static final String EXPANDED_TEXT = "a" + CR + "a";
static final String NEWLINE_TEXT = "a\na";
static final String XML = "<x>" + TEXT + "</x>";
static final byte[] XML_BYTES = XML.getBytes(
StandardCharsets.UTF_8);
/** Run some tests for each StAX implementation. */
public static void main(final String[] args) throws Exception {
for (final StAXImplementation impl : StAXImplementation.values()) {
runTest(impl);
System.out.println();
}
}
private static void runTest(final StAXImplementation impl)
throws Exception {
impl.use();
System.out.println("************** Trying " + impl + " **************");
final XMLInputFactory inputFactory = XMLInputFactory.newFactory();
final XMLOutputFactory outputFactory =
XMLOutputFactory.newFactory();
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
System.out.println("-------------- Factory classes:");
printName("XMLInputFactory", inputFactory);
printName("XMLOutputFactory", outputFactory);
printName("XMLEventFactory", XMLEventFactory.newFactory());
System.out.println("-------------- StAX results:");
final XMLEventReader r = inputFactory.createXMLEventReader(
new ByteArrayInputStream(XML_BYTES));
final XMLEventWriter w = outputFactory.createXMLEventWriter(baos);
final StringBuilder buffer = new StringBuilder();
while (r.hasNext()) {
final XMLEvent e = r.nextEvent();
if (e.isStartDocument()) {
// Avoid the XML declaration. Not present in the input.
continue;
}
if (e.isCharacters()) {
buffer.append(e.asCharacters().getData());
} else {
if (buffer.length() > 0) {
testText(buffer.toString());
buffer.setLength(0);
}
}
w.add(e);
}
r.close();
w.flush();
w.close();
final byte[] resultBytes = baos.toByteArray();
System.out.println("StAX XML: [" + new String(resultBytes,
StandardCharsets.UTF_8) + "]");
testDOM(resultBytes);
}
private static void printName(final String name, final Object obj) {
System.out.println(name + "=" + obj.getClass().getName());
}
private static void testText(final String text) {
System.out.println("Buffered text: [" + text + "]");
System.out.println("Code point at index 1: " +
Character.codePointAt(text, 1));
System.out.println("Buffered text equals input text? " +
TEXT.equals(text));
System.out.println(
"Buffered text equals expanded text? " +
EXPANDED_TEXT.equals(text));
System.out.println("Buffered text has \\n for \\r? " +
NEWLINE_TEXT.equals(text));
}
private static void testDOM(final byte[] resultBytes) throws Exception {
final DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setExpandEntityReferences(false);
final Document doc = dbf.newDocumentBuilder().parse(
new ByteArrayInputStream(resultBytes));
final XPath xpath = XPathFactory.newInstance().newXPath();
final String domText = (String)xpath.evaluate("/x/text()", doc,
XPathConstants.STRING);
System.out.println("============== DOM results:");
testText(domText);
}
}
It's pretty straightforward to run if you have all the implementations in your classpath. I use the following pom:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>foobar</groupId>
<artifactId>stax-stuff</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>com.fasterxml.woodstox</groupId>
<artifactId>woodstox-core</artifactId>
<version>5.0.1</version>
</dependency>
<dependency>
<groupId>xerces</groupId>
<artifactId>xercesImpl</artifactId>
<version>2.11.0</version>
</dependency>
<dependency>
<groupId>com.fasterxml</groupId>
<artifactId>aalto-xml</artifactId>
<version>1.0.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
Output:
************** Trying AALTO **************
-------------- Factory classes:
XMLInputFactory=com.fasterxml.aalto.stax.InputFactoryImpl
XMLOutputFactory=com.fasterxml.aalto.stax.OutputFactoryImpl
XMLEventFactory=com.fasterxml.aalto.stax.EventFactoryImpl
-------------- StAX results:
Buffered text: [a
a]
Code point at index 1: 13
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false
StAX XML: [<x>aa</x>]
============== DOM results:
Buffered text: [aa]
Code point at index 1: 97
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false
************** Trying JDK **************
-------------- Factory classes:
XMLInputFactory=com.sun.xml.internal.stream.XMLInputFactoryImpl
XMLOutputFactory=com.sun.xml.internal.stream.XMLOutputFactoryImpl
XMLEventFactory=com.sun.xml.internal.stream.events.XMLEventFactoryImpl
-------------- StAX results:
Buffered text: [a
a]
Code point at index 1: 13
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false
StAX XML: [<x>a
a</x>]
============== DOM results:
Buffered text: [a
a]
Code point at index 1: 10
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? true
************** Trying WOODSTOX **************
-------------- Factory classes:
XMLInputFactory=com.ctc.wstx.stax.WstxInputFactory
XMLOutputFactory=com.ctc.wstx.stax.WstxOutputFactory
XMLEventFactory=com.ctc.wstx.stax.WstxEventFactory
-------------- StAX results:
Buffered text: [a
a]
Code point at index 1: 13
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false
StAX XML: [<x>a
a</x>]
============== DOM results:
Buffered text: [a
a]
Code point at index 1: 13
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false
************** Trying XERCES **************
-------------- Factory classes:
XMLInputFactory=com.sun.xml.internal.stream.XMLInputFactoryImpl
XMLOutputFactory=com.sun.xml.internal.stream.XMLOutputFactoryImpl
XMLEventFactory=org.apache.xerces.stax.XMLEventFactoryImpl
-------------- StAX results:
Buffered text: [a
a]
Code point at index 1: 13
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? false
StAX XML: [<x>a
a</x>]
============== DOM results:
Buffered text: [a
a]
Code point at index 1: 10
Buffered text equals input text? false
Buffered text equals expanded text? false
Buffered text has \n for \r? true
Hello.
I tried to parse xml with windows-1251, but caught error:
com.fasterxml.aalto.WFCException: Unsupported encoding 'windows-1251': only UTF-8 and US-ASCII support by async parser
Any plans to support encodings differing from UTF-8 and US-ASCII.
Or did I miss something?
Migrate the build over to Maven.
When parsing the following xml:
<root att="
"></root>
The parser fails with the following error:
com.fasterxml.aalto.WFCException: Unexpected character 'A' (code 65) expected a hex digit (0-9a-fA-F) for character entity
at [row,col {unknown-source}]: [1,16]
at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1333)
at com.fasterxml.aalto.in.XmlScanner.throwUnexpectedChar(XmlScanner.java:1498)
at com.fasterxml.aalto.async.AsyncByteArrayScanner.handleHexEntityInAttribute(AsyncByteArrayScanner.java:4198)
The parser does not seem to support hexadecimal character references not starting with [0-9] although it is allowed by the XML specification.
I will issue a pull request shortly containing a fix proposal with a JUnit test case.
PS: as it is my first comment here, thanks a lot for this great library!
The following is always -1:
AsyncXMLStreamReader#getLocationInfo().getEndingCharOffset() == -1
AsyncXMLStreamReader#getLocationInfo().getStartingCharOffset() == -1
although there are definitively characters, the byte offset is correct:
AsyncXMLStreamReader#getLocationInfo().getEndingByteOffset() == 686
Please note that character offset may be different from byte offset, e.g. when using Emojis in the XML (surrogate pairs)
(version 1.0.0)
In SAXParserImpl.parse (line 328) an instance of ReaderConfig is created with arguments publicId and systemId being passed in wrong order. This causes getPublicId() and getSystemId() to produce bad results.
I recognized that
com.fasterxml.aalto.out.NonRepairingStreamWriter#writeNamespace(XMLConstants.XML_NS_PREFIX, XMLConstants.XML_NS_URI)
writes the full namespace xmlns
declaration, e.g.:
<element xmlns:xml="http://www.w3.org/XML/1998/namespace" xml:lang="en">
although it's not necessary.
The reference implemention which ships with the JDK behaves differently, it doesn't write the full namespace.
It's not a bug, but the XML looks bloated.
With non-blocking Stax implementation, it should be relatively easy to add interface for doing non-blocking SAX parsing -- feed block of content, which triggers all SAX callbacks, returns. The main trick is that of keeping track
The Stax2 java files say that they are subject to the license found in the "LICENSE" file shipped with the source code. However, there is NO file named "LICENSE" anywhere in the source. So, what license applies to Stax2? (Looking at stax2-api-3.1.1-sources.jar) and the current github source, there is no LICENSE file to be found.
This is a test case with the simplest document I could find. The code reproducing the problem is a copy-paste of an official example with a small fix.
@Test
public void asynchronousParseSmallestDocument() throws Exception {
final String xml =
// "<?xml version=\"1.0\" encoding=\"US-ASCII\"?>" +
"<d/>" ;
final AsyncXMLInputFactory asyncXmlInputFactory = new InputFactoryImpl() ;
final AsyncXMLStreamReader< AsyncByteArrayFeeder > xmlStreamReader =
asyncXmlInputFactory.createAsyncFor( xml.getBytes( Charsets.US_ASCII ) ) ;
final AsyncByteArrayFeeder inputFeeder = xmlStreamReader.getInputFeeder() ;
byte[] xmlBytes = xml.getBytes() ;
int bufferFeedLength = 1 ;
int currentByteOffset = 0 ;
int type ;
do{
while( ( type = xmlStreamReader.next() ) == AsyncXMLStreamReader.EVENT_INCOMPLETE ) {
byte[] buffer = new byte[]{ xmlBytes[ currentByteOffset ] } ;
currentByteOffset ++ ;
inputFeeder.feedInput( buffer, 0, bufferFeedLength ) ;
if( currentByteOffset >= xmlBytes.length ) {
inputFeeder.endOfInput() ;
}
}
switch( type ) {
case XMLEvent.START_DOCUMENT :
LOGGER.debug( "start document" ) ;
break ;
case XMLEvent.START_ELEMENT :
LOGGER.debug( "start element: " + xmlStreamReader.getName() ) ;
break ;
case XMLEvent.CHARACTERS :
LOGGER.debug( "characters: " + xmlStreamReader.getText() ) ;
break ;
case XMLEvent.END_ELEMENT :
LOGGER.debug( "end element: " + xmlStreamReader.getName() ) ;
break ;
case XMLEvent.END_DOCUMENT :
LOGGER.debug( "end document" ) ;
break ;
default :
break ;
}
} while( type != XMLEvent.END_DOCUMENT ) ;
xmlStreamReader.close() ;
}
This is what I get:
15:23:41.388 DEBUG [main] c.o.s.m.c.x.SingleContactXmlParserTest - start document
15:23:41.394 DEBUG [main] c.o.s.m.c.x.SingleContactXmlParserTest - start element: d
15:23:41.394 DEBUG [main] c.o.s.m.c.x.SingleContactXmlParserTest - end element: d
com.fasterxml.aalto.WFCException: Unexpected character 'd' (code 100) in epilog (unbalanced start/end tags?)
at [row,col {unknown-source}]: [1,7]
at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1333)
at com.fasterxml.aalto.in.XmlScanner.throwUnexpectedChar(XmlScanner.java:1498)
at com.fasterxml.aalto.in.XmlScanner.reportPrologUnexpChar(XmlScanner.java:1358)
at com.fasterxml.aalto.async.AsyncByteArrayScanner.nextFromProlog(AsyncByteArrayScanner.java:1068)
at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:802)
at com.otcdlink.server.model.contact.xml.SingleContactXmlParserTest.asynchronousParseSmallestDocument(SingleContactXmlParserTest.java:102)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
I'm using aalto-xml 1.0.
Currently I get a NPE when I try to do this. I don't need validation, so I can ignore the DTD declarations. Is there a setting for this in aalto-xml?
I have just been trying to use Aalto's parser with Jena's RDF parsers.
I asked the Jena group what was the best way to do that and we thought that
adapting the StaX2SAX would probably do it.
So I worked today on this and ended up with this gist
https://gist.github.com/1713430
It seems to be close to working.
but we get an exception on line 82
XMLEvent e = eventReader.nextEvent();
javax.xml.stream.XMLStreamException: Unrecognized event type 257.
at org.codehaus.stax2.ri.evt.Stax2EventAllocatorImpl.allocate(Stax2EventAllocatorImpl.java:85)
at org.codehaus.stax2.ri.Stax2EventReaderImpl.createStartDocumentEvent(Stax2EventReaderImpl.java:441)
at org.codehaus.stax2.ri.Stax2EventReaderImpl.nextEvent(Stax2EventReaderImpl.java:245)
at patch.AsyncJenaParser.parse(AsyncJenaParser.java:82)
This is I suppose because I am transforming the AsyncXMLStreamReader into an
XMLEventReader as that is the closest to the the way the Jena code was written. You can
see this in the constructor
public AsyncJenaParser(ContentHandler handler, AsyncXMLStreamReader streamReader) throws XMLStreamException {
this.handler = handler;
this.lhandler = (handler instanceof LexicalHandler) ?
(LexicalHandler) handler :
NO_LEXICAL_HANDLER ;
handler.setDocumentLocator(new LocatorConv(streamReader));
final XMLInputFactory xf = InputFactoryImpl.newInstance();
this.streamReader = streamReader;
this.eventReader = xf.createXMLEventReader(streamReader);
}
So this is not SAX parsing. It is just using an event reader instead of the stream reader. The class is used currently by the following Scala code which extends the com.ning.http.AsyncHandler
Now that API has been defined, and most of parsing functionality done (with character data, elements, PI, comments, xml declaration), need to complete rest of functionality. What remains is:
I know this is not correct but I am trying to parse some sources that have illegal characters in the xml feed.
Would it be possible to let the parser continue parse without exception and warn instead?
com.fasterxml.aalto.WFCException: Illegal XML character ((CTRL-CHAR, code 11))
at [row,col {unknown-source}]: [29273,485]
at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1335) ~[aalto-xml-0.9.8.jar:na]
at com.fasterxml.aalto.in.XmlScanner.throwInvalidXmlChar(XmlScanner.java:1525) ~[aalto-xml-0.9.8.jar:na]
at com.fasterxml.aalto.async.AsyncUtfScanner.skipCharacters(AsyncUtfScanner.java:755) ~[aalto-xml-0.9.8.jar:na]
at com.fasterxml.aalto.async.AsyncByteScanner.nextFromTree(AsyncByteScanner.java:572) ~[aalto-xml-0.9.8.jar:na]
at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:720) ~[aalto-xml-0.9.8.jar:na]
...
and
com.fasterxml.aalto.WFCException: Illegal XML character ((CTRL-CHAR, code 22))
at [row,col {unknown-source}]: [8982,3]
at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1335)
at com.fasterxml.aalto.in.XmlScanner.throwInvalidXmlChar(XmlScanner.java:1525)
at com.fasterxml.aalto.async.AsyncUtfScanner.skipCharacters(AsyncUtfScanner.java:755)
at com.fasterxml.aalto.async.AsyncByteScanner.nextFromTree(AsyncByteScanner.java:572)
at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:720)
...
Consider:
SAXParser parser = /** retreieve Aalto SAX Parser **/;
parser.setProperty("http://xml.org/sax/properties/document-xml-version","1.0");
this throws:
org.xml.sax.SAXNotRecognizedException: Feature 'http://xml.org/sax/properties/document-xml-version' not recognized
at com.fasterxml.aalto.sax.SAXUtil.reportUnknownFeature(SAXUtil.java:121)
at com.fasterxml.aalto.sax.SAXParserImpl.setProperty(SAXParserImpl.java:145)
Hi,
Great job on the async xml parser. I'm using it in conjunction with another Ning library, the async-http-client, and so far it's been working great.
I have a question though, how should I deal with methods that span multiple events, e.g. reader.getElementText()? For example, the fragment I want to process is "hello worl". If the cursor is currently on the start element (), when I call getElementText it will throw a WFCException because the next token is the Event Incomplete token, which it couldn't handle. Thanks for your help.
Hello,
I tried around with the example to setup the speed configuration for the XML-Parser and got the following problems:
First it did not find the InputFactoryImpl so I had to add it specifically to the system properties:
System.setProperty("com.fasterxml.aalto.stax.InputFactoryImpl","com.fasterxml.aalto.stax.InputFactoryImpl");
And secondly if I try to run it on an ordinary XML-File it gives me:
Exception in thread "main" com.fasterxml.aalto.WFCException: Unexpected End-of-input in prolog at [row,col {unknown-source}]: [1,1] at com.fasterxml.aalto.stax.StreamReaderImpl.throwWfe(StreamReaderImpl.java:1775) at com.fasterxml.aalto.stax.StreamReaderImpl.throwUnexpectedEOI(StreamReaderImpl.java:1794) at com.fasterxml.aalto.stax.StreamReaderImpl.handlePrologEoi(StreamReaderImpl.java:1754) at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:810) at com.fasterxml.aalto.speedtest.SpeedTest.execute(SpeedTest.java:27) at com.fasterxml.aalto.speedtest.SpeedTest.main(SpeedTest.java:47)
I followed the basic STAX setup e.g. as described here:
http://www.studytrails.com/java/xml/aalto/java-xml-aalto-stax-parsing/
Any suggestions to resolve this?
Kind regards
Consider the following setup.
If aalto-xml 1.0.0 is used as STAX backend, the resulting document's root schema tag's xmlns attribute will be written 3 times, as follows :
<schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns="http://www.w3.org/2001/XMLSchema" xmlns="http://www.w3.org/2001/XMLSchema" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" targetNamespace="http://www.w3.org/2000/09/xmldsig#" version="0.1" elementFormDefault="qualified">
Note that if I simply switch the implementation to woodstox-core 5.0.3, everything works fine, so this is definitely an issue with aalto-xml.
Also note that definitions of prefixed namespaces (like xmlns:xx) are unaffected by the issue : only the "xmlns" attribute is affected.
I'm having trouble with the library com.fasterxml.aalto-xml.
I Have a Project A (Dspace) that do not depend on the library com.fasterxml.aalto-xml.
I developed a library B, that use a library C that depends on com.fasterxml.aalto-xml.
I wonder why when com.fasterxml.aalto-xml gets added as a dependency to A by way of B -> C, A picks up com.fasterxml.aalto-xml to do his usual XML Parsing.
The problem is that, it breaks the application. com.fasterxml.aalto-xml is having trouble to properly parse the XML that comes originally with A. Something is going wrong.
I would like to understand why does it happens. Is there a way to tell A not to use com.fasterxml.aalto-xml but still having B -> C -> com.fasterxml.aalto-xml at work within the A project.
What is particular about those XML parsers? I just don't understand why A would pick up com.fasterxml.aalto-xml to work with it, while it does not have any dependency on it.
I can see that the lib is also an osgi bundle jar, may it has any implication here ?
I mean originally A (Dspace) works without alto and use whatever parser it has, why picking alto now that it is in the dependency ? what mechanism allow that ?
Specifically method writeSplitCharacters
outputs additional byte after every output2ByteChar(ch)
call. There are one line fix for this. Code fragment
case CT_MULTIBYTE_2: // 3, 4 and N can never occur
// To off-line or not?
output2ByteChar(ch);
break;
in writeSplitCharacters
should be changed to
case CT_MULTIBYTE_2: // 3, 4 and N can never occur
// To off-line or not?
output2ByteChar(ch);
continue main_loop;
Tell me if pull request is needed for this.
Whilst running a test in my environment I encountered a bug in aalto-xml-0-9.11.jar.
I debugged the code and think that I discovered what the problem was.
It occurred in ByteXmlWriter.writeCharacters() called from longWriteCharacters().
I was attempting to add 4K of XML onto an existing XML document (in my case it was a soapenv body that was being built up). _outputPtr was 407 at the start of the 4K output. The first part of the 4K text was 325 characters before a newline character was encountered. I think the fault is that at this point in the code (_config..willEscapeCR()) the _outputPtr is not set to ptr! So in my case ptr was 692 but gets reset to 407. This left me with garbled output.
I tried a patched aalto-xml jar with a modified ByteXmlWriter.writeCharacters():
if (_config.willEscapeCR()) {
_outputPtr = ptr;
break;
}
this appeared to fix my issue.
The Aalto wiki states that Aalto is distributed with the Apache 2.0 license, but the only license attached to the actual source is GPL 3.0.
We'd love to be able to use Aalto, but sadly if it's GPL, we can't. Can we shed some light on what the actual license is?
Validation is not working with an XMLStreamWriter. A quick comparison with woodstox makes obvious that several calls to XMLValidator are missing in StreamWriterBase (at least in writeAttribute(...), writeStartElement(...) and _closeStartElement(...)) and subclasses (several locations).
Furthermore, the invocation of XMLValidator..validateElementEnd(...) in StreamWriterBase._validator.validateElementEnd() mixes up the prefix and nsUri arguments.
The javadoc for AsyncByteScanner.validPublicIdChar()
references PubidLiteral in the XML 1.0 specification:
PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'" PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]
Note that this includes [0-9]
. However, the implementation of the method does not:
protected boolean validPublicIdChar(int c) {
return
c == 0xA || //<LF>
c == 0xD || //<CR>
c == 0x20 || //<SPACE>
(c >= '@' && c <= 'Z') || //@[A-Z]
(c >= 'a' && c <= 'z') ||
c == '!' ||
(c >= 0x23 && c <= 0x25) || //#$%
(c >= 0x27 && c <= 0x2F) || //'()*+,-./
(c >= ':' && c <= ';') ||
c == '=' ||
c == '?' ||
c == '_';
}
Note also that com.fasterxml.aalto.util.XmlCharTypes.PUBID_CHARS
correctly includes these digits.
<!DOCTYPE>
declaration (e.g., add it to this file):<!DOCTYPE ead PUBLIC "+// http://ead3.archivists.org/schema/ //DTD ead3 (Encoded Archival Description (EAD) Version 3)//EN" "ead3.dtd">
AsyncXMLStreamReader
.<!DOCTYPE>
declaration.WFCException
:Error parsing XML stream
com.fasterxml.aalto.WFCException: Unexpected character '3' (code 51) in prolog (not valid in PUBLIC ID)
at [row,col {unknown-source}]: [1,77]
at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1333)
at com.fasterxml.aalto.in.XmlScanner.throwUnexpectedChar(XmlScanner.java:1498)
at com.fasterxml.aalto.in.XmlScanner.reportPrologUnexpChar(XmlScanner.java:1358)
at com.fasterxml.aalto.async.AsyncByteBufferScanner.parseDtdId(AsyncByteBufferScanner.java:1946)
at com.fasterxml.aalto.async.AsyncByteBufferScanner.handleDTD(AsyncByteBufferScanner.java:1833)
at com.fasterxml.aalto.async.AsyncByteBufferScanner.handlePrologDeclStart(AsyncByteBufferScanner.java:1264)
at com.fasterxml.aalto.async.AsyncByteBufferScanner.nextFromProlog(AsyncByteBufferScanner.java:1067)
at com.fasterxml.aalto.stax.StreamReaderImpl.next(StreamReaderImpl.java:790)
at org.codehaus.stax2.ri.Stax2EventReaderImpl.nextEvent(Stax2EventReaderImpl.java:255)
In the README.MD and in the wiki, all links to the fasterxml.com domain are not working.
The build.xml refers to GPL. Given the project was relicensed in 2010, that's presumably old cruft.
Hi,
The following source files are without license headers:
./src/main/java/com/fasterxml/aalto/AsyncByteArrayFeeder.java
./src/main/java/com/fasterxml/aalto/AsyncByteBufferFeeder.java
./src/main/java/com/fasterxml/aalto/AsyncInputFeeder.java
./src/main/java/com/fasterxml/aalto/AsyncXMLInputFactory.java
./src/main/java/com/fasterxml/aalto/AsyncXMLStreamReader.java
./src/main/java/com/fasterxml/aalto/package-info.java
./src/main/java/com/fasterxml/aalto/ValidationException.java
./src/main/java/com/fasterxml/aalto/async/AsyncByteScanner.java
./src/main/java/com/fasterxml/aalto/async/AsyncStreamReaderImpl.java
./src/main/java/com/fasterxml/aalto/async/package-info.java
./src/main/java/com/fasterxml/aalto/dom/BijectiveNsMap.java
./src/main/java/com/fasterxml/aalto/dom/DOMOutputElement.java
./src/main/java/com/fasterxml/aalto/dom/DOMReaderImpl.java
./src/main/java/com/fasterxml/aalto/dom/DOMWriterImpl.java
./src/main/java/com/fasterxml/aalto/dom/OutputElementBase.java
./src/main/java/com/fasterxml/aalto/evt/IncompleteEvent.java
./src/main/java/com/fasterxml/aalto/impl/CommonConfig.java
./src/main/java/com/fasterxml/aalto/impl/ErrorConsts.java
./src/main/java/com/fasterxml/aalto/impl/IoStreamException.java
./src/main/java/com/fasterxml/aalto/impl/LocationImpl.java
./src/main/java/com/fasterxml/aalto/in/ElementScope.java
./src/main/java/com/fasterxml/aalto/in/FixedNsContext.java
./src/main/java/com/fasterxml/aalto/in/MergedStream.java
./src/main/java/com/fasterxml/aalto/in/PName3.java
./src/main/java/com/fasterxml/aalto/in/ReaderConfig.java
./src/main/java/com/fasterxml/aalto/in/ReaderScanner.java
./src/main/java/com/fasterxml/aalto/in/XmlScanner.java
./src/main/java/com/fasterxml/aalto/io/UTF8Writer.java
./src/main/java/com/fasterxml/aalto/sax/SAXFeature.java
./src/main/java/com/fasterxml/aalto/sax/SAXProperty.java
./src/main/java/com/fasterxml/aalto/out/WriterConfig.java
./src/main/java/com/fasterxml/aalto/util/CharsetNames.java
./src/main/java/com/fasterxml/aalto/util/DataUtil.java
./src/main/java/com/fasterxml/aalto/util/EmptyIterator.java
./src/main/java/com/fasterxml/aalto/util/IllegalCharHandler.java
./src/main/java/com/fasterxml/aalto/util/SingletonIterator.java
./src/main/java/com/fasterxml/aalto/util/TextAccumulator.java
./src/main/java/com/fasterxml/aalto/util/TextBuilder.java
./src/main/java/com/fasterxml/aalto/util/TextUtil.java
./src/main/java/com/fasterxml/aalto/util/UriCanonicalizer.java
./src/main/java/com/fasterxml/aalto/util/XmlConsts.java
./src/main/java/test/BasePerfTest.java
./src/main/java/test/RunStreamWriter.java
./src/main/java/test/TestAsyncReader.java
./src/main/java/test/TestBase64Reader.java
./src/main/java/test/TestEventReader.java
./src/main/java/test/TestLineReader.java
./src/main/java/test/TestNameHashing.java
./src/main/java/test/TestNameTable.java
./src/main/java/test/TestPNamePerf.java
./src/main/java/test/TestRawStream.java
./src/main/java/test/TestSaxReader.java
./src/main/java/test/TestScannerPerf.java
./src/main/java/test/TestStreamCopier.java
./src/main/java/test/TestStreamReader.java
./src/main/java/test/TestTypedSpeed.java
./src/main/java/test/TestUTF8.java
./src/test/java/async/AsyncReaderWrapperForByteArray.java
./src/test/java/async/AsyncReaderWrapperForByteBuffer.java
./src/test/java/async/AsyncTestBase.java
./src/test/java/async/TestAsyncViaEventReader.java
./src/test/java/async/TestCDataParsing.java
./src/test/java/async/TestCharactersParsing.java
./src/test/java/async/TestCommentParsing.java
./src/test/java/async/TestDoctypeParsing.java
./src/test/java/async/TestElementParsing.java
./src/test/java/async/TestEntityParsing.java
./src/test/java/async/TestPIParsing.java
./src/test/java/async/TestSurrogates.java
./src/test/java/async/TestXmlDeclaration.java
./src/test/java/base/BaseTestCase.java
./src/test/java/evt/TestLaziness.java
./src/test/java/sax/TestEntityResolver.java
./src/test/java/sax/TestSaxReader.java
./src/test/java/stream/TestDTDSkimming.java
./src/test/java/stream/TestNameDecoding.java
./src/test/java/stream/TestSimple.java
./src/test/java/stream/TestSurrogates.java
./src/test/java/util/TestPNameTable.java
./src/test/java/util/TestTextAccumulator.java
./src/test/java/wstream/TestIndentation.java
./src/test/java/wstream/TestLongerContent.java
./src/test/java/wstream/TestNameEncoding.java
Please, confirm the licensing of code and/or content/s, and add license headers.
Thanks in advance
Regards
I have a file that reports in its prolog definition that it's UTF-16, when it is indeed UTF-8. I've had to copy the parser class and do some ugly hacks to override the default behaviour (which is to fail when a weird encoding is encountered):
override def verifyAndSetXmlEncoding(): Unit = { val enc = CharsetNames.normalize(_textBuilder.contentsAsString) _config.setXmlEncoding(enc) /* 09-Feb-2011, tatu: For now, we will only accept UTF-8 and ASCII; could * expand in future (Latin-1 should be doable) */ if ((CharsetNames.CS_UTF8 != enc) && (CharsetNames.CS_US_ASCII != enc)) { _config.setXmlEncoding("UTF-8") } }
It should be possible to override this behaviour and just set the encoding manually. In this case I have no control over the file, so changing the file is not an option.
Otherwise, great library :)
~Karl
Hello,
I wanted to test out the performances of aalto so I wrote this simple test program where I indent some XML using an XSLT stylesheet and use StaXSources and StaXResults as the input and output of the Transformer respectively.
Here is my test code:
public static void main(String[] args) throws Exception {
SAXTransformerFactory xformerFactory = (SAXTransformerFactory) TransformerFactory
.newInstance();
InputStream is = null;
is = XMLTest.class.getResourceAsStream("/xml/indenter.xsl");
Templates indenter = xformerFactory.newTemplates(new StreamSource(is));
is.close();
System.out.println(indenter);
File xml = new File(args[0]);
File outXml = new File(args[0] + ".out");
// XMLInputFactory xif = XMLInputFactory.newFactory(
// "com.sun.xml.internal.stream.XMLInputFactoryImpl", XMLTest.class.getClassLoader());
XMLInputFactory xif = XMLInputFactory.newFactory(
"com.fasterxml.aalto.stax.InputFactoryImpl", XMLTest.class.getClassLoader());
xif.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
xif.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.TRUE);
System.out.println(xif);
// XMLOutputFactory xof = XMLOutputFactory.newFactory(
// "com.sun.xml.internal.stream.XMLOutputFactoryImpl", XMLTest.class.getClassLoader());
XMLOutputFactory xof = XMLOutputFactory.newFactory(
"com.fasterxml.aalto.stax.OutputFactoryImpl", XMLTest.class.getClassLoader());
System.out.println(xof);
Source in = new StAXSource(xif.createXMLStreamReader(new FileInputStream(xml)));
Result out = new StAXResult(
xof.createXMLStreamWriter(new FileOutputStream(outXml), "UTF-8"));
indenter.newTransformer().transform(in, out);
// xformerFactory.newTransformer().transform(in, out);
}
And I get the exception visible in the title. Any help would be suggested.
Nicolas
Hi,
The flow is pretty basic.
Using aalto-xml version 1.0.0
Attached reproduction zip file (Eclipse Neon).
Thank you!
reproduction_env.zip
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.