real-logic / artio Goto Github PK
View Code? Open in Web Editor NEWArtio - Resilient High-Performance FIX and FIXP Gateway
License: Apache License 2.0
Artio - Resilient High-Performance FIX and FIXP Gateway
License: Apache License 2.0
So far testing has primarily focused on FIX 4.4, we should evaluate behavioural differences with FIX 4.2 and enable acceptance tests in the fix integration project accordingly.
We need to implement support for clustering a series of gateways in a reliable fashion. This includes:
Currently the parser doesn't completely validate group messages, even with validation mode enabled.
Validate checksum in the framer?
Do we validate fields via a dictionary?
How do they want to know if validation has passed or failed? Callbacks, Log
FIX Library instances should be configured to either be an acceptor or an initiator. The gateway should verify that only one acceptor has connected at any point in time.
Currently we have a performance testing oriented latency histogram setup that can be enabled/disabled and prints out the latency histogram to the commandline.
We should dump out latency histograms periodically to a binary log that can be independently read/printed/monitored.
Maintain some kind of mapping from library id to sessions associated with that library.
If you want to shutdown the gateway logoff and disconnect every session.
When a library or application has a restart or failover then whatever picks up the sessions in questions needs to know what has been sent to the client and what hasn't.
So we should be able to expose high water marks of what messages have been sent out via TCP. Probably best for this to be a periodic tick stream of update information.
The interpretation of scale
in DecimalFloat.toString is unexpected and inconsistent with the way scale is used in classes such as BigDecimal
.
I would expect expect new DecimalFloat(12345, 2)
to have a string representation of "123.45". Instead it has "12.345", i.e. scale moves the decimal point from the left instead of from the right.
For FIX message types that have two characters, the header.msgType("...");
code in the generated encoder constructor is incorrect. EncoderGenerator.generateConstructor
directly converts the "packed" message type integer to a string. It needs to unpack first.
In the generated decoder the MESSAGE_TYPE_BYTES
value is incorrect for the same reason.
The Message.toString
method should probably also return the "proper" message type string.
At the moment we don't checksum components of the message log. This means that a disk failure or hard machine poweroff has the possibility of corrupting the message log without us being able to detect the corruption or ignore corrupted messages.
We should incrementally checksum the message log and be able to skip and report on corrupted log messages.
When I launch a FixEngine
and FixLibrary
instance without setting monitoringFile
explicitly to different values the library will not initialise because it's trying to map the same file and, at least on Windows, I get this:
Exception in thread "main" java.nio.file.FileSystemException: C:\Users\43854743\AppData\Local\Temp\fix\monitoring: The process cannot access the file because it is being used by another process.
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
at java.nio.file.Files.delete(Files.java:1126)
at uk.co.real_logic.agrona.IoUtil.deleteIfExists(IoUtil.java:167)
at uk.co.real_logic.fix_gateway.MonitoringFile.<init>(MonitoringFile.java:53)
at uk.co.real_logic.fix_gateway.GatewayProcess.initMonitoring(GatewayProcess.java:49)
at uk.co.real_logic.fix_gateway.GatewayProcess.init(GatewayProcess.java:42)
at uk.co.real_logic.fix_gateway.library.FixLibrary.<init>(FixLibrary.java:75)
at com.hsbc.efx.erisk.lfix.server.Server.boot(Server.java:37)
at com.hsbc.efx.erisk.lfix.server.Server.main(Server.java:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
The other process in this case is the current process. Should engine and library be able to share the same monitor file? If not, should the default monitor file path be different for EngineConfiguration
and LibraryConfiguration
?
Quickfix rejects connections if their time is outside of a certain latency band (both past and future). Should we also validate the sending time of messages?
Some framer operations (for example acquiring a session and resetting the framer id) wait for the archival mechanism to update some state or catchup. They have idle strategy wait loops with no bounds. This could cause unbounded latency issues to critical path code or nasty gateway pauses?
Perhaps think about timeouts and failures or having a kind of duty cycle where these operations are checked and replied to?
Fixing this issue isn't critical as these aren't common operations.
Simplifies implementation and removes fragmentation issues.
We need to establish that we're meeting our performance SLAs when benchmarking on non-resource constrained hardware. This includes adding any further monitoring in order to identify bottlenecks sources for SLAs.
So you get some warning if your engine disappears.
This would store a mapping form session ids to a pair of sender comp id and target comp id. See the SessionIds
and SessionIdStrategy
classes.
When libraries claim a session (see #39) they need to be able to easily get a replay of the messages that have passed through that stream.
At the moment this is possible, but only by back scanning the journal of messages.
See Fix-Integration project for details.
At the moment our use of transferTo in the archiver doesn't correspond to just a single system call, it also involves allocating memory mapped files and copying using them. We need to identify why the JDK is doing this and convince it to use the sendTo system call.
If a single FIX Connection is failing to read off its channel fast enough and is blocking sending to other clients then we should cutoff and disconnect the client in question. Also log why we cutoff this client.
Change TCP position to be "the engine/cluster owns the message" rather than "has sent on a TCP stream".
I start an acceptor and connect to it with a FIX test client. The FIX session is successfully established. I then kill the test client.
The server does not handle the resulting IOException
in the socket channel read gracefully. I see an infinite loop of the following:
2015-09-09T17:44:08.539: java.io.IOException(An existing connection was forcibly closed by the remote host)
sun.nio.ch.SocketDispatcher.read0(SocketDispatcher.java:-2)
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
sun.nio.ch.IOUtil.read(IOUtil.java:192)
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
uk.co.real_logic.fix_gateway.engine.framer.ReceiverEndPoint.readData(ReceiverEndPoint.java:143)
uk.co.real_logic.fix_gateway.engine.framer.ReceiverEndPoint.pollForData(ReceiverEndPoint.java:124)
uk.co.real_logic.fix_gateway.engine.framer.ReceiverEndPointPoller.pollEndPoints(ReceiverEndPointPoller.java:118)
uk.co.real_logic.fix_gateway.engine.framer.Framer.pollEndPoints(Framer.java:197)
uk.co.real_logic.fix_gateway.engine.framer.Framer.doWork(Framer.java:137)
uk.co.real_logic.agrona.concurrent.AgentRunner.run(AgentRunner.java:105)
java.lang.Thread.run(Thread.java:745)
Specifically things that can have error replies.
This also requires usage documentation and samples. Key features include:
add the remote address as a description field in the counter.
If we throw an exception or error whilst writing anything involving a buffer claim we should abort the claim.
uk.co.real_logic.fix_gateway.replication.ClusterReplicationTest > shouldEstablishCluster FAILED
java.lang.AssertionError: 1 and 3 disagree on leader expected:<-1125472483>
but was:<0>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:645)
at uk.co.real_logic.fix_gateway.replication.ClusterReplicationTest.assertAllNodesSeeSameLeader(ClusterReplicationTest.java:263)
at uk.co.real_logic.fix_gateway.replication.ClusterReplicationTest.checkClusterStable(ClusterReplicationTest.java:237)
at uk.co.real_logic.fix_gateway.replication.ClusterReplicationTest.shouldEstablishCluster(ClusterReplicationTest.java:64)
See #8 for details of persistence over gateway restarts.
Write this using something like Aeron's broadcast buffers so you can just keep writing and read off them if you want a debug log.
It would be nice to annotate the generated codecs classes with @javax.annotation.Generated
. This will help stop IDEs from running code inspections on it.
DELTA the slow consumer from the normal position, offer this as application level flow control to the Session.send method
This let's us continue to keep sessions alive when a library goes away.
Just needs a single wiki page, or perhaps an extension of the benchmarking page, explaining the common configuration/tuning options and what their effect is.
We need to agree upon the Parser API style.
Context
The parser is extracting information from a FIX message and needs to transfer that information to an application using the gateway. We'll be calling handlers which the application implements in order to transfer the information.
The question: what API style should the handlers use. There are two proposed options.
1. Generic Callback API
In this proposal there would be a single callback interface. You get notified when the message starts and a callback for each field that is parsed. Since FIX can contain repeating groups there also needs to be callbacks at the beginning of a group.
We've put together a simple sample of how an application might use the API. There are comments inline on the file to explain each step. There is also a sample acceptor which shows how you might use the generic callback based API.
Implementations could be registered against specific message types. The onField
callback passes in a buffer, offset and length. We would provide a series of Flyweights over theses buffers which would offer specific functionality, for example, parsing a date.
Pros
Cons
2. Dictionary Generated API
In this proposal there would be a callback interface, with a method on for each message type. A Decoder
class would be generated for each message. Applications would implement these callback interfaces. and consume the decoder objects.
There's also a simple sample of how an application might use the API, mostly similar to the previous example. There is also a sample acceptor which shows how you might use this callback based API and is quite a bit different.
In order to allow users to configure the API for only the message types they are interested in we would generate these interfaces from a dictionary of message types and fields. users could customise the dictionary in order to ignore specific fields and elide the cost of parsing things like formatted dates. This follows the standard FIX XML based Dictionary format.
Pros
Cons
Stress testing identifies that the gateway will not accept more than 8190 connections.
We currently have the SBE schemas for the gateway messaging protocol, archive system and raft protocol all in the same message-schema.xml file. They should be decoupled into appropriate single-purpose schemas.
Session acquisition should be about sessions, and should be able to acquire sessions rather than connections. Users of the gateway should know enough information to be able to acquire sessions based upon sender and target comp id.
Connection ids might be remove-able from some Gateway protocol messages.
Currently it possible, albeit unlikely, to fail to resend a message if the resending process is far enough behind in indexing. The resender should have the ability to catchup with its indexing or process "in flight" messages for resending.
Remaining: add expected flow of callbacks and messages
It could be:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.