adacore / recordflux Goto Github PK
View Code? Open in Web Editor NEWFormal specification and generation of verifiable binary parsers, message generators and protocol state machines
License: GNU Affero General Public License v3.0
Formal specification and generation of verifiable binary parsers, message generators and protocol state machines
License: GNU Affero General Public License v3.0
str()
methodDevelop test cases and a parser for the RecordFlux DSL. We start with a Python prototype using pyparsing
.
Currently we only support protocol fields with a size of a multiple of one byte. But this is not always the case. One example would be the IPv4 Header.
In the current implementation the explicit preconditions for accessing a field value (especially the necessary Buffer'Length
) is not explicitly known, as for each variant of a field only its path condition and path facts (first and last attributes for the field and all previous fields) is captured. With this set of information we do not know which path was taken to get this variant, but only if all conditions on the path are true. The problem is currently solved by generating preconditions for Buffer'Length
based on Last of all previous fields (whereby subsumed conditions are filtered out).
A better solution to this problem could be to add for each variant (condition-fact tuple) a reference to the previous field and variant. The condition of a specific variant can then be replaced by a predicate, which can be used in the functions for successive fields. Thus each validation or accessor function gets a precondition which also refers to the validation function of the previous field in the specific variant, and so all related functions of an path are chained together. This should also make the generated code better readable.
Extend the DSL parser with a SPARK code generator that generates dissector code automatically.
Right now, we implemented expressions on our own. This has some drawbacks, including that we must build the evaluation logic needed for #22 by hand. We should replace our own expressions by some existing library which has that functionality.
Some library that supports expressions, evaluation and simplification is the official z3 python binding.
Most performance optimized protocol implementations do not use plain buffers, but some form of scatter/gather data structure. Often this is a record with an address and a offset. Sometimes those data structures are form a linked data structure using previous/next pointers.
To avoid dramatic performance penalties, we must support scatter/gather data structures in RecordFlux. This implies, that we need to change our model from one central buffer that contains the message, to a list of pointer/length pairs.
This should be integrated with the work done in #32.
It should be possible to specify an enum type where all possible values of the base type are considered valid. This could be realized by adding an aspect Always_Valid
.
type Foo is (Bar => 0, Baz => 1) with Size => 8, Always_Valid;
This is motivated by the extension type field in TLS 1.3 (#34). Extensions with unknown type are valid, but will be ignored.
Create a full specification of our current understanding of the DSL as collecting in the various .rflx files in the tests/ subdirectory.
The document should cover
For each construct, define the static semantics (syntax) and an example.
It should be possible to define the length of the first component of a message. This could be realized by adding a keyword Start
which represents the initial node in our internal model and allows to specify a then-clause to the first component of the message.
type Foo is
message
Start
then Payload
with Length = Message'Length;
Payload : Payload_Type;
end message;
To be consistent we should also add a keyword End
which represents the final node in our internal model (currently null
is used in this case).
type Bar is
message
Content : Content_Type
then End
if Content = 42;
end message;
Support enumeration type including representation clause.
Create Ada/SPARK code manually that dissects Ethernet frames manually. The code should validate a sequence and provide preconditions, post conditions and accessor functions/procedures for field elements.
The length of an sequence is often defined by a length field. But some protocols like USB and DNS only specify the number of elements of an sequence. In such a case the length of the sequence field can only be determined statically if the length of the elements is known and homogeneous. Otherwise the length has to be determined dynamically at runtime (requires #32).
An aspect has to be added to be able to represent these protocols.
@yannickmoy brought up the issue that all conditions on edges with the same source node need to be mutually exclusive / non-overlapping to be decidable in the parser. We need to make sure this is the case, either (1) in the RecordFlux tool or (2) by adding a respective assertion to the code. The first option may yield better error messages while the second option may be very easy to implement.
I still cannot wrap my head around how the mutually exclusive / non-overlapping property would be expressed in an assertion. Assume the following 2 conditions:
C1: x < 100
C2: x < 99
We'd need an expression that is false in that case, but true for
C1: X < 100
C2: X >= 100
@treiher, ideas?
Logical expressions should be interpreted in the same way as in Ada. Logical operators (and, or, xor) are on the same precedence level in Ada. "For a sequence of operators of the same precedence level, the operators are associated with their operands in textual order from left to right. Parentheses can be used to impose specific associations." (Ada RM 2012)
In the AST a logical expression should be represented by a binary tree. A logical operator is a node (AND, OR, XOR) with two children, which can either be another logical operator or a relation. Parsing the logical expressions a OR b AND c
and a OR (b AND c) AND d
, where a, b, c, and d are relations, should result in the AST AND(OR(a, b), c)
and AND(OR(a, AND(b, c)), d)
, respectively.
The straightforward solution would be to use a left-recursive grammar. As we are currently using pyparsing, which is a (vaguely) recursive descent parser, this is not possible, because this would lead to a infinite recursion. Although there are general algorithms to remove left-recursion in a grammar, these usually change the resulting parse tree and seem not to respect the needed left-to-right precedence.
In some cases it would be nice to give the same for a component and its type in a message. Currently there is a name conflict, as the generator creates access functions with the name of the components. This issue should be solved by adding a prefix to these functions.
For the new graph-based model, implement the code generator (without field refinement for now).
Currently, we need to use a different function to access the first element of an array than the remaining elements. This makes it difficult to use the code sensibly in a loop.
Specify the USB descriptors as defined here
When parsing conditions and location expressions in a message type we should differentiate between fields which are used as index (indicated by a First
attribute), length (indicated by a Length
attribute) or other value. This information is needed to be able to ensure that lengths can be added among each other and can be added to an index in the generated code. Ada does not allow the addition between different types, thus an appropriate conversion is needed.
In some protocols, the length field is encoded using a variable-length encoding scheme. In MQTT, such an encoding is used for the Remaining Length field. There are multiple variants of variable-length encoding of integers.
VLQ is used by:
LEB128 is used by:
Variable Byte Integer (equivalent to ULEB128)
Variable-Length Integer Encoding
Protocol/Service Multiplexer - PSM (2 octets (minimum))
The PSM field is at least two octets in length. All PSM values shall have the
least significant bit of the most significant octet equal to 0 and the least
significant bit of all other octets equal to 1.
Note: This means that all PSMs are odd numbers and that the end of a PSM
can be easily found by searching for the first even octet.
(Bluetooth Core Specification v5.2 - Vol 3, Part A, 4.2)
The Wiki is rather inconsistent at the moment. Consolidate the information into a README file in the root folder of the project. Add the specification to the repository and link it from the README. Cleanup the remaining Wiki pages (or disable it).
We start with a first version of the DSL only describing PDU layout. For the syntax of the DSL we follow Ada as close as feasible. As main data type for PDUs a modification of variant records may be used. To capture dependencies between packet fields, we may use something like type invariants.
When creating the model for a message specification it should be ensured that conditions refer only to previous message fields or known enumeration literals.
Add support for non-modular integers. The bit-length of a new integer type has to be defined.
There are multiple scenarios where the generator creates invalid SPARK code and which can be fixed by adapting the specification. It would be good to detect such cases already before generating the code. This would need a thorough verification of the location and condition expressions. A prerequisite for this is a more powerful system to evaluate expressions (#29).
Properties to validate by checks:
Message'Last
and constants... by proof:
Message'First
Restructure the repository and add the necessary files to create a Python package.
Create an example that requires field refinement (e.g. Ethernet with VLAN support) and prototype the SPARK code required to parse such Ethernet frames.
Define a concept to access fields according to the valid refinement alternative and generate code to handle packets.
Support "with"-ing other RecordFlux packages. For the time being, we do not implement a search for files, but all potential input files must be present on the command line.
Document the with construction in the specification.
The Language Reference is very lean, which is very favorable to get into the SDL quickly.
Some parts, though, contain too little information to really grasp all the necessary details:
The components in not further specified, only the second sentence of the introduction mentions conditions and dependencies. As this is a crucial (and unusual/non-trivial) concept of the language, we should go into more detail here.
Do we want to replace any kind of component in a type refinement? Or is this possible only for byte arrays (of a specific type)? We should clarify that.
Also, the section on type refinement does not mention the condition that can be applied to a refinement. This is crucial for real-world protocols.
From that paragraph it's unclear where the with-clauses belong. Clarify where those are valid (probably only at the beginning of a package?)
We should specify how the package name corresponds to the file name. Also, we should specify somewhere how package_name, name and refined_type_name are structured.
At the beginning of the spec we should (very briefly!) describe how we specify our syntax. E.g. terminals in bold.
We should decide how to handle the non-keyword/terminal cases. For a real grammar, every non-terminal should map to some production. Maybe we strive for completeness here, but put the obvious details (e.g. that first is some numeric type) into an appendix?
I find the use of "usually" a bit too casual for a specification document. Where applicable (e.g. in the static semantics of the packages) use "by convention". Replace other occurrences by something more definite.
The current syntax of the location expressions is quite confusing, since it seems that arbitrary expressions are allowed while this is not the case. It is particularly confusing when a length expression refers to a length field, as in this example:
type Tag_Type is (Msg_Data => 1, Msg_Error => 3) with Size => 2;
type Message is
message
Length : Length_Type
then Value
with Length = Length * 8;
Value : Payload_Type;
end message;
I propose to adapt the syntax from with Length = Length * 8
to with Length => Length * 8
. This would also match the syntax we use to specify the type size (with Size => 2
). To separate a location and a length expression we should also use ,
instead of and
: with First => Value'First, Length => Length * 8
.
The marketing department proposes the extension .rflx for RecordFlux files ;-)
So far, we focused on packet parsing. The other direction, the generation of packet, is equally important. Design the package generation logic of RecordFlux.
Ensure that no reserved keywords of Ada/SPARK and our DSL are used in the specification file, which would corrupt the generated code.
Currently, conditions for protocol fields are checked multiple times. The conditions of all variants of a field are checked in the validation function to decide if any variant is valid. If one valid variant was found and the corresponding accessor function is used, the conditions of the variants are checked again to find the currently valid variant and determine the location of the field. We should try prevent these double executions.
Once we support refinement (#12), we need to test it using a more complex protocol. In this ticket we handle the parsing of UDP/IP/Ethernet (generation is a future topic).
Enumerations are great to specify countable types:
type Packet_Kind is (Msg_Error, Msg_Data);
To specify the value for each enumeration element, Ada support an enumeration representation clause:
for Packet_Kind use (Msg_Error => 1, Msg_Data => 2);
The RecordFlux grammar does not support this right now.
The focus of the current implementation of the code generator is the TLV scheme: The offset and length of all fields are either defined statically or by the value of another field. Although many protocols follow this scheme, this assumption is not always true. Imagine two consecutive arrays for which no length is specified and which contain elements of heterogeneous size or an unknown number of elements. The end of such an array is usually represented by an end-postfix. In such a case the offset of the second array can only be determined by reading all elements of the first one.
Right now, we use a custom buffer type in the generated code. While this is no problem in applications that use our type exclusively, the more common case is that an application defines its own buffer type. Then, the user must convert types using unchecked conversion. This is problematic for out-parameters which need either be stored somewhere or require address overlays which are problematic for flow analysis.
To solve this issue, the buffer type of the generated code needs to become generic.
Add support for arrays with an integer or enumeration as element type.
It should be possible to use expressions in type declarations, e.g.:
type UINT16 is range 0 .. 2**16 - 1 with Size => 16;
type UINT48 is mod 2**48;
A special handling is needed for integers with arbitrary range (e.g., type Foo is range 32 .. 64 with Size => 8;
). We could realize that like proposed in #16:
Maybe we should realize that (in the code) using a subtype of some base type that covers the whole possible range of the bits that type is represented by. That will help with the proofs, as we can represent the (potentially out-of-bounds) value of the packet as a variable of the base type, perform range checks and if successful, represent it as the subtype.
Extend the RecordFlux model by a concept to express automata, including states, transitions, conditions and actions. Dynamic properties should be expressible by some temporal logic.
As an intermediate step, when no session syntax is available, we use the YAML-based representation introduced for GreenTLS. To develop and validate those models, the cli
is extended to parse the YAML sessions files.
To make it easier to use the generated code, we should generate a function that converts an enumeration type to an corresponding integer type.
We have to consider:
Constrained array types
type Index_Type is range 1..1000;
type Natural_Array is array (Index Type) of Natural;
Unconstrained array types
type String is array (Positive range <>) of Character;
Constrained arrays are the easier case, as their size is fixed. However, for unconstrained arrays we have to ensure that the length of a field with such an type can be inferred or is defined by a condition.
To convert the corresponding slice of the input byte array to the targeted array type, we may have to use an Unchecked_Conversion
.
Design and implement algorithm to generate preconditions for field validator and accessor functions from record type invariant.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.