adacore / recordflux Goto Github PK

Formal specification and generation of verifiable binary parsers, message generators and protocol state machines

License: GNU Affero General Public License v3.0

Python 46.40% Ada 52.60% Makefile 0.33% Shell 0.01% JavaScript 0.01% TypeScript 0.65%

parser binary-parser ada spark formal-verification formal-specification formal-methods communication-protocol protocol-parser protocol-specification

recordflux's Introduction

RecordFlux

RecordFlux is a toolset for the formal specification and generation of verifiable binary parsers, message generators and protocol state machines.

See the website, the user's guide and the language reference for more information.

Contribution and Feedback

Contributions and feedback to RecordFlux are very welcome. To discuss a bug or an enhancement, open a ticket on GitHub and select the appropriate issue template. Please give sufficient information about your issue, the software version you are using and your environment such that the developers can understand and (if necessary) reproduce the problem. If none of the provided issue templates fit your needs, feel free to open a blank issue.

See the development guide on how to contribute to RecordFlux.

Licence

This software is licensed under the AGPL-3.0. See the LICENSE file for the full license text.

recordflux's People

Contributors

Stargazers

Watchers

Forkers

emuhedo kug1977 brendanzab shintakezou yannickmoy junxie6

recordflux's Issues

Generic buffer type

Right now, we use a custom buffer type in the generated code. While this is no problem in applications that use our type exclusively, the more common case is that an application defines its own buffer type. Then, the user must convert types using unchecked conversion. This is problematic for out-parameters which need either be stored somewhere or require address overlays which are problematic for flow analysis.

To solve this issue, the buffer type of the generated code needs to become generic.

Implement code generator for field refinement

Create an example that requires field refinement (e.g. Ethernet with VLAN support) and prototype the SPARK code required to parse such Ethernet frames.

Define a concept to access fields according to the valid refinement alternative and generate code to handle packets.

Optimize usability of generated array interface

Currently, we need to use a different function to access the first element of an array than the remaining elements. This makes it difficult to use the code sensibly in a loop.

Add missing details to language reference

The Language Reference is very lean, which is very favorable to get into the SDL quickly.

Some parts, though, contain too little information to really grasp all the necessary details:

Message Type

The components in not further specified, only the second sentence of the introduction mentions conditions and dependencies. As this is a crucial (and unusual/non-trivial) concept of the language, we should go into more detail here.

Type Refinement

Do we want to replace any kind of component in a type refinement? Or is this possible only for byte arrays (of a specific type)? We should clarify that.

Also, the section on type refinement does not mention the condition that can be applied to a refinement. This is crucial for real-world protocols.

Context Clause

From that paragraph it's unclear where the with-clauses belong. Clarify where those are valid (probably only at the beginning of a package?)

File

We should specify how the package name corresponds to the file name. Also, we should specify somewhere how package_name, name and refined_type_name are structured.

Style and wording

At the beginning of the spec we should (very briefly!) describe how we specify our syntax. E.g. terminals in bold.

We should decide how to handle the non-keyword/terminal cases. For a real grammar, every non-terminal should map to some production. Maybe we strive for completeness here, but put the obvious details (e.g. that first is some numeric type) into an appendix?

I find the use of "usually" a bit too casual for a specification document. Where applicable (e.g. in the static semantics of the packages) use "by convention". Replace other occurrences by something more definite.

Specification of sequence length in number of elements

The length of an sequence is often defined by a length field. But some protocols like USB and DNS only specify the number of elements of an sequence. In such a case the length of the sequence field can only be determined statically if the length of the elements is known and homogeneous. Otherwise the length has to be determined dynamically at runtime (requires #32).

An aspect has to be added to be able to represent these protocols.

Create Python package

Restructure the repository and add the necessary files to create a Python package.

Implement parser for graph-based specification

Allow same name for component and type in message type

In some cases it would be nice to give the same for a component and its type in a message. Currently there is a name conflict, as the generator creates access functions with the name of the components. This issue should be solved by adding a prefix to these functions.

Support file inclusion

Support "with"-ing other RecordFlux packages. For the time being, we do not implement a search for files, but all potential input files must be present on the command line.

Document the with construction in the specification.

DNS

RFC 1035

Depends on:

#32 (MUST)
#51 (MUST)
#52 (SHOULD)

Change syntax of location expressions

The current syntax of the location expressions is quite confusing, since it seems that arbitrary expressions are allowed while this is not the case. It is particularly confusing when a length expression refers to a length field, as in this example:

type Tag_Type is (Msg_Data => 1, Msg_Error => 3) with Size => 2;

type Message is
      message
         Length : Length_Type
            then Value
               with Length = Length * 8;
         Value  : Payload_Type;
       end message;

I propose to adapt the syntax from with Length = Length * 8 to with Length => Length * 8. This would also match the syntax we use to specify the type size (with Size => 2). To separate a location and a length expression we should also use , instead of and: with First => Value'First, Length => Length * 8.

Implement code generator

For the new graph-based model, implement the code generator (without field refinement for now).

Restructure grammar in parser

Specify and parse complex UDP/IP/Ethernet scenario.

Once we support refinement (#12), we need to test it using a more complex protocol. In this ticket we handle the parsing of UDP/IP/Ethernet (generation is a future topic).

Add support for arrays

We have to consider:

Constrained array types

type Index_Type is range 1..1000;
type Natural_Array is array (Index Type) of Natural;

Unconstrained array types

type String is array (Positive range <>) of Character;

Constrained arrays are the easier case, as their size is fixed. However, for unconstrained arrays we have to ensure that the length of a field with such an type can be inferred or is defined by a condition.

To convert the corresponding slice of the input byte array to the targeted array type, we may have to use an Unchecked_Conversion.

Implement enumeration

Support enumeration type including representation clause.

Use file name extension

The marketing department proposes the extension .rflx for RecordFlux files ;-)

Make conditions mutially exclusive

@yannickmoy brought up the issue that all conditions on edges with the same source node need to be mutually exclusive / non-overlapping to be decidable in the parser. We need to make sure this is the case, either (1) in the RecordFlux tool or (2) by adding a respective assertion to the code. The first option may yield better error messages while the second option may be very easy to implement.

I still cannot wrap my head around how the mutually exclusive / non-overlapping property would be expressed in an assertion. Assume the following 2 conditions:

C1: x < 100
C2: x < 99

We'd need an expression that is false in that case, but true for

C1: X < 100
C2: X >= 100

@treiher, ideas?

Support scatter/gather-style buffers

Most performance optimized protocol implementations do not use plain buffers, but some form of scatter/gather data structure. Often this is a record with an address and a offset. Sometimes those data structures are form a linked data structure using previous/next pointers.

To avoid dramatic performance penalties, we must support scatter/gather data structures in RecordFlux. This implies, that we need to change our model from one central buffer that contains the message, to a list of pointer/length pairs.

This should be integrated with the work done in #32.

Define DSL

We start with a first version of the DSL only describing PDU layout. For the syntax of the DSL we follow Ada as close as feasible. As main data type for PDUs a modification of variant records may be used. To capture dependencies between packet fields, we may use something like type invariants.

Extend DSL parser with dissector-code generator

Extend the DSL parser with a SPARK code generator that generates dissector code automatically.

Design packet generator

So far, we focused on packet parsing. The other direction, the generation of packet, is equally important. Design the package generation logic of RecordFlux.

Ensure generation of valid SPARK code for expressions

There are multiple scenarios where the generator creates invalid SPARK code and which can be fixed by adapting the specification. It would be good to detect such cases already before generating the code. This would need a thorough verification of the location and condition expressions. A prerequisite for this is a more powerful system to evaluate expressions (#29).

Properties to validate by checks:

No length attribute for fixed size fields
Field first bit is calculated from Other_Field'First or default (UNDEFINED)
Only preceding fields referenced
Field length calculated from previous fields value, previous field first/last/length,Message'Last and constants
Unconstrained fields have a Length attribute

... by proof:

Positive field length
Field start greater than Message'First
~~Field end smaller than message end~~ (This is a precondition)
Fields are continuous on all paths
Fields where 'First is set have the same 'Last as the overlayd field

Write specification for current RecordFlux DSL

Create a full specification of our current understanding of the DSL as collecting in the various .rflx files in the tests/ subdirectory.

The document should cover

File structure and file inclusion
Declaration of primitive types (numeric, enumeration)
Complex (record) type
Type invariants
Type refinement

For each construct, define the static semantics (syntax) and an example.

Create Ethernet parser with ASN.1

We want to try an alternative approach to a custom DSL. The plan is to define an ASN.1 specification for Ethernet and a custom encoding rule using ACN. The Ada code generated using asn1scc should correctly parse all of our existing test cases.

Support non-TLV message schemes in generated dissector code

The focus of the current implementation of the code generator is the TLV scheme: The offset and length of all fields are either defined statically or by the value of another field. Although many protocols follow this scheme, this assumption is not always true. Imagine two consecutive arrays for which no length is specified and which contain elements of heterogeneous size or an unknown number of elements. The end of such an array is usually represented by an end-postfix. In such a case the offset of the second array can only be determined by reading all elements of the first one.

Subtasks

Design parsing code for messages
Design parsing code for refinements
Design parsing code for sequences
Adapt generic code for parsing scalar sequences
Adapt generic code for parsing message sequences
Design generation of parsing code
Adapt code generator
Adapt integration tests
Adapt documentation

Add support for integers with arbitrary range to code generator

A special handling is needed for integers with arbitrary range (e.g., type Foo is range 32 .. 64 with Size => 8;). We could realize that like proposed in #16:

Maybe we should realize that (in the code) using a subtype of some base type that covers the whole possible range of the bits that type is represented by. That will help with the proofs, as we can represent the (potentially out-of-bounds) value of the packet as a variable of the base type, perform range checks and if successful, represent it as the subtype.

Optimize generated dissector code

Currently, conditions for protocol fields are checked multiple times. The conditions of all variants of a field are checked in the validation function to decide if any variant is valid. If one valid variant was found and the corresponding accessor function is used, the conditions of the variants are checked again to find the currently valid variant and determine the location of the field. We should try prevent these double executions.

Parsing of Logical Expressions

Logical expressions should be interpreted in the same way as in Ada. Logical operators (and, or, xor) are on the same precedence level in Ada. "For a sequence of operators of the same precedence level, the operators are associated with their operands in textual order from left to right. Parentheses can be used to impose specific associations." (Ada RM 2012)

In the AST a logical expression should be represented by a binary tree. A logical operator is a node (AND, OR, XOR) with two children, which can either be another logical operator or a relation. Parsing the logical expressions a OR b AND c and a OR (b AND c) AND d, where a, b, c, and d are relations, should result in the AST AND(OR(a, b), c) and AND(OR(a, AND(b, c)), d), respectively.

The straightforward solution would be to use a left-recursive grammar. As we are currently using pyparsing, which is a (vaguely) recursive descent parser, this is not possible, because this would lead to a infinite recursion. Although there are general algorithms to remove left-recursion in a grammar, these usually change the resulting parse tree and seem not to respect the needed left-to-right precedence.

Create dissector code for Ethernet

Create Ada/SPARK code manually that dissects Ethernet frames manually. The code should validate a sequence and provide preconditions, post conditions and accessor functions/procedures for field elements.

Cleanup wiki and add README

The Wiki is rather inconsistent at the moment. Consolidate the information into a README file in the root folder of the project. Add the specification to the repository and link it from the README. Cleanup the remaining Wiki pages (or disable it).

Create Python prototype for DSL parser

Develop test cases and a parser for the RecordFlux DSL. We start with a Python prototype using pyparsing.

Revise generation of verification conditions

In the current implementation the explicit preconditions for accessing a field value (especially the necessary Buffer'Length) is not explicitly known, as for each variant of a field only its path condition and path facts (first and last attributes for the field and all previous fields) is captured. With this set of information we do not know which path was taken to get this variant, but only if all conditions on the path are true. The problem is currently solved by generating preconditions for Buffer'Length based on Last of all previous fields (whereby subsumed conditions are filtered out).

A better solution to this problem could be to add for each variant (condition-fact tuple) a reference to the previous field and variant. The condition of a specific variant can then be replaced by a predicate, which can be used in the functions for successive fields. Thus each validation or accessor function gets a precondition which also refers to the validation function of the previous field in the specific variant, and so all related functions of an path are chained together. This should also make the generated code better readable.

Support non-modular integer type

Add support for non-modular integers. The bit-length of a new integer type has to be defined.

Allow to specify the length of the first component of a message

It should be possible to define the length of the first component of a message. This could be realized by adding a keyword Start which represents the initial node in our internal model and allows to specify a then-clause to the first component of the message.

   type Foo is
      message
         Start
            then Payload
               with Length = Message'Length;
         Payload : Payload_Type;
      end message;

To be consistent we should also add a keyword End which represents the final node in our internal model (currently null is used in this case).

   type Bar is
      message
         Content : Content_Type
            then End
               if Content = 42;
      end message;

Support enumeration representation clauses

Enumerations are great to specify countable types:

   type Packet_Kind is (Msg_Error, Msg_Data);

To specify the value for each enumeration element, Ada support an enumeration representation clause:

   for Packet_Kind use (Msg_Error => 1, Msg_Data => 2);

The RecordFlux grammar does not support this right now.

Add support for arrays of scalar types

Add support for arrays with an integer or enumeration as element type.

Variable-length integer encoding

Context and Problem Statement

In some protocols, the length field is encoded using a variable-length encoding scheme. In MQTT, such an encoding is used for the Remaining Length field. There are multiple variants of variable-length encoding of integers.

Examples

Variable-Length Quantity (VLQ)

VLQ is used by:

MIDI
ASN.1 BER (e.g., SNMP) / DER (e.g., X.509)
- Identifier octets
- Length octets

Little Endian Base 128 (LEB128)

LEB128 is used by:

MQTT

Variable Byte Integer (equivalent to ULEB128)

QUIC

Variable-Length Integer Encoding

Bluetooth L2CAP

Protocol/Service Multiplexer - PSM (2 octets (minimum))
The PSM field is at least two octets in length. All PSM values shall have the
least significant bit of the most significant octet equal to 0 and the least
significant bit of all other octets equal to 1.
Note: This means that all PSMs are odd numbers and that the end of a PSM
can be easily found by searching for the first even octet.

(Bluetooth Core Specification v5.2 - Vol 3, Part A, 4.2)

Differentiate between indices, lengths and other values in model

When parsing conditions and location expressions in a message type we should differentiate between fields which are used as index (indicated by a First attribute), length (indicated by a Length attribute) or other value. This information is needed to be able to ensure that lengths can be added among each other and can be added to an index in the generated code. Ada does not allow the addition between different types, thus an appropriate conversion is needed.

Generate graph from message specification

Generate networkx graph from Message
Extend CLI to generate all possible message graphs (specify output directory and format)
Style graph (boxes, orthogonal layout, colors, special layout for INITIAL/FINAL)
Add method to output graph (DOT, PNG, SVG, depending on output file extension)
Retain hexadecimal constants in expression str() method

Add support for protocol fields with arbitrary size

Currently we only support protocol fields with a size of a multiple of one byte. But this is not always the case. One example would be the IPv4 Header.

Replace expressions by library

Right now, we implemented expressions on our own. This has some drawbacks, including that we must build the evaluation logic needed for #22 by hand. We should replace our own expressions by some existing library which has that functionality.

Some library that supports expressions, evaluation and simplification is the official z3 python binding.

USB descriptors

Specify the USB descriptors as defined here

Extend model by protocol sessions

Extend the RecordFlux model by a concept to express automata, including states, transitions, conditions and actions. Dynamic properties should be expressible by some temporal logic.

As an intermediate step, when no session syntax is available, we use the YAML-based representation introduced for GreenTLS. To develop and validate those models, the cli is extended to parse the YAML sessions files.

Add generation of enumeration conversion function

To make it easier to use the generated code, we should generate a function that converts an enumeration type to an corresponding integer type.

Precondition Generation

Design and implement algorithm to generate preconditions for field validator and accessor functions from record type invariant.

Prohibit reserved keywords

Ensure that no reserved keywords of Ada/SPARK and our DSL are used in the specification file, which would corrupt the generated code.

Allow expressions in type declarations

It should be possible to use expressions in type declarations, e.g.:

type UINT16 is range 0 .. 2**16 - 1 with Size => 16;
type UINT48 is mod 2**48;

Add check for unexpected references in conditions

When creating the model for a message specification it should be ensured that conditions refer only to previous message fields or known enumeration literals.

Allow enum types to always be considered valid

It should be possible to specify an enum type where all possible values of the base type are considered valid. This could be realized by adding an aspect Always_Valid.

type Foo is (Bar => 0, Baz => 1) with Size => 8, Always_Valid;

This is motivated by the extension type field in TLS 1.3 (#34). Extensions with unknown type are valid, but will be ignored.