Giter Site home page Giter Site logo

adacore / recordflux Goto Github PK

View Code? Open in Web Editor NEW
100.0 11.0 6.0 28.62 MB

Formal specification and generation of verifiable binary parsers, message generators and protocol state machines

License: GNU Affero General Public License v3.0

Python 47.08% Ada 51.90% Makefile 0.34% Shell 0.01% JavaScript 0.01% TypeScript 0.67%
parser binary-parser ada spark formal-verification formal-specification formal-methods communication-protocol protocol-parser protocol-specification

recordflux's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

recordflux's Issues

Generate graph from message specification

  • Generate networkx graph from Message
  • Extend CLI to generate all possible message graphs (specify output directory and format)
  • Style graph (boxes, orthogonal layout, colors, special layout for INITIAL/FINAL)
  • Add method to output graph (DOT, PNG, SVG, depending on output file extension)
  • Retain hexadecimal constants in expression str() method

Revise generation of verification conditions

In the current implementation the explicit preconditions for accessing a field value (especially the necessary Buffer'Length) is not explicitly known, as for each variant of a field only its path condition and path facts (first and last attributes for the field and all previous fields) is captured. With this set of information we do not know which path was taken to get this variant, but only if all conditions on the path are true. The problem is currently solved by generating preconditions for Buffer'Length based on Last of all previous fields (whereby subsumed conditions are filtered out).

A better solution to this problem could be to add for each variant (condition-fact tuple) a reference to the previous field and variant. The condition of a specific variant can then be replaced by a predicate, which can be used in the functions for successive fields. Thus each validation or accessor function gets a precondition which also refers to the validation function of the previous field in the specific variant, and so all related functions of an path are chained together. This should also make the generated code better readable.

Replace expressions by library

Right now, we implemented expressions on our own. This has some drawbacks, including that we must build the evaluation logic needed for #22 by hand. We should replace our own expressions by some existing library which has that functionality.

Some library that supports expressions, evaluation and simplification is the official z3 python binding.

Support scatter/gather-style buffers

Most performance optimized protocol implementations do not use plain buffers, but some form of scatter/gather data structure. Often this is a record with an address and a offset. Sometimes those data structures are form a linked data structure using previous/next pointers.

To avoid dramatic performance penalties, we must support scatter/gather data structures in RecordFlux. This implies, that we need to change our model from one central buffer that contains the message, to a list of pointer/length pairs.

This should be integrated with the work done in #32.

Allow enum types to always be considered valid

It should be possible to specify an enum type where all possible values of the base type are considered valid. This could be realized by adding an aspect Always_Valid.

type Foo is (Bar => 0, Baz => 1) with Size => 8, Always_Valid;

This is motivated by the extension type field in TLS 1.3 (#34). Extensions with unknown type are valid, but will be ignored.

Write specification for current RecordFlux DSL

Create a full specification of our current understanding of the DSL as collecting in the various .rflx files in the tests/ subdirectory.

The document should cover

  • File structure and file inclusion
  • Declaration of primitive types (numeric, enumeration)
  • Complex (record) type
  • Type invariants
  • Type refinement

For each construct, define the static semantics (syntax) and an example.

Allow to specify the length of the first component of a message

It should be possible to define the length of the first component of a message. This could be realized by adding a keyword Start which represents the initial node in our internal model and allows to specify a then-clause to the first component of the message.

   type Foo is
      message
         Start
            then Payload
               with Length = Message'Length;
         Payload : Payload_Type;
      end message;

To be consistent we should also add a keyword End which represents the final node in our internal model (currently null is used in this case).

   type Bar is
      message
         Content : Content_Type
            then End
               if Content = 42;
      end message;

Create dissector code for Ethernet

Create Ada/SPARK code manually that dissects Ethernet frames manually. The code should validate a sequence and provide preconditions, post conditions and accessor functions/procedures for field elements.

Specification of sequence length in number of elements

The length of an sequence is often defined by a length field. But some protocols like USB and DNS only specify the number of elements of an sequence. In such a case the length of the sequence field can only be determined statically if the length of the elements is known and homogeneous. Otherwise the length has to be determined dynamically at runtime (requires #32).

An aspect has to be added to be able to represent these protocols.

Make conditions mutially exclusive

@yannickmoy brought up the issue that all conditions on edges with the same source node need to be mutually exclusive / non-overlapping to be decidable in the parser. We need to make sure this is the case, either (1) in the RecordFlux tool or (2) by adding a respective assertion to the code. The first option may yield better error messages while the second option may be very easy to implement.

I still cannot wrap my head around how the mutually exclusive / non-overlapping property would be expressed in an assertion. Assume the following 2 conditions:

C1: x < 100
C2: x < 99

We'd need an expression that is false in that case, but true for

C1: X < 100
C2: X >= 100

@treiher, ideas?

Parsing of Logical Expressions

Logical expressions should be interpreted in the same way as in Ada. Logical operators (and, or, xor) are on the same precedence level in Ada. "For a sequence of operators of the same precedence level, the operators are associated with their operands in textual order from left to right. Parentheses can be used to impose specific associations." (Ada RM 2012)

In the AST a logical expression should be represented by a binary tree. A logical operator is a node (AND, OR, XOR) with two children, which can either be another logical operator or a relation. Parsing the logical expressions a OR b AND c and a OR (b AND c) AND d, where a, b, c, and d are relations, should result in the AST AND(OR(a, b), c) and AND(OR(a, AND(b, c)), d), respectively.

The straightforward solution would be to use a left-recursive grammar. As we are currently using pyparsing, which is a (vaguely) recursive descent parser, this is not possible, because this would lead to a infinite recursion. Although there are general algorithms to remove left-recursion in a grammar, these usually change the resulting parse tree and seem not to respect the needed left-to-right precedence.

Allow same name for component and type in message type

In some cases it would be nice to give the same for a component and its type in a message. Currently there is a name conflict, as the generator creates access functions with the name of the components. This issue should be solved by adding a prefix to these functions.

Implement code generator

For the new graph-based model, implement the code generator (without field refinement for now).

Differentiate between indices, lengths and other values in model

When parsing conditions and location expressions in a message type we should differentiate between fields which are used as index (indicated by a First attribute), length (indicated by a Length attribute) or other value. This information is needed to be able to ensure that lengths can be added among each other and can be added to an index in the generated code. Ada does not allow the addition between different types, thus an appropriate conversion is needed.

Variable-length integer encoding

Context and Problem Statement

In some protocols, the length field is encoded using a variable-length encoding scheme. In MQTT, such an encoding is used for the Remaining Length field. There are multiple variants of variable-length encoding of integers.

Examples

Variable-Length Quantity (VLQ)

VLQ is used by:

Little Endian Base 128 (LEB128)

LEB128 is used by:

MQTT

Variable Byte Integer (equivalent to ULEB128)

QUIC

Variable-Length Integer Encoding

Bluetooth L2CAP

Protocol/Service Multiplexer - PSM (2 octets (minimum))
The PSM field is at least two octets in length. All PSM values shall have the
least significant bit of the most significant octet equal to 0 and the least
significant bit of all other octets equal to 1.
Note: This means that all PSMs are odd numbers and that the end of a PSM
can be easily found by searching for the first even octet.

(Bluetooth Core Specification v5.2 - Vol 3, Part A, 4.2)

Cleanup wiki and add README

The Wiki is rather inconsistent at the moment. Consolidate the information into a README file in the root folder of the project. Add the specification to the repository and link it from the README. Cleanup the remaining Wiki pages (or disable it).

Define DSL

We start with a first version of the DSL only describing PDU layout. For the syntax of the DSL we follow Ada as close as feasible. As main data type for PDUs a modification of variant records may be used. To capture dependencies between packet fields, we may use something like type invariants.

Ensure generation of valid SPARK code for expressions

There are multiple scenarios where the generator creates invalid SPARK code and which can be fixed by adapting the specification. It would be good to detect such cases already before generating the code. This would need a thorough verification of the location and condition expressions. A prerequisite for this is a more powerful system to evaluate expressions (#29).

Properties to validate by checks:

  • No length attribute for fixed size fields
  • Field first bit is calculated from Other_Field'First or default (UNDEFINED)
  • Only preceding fields referenced
  • Field length calculated from previous fields value, previous field first/last/length,Message'Last and constants
  • Unconstrained fields have a Length attribute

... by proof:

  • Positive field length
  • Field start greater than Message'First
  • Field end smaller than message end (This is a precondition)
  • Fields are continuous on all paths
  • Fields where 'First is set have the same 'Last as the overlayd field

Create Python package

Restructure the repository and add the necessary files to create a Python package.

Implement code generator for field refinement

Create an example that requires field refinement (e.g. Ethernet with VLAN support) and prototype the SPARK code required to parse such Ethernet frames.

Define a concept to access fields according to the valid refinement alternative and generate code to handle packets.

Support file inclusion

Support "with"-ing other RecordFlux packages. For the time being, we do not implement a search for files, but all potential input files must be present on the command line.

Document the with construction in the specification.

Add missing details to language reference

The Language Reference is very lean, which is very favorable to get into the SDL quickly.

Some parts, though, contain too little information to really grasp all the necessary details:

Message Type

The components in not further specified, only the second sentence of the introduction mentions conditions and dependencies. As this is a crucial (and unusual/non-trivial) concept of the language, we should go into more detail here.

Type Refinement

Do we want to replace any kind of component in a type refinement? Or is this possible only for byte arrays (of a specific type)? We should clarify that.

Also, the section on type refinement does not mention the condition that can be applied to a refinement. This is crucial for real-world protocols.

Context Clause

From that paragraph it's unclear where the with-clauses belong. Clarify where those are valid (probably only at the beginning of a package?)

File

We should specify how the package name corresponds to the file name. Also, we should specify somewhere how package_name, name and refined_type_name are structured.

Style and wording

At the beginning of the spec we should (very briefly!) describe how we specify our syntax. E.g. terminals in bold.

We should decide how to handle the non-keyword/terminal cases. For a real grammar, every non-terminal should map to some production. Maybe we strive for completeness here, but put the obvious details (e.g. that first is some numeric type) into an appendix?

I find the use of "usually" a bit too casual for a specification document. Where applicable (e.g. in the static semantics of the packages) use "by convention". Replace other occurrences by something more definite.

Change syntax of location expressions

The current syntax of the location expressions is quite confusing, since it seems that arbitrary expressions are allowed while this is not the case. It is particularly confusing when a length expression refers to a length field, as in this example:

type Tag_Type is (Msg_Data => 1, Msg_Error => 3) with Size => 2;

type Message is
      message
         Length : Length_Type
            then Value
               with Length = Length * 8;
         Value  : Payload_Type;
       end message;

I propose to adapt the syntax from with Length = Length * 8 to with Length => Length * 8. This would also match the syntax we use to specify the type size (with Size => 2). To separate a location and a length expression we should also use , instead of and: with First => Value'First, Length => Length * 8.

Design packet generator

So far, we focused on packet parsing. The other direction, the generation of packet, is equally important. Design the package generation logic of RecordFlux.

Prohibit reserved keywords

Ensure that no reserved keywords of Ada/SPARK and our DSL are used in the specification file, which would corrupt the generated code.

Optimize generated dissector code

Currently, conditions for protocol fields are checked multiple times. The conditions of all variants of a field are checked in the validation function to decide if any variant is valid. If one valid variant was found and the corresponding accessor function is used, the conditions of the variants are checked again to find the currently valid variant and determine the location of the field. We should try prevent these double executions.

Create Ethernet parser with ASN.1

We want to try an alternative approach to a custom DSL. The plan is to define an ASN.1 specification for Ethernet and a custom encoding rule using ACN. The Ada code generated using asn1scc should correctly parse all of our existing test cases.

Support enumeration representation clauses

Enumerations are great to specify countable types:

   type Packet_Kind is (Msg_Error, Msg_Data);

To specify the value for each enumeration element, Ada support an enumeration representation clause:

   for Packet_Kind use (Msg_Error => 1, Msg_Data => 2);

The RecordFlux grammar does not support this right now.

Support non-TLV message schemes in generated dissector code

The focus of the current implementation of the code generator is the TLV scheme: The offset and length of all fields are either defined statically or by the value of another field. Although many protocols follow this scheme, this assumption is not always true. Imagine two consecutive arrays for which no length is specified and which contain elements of heterogeneous size or an unknown number of elements. The end of such an array is usually represented by an end-postfix. In such a case the offset of the second array can only be determined by reading all elements of the first one.

Subtasks

  • Design parsing code for messages
  • Design parsing code for refinements
  • Design parsing code for sequences
  • Adapt generic code for parsing scalar sequences
  • Adapt generic code for parsing message sequences
  • Design generation of parsing code
  • Adapt code generator
  • Adapt integration tests
  • Adapt documentation

Generic buffer type

Right now, we use a custom buffer type in the generated code. While this is no problem in applications that use our type exclusively, the more common case is that an application defines its own buffer type. Then, the user must convert types using unchecked conversion. This is problematic for out-parameters which need either be stored somewhere or require address overlays which are problematic for flow analysis.

To solve this issue, the buffer type of the generated code needs to become generic.

Add support for integers with arbitrary range to code generator

A special handling is needed for integers with arbitrary range (e.g., type Foo is range 32 .. 64 with Size => 8;). We could realize that like proposed in #16:

Maybe we should realize that (in the code) using a subtype of some base type that covers the whole possible range of the bits that type is represented by. That will help with the proofs, as we can represent the (potentially out-of-bounds) value of the packet as a variable of the base type, perform range checks and if successful, represent it as the subtype.

Extend model by protocol sessions

Extend the RecordFlux model by a concept to express automata, including states, transitions, conditions and actions. Dynamic properties should be expressible by some temporal logic.

As an intermediate step, when no session syntax is available, we use the YAML-based representation introduced for GreenTLS. To develop and validate those models, the cli is extended to parse the YAML sessions files.

Add support for arrays

We have to consider:

Constrained array types

type Index_Type is range 1..1000;
type Natural_Array is array (Index Type) of Natural;

Unconstrained array types

type String is array (Positive range <>) of Character;

Constrained arrays are the easier case, as their size is fixed. However, for unconstrained arrays we have to ensure that the length of a field with such an type can be inferred or is defined by a condition.

To convert the corresponding slice of the input byte array to the targeted array type, we may have to use an Unchecked_Conversion.

Precondition Generation

Design and implement algorithm to generate preconditions for field validator and accessor functions from record type invariant.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.