Giter Site home page Giter Site logo

foxglove / mcap Goto Github PK

View Code? Open in Web Editor NEW
426.0 10.0 75.0 131.54 MB

MCAP is a modular, performant, and serialization-agnostic container file format, useful for pub/sub and robotics applications.

Home Page: https://mcap.dev

License: MIT License

TypeScript 19.97% JavaScript 0.98% Makefile 0.45% Dockerfile 0.24% CMake 0.31% C++ 16.60% Python 24.79% Go 21.90% Shell 0.15% Swift 4.65% Rust 7.72% Kaitai Struct 0.59% CSS 0.41% MDX 1.25%
robotics data serialization deserialization typescript golang python cpp swift

mcap's Introduction

MCAP logo

MCAP

MCAP is a modular container format and logging library for pub/sub messages with arbitrary message serialization. It is primarily intended for use in robotics applications, and works well under various workloads, resource constraints, and durability requirements.

Documentation

Developer quick start

MCAP libraries are provided in the following languages. For guidance on each language, see its corresponding README:

Language Readme API docs Package name Version
C++ readme API docs mcap
Go readme API docs see releases
Python readme API docs mcap
JavaScript/TypeScript readme API docs @mcap/core
Swift readme API docs see releases
Rust readme API docs mcap

To run the conformance tests, you will need to use Git LFS, which is used to store the test logs under tests/conformance/data.

CLI tool

Interact with MCAP files from the command line using the MCAP CLI tool.

Download the latest mcap-cli version from the releases page.

License

MIT License. Contributors are required to accept the Contributor License Agreement.

Release process

Release numbering follows a major.minor.patch format, abbreviated as "X.Y.Z" below.

CI will build the appropriate packages once tags are pushed, as described below.

Go library

  1. Update the Version in go/mcap/version.go
  2. Tag a release matching the version number go/mcap/vX.Y.Z.

CLI

Tag a release matching releases/mcap-cli/vX.Y.Z.

The version number is set at build time based on the tag.

C++

  1. Update the version in all relevant files
    • cpp/bench/conanfile.py
    • cpp/build-docs.sh
    • cpp/build.sh
    • cpp/docs/conanfile.py
    • cpp/examples/conanfile.py
    • cpp/mcap/include/mcap/types.hpp (MCAP_LIBRARY_VERSION)
    • cpp/mcap/include/conanfile.py
    • cpp/test/conanfile.py
  2. Tag a release matching the version number releases/cpp/vX.Y.Z

Python

There are several python packages; updating any follows a similar process.

  1. Update the version in the appropriate __init.py__ file
  2. Tag a release
    • For the core mcap library, match the pattern releases/python/vX.Y.Z
    • For other packages, use releases/python/PACKAGE/vX.Y.Z
      • For example, releases/python/mcap/v1.2.3

TypeScript

There are several TS packages; updating any follows a similar process.

  1. Update the version in the appropriate package.json
  2. Tag a release matching releases/typescript/PACKAGE/vX.Y.Z
    • For example, releases/typescript/core/v1.2.3

Swift

Tag a release matching the version number releases/swift/vX.Y.Z

Rust

  1. Update the version in rust/Cargo.toml
  2. Tag a release matching the version number releases/rust/vX.Y.Z

mcap's People

Contributors

achim-k avatar amacneil avatar bradsquicciarini-coco avatar bryfox avatar defunctzombie avatar dependabot[bot] avatar emersonknapp avatar esthersweon avatar foxymiles avatar idrilirdi avatar james-rms avatar jameskuszmaul-brt avatar jhurliman avatar jiangengdong avatar jon-chuang avatar jtbandes avatar ktong821 avatar michaelorlov avatar mrkline avatar narasaka avatar ocin-rye avatar olavsr avatar pezy avatar saching13 avatar snosenzo avatar starcsu avatar wimagguc avatar wirthual avatar wkalt avatar yizhang24 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mcap's Issues

Add schema_count to Statistics

This was overlooked in #102. schema_count can be uint16 because schema id 0 is reserved (contingent on #126).

This could be implemented without binary breakage by appending the field to the end of the Statistics record. Or for aesthetic (and fixed-offset) reasons we could put it earlier in the record with a binary breakage.

Add created_at time to attachments

From #16

Why is these no publishing/acquisition time for an attachment? Not all attachment formats will support this metadata natively, so it would be useful to record it here.

Make Chunk compression field a char[4]

Right now, the compression field in Chunk is a variable-length string. This places the actual chunk payload at a variable offset, and requires parsing the compression string to determine the chunk payload length. If compression was instead a fixed-length char[4], we would know the chunk payload size immediately after parsing the record length and it would avoid an additional allocation for the std::string compression.

uncompressed would be [0x00, 0x00, 0x00, 0x00] or little-endian uint32_t 0
lz4 would be [0x6C, 0x7A, 0x34, 0x00] or little-endian uint32_t 3439212
zstd would be [0x7A, 0x73, 0x74, 0x64] or little-endian uint32_t 1685353338

indicate to readers whether record timestamps are relative to custom offset

We specify that the record_time may be relative to an arbitrary epoch. Unix epoch will be common but others options may also be used. It would be useful to readers to know what epoch the timestamps are relative to in some way - this could inform whether stamps could be displayed as date strings rather than integers.

Rework high-level format variants

The specification currently makes a division between "chunked" and "unchunked" files, with each having a mandatory set of fields. Discussions have leaned in the direction of this being too restrictive on at least a couple fronts:

  • Users may want the compression benefits of chunking, but not want the cost of retaining channel info records in RAM for the statistics or chunk index records.
  • Users of the unchunked format may not want the cost of retaining channel info records in RAM for the statistics record. That's part of what they are trying to avoid by using the unchunked variant to begin with.

In consideration of these, we are considering making the following changes:

  • Chunked and unchunked files are eliminated as terms. There will be just one "mcap file".
  • Chunks and messages may both appear at the top level of the file.
  • Chunk indexes, attachment indexes, statistics, and channel infos in the index data section are optional, but subject to some mutual constraints:
  • if chunk indexes are included, any channels referenced by those chunk indexes must have channel infos in the index data section
  • if the channel_stats field of the statistics record is included, any channels it references must be reflected in the index data section as channel infos
  • if there are no records in the index data section, the index_offset of the footer record will be set to zero. Otherwise it will point to the first record in the section, regardless of what kind of record that is.
  • the channel_stats field of the statistics record may be zero-length/empty. This is to allow tracking of cheap global file stats without the expense of retaining the channel infos.

Messages written outside chunks will be readable by a sequential reader, but invisible to a random access reader using the chunk index.

Writers that do not include data in the index section will progressively lose utility from the "fast summarization support". The algorithm for "summary" is roughly,

  • seek to the index_offset
  • read to the end of the file
  • report aggregated statistics

If the index data section is empty, no statistics will be aggregated. Fallback behavior to a full file read is inadvisable to maintain good support on remote files. Update the explanatory notes section to discuss this a little bit.

Remove message_count from Statistics record

The total message count is already available once you've parsed the Statistics record by summing up the message counts for each channel. I don't think the read-time speedup of avoiding a reduce() function on a map is worth having another potential source of disagreement in files.

profiles are underspecified

The profile field in the Header says it specifies "interpretation of channel info user data.". Does this mean that a file with protobuf encoding would still be a valid ros1 profile?

It would be useful to expand the scope of what a profile can describe to include any of the open-ended fields (i.e. encoding, schema, schema name, etc).

This would allow creating a ros1 profile that would indicate ros1 as the required encoding, .msg text as the required schema format, and the schema naming convention. By specifying all of the requirements within the profile, a library author can definitely say they implement support for a ros1 profile which will interoperate with other tooling producing ros1 profile files.

ChannelInfo encoding is underspecified

The encoding field within ChannelInfo is a string type with a few examples: ros1, protobuf, cbor. If I select protobuf as the encoding, what should I put for schema?, schema_name? Does selecting an encoding have any other restrictions?

I'd like to write mcap reader/writer libraries that inter-operate with other tools but without additional specifications for how to handle encodings can't be sure what to do.

Some useful encodings to specify:

  • ros1
  • protobuf
  • json
  • cbor

Writing an attachment forces me to end my chunk

My writer is trying to produce a chunk for every second of data. If I've written a few messages and now want to write an attachment record, it seems I have to end my chunk, write out all the message indexes, write the attachment, and then start a new chunk.

go: reusable readers/writers

The readers and writers should support some kind of Reset functionality, to allow reuse of underlying buffers for compression and chunking across files.

Chunk records should have a start and end stamp

With a start and end stamp on chunks, I could cheaply identify if I need to read a chunk or decompress a chunk based on the time I want to read messages at. Without the start and end stamp I have to process a chunk before knowing it contains messages for my timestamp.

Add a way to indicate a schema name but no schema.

Let's say I have CBOR messages on channels. I'd like to give each "type" of message a "name" so that Studio can render my data in various panels. I don't want to (or need to) provide a schema since CBOR is self describing. What do I do?

Comments on the file format spec

Hi,

I've been reviewing the MCAP spec, and have some feedback which may be of interest.

Attachments

  • Why is these no publishing/acquisition time for an attachment? Not all attachment formats will support this metadata natively, so it would be useful to record it here.

Channel Info

  • It may be useful to have a schema version, or store a hash of the schema. This allows consumers to determine whether they are able to process the data described by the schema, without having to reason about the contents of the schema itself.

Encryption / Signing

  • No support.
  • It would be useful to support encryption and signing on a per chunk basis.

Robustness and Chunks

  • Major downsides of unchunked files are the lack of message indexes, and integrity checking.
  • Would it be possible to use an approach whereby un-compressed chunks are written, along with message indexes? This would have negligable impact on write performance, and depending on the implementation (special case of an uncompressed chunk, where messages are written straight to file), would maintain robustness in the case of a crash.

swap publish time and log time in message record?

We currently have publish time physically ordered before log time in the message record. This isn't a big deal but introduces a couple inconveniences:

  • The description of the publish time references the log time, which a spec reader will not have read about yet
  • The log time is the time on which the file is most closely ordered (under most conditions) and the one on which the indexes are built, so it may be more natural somehow for it to come first. Since all the fields are fixed-width, the difference is just cosmetic.

The corrective action would be to swap the ordering in the record.

include schema_version in channel info

A schema version field should be included in channel info. This could be a fixed-width 16 byte field (md5). This field can be used as a cache key by readers, to either avoid parsing (local) or downloading (remote) schemas they have observed in previous requests.

How are readers supposed to determine that unchunked files contain messages?

Here is a simple chunked file layout:

Chunked
-------
Magic
Header
Chunk
  ChannelInfo
  Message
MessageIndex
[index_offset]
ChannelInfo
ChunkIndex
Statistics
Footer
Magic

And the same file unchunked:

Unchunked
---------
Magic
Header
ChannelInfo
Message
[index_offset]
ChannelInfo
Statistics
Footer
Magic

And here's an empty file:

Empty (Chunked or Unchunked)
-------
Magic
Header
[index_offset]
Statistics
Footer
Magic

How are readers supposed to distinguish between unchunked files and empty files?

Make McapWriter file interface agnostic

Rather than requiring nodejs fs module, allow McapWriter to write to any FileLike conforming instance. This is a pattern we've used before so readers and writers can function in different i/o environments.

String datatype is unclear

Description
The spec says the following about string types:

String: a uint32-prefixed UTF8 string

Is the prefix the length of the string (characters) or the number of bytes for the entire UTF8 encoded portion?

Spec is not clear up front about serialization details

When reading the spec one encounters usage of Array<Tuple<...>> before these terms are defined. We should move the definitions up, or add links to the serialization info, or at least mention earlier in the doc that serialization terms will be specified later.

Byte prefixed individual records (like arrays) feels weird

Issue for discussion

Since we have byte length prefixing for the entire record - having byte length for array fields feels weird. On the writer side I need to write out the array and only then do I know the length of bytes for the array.

v1: make profile available in summary section?

This would prevent the need to read from the front of the file when doing indexed reads, which would reduce latency for remote reading by some amount (should be tested).

This doesn't have to break backward compatibility, but would require either a new record type, or else overloading the Header. Strictly speaking, we only need the profile, not the library, but perhaps the library would be useful too.

Clarify length semantics for String and KeyValue

The spec says

String: a uint32-prefixed UTF8 string
KeyValues<T1, T2>: A uint32 length-prefixed association of key-value pairs, serialized as

For string is this the length of the string or the number of bytes?
For KeyValues is this the number of pairs or the number of bytes for the remaining serialized portion?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.