foxglove / mcap Goto Github PK

MCAP is a modular, performant, and serialization-agnostic container file format, useful for pub/sub and robotics applications.

Home Page: https://mcap.dev

License: MIT License

TypeScript 19.97% JavaScript 0.98% Makefile 0.45% Dockerfile 0.24% CMake 0.31% C++ 16.60% Python 24.79% Go 21.90% Shell 0.15% Swift 4.65% Rust 7.72% Kaitai Struct 0.59% CSS 0.41% MDX 1.25%

robotics data serialization deserialization typescript golang python cpp swift

mcap's Introduction

MCAP

MCAP is a modular container format and logging library for pub/sub messages with arbitrary message serialization. It is primarily intended for use in robotics applications, and works well under various workloads, resource constraints, and durability requirements.

Documentation

Developer quick start

MCAP libraries are provided in the following languages. For guidance on each language, see its corresponding README:

Language	Readme	API docs	Package name	Version
C++	readme	API docs	`mcap`
Go	readme	API docs		see releases
Python	readme	API docs	`mcap`
JavaScript/TypeScript	readme	API docs	`@mcap/core`
Swift	readme	API docs		see releases
Rust	readme	API docs	`mcap`

To run the conformance tests, you will need to use Git LFS, which is used to store the test logs under tests/conformance/data.

CLI tool

Interact with MCAP files from the command line using the MCAP CLI tool.

Download the latest mcap-cli version from the releases page.

License

MIT License. Contributors are required to accept the Contributor License Agreement.

Release process

Release numbering follows a major.minor.patch format, abbreviated as "X.Y.Z" below.

CI will build the appropriate packages once tags are pushed, as described below.

Go library

Update the Version in go/mcap/version.go
Tag a release matching the version number go/mcap/vX.Y.Z.

CLI

Tag a release matching releases/mcap-cli/vX.Y.Z.

The version number is set at build time based on the tag.

C++

Update the version in all relevant files
- cpp/bench/conanfile.py
- cpp/build-docs.sh
- cpp/build.sh
- cpp/docs/conanfile.py
- cpp/examples/conanfile.py
- cpp/mcap/include/mcap/types.hpp (MCAP_LIBRARY_VERSION)
- cpp/mcap/include/conanfile.py
- cpp/test/conanfile.py
Tag a release matching the version number releases/cpp/vX.Y.Z

Python

There are several python packages; updating any follows a similar process.

Update the version in the appropriate __init.py__ file
Tag a release
- For the core mcap library, match the pattern releases/python/vX.Y.Z
- For other packages, use releases/python/PACKAGE/vX.Y.Z
  - For example, releases/python/mcap/v1.2.3

TypeScript

There are several TS packages; updating any follows a similar process.

Update the version in the appropriate package.json
Tag a release matching releases/typescript/PACKAGE/vX.Y.Z
- For example, releases/typescript/core/v1.2.3

Swift

Tag a release matching the version number releases/swift/vX.Y.Z

Rust

Update the version in rust/Cargo.toml
Tag a release matching the version number releases/rust/vX.Y.Z

mcap's People

Contributors

Stargazers

Watchers

Forkers

wep21 misc-git-forks frankfanslc jameskuszmaul-brt jhurliman asmwarrior yishengsjh emersonknapp kobayashi-maru blythetowal alkasm b-camacho mindula-dilthushan mrkline michaelorlov neilisaac blec123 pickledgator corsair-cxs 1beb doutdex james-rms neophack artigent-team mrkbac iamphytan awesomegolang run-lin dmweis alexanderekdahl jarkenau irenebm hyokachen jon-chuang idrilirdi therishidesai rayanem98 jeremy-shannon ktong821 leighleighleigh chengwei920412 achim-k postrantor ayman-saleh omegacoleman edgarriba neal-nie pezy rileyev yizhang24 wirthual starcsu junyang0412 lucimobility alexern14 strapsai zard-c jiahexu nstrumenta siliciuss troyneubauer loongsunchan filippobrizzi wkalt gryphonracingfsae russ76 shabbirhasan1 q-rains yongxo qdhuxp bradsquicciarini-coco watwea saching13 christopherayling natrad100

mcap's Issues

Delete ros2 profile from docs

Until we figure this out with more detail I think we should remove it from the docs.

Clarify Chunk Index compressed_size / uncompressed_size descriptions

The compressed size of the chunk.
The uncompressed size of the chunk.

These should say "The {} size of the chunk records field"

Add conformance test runner for Python

Add test runner for https://github.com/foxglove/mcap/tree/main/tests/conformance/scripts/run-tests/runners which uses child_process.exec() (or similar) to run the Python reader(s).

Add schema_count to Statistics

This was overlooked in #102. schema_count can be uint16 because schema id 0 is reserved (contingent on #126).

This could be implemented without binary breakage by appending the field to the end of the Statistics record. Or for aesthetic (and fixed-offset) reasons we could put it earlier in the record with a binary breakage.

Add created_at time to attachments

From #16

Why is these no publishing/acquisition time for an attachment? Not all attachment formats will support this metadata natively, so it would be useful to record it here.

Make Chunk compression field a char[4]

Right now, the compression field in Chunk is a variable-length string. This places the actual chunk payload at a variable offset, and requires parsing the compression string to determine the chunk payload length. If compression was instead a fixed-length char[4], we would know the chunk payload size immediately after parsing the record length and it would avoid an additional allocation for the std::string compression.

uncompressed would be [0x00, 0x00, 0x00, 0x00] or little-endian uint32_t 0
lz4 would be [0x6C, 0x7A, 0x34, 0x00] or little-endian uint32_t 3439212
zstd would be [0x7A, 0x73, 0x74, 0x64] or little-endian uint32_t 1685353338

indicate to readers whether record timestamps are relative to custom offset

We specify that the record_time may be relative to an arbitrary epoch. Unix epoch will be common but others options may also be used. It would be useful to readers to know what epoch the timestamps are relative to in some way - this could inform whether stamps could be displayed as date strings rather than integers.

Change KeyValue to Array

Add conformance test runner for Go

Add test runner for https://github.com/foxglove/mcap/tree/main/tests/conformance/scripts/run-tests/runners which uses child_process.exec() (or similar) to run the golang reader(s).

Compression fields in both in Chunk Index and Chunk

Which of these is the authoritative one?

Chunk Index record doesn't tell me how big my chunk record is

After reading a chunk index record, I know which message index record to read, but I don't know how big my chunk record is which means I have to go read the size of the record and then go read the record itself.

Statistics has no `metadata_count` field

A metadata_count field should go after the chunk_count field.

Rework high-level format variants

The specification currently makes a division between "chunked" and "unchunked" files, with each having a mandatory set of fields. Discussions have leaned in the direction of this being too restrictive on at least a couple fronts:

Users may want the compression benefits of chunking, but not want the cost of retaining channel info records in RAM for the statistics or chunk index records.
Users of the unchunked format may not want the cost of retaining channel info records in RAM for the statistics record. That's part of what they are trying to avoid by using the unchunked variant to begin with.

In consideration of these, we are considering making the following changes:

Chunked and unchunked files are eliminated as terms. There will be just one "mcap file".
Chunks and messages may both appear at the top level of the file.
Chunk indexes, attachment indexes, statistics, and channel infos in the index data section are optional, but subject to some mutual constraints:

if chunk indexes are included, any channels referenced by those chunk indexes must have channel infos in the index data section
if the channel_stats field of the statistics record is included, any channels it references must be reflected in the index data section as channel infos
if there are no records in the index data section, the index_offset of the footer record will be set to zero. Otherwise it will point to the first record in the section, regardless of what kind of record that is.
the channel_stats field of the statistics record may be zero-length/empty. This is to allow tracking of cheap global file stats without the expense of retaining the channel infos.

Messages written outside chunks will be readable by a sequential reader, but invisible to a random access reader using the chunk index.

Writers that do not include data in the index section will progressively lose utility from the "fast summarization support". The algorithm for "summary" is roughly,

seek to the index_offset
read to the end of the file
report aggregated statistics

If the index data section is empty, no statistics will be aggregated. Fallback behavior to a full file read is inadvisable to maintain good support on remote files. Update the explanatory notes section to discuss this a little bit.

Output Go binaries in a gitignored location

It's easy to accidentally check in the go conformance check binary since it's not gitignored.

Use Docker layer caching in CI for C++ builds

https://www.docker.com/blog/docker-github-actions/

https://github.com/marketplace/actions/docker-layer-caching (popular, but not official?)

Revive TypeScript validate script

Having some script that can be run on a .mcap file in each language seems useful. For TypeScript we had this but it was removed in #112.

Remove message_count from Statistics record

The total message count is already available once you've parsed the Statistics record by summing up the message counts for each channel. I don't think the read-time speedup of avoiding a reduce() function on a map is worth having another potential source of disagreement in files.

"make build" for go cli tool fails on mac

Building with go build succeeds. This is probably resulting from choice of build flags for sqlite.

schema ID could be fixed length offset from start of channel info record

We have already said we are not making more breaking changes at this time, but if we do, this seems like a nice one to include.

profiles are underspecified

The profile field in the Header says it specifies "interpretation of channel info user data.". Does this mean that a file with protobuf encoding would still be a valid ros1 profile?

It would be useful to expand the scope of what a profile can describe to include any of the open-ended fields (i.e. encoding, schema, schema name, etc).

This would allow creating a ros1 profile that would indicate ros1 as the required encoding, .msg text as the required schema format, and the schema naming convention. By specifying all of the requirements within the profile, a library author can definitely say they implement support for a ros1 profile which will interoperate with other tooling producing ros1 profile files.

ChannelInfo encoding is underspecified

The encoding field within ChannelInfo is a string type with a few examples: ros1, protobuf, cbor. If I select protobuf as the encoding, what should I put for schema?, schema_name? Does selecting an encoding have any other restrictions?

I'd like to write mcap reader/writer libraries that inter-operate with other tools but without additional specifications for how to handle encodings can't be sure what to do.

Some useful encodings to specify:

ros1
protobuf
json
cbor

Bring the diagrams back

MessageIndex records type is incorrectly saying KeyValue

This is not a key/value but an array. Timestamps are allowed to repeat.

Writing an attachment forces me to end my chunk

My writer is trying to produce a chunk for every second of data. If I've written a few messages and now want to write an attachment record, it seems I have to end my chunk, write out all the message indexes, write the attachment, and then start a new chunk.

Clarify start_time and end_time in chunk index records

This should be the record_time of a message. Should this also be the earliest seen message in the chunk? Or just the first and last messages (if record_time happens to be out of order).

Do channel info records have to appear in chunks if they have appeared in earlier chunks?

The spec says: A channel info record must occur in the file prior to any message that references its Channel ID.

If I write a chunk and know I've written a channel info record before in another chunk - do I need to write the channel info record again?

Run the CI conformance tests for different languages in parallel

go: reusable readers/writers

The readers and writers should support some kind of Reset functionality, to allow reuse of underlying buffers for compression and chunking across files.

Rename record_time to log_time

From team discussion - record_time is ambiguous. log_time was considered more appropriate

Chunk records should have a start and end stamp

With a start and end stamp on chunks, I could cheaply identify if I need to read a chunk or decompress a chunk based on the time I want to read messages at. Without the start and end stamp I have to process a chunk before knowing it contains messages for my timestamp.

Add a way to indicate a schema name but no schema.

Let's say I have CBOR messages on channels. I'd like to give each "type" of message a "name" so that Studio can render my data in various panels. I don't want to (or need to) provide a schema since CBOR is self describing. What do I do?

Comments on the file format spec

Hi,

I've been reviewing the MCAP spec, and have some feedback which may be of interest.

Attachments

Why is these no publishing/acquisition time for an attachment? Not all attachment formats will support this metadata natively, so it would be useful to record it here.

Channel Info

It may be useful to have a schema version, or store a hash of the schema. This allows consumers to determine whether they are able to process the data described by the schema, without having to reason about the contents of the schema itself.

Encryption / Signing

No support.
It would be useful to support encryption and signing on a per chunk basis.

Robustness and Chunks

Major downsides of unchunked files are the lack of message indexes, and integrity checking.
Would it be possible to use an approach whereby un-compressed chunks are written, along with message indexes? This would have negligable impact on write performance, and depending on the implementation (special case of an uncompressed chunk, where messages are written straight to file), would maintain robustness in the case of a crash.

swap publish time and log time in message record?

We currently have publish time physically ordered before log time in the message record. This isn't a big deal but introduces a couple inconveniences:

The description of the publish time references the log time, which a spec reader will not have read about yet
The log time is the time on which the file is most closely ordered (under most conditions) and the one on which the indexes are built, so it may be more natural somehow for it to come first. Since all the fields are fixed-width, the difference is just cosmetic.

The corrective action would be to swap the ordering in the record.

Add "generic-protobuf" profile

Should we have a generic-protobuf profile which clarifies how to store protobuf data in an mcap file.

Why does Chunk Index have a crc field when Attachment Index does not?

It seems like they should either both have it or neither have it. My personal leaning is toward only doing CRCs on user data (chunks, attachments) and not on index records.

include schema_version in channel info

A schema version field should be included in channel info. This could be a fixed-width 16 byte field (md5). This field can be used as a cache key by readers, to either avoid parsing (local) or downloading (remote) schemas they have observed in previous requests.

How are readers supposed to determine that unchunked files contain messages?

Here is a simple chunked file layout:

Chunked
-------
Magic
Header
Chunk
  ChannelInfo
  Message
MessageIndex
[index_offset]
ChannelInfo
ChunkIndex
Statistics
Footer
Magic

And the same file unchunked:

Unchunked
---------
Magic
Header
ChannelInfo
Message
[index_offset]
ChannelInfo
Statistics
Footer
Magic

And here's an empty file:

Empty (Chunked or Unchunked)
-------
Magic
Header
[index_offset]
Statistics
Footer
Magic

How are readers supposed to distinguish between unchunked files and empty files?

Make McapWriter file interface agnostic

Rather than requiring nodejs fs module, allow McapWriter to write to any FileLike conforming instance. This is a pattern we've used before so readers and writers can function in different i/o environments.

Remove _count_ from message index record

The count is inferred from the number of records

Is Chunk Index message index offsets required

Is it valid to write chunk'd files without Message Index records following chunks.

String datatype is unclear

Description
The spec says the following about string types:

String: a uint32-prefixed UTF8 string

Is the prefix the length of the string (characters) or the number of bytes for the entire UTF8 encoded portion?

Index data section contains more than "index" data. Creates confusion.

Alternative names to consider:

epilogue (zstd uses this)
appendix
end-of-file section

Change alpha badge to beta

Spec is not clear up front about serialization details

When reading the spec one encounters usage of Array<Tuple<...>> before these terms are defined. We should move the definitions up, or add links to the serialization info, or at least mention earlier in the doc that serialization terms will be specified later.

String: a uint32-prefixed UTF8 string
KeyValues<T1, T2>: A uint32 length-prefixed association of key-value pairs, serialized as

For string is this the length of the string or the number of bytes?
For KeyValues is this the number of pairs or the number of bytes for the remaining serialized portion?

Add conformance test runner for C++

Add test runner for https://github.com/foxglove/mcap/tree/main/tests/conformance/scripts/run-tests/runners which uses child_process.exec() (or similar) to run the C++ reader(s).