cadubentzen / mkvdump Goto Github PK

MKV and WebM parser CLI tool

License: Apache License 2.0

Rust 90.80% Dockerfile 0.37% Makefile 0.22% CSS 0.69% HTML 1.21% JavaScript 6.71%

mkv webm cli ebml matroska audio multimedia rust video

mkvdump's Introduction

mkvdump

A command-line tool for debugging Matroska/WebM files. It displays all internal elements of a Matroska file as JSON or YAML.

Sample YAML output

- id: EBML
  header_size: 5
  size: 36
  children:
  - id: EBMLVersion
    header_size: 3
    size: 4
    value: 1
  - id: EBMLReadVersion
    header_size: 3
    size: 4
    value: 1
  - id: EBMLMaxIDLength
    header_size: 3
    size: 4
    value: 4
  - id: EBMLMaxSizeLength
    header_size: 3
    size: 4
    value: 8
  - id: DocType
    header_size: 3
    size: 7
    value: webm
  - id: DocTypeVersion
    header_size: 3
    size: 4
    value: 2
  - id: DocTypeReadVersion
    header_size: 3
    size: 4
    value: 2
- id: Segment
  header_size: 12
  size: Unknown
  children:
  - id: Void
    header_size: 9
    size: 229
    value: null
  - id: Info
    header_size: 5
    size: 44
    children:
    - id: TimestampScale
      header_size: 4
      size: 7
      value: 1000000
    - id: MuxingApp
      header_size: 3
      size: 16
      value: Lavf58.29.100
    - id: WritingApp
      header_size: 3
      size: 16
      value: Lavf58.29.100
  - id: Tracks
    header_size: 5
    size: 101
    children:
    - id: TrackEntry
      header_size: 9
      size: 96
      children:
      - id: TrackNumber
        header_size: 2
        size: 3
        value: 1
      - id: TrackUID
        header_size: 3
        size: 4
        value: 1
      - id: FlagLacing
        header_size: 2
        size: 3
        value: 0
      - id: Language
        header_size: 4
        size: 7
        value: und
      - id: CodecID
        header_size: 2
        size: 7
        value: V_AV1
      - id: TrackType
        header_size: 2
        size: 3
        value: video
      - id: DefaultDuration
        header_size: 4
        size: 8
        value: 41708333
      - id: Video
        header_size: 9
        size: 32
        children:
        - id: PixelWidth
          header_size: 2
          size: 4
          value: 1280
        - id: PixelHeight
          header_size: 2
          size: 4
          value: 720
        - id: Colour
          header_size: 3
          size: 15
          children:
          - id: Range
            header_size: 3
            size: 4
            value: broadcast range
          - id: ChromaSitingHorz
            header_size: 3
            size: 4
            value: left collocated
          - id: ChromaSitingVert
            header_size: 3
            size: 4
            value: half
      - id: CodecPrivate
        header_size: 3
        size: 20
        value: '[81 05 0c 00 0a 0b 00 00 00 2d 4c ff b3 df ff 98 04]'
  - id: Tags
    header_size: 5
    size: 61
    children:
    - id: Tag
      header_size: 10
      size: 56
      children:
      - id: Targets
        header_size: 10
        size: 10
        children: []
      - id: SimpleTag
        header_size: 10
        size: 36
        children:
        - id: TagName
          header_size: 3
          size: 10
          value: ENCODER
        - id: TagString
          header_size: 3
          size: 16
          value: Lavf58.29.100
  - id: Cluster
    header_size: 6
    size: 2679
    children:
    - id: Timestamp
      header_size: 2
      size: 3
      value: 0
    - id: SimpleBlock
      header_size: 2
      size: 45
      value:
        track_number: 1
        timestamp: 0
        keyframe: true
    - id: SimpleBlock
      header_size: 2
      size: 59
      value:
        track_number: 1
        timestamp: 42
    - id: SimpleBlock
      header_size: 2
      size: 32
      value:
        track_number: 1
        timestamp: 83
    # ...

What's it useful for?

This tool is similar to mp4dump, but for Matroska files. It may be useful for:

snapshot testing: you can save mkvdump's output for a produced Matroska asset and use that in a human-readable snapshot test.
learning about EBML/Matroska/WebM: with this tool you can see how a Matroska file is structured. I also learned by writing the tool 😊

Getting mkvdump

Debian package

Ubuntu users (>= 20.04) can install mkvdump via the DEB package available in the releases page.

Homebrew

Linux and macOS users on x86_64 devices can install mkvdump via the Homebrew tap:

$ brew install cadubentzen/mkvdump/mkvdump

macOS users on M1 or M2 devices need to use

$ brew install --build-from-source cadubentzen/mkvdump/mkvdump

Cargo

If you have cargo-binstall installed, you can install mkvdump with

$ cargo binstall mkvdump

Else, you can install by building it from source with:

$ cargo install mkvdump

Docker

To pull latest mkvdump from Docker Hub:

$ docker pull cadubentzen/mkvdump

A GitHub package is also available via

$ docker pull ghcr.io/cadubentzen/mkvdump

Images are multi-arch with support for linux/amd64, linux/386, linux/arm64, linux/arm/v7 and linux/arm/v6.

Running the container

Asssuming a Mastroska file in the host located at /host-path/sample.mkv. You can run mkvdump on it with the following command, by mounting a volume:

$ docker run -v /host-path:/media cadubentzen/mkvdump /media/sample.mkv

Prebuilt binaries

Download prebuilt binaries from the release page. There are binaries for the following targets:

Linux
- statically linked with musl: x86_64, x86, aarch64, armv7l and armv6l
- with GNU libc: x86_64 and x86 (built on Ubuntu 20.04)
macOS
- x86_64 and aarch64 (>= macOS 11 Big Sur)
Windows
- x86_64 and x86 with MSVC and MinGW

License

This project is licensed under either of

at your option.

The SPDX license identifier for this project is MIT OR Apache-2.0.

mkvdump's People

Contributors

Stargazers

Watchers

mkvdump's Issues

Children is private

Thanks for working on this! I was having issues trying to get mkv tags with symphonia and vlc, maybe because the mkv tags are a matrix rather than flat? Not sure.

In the MasterElement struct children is private so library users can't walk the tree. I didn't see any other way to get it other than serialize to json then deserialize.

Add values to serialization of enumerations

This would help with debugging as we would both infer the enumeration and the value without having to look at the spec

Parse enumeration values

Fields like TrackType are integers with enumeration values, e.g. audio, video, etc.

https://www.matroska.org/technical/elements.html#TrackType

This could be parsed, ideally with the XML definitions.

Support concat of initialization segment after cluster

For WebM byte streams

Use Serde-XML to parse XML files

The current XML parsing in build.rs is very primitive. It could be improved by using using Serde-XML

Implement CRC-32 validation

Master elements might have it, so we need to check that the CRC-32 matches and discard the content if not.

Write README page

Last thing before the first release, once all things are set in place.

Parse Date

https://github.com/ietf-wg-cellar/ebml-specification/blob/master/specification.markdown#date-element

Add CLI tests

Currently we test the crate as a library but not so much how the CLI interface looks like.

https://rust-cli.github.io/book/tutorial/testing.html

remove nom

While nom is a great parser library, this project doesn't really use it fully, just take() and peek() functions.

Those functions should be easy to implement by hand and should reduce compilation time a lot.

Parse XML files for getting Element IDs

Could parse the files

To obtain the element IDs and types. This way, it's not needed to hardcode them.

It could be done in build.rs, and then generate an elements.rs file.

Fix issues with Matroska test suite files

https://www.matroska.org/downloads/test_suite.html
Files 4 and 7 yield problems in the parser.

currently we parse the input by requiring each element (including its body) to be loaded into memory. That's the whole reason why we have a buffer-size option, so that it can be increased e.g. if we are parsing an MKV file with huge video frames.

However, since this crate is about displaying headers for those elements, it shouldn't be required to load the whole body into memory.

It's a bit trick though since we need to sync with skipping bytes from the input source, and sometimes we require to parse part of the body (e.g. in SimpleBlock) for useful info.

Package the tool with Docker

Use alpine as base image

Add position to elements

Keep track of position so that we could integrate the parser with a binary viewer UI.

turn repo into workspace

Currently the mkvudmp bin and the library are in the same crate, thus we mix all the dependencies, although some of them are not used by the library.

Those who would like to use the library thus pay the price of adding those dependencies as well.

The soon-to-be wasm crate to use in the website also only needs to use the library.

Support Unknown sizes

https://github.com/ietf-wg-cellar/ebml-specification/blob/master/specification.markdown#unknown-data-size

Parse SimpleBlock

https://www.matroska.org/technical/basics.html#simpleblock-structure

Compile to WASM

It would be pretty neat to have this crate compiled to WASM, and possibly in the future released as an Web app and/or extension

Improve cli interface

We are not using clap's full power to deliver a nice cli interface yet.

Inline binary data with up to N bytes

Could display binary elements in the format
value: [00 0a 0b 0c]

if the value of the payload is smaller than N=64 (maybe).

Need to figure out how to do that with Serde YAML

Implement streamed reading

Reading a GB file should not result in GB of memory utilized. The implementation could support that but the file reading is done with a single read-to-buffer call currently.

Could use VecDeque or some smarter library for buffered reading.

Snapshot tests

Use insta.rs

Parse CodecPrivate for some codecs

Would be nice to see some parsing of CodecPrivate. Maybe implement this in a separate crate and use it here.

List of codecs:

automate homebrew tap update

A homebrew tap is available at https://github.com/cadubentzen/homebrew-mkvdump, but the releases of mkvdump do not update the tap automatically yet

Improve error handling

Right now it's quite fixed how the parsing will happen, but we could improve the error handling by defining better error return types, that can be asserted in tests.

Generate first release

Pre-built webmdump binary in the GitHub releases.

Try to have it built with musl to maximize portability. At first, Linux only.

Add smoke tests to CI

Parse some full WebM files and enforce that it doesn't crash.

Add "no tree" mode

As we already parse the elements in a linear way, providing also the linear output, rather than tree mode, makes it easy to find elements.

add rustdoc to enumerations

add deb package to releases

use cargo-deb

Use Element and Enumeration names as in the original XML files

Currently, we rely on the Enum variant names for the serializations, but we could have a #[serde(rename = ...)] to make them be displayed exactly as in the spec.

Would be nice for longer names with spaces and other characters.

Package tool for MacOS and Windows

Mostly CI work.
Is there some code signing needed?

fuzzy testing

fuzz testing this crate would be really nice. And it also seems relatively simply to do so, because many random bytes actually yield valid EBML content, so the parser can go into different code paths.

Use Element Paths for parsing and building the tree

Follow-up to #12. As Unknown sizes are a thing, building up the element tree could be done more elegantly by using paths specified in the XML file.

Currently, if we have a Master element of an unknown size, all elements following will be children of that Master element.

That does not work when concatenating multiple header files, or with clusters of unknown size.

Edit:
The element paths are also important for parsing, in order to recover from damaged elements.

Check if musl would provide smaller binaries
Quick check of performance

improve mkvparser documentation

now that there's a separate crate library for mkvparser, I should improve it's documentation.

Mainly adding an example to the landing docs.rs page

Add more samples to snapshot tests

Add a few more samples to the test suite:

H.264, VP9 and AV1 files
Audio files and mixed files
Encrypted and Unencrypted
Different muxers: ffmpeg, shaka-packager (what more?)