Giter Site home page Giter Site logo

webcodecs's Introduction

WebCodecs

The WebCodecs API allows web applications to encode and decode audio and video.

Many Web APIs use media codecs internally to support APIs for particular uses:

  • HTMLMediaElement and Media Source Extensions
  • WebAudio (decodeAudioData)
  • MediaRecorder
  • WebRTC

But there’s no general way to flexibly configure and use these media codecs. Because of this, many web applications have resorted to implementing media codecs in JavaScript or WebAssembly, despite the disadvantages:

  • Increased bandwidth to download codecs already in the browser.
  • Reduced performance
  • Reduced power efficiency

It's great for:

  • Live streaming
  • Cloud gaming
  • Media file editing and transcoding

See the explainer for more info.

Code samples

Please see https://w3c.github.io/webcodecs/samples/

WebCodecs registries

This repository also contains two registries:

  • The WebCodecs Codec Registry provides the means to identify and avoid collisions among codec strings used in WebCodecs and provides a mechanism to define codec-specific members of WebCodecs codec configuration dictionaries. Codec-specific registrations entered in the registry are also maintained in the repository, please refer to the registry for a comprehensive list.

  • The WebCodecs VideoFrame Metadata Registry enumerates the metadata fields that can be attached to VideoFrame objects via the VideoFrameMetadata dictionary. Metadata registrations entered in the registry may be maintained in this repository or elsewhere. Please refer to the registry for a comprehensive list.

webcodecs's People

Contributors

aboba avatar bdrtc avatar chcunningham avatar chrisguttandin avatar chrisn avatar dalecurtis avatar daxiajunjun avatar djuffin avatar dontcallmedom avatar fippo avatar foolip avatar guest271314 avatar helaozhao avatar huachunbo avatar jimbankoski avatar josephrocca avatar mfoltzgoogle avatar padenot avatar pthatcherg avatar sandersdan avatar stazhu avatar steveanton avatar taste1981 avatar tguilbert-google avatar tidoust avatar vilicvane avatar vinlic avatar yahweasel avatar yoavweiss avatar youennf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webcodecs's Issues

Packetization

The APIs make sense for encoding a bitstream, transporting it reliably then decoding and rendering it.

But what if you want to packetize an encoded bitstream within QUIC datagrams and then reassemble and decode it on the other side?

Or what if you want a mixture of reliable and unreliable transport? For example, you might want to provide different transport for keyframes and P-frames. So you might want to separate out portions of the encoded bitstream for reliable transport, and other portions for datagram transport.

Decoding "out of order" video frames (B-frames)

Many video codecs support B-frames, frames which have dependencies on previous and future (chronological) frames. Some decoders want to be fed packet in the decoder order (e.g., the future I-frame is fed before the B-frame that depends on it) while others will take care of buffering and reordering internally.

This is a tracking issue for sketching how the WebCodecs API supports B-frames.

Memory Reuse and Decode Length

Would it be helpful to offer APIs that use pre-defined ring buffers to reduce garbage collection and maintain low latency? SharedArrayBuffer (SAB) could also be used for cross-realm/thread processing and browser support is returning.

Additionally, would it be helpful to control the decoder by specifying how many samples/frames to decode per call? We could decode quickly at first for low-latency playback and then gradually increase frame sizes after we have enough decoded data for playback continuity.

For example, consider a streaming audio AudioWorklet where GC is reduced using ring buffers and specifying 128 samples to decode synchronously (relates to #19).

audio-worklet-processer.js

// ring buffer of encoded bytes (set by "onmessage" or SAB from main/worker thread)
inputBuffer = new ArrayBuffer(...)

// ring buffers for decoded stereo 2.5s PCM @ 48,000hz
outLeft =  new ArrayBuffer(Float32Array.BYTES_PER_ELEMENT * 48000 * 2.5) // 469K
outRight = new ArrayBuffer(Float32Array.BYTES_PER_ELEMENT * 48000 * 2.5) // 469K

// decoded PCM samples
samplesLeft =  new Float32Array(outLeft)   // 120,000 samples
samplesRight = new Float32Array(outRight)  // 120,000 samples

// new stereo decoder (could also be on Worker/main thread via SAB)
decoder = new AudioDecoder({
  srcBuffer: inputBuffer,
  outputBuffers: [outLeft, outRight]
})

// buffer read/write index values
inStart, inEnd, outStart, outEnd

// return values after decode() call
totalSrcBytesUsed, totalSamplesDecoded

// AudioWorkletProcessor.process - processes 128 frames per quantum
process(inputs_NOT_USED, outputs) {
  // specify the max samples to decode (could also be called on Worker/main thread)
  { totalSrcBytesUsed, totalSamplesDecoded } = decoder.decode({ maxToDecode: 128 })

  // update src & output buffers read/write indexes
  ...

  // output decoded [samplesLeft, samplesRight] to @outputs
 ...
}

Byte stream formats

As currently defined, WebCodecs supports packetized codecs, where we expect one decoded frame per encoded chunk. For some codecs (eg. H.264 in Annex B format), it makes sense to use a byte stream instead.

This changes the interface of an encoder or decoder, so it's not a trivial change. It doesn't seem to be compatible with our flush or configure model unless streams gain support for flush.

Video frame type

At TPAC we got a lot of questions about what type to use for unencoded video frame. Currently the explainer uses the ImageData type, but that might not be efficiently implemented.

How to efficiently process video frames is really a larger problem than what we want to solve in WebCodecs, but we're getting a lot of questions about it so should try to figure something out (or find a different group to work on it).

Explainer: "MediaRecorder allows encoding a MediaStream that has audio and video tracks."

Explainer presently states

MediaRecorder allows encoding a MediaStream that has audio and video tracks.

Technically, as currently specified MediaRecorder does not provide a means to record multiple video tracks within a single MediaStream

w3c/mediacapture-record#168

Recording multiple tracks was intended to be possible from day one. Many formats handle multiple tracks.

When it was pointed out that a lot of container formats couldn't handle increasing the number of channels mid-recording, we were left with two choices:

  • Make the behavior dependent on container format (unpalatable)
  • Make the behavior consistent, but not very useful (ie stop).
    The WG chose the latter.
    I am not aware of any change in the landscape of container formats that seems to indicate that varying the number of tracks is a generally available option. If you know of such changes, please provide references.

As for the "replace track" option - I don't think anyone thought about that possibility at the time.

The fact that MediaRecorder specification does not provide a means to record multiple video tracks is a problem that this proposal can resolve.

For example, it should be possible to write code similar to

const merged = await decodeVideoData(["video1.webm#t=5,10", "video2.mp4#t=10,15", "video3.ogv#t=0,5"], {codecs="openh264"});

Capabilities

We need an API for capabilities, like what MediaCapabilities has. Is that enough for us? What input do we need to provide to MediaCapabilities?

Latency and bidirectionally predicted frames in VideoEncoder.

VideoEncoder can increase coding efficiency by referencing future frames when coding a frame (eg. B-frames). However, this also increases latency since the future frames need to be decoded before the bidirectionally encoded frames. This is not desirable in all use cases such as video conferencing.

How about adding maxPredictionFrames -

dictionary VideoEncoderEncodeOptions {
  .. 
  unsigned long maxPredictionFrames;
};

which gives the maximum number of future frames to use for prediction. Setting maxPredictionFrames to 0 would disable B-frames altogether.

Related to #55 .

Support for web workers

How are webcodecs objects supposed to work with web workers? Should the mediastreamtrack be transferable from/to the main thread to/from a web worker?

Buffer reuse

It can be much more efficient to decode to and encode from pools of buffers rather than allocating for every frame. This likely requires a way to return buffers quickly (more quickly than GC).

Encoded data is typically small enough that copies into/out of a pool are a negligible cost, but we should verify that.

How to express encoded size vs. visible size vs. natural size

When we express sizes, how do we distinguish between encoded size, visible size, and natural size. From what I understand:

  • coded size: the resolution of the encoded picture, always a multiple of the macroblock size (eg. 1920x1088).
  • visible region: the region of the coded picture that is valid image data (eg. 1920x1080@0,0).
  • natural size: the intended display size assuming square display pixels (that is, after applying the pixel aspect ratio).

Negotiating colorspaces

Ideally decoded content would be in the same colorspace as the encoded content, and colorspace negotiation is "just" a metadata management problem. Android MediaCodec works differently:

  • MediaCodec requires us to supply HDR metadata up front, and provides output Textures. To support a decoder like this, we would need to specify colorspace information up-front, and that means it has to be a per-codec-profile description.
  • Thus colorspace is a property of a codec as much as it is a property of a frame.
  • A decoded frame could be in a different colorspace from the encoded media.
  • There is no obvious way to handle transcoding generically.

Open questions:

  • Is there a workaround for MediaCodec? If so, can we assume that all future decoders would also allow unprocessed access to the decoded frames?
  • Should we be implementing colorspace conversions for encoding, or is this always something the app should do? (Related: should encoders handle scaling content or is this also an app concern?)

Is duration needed ?

readonly attribute unsigned long long? duration;  // microseconds

The duration is an optional attribute in VideoFrame object that represents the time interval for which the video composition should render the composed VideoFrame in microseconds.

VideoFrames to VideoDecoderOutputCallback in decoding order, not presentation order.

Modern standards allow a frame to predict its content temporally from both past and future (B-frames). Encoded chunks are usually stored in encoding/decoding order which may be different from presentation order. VideoFrames are given to VideoEncoder in presentation order which may reorder them before coding and output encoded video chunks in decoding order. Similarly, encoded chunks are given in decoding order to VideoDecoder.

What if VideoDecoder calls VideoDecoderOutputCallback as soon as a video frame decoding has finished, i.e. it gives VideoFrames to VideoDecoderOutputCallback in decoding order, not presentation order. In use cases like video playback Web App then needs to reorder the VideoFrame into presentation order. Although this increases a bit Web App complexity, it has several advantages:

  • UA does not need to maintain a queue of decoded VideoFrames that will be given to the Web App sometimes later after other frames.
  • VideoFrames are given to Web App as soon as possible which gives the Web App more time to do some processing to frames before displaying them.
  • Useful if the Web App is in full control of the synchronization and displaying the frames

How about adding decodingSequence and presentationSequence like -

interface EncodedVideoChunk {
  ... 
  readonly attribute unsigned long decodingSequence;
};

[Exposed=(Window)]
interface VideoFrame {
  ...
   readonly attribute unsigned long presentationSequence;
};

In EncodedVideoChunk decodingSequence would denote the coded order and be filled by the Web App before giving the chunk to VideoDecoder. VideoDecoder would fill presentationSequence of VideoFrame before calling OutputCallback.

There has been similar discussions in #7.

Support for images

Is there a path for using this proposal to encode/decode images?

I'm suspecting it's possible by implementing the image container and somehow getting the binary keyframe out, but how to plumb this is somewhat unclear.

Given that a lot of developers have been asking for functionality to encode/decode images without using an HTMLImageElement as an intermediary, it would be nice to have this use case covered.

Timestamps and time domains

This is a tracking issue for describing how timestamps work with WebCodecs and integrate with the video and audio playout time domains.

What should the shape of the API be?

A number of considerations all combine in somewhat complex ways, such as (re)initialization, buffering, failure recovery, and flushing. Obviously we'd like a good API for all of these, but there are tradeoffs between different options. This issue is for tracking the discussion of what we want the API to look like.

Here are some options. Note that it may be possible to have different options for encode and decode. For example, we could do a combination of B for encode and D for decode.

Option A: new Encoder/Decoder for each change

Every time you want to change something that requires (re)initialization, such as changing the codec or resolution, create a new Encoder/Decoder. Also reinit every time a flush is desired.

Pros:

  • Simple API
  • It's clear when (re)init fails, and recovering is straightforward
  • If we have buffered frames, it's clear which initialization applies to which frames.
  • Flushing is just closing the .writable

Cons:

  • Dealing with downstream changes to the pipeline may be difficult (because you have a new .readable that you need to pipe somewhere).
  • Dealing with upstream changes to the pipeline may be difficult (because you have a new .writable that you need to pipe into).
  • It may be much more efficient to have only one encoder/decoder around at a time which is difficult to manage when the JS is creating new ones for every resolution change and/or flush.

Option B: An Initialize() method for each change

If a change requires a reinitialization, call Initialize(), as many times as you want. The .writable and .readable are stable.

Pros:

  • It's clear when (re)init fails, and recovering is straightforward
  • .readable is the same across multiple initializations, which makes the downstream consumption easier (nothing to re-pipe).
  • .writable is the same across multiple initializations, which makes the upstream production easier (nothing to re-pipe).
  • Reinitialization can be efficient because the implementation can keep only one encoder/decoder around at a time.

Cons:

  • Flushing mush be a separate method (since you can only call .close() on the .writable once)
  • If buffering is used, which initialization applies to which frames becomes unclear. If we don't buffer on the .writable, we can avoid this, but that means that we must require the JS to respect .ready and we must deal with what happens when it does not.

Option C: An Initialize() method that produces a new WritableStream

If a change requires a reinitialization, call Initialize(), as many times as you want. The .readable is stable, but not the .writable (if there is one).

Pros:

  • It's clear when (re)init fails, and recovering is straightforward
  • .readable is the same across multiple initializations, which makes the downstream consumption easier (nothing to re-pipe).
  • Reinitialization can be efficient because the implementation can keep only one encoder/decoder around at a time.
  • Flushing is just closing the WritableStream
  • If we have buffered frames, it's clear which initialization applies to which frames.

Cons:

  • Dealing with upstream changes to the pipeline may be difficult (because you have a new Writable that you need to pipe into).
  • It's not exactly a TransformStream any more
  • Transferring streams may be tricky

Option D: In-band parameters

To reinitialize, put new parameters on the chunk passed into the .writable. Init failure is conveyed via a write failure.

Pros:

  • For decode, the source of frames (such as a media container) is likely related to the decoding parameters desired, making this a convenient/natural fit.
  • Cleaner encoder/decoder API (no extra Initialize/Flush methods)
  • Transferring streams is likely less tricky
  • If we have buffered frames, it's clear which initialization applies to which frames.
  • Upstream and downstream piping is easy (since both .readable and .writable are stable)
  • Flushing is just closing the .writable

Cons:

  • (Re)init failure recovery isn't straightforward. You'll likely need to catch an exception on .pipeTo and probably want to use preventCancel.
  • For encode, the source of frames (such as a MediaStreamTrack tied to a VideoTrackReader) likely isn't related to the encoding parameters desired, making this an inconvenient fit.
  • More complex chunk types (not just EncodedVideoFrame and VideoFrame; more things need to be on there)

Option E: Internal reinitialization

Instead of asking for an init, just give it what you want and have it (re)init when it needs. There is a fine line between this and Option D. But consider resolution changes. Instead of specifying that the codec reinit with a new size, you just give it whatever frame comes from a MediaStreamTrack and it reinits based on that size. Similarly, an EncodedVideoFrame could simple express what codec it is and the decoder deals with whatever it is.

Pros:

  • API is easier to use, and simple
  • Transferring streams is likely less tricky
  • No problems with buffering and which settings apply to which frames
  • Upstream and downstream piping is easy (since both .readable and .writable are stable)
  • Flushing is just closing the .writable

Cons:

  • (Re)init failure recovery isn't straightforward. You'll likely need to catch an exception on .pipeTo and probably want to use preventCancel.
  • It's easy for performance issues to creap in easily and become too automative/implicit. For example, if a codec switch happens and the new codec is only available via software, not hardware, should we reinit internally from hardware to software? This leads to a higher level, more automatic API with constraints that are difficult to specify and keep consistent across browser, somewhat like getUserMedia, whereas this API was initially intended to be low-level and explicit about performance.

Consider Blob for access to raw buffers

Due to the lack of read-only ArrayBuffers, extra copying may be necessary to make some WebCodecs APIs safe.

Blobs however can be read-only. We should investigate whether supporting Blobs in some APIs would be beneficial.

Refine encoder configuration of bitrate and/or quantization

How about we leave rate control to Web App, because it depends heavily in use case but is not a computationally heavy algorithm. If we go that route, it implies bitrate attribute can be be removed from VideoEncoderTuneOptions

dictionary VideoEncoderEncodeOptions {
  .. 
  unsigned long long quantization; 
};

It would affect the rate distortion balance so that 0 would be maximum quality and highest
bitrate while 255 would be the worst quality with smallest encoded chunk size. By controlling quantization value Web App could achieve the desired bit rate or variable bit rate at constant quality, whichever is preferred. Internally, quantization would be mapped to encoder quantization parameter depending on platform encoder. The highest quality setting (255) then imply lossless coding if the selected codec supports it.

@sandersdan knows better if there's support for this in all platform decoders.

Add Audio/VideoTrackWriter to Explainer

It should be possible to mux encoded content using VideoTrackWriter with MediaRecoder (encoded frames → VideoTrackWriter → VideoTrack → MediaRecorder → muxed content). We would need to verify that this combination is capable of correctly representing reordered frames (or reject them at runtime).

The same API could be used to handle encoding as well (raw frames → VideoTrackWriter → VideoTrack → MediaRecorder → muxed content). It's unclear if this is useful to WebCodecs users.

Synchronous encoding/decoding

@padenot mentioned there are some use cases for synchronously encoding/decoding media in contrast to the current API proposal which encourages/mandates asynchronous execution.

Add an ErrorEvent.

It would be nice to add an ErrorEvent to nicely handle errors in the error EventListener. I understand with promises we can always reject Promise and throw errors that we cannot escape, but for other "minor" errors it's good to handle the onerror EventListener. It will help us to distinguish different kinds of errors, temporary vs permanent, etc.

const videoDecoder = new VideoDecoder({
  output: someCanvas
});

videoDecoder.configure({codec: 'vp8'}).then(() => {
  streamEncodedChunks(videoDecoder.decode.bind(videoDecoder));
}).catch(() => {
  // App provides fallback logic when config not supported.
  ...
});
...

videoDecoder.onerror = event => {
    console.log("Error!  ");
  };

OR 

videoDecoder.addEventListener('error', error => console.log(`Error: ${error.name}`));

Strawman proposal :

[SecureContext, Exposed=(DedicatedWorker, Window)]
interface WebcodecsErrorEvent : Event {
  constructor(DOMString type, WebcodecsErrorEventInit errorEventInitDict);
  readonly attribute DOMException error;
};

dictionary WebcodecsErrorEventInit : EventInit {
  required DOMException error;
};


interface VideoDecoder {
  ...
  attribute EventHandler onerror;
}

Somewhat related to #49

Threading

Chromium's WebCodecs implementation will offload decoding to a separate thread. We probably want to specify this behavior.

An alternative would be to specify a Worklet type for codec implementations, but this doesn't seem compatible with having a codec that uses many threads.

How would Web Codecs support extracting PCM data for a specific time range?

Hey,

As you are likely aware, there is a huge and painful limitation in the Web Audio API, and accessing only a specific time range of audio sample data is not possible in any remotely feasible fashion without jamming the entire audio file into memory.

We are looking forward to finally getting this limitation behind using Web Codecs. Are there any plans for somehow supporting extracting raw PCM audio data from a specific time range, say from 5 seconds to 10 seconds? (obviously given an audio file not shorter than 10 seconds for this specific example)

Streaming image decoding

const decoded = await imageDecoder.decode(input);
const canvas = ...;
canvas.getContext('2d').putImageData(decoded, 0, 0);

The above (from the explainer) suggests that decoding doesn't stream, which feels like a missed opportunity.

In its current state, it feels like it'd be better to change createImageBitmap so it could accept a stream.

Maybe that should happen, whereas a whole new API could provide expose streamed image decoding.

Images can stream in a few ways:

  • Yielding pixel data from top to bottom.
  • Yielding pixel data with increasing detail (progressive/interlaced).
  • Yielding frames.

This would allow partially-loaded images to be used in things like <canvas>.

Interleaved / Non-Interleaved Decoded Audio

After audio is decoded, would it be best for the decoder to return 1 buffer of interleaved audio or multiple buffers of decoded audio per channel (e.g. 2 buffers for stereo audio)?

Would de-interleave functionality be natively provided by the platform/UA or would it be the developer's responsibility? I think the former could be an easier API for devs to use.

Predicted frames and error resiliency.

During encoding, it is often desirable to limit the number of predicted frames between keyframes. For example, if a frame is lost during transmission, all frames predicted from that are also lost until the next keyframe. However, reducing the number of predicted frames also makes the rate-distortion behavior worse. Therefore a Web App should be allowed to choose the number of predicted frames (often called GOP size).

How about adding max_prediction -

dictionary VideoEncoderEncodeOptions {
  .. 
  unsigned long max_prediction;
};

There could also be -

dictionary VideoEncoderEncodeOptions {
  .. 
  unsigned byte error_resiliency;
};

which would enable codec-specific error resiliency features, such as extra resynchronization headers, if greater than 0.

Related to #57

Error Recovery

After an error in processing (configure, decode, or otherwise), there are multiple potential ways we could allow recovery:

  • Automatically recover at the next keyframe.
  • Recover after the next configure().
  • Recover after the next reset().
  • The above, but discard (abort) all queued requests first.
  • Do not allow recovery at all.

The latter two are easier to reason about, but some apps would be simpler with a different choice. Perhaps it should be configurable.

(Example that is difficult to reason about: if a configure() fails or is aborted, what configuration would we use to decode the next keyframe? This may be a reason to require configure() after reset()--it is always unambiguous.)

Support for content protection

I would like to understand how WebCodecs supports content protection. In WebRTC NV Uses Cases, we initially had a use case where Javascript could be trusted with keys used to encrypt or decrypt protected content. That use case was removed after the IESG took objection. So the question is how WebCodecs can address the only remaining use case (untrusted Javascript).

Specify options or a method to encode pixel dimensions (width and height) of individual frames

Background

MediaRecorder supports VP8 video codec at Chromium, Chrome, and Firefox. However, the specification provides no means to programmatically set encoder options, if available at the implementation source code, for width and height of individual frames (images) of input. The result at Chromium and Chrome is that "video/webm;codecs=vp8" and "video/webm;codecs=vp9" result in a WebM file that does not match the input width and height where the input frames potentially have variable width and height (https://bugs.chromium.org/p/chromium/issues/detail?id=972470; https://bugs.chromium.org/p/chromium/issues/detail?id=983777). The only code shipped with Chromium that have been able to record variable width and height input which outputs frames identical to the input frames is using "video/x-matroska;codecs=h264" or "video/x-matroska;codecs=avc1" https://plnkr.co/edit/Axkb8s?p=info although technically WebM is specified to only support VP8 or VP9, the fact is Chromium, Chrome support video codecs other than VP8 or VP9 for WebM container (https://bugzilla.mozilla.org/show_bug.cgi?id=1562862; https://bugs.chromium.org/p/chromium/issues/detail?id=980822; https://bugs.chromium.org/p/chromium/issues/detail?id=997687; https://bugs.chromium.org/p/webm/issues/detail?id=1642). When the codecs are changed to VP8 or VP9 the resulting WebM file does not output the correct pixel dimensions corresponding to input MediaStreamTrack.

Mozilla Firefox and Nightly does record and encode the correct variable input video frames, both when using MediaRecorder to create media files and MediaSource to playback media files.

Proposed solution

Specify options or a method to encode pixel dimensions (width and height) of individual frames, and make sure the options and/or method does output the expected result, if not then write code from scratch that achieves that requirement to be included in Web Codecs specification.

For example, using code at the Explainer

const videoEncoder = new VideoEncoder({
  codec: "vp9", 
  // code
});

include options or a method to explicitly set the encoder to encode each input frame width and height, to avoid the output at Chromium, Chrome, which outputs pixel dimensions that do not match input dimensions.

WebGL access to decoded planes

Current proposals provide for accessing decoded frames as WebGL textures, but these would be RGB, implying a conversion for most content. We should investigate whether we can provide access to individual planes.

Progressive loading for textures

In WebGL applications we facing with issue that downloading from server and loading texture to GPU is pretty slow operation. In web we have progressive JPEG/WEBP but we can't access to level of details and need to create 2-4 versions of a file each of which laready has progressive format.

In same WebGPU issue gpuweb/gpuweb#766 they suggested to me Basis and your repo. But looks like basis should be smaller in size or inside js fetch API?

Is it possible to do our codec or media processing based on WASM

Hello.
we have a requirement that we need do ourselves encode/decoder/packetization ourselves on WASM. And then transport the data via webtransport or RtcQuic. In the begining, we might need do some media process, besides the encoder/decoder.

What we need is :
1) read the raw data of captured media data, such as YUV video data, or PCM audio data,etc. Then the raw data could be passed to wasm/js module.
2) It will be better the capture could be worked on worker.
3) The data passing is efficient without memory copy.
4) hardware accelerate encoder/decoder could be used by wasm/ js module, which is a Plus.

Could this spec satisfy our requirement?

Specify options to get metadata and order of existing tracks in input and set track order of muxer and, or writer

Background

Merging Matroska and WebM files requires at least

  1. Metadata about the input (media stream or file), specifically, a) if both A (audio) and V (video) tracks exist in the input media;
  2. The order of the A and V tracks in the file; could be AV or VA
  3. If either A or V track do not exist in the input file, the A or V track must be created (A can output silence; e.g., V can output #FFFFFF or #000000 frames) for the purpose of merging N Matroska or WebM files into a single file.

See

The WebM file output by MediaRecorder implementations at Chromium and Chrome, Mozilla Firefox and Nightly can have arbitrary AV track order, in general, per Media Capture and Streams specification https://www.w3.org/TR/mediacapture-streams/

The tracks of a MediaStream are stored in a track set. The track set MUST contain the MediaStreamTrack objects that correspond to the tracks of the stream. The relative order of the tracks in the set is User Agent defined and the API will never put any requirements on the order.

Proposed solution

The Web Codecs specification should define a means to get input media track order and set output file track order.

Support for SVC scalability modes

TheVideoEncodeLayer dictionary is using the approach to SVC used in the ORTC API. Unfortunately, this approach cannot support non-hierarchical scalability modes such as K-SVC. That is why we took a different approach in WebRTC-SVC. Can we instead use an approach based on scalabiltyModes?

Add an example demonstrating media editing scenarios

One of the key use cases mentioned in the explainer is
Non-realtime encoding/decoding/transcoding, such as for local file editing

Can we add an example demonstrating basic media editing operations like trim and concatenation ?

We are trying to understand on how to use WebCodecs to achieve the same functionality as MediaBlob to ensure that it meets developer needs.

API for containers?

It comes up as a common question: can we have an API for media containers? It's something that can be done JS/wasm and is arguably orthogonal to WebCodecs. But for some formats that you might consider video (GIF, (M)JPEG), the line is blurry between container and codec.

This is a tracking issue for a conversation around this topic. My current opinion is to leave it out of WebCodecs until it's more mature and then perhaps readdress it later.

Importing / exporting out of band side data

Many codecs require additional information that is not stored in the bitstream. This information is usually stored in the container format. For example, H264 requires SPS and PPS to be specified out of band when using the AVCC format.

WebCodecs must have a way to get the exported side data when encoding and provide a way to specify the side data when decoding.

Support VideoFrame creation from yuv data

Currently there is no way of creating a VideoFrame from an existing image data, so the only way to implement a custom media stream track is to paint it on a canvas object and capture the stream from it.

This would allow implementing custom decoders in WASM and half the work required for the funy hat use case.

Encoder and decoder stats

Many applications may wish to get statistics from the encoder/decoder in real time. Examples:

  • Per-frame encode/decode durations. Real-time applications may use this to decrease the quality of the media stream if the encoder or decoder is taking too long.
  • Per-frame quality information. For example, getting the QP value of a video frame.

Not sure if there's as much need for cumulative statistics across frames.

More ideas?

Add privacy considerations section

At TPAC it was brought up that there are some privacy concerns with low-level codec access. We should try to address some of these concerns in the explainer.

A couple concerns that were brought up:

  • Precise timing information of encode/decode and codec limitations could be a fingerprinting surface.
  • Codec capabilities are a fingerprinting surface.

Audio rendering and AV sync

In the TrackWriter ->

Decoded audio could be passed to WebAudio or the audio device client API. We may need dynamic feedback from the audio api to say how much delay the platform's internal buffering adds to the playout of the most recently provided buffer.

I'm not an expert on those Audio APIs so yell if I'm missing something obvious. @padenot @hoch

Backpressure/buffering for hardware decoders?

The Streams spec assumes you can get some backpressure signal from the implementation of a TransformStream.

But a hardware codec may not give you that much control over what happens after frame data is passed into the codec API. It will typically decode the frame data immediately and render it into a GPU frame buffer.

So the high level feedback is: the spec (at a future level of maturity) should map the behavior of an abstract codec onto the behavior of the TransformStream, and allow codec implementations that run in immediate mode (without buffering). (Or require implementations to do this buffering internally.)

Decoding errors

How is the application supposed to be informed about decoding errors?

Put the change of parameters inline

A reminder to:

Rather than call transform_stream.setParameters(...), attach new parameters to the stream that go into the WritableStream.

Simulcast support

what is the intention regarding simulcast support?

We have the following options:

  • Not support it, and make the app create N encoder instances, and make the app support it. Not very likely as currently there is no support for accessing the image raw data, so it will have draw the media stream in n different canvas, downscale it, create the capture stream for each one and pass each one to an instance of the decoder.

  • Support it as configuration parameter of an encoder. This would allow to have single input for the simulcast encoder, but would make it much difficult to provide a good api compatible with non-encoders.

  • Provide a helper/adapter that can be created with a webrtc encoding-like object parameter and will create internally n encoders, exposing each one for controlling each simulcast layer individually. This would allow to have a single input, and the simulcast adapter will internally downscale/forward it to each of the encoders individually.

I think that the latest one is the current approach in libwbertc internal code and also my preferred option.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.