There is an old issue in <a href="https://www.w3.org/Bugs/Public/show_bug.cgi?id=26336

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

You might also be interested in <a href="https://wiki.mozilla.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Support SpeechRecognition on an audio MediaStreamTrack,about wicg/speech-api

Comments (17)

guest271314 commented on July 28, 2024 1

I don't understand. getUserMedia cannot currently be used with SpeechRecognition, and MediaRecorder is not even remotely related.

Actually MediaRecorder can currently be used with SpeechRecognition. Either a live user voice or SpeechSynthesis can be recorded then played back for input to SpeechRecognition (https://stackoverflow.com/a/46383699 ; https://stackoverflow.com/a/47113924).

from speech-api.

guest271314 commented on July 28, 2024

@Pehrsons Not entirely certain what is being proposed here relevant to SpeechRecognition? There is no special handling needed for speech to be recorded using getUserMedia(). MediaRecorder start(), pause(), resume() and stop() should suffice for audio input. Processing the input locally instead of sending the input to an undisclosed remote web service is what is needed.

Currently Chrome, Chromium records the user voice (without notice or permission) then sends that recording to a remote web service (#56; https://webwewant.fyi/wants/55/). The response is a transcript (text) of the input; depending on the input words, heavily censored. It is unclear what happens to the users' input (potential biometric data; their voice).

For output from speechSynthesis.speak() to be set to a MediaStreamTrack would suggest re-writing the algorithm from scratch to configure the API to communicate directly with speechd.

from speech-api.

guest271314 commented on July 28, 2024

Is this issue to specify for MediaStreamTrack as input to SpeechRecognition?

from speech-api.

Pehrsons commented on July 28, 2024

I don't understand. getUserMedia cannot currently be used with SpeechRecognition, and MediaRecorder is not even remotely related.

This proposal is about adding a MediaStreamTrack argument to SpeechRecognition's start method.

Avoiding using an online service for recognition is completely unrelated to this, please use a separate issue for that.

from speech-api.

guest271314 commented on July 28, 2024

Is the proposal that when the argument to SpeechRecognition is a MediaStreamTrack the current implementation which captures microphone would be overridden by the MediaStreamTrack input argument?

Or is the idea that capturing the microphone input would be replaced entirely by the MediaStreamTrack input?

MediaRecorder is related to the degree that AFAIK there is no implementation which performs the STT procedure in the browser, meaning even if a MediaStreamTrack is added as argument, that MediaStreamTrack would still currently need to be recorded internally to be processed - there is no browser implementation that processes the audio input in "real-time" and produces output using code shipped in the browser (perhaps besides certain handhelds). Therefore, if a MediaStreamTrack argument is added to SpeechSynthesis there might as well be the entire MediaRecorder implementation added as well, to control all aspects of input to be recorded.

AFAICT Mozilla does not implement SpeechRecognition, even when setting media.webspeech.recognition.enable to true.

There is no benefit in adding MediaStreamTrack to SpeechRecognition without the corresponding control over the entire recording, and ideally, without recording anything at all, rather processing STT in "real-time". It would be much more advantageous to focus on actually implementing SpeechRecognition in the browser without any external service than to spend time on incorporating MediaStreamTrack as argument to an API that does not work at Mozilla and relies on an external service for result. What is there to gain in adding MediaStreamTrack as an argument to SpeechRecognition when the underlying specification and implementation of SpeechRecognition needs work?

from speech-api.

guest271314 commented on July 28, 2024

Since this is obviously something you have thought about would only suggest when proceeding to not omit the opportunity to concurrently or consecutively add AudioBuffer, Float32Array, and .mp3, .ogg, .opus, .webm, .wav static file input as well.

from speech-api.

guest271314 commented on July 28, 2024

You might also be interested in

from speech-api.

Pehrsons commented on July 28, 2024

Is the proposal that when the argument to SpeechRecognition is a MediaStreamTrack the current implementation which captures microphone would be overridden by the MediaStreamTrack input argument?

Or is the idea that capturing the microphone input would be replaced entirely by the MediaStreamTrack input?

I read your question as "Should the MediaStreamTrack argument to start() be required or optional?"

Preferably required, since if it's optional we cannot get rid of any of the language that I claimed we can in the proposal.

MediaRecorder is related to the degree that AFAIK there is no implementation which performs the STT procedure in the browser, meaning even if a MediaStreamTrack is added as argument, that MediaStreamTrack would still currently need to be recorded internally to be processed - there is no browser implementation that processes the audio input in "real-time" and produces output using code shipped in the browser (perhaps besides certain handhelds). Therefore, if a MediaStreamTrack argument is added to SpeechSynthesis there might as well be the entire MediaRecorder implementation added as well, to control all aspects of input to be recorded.

Are you suggesting allowing to run SpeechRecognition on a buffer of data in non-realtime? That seems like an orthogonal proposal, please file your own issue if you want to argue for that.

Giving SpeechRecognition the controls of MediaRecorder because the implementation happens to encode and send audio data to a server doesn't make sense. The server surely only allows specific settings to container and codec. It also locks out any future implementations that do not rely on a server, because there'd be no reason to support MediaRecorder configurations for them, yet they have to.

AFAICT Mozilla does not implement SpeechRecognition, even when setting media.webspeech.recognition.enable to true.

This issue is about the spec, not Mozilla's implementation.

There is no benefit in adding MediaStreamTrack to SpeechRecognition without the corresponding control over the entire recording

See my first post in this issue again, there are lots of benefits.

, and ideally, without recording anything at all, rather processing STT in "real-time". It would be much more advantageous to focus on actually implementing SpeechRecognition in the browser without any external service than to spend time on incorporating MediaStreamTrack as argument to an API that does not work at Mozilla and relies on an external service for result. What is there to gain in adding MediaStreamTrack as an argument to SpeechRecognition when the underlying specification and implementation of SpeechRecognition needs work?

It's part of improving the spec, so you seem to have answered your own question.

from speech-api.

guest271314 commented on July 28, 2024

@Pehrsons

Are you suggesting allowing to run SpeechRecognition on a buffer of data in non-realtime? That seems like an orthogonal proposal, please file your own issue if you want to argue for that.

That is what occurs now at Chromium, Chrome.

Reading the specification the term "real-time" does not appear at all in the document.

The term would need to be clearly defined anyway, as "real-time" can have different meanings in different domains or be interpreted differently by different individuals.

SpeechRecognition() should be capable of accepting either a live MediaStreamTrack or a static file or JavaScript object.

Downloaded https://github.com/guest271314/mozilla-central/tree/libdeep yesterday. Will try to build and test.

It should not be any issue setting MediaStreamTrack as a possible parameter to start(). It should also not be an issue allowing that parameter to be a Float32Array, ArrayBuffer.

Am particularly interested in how you intend to test what you are proposing to be specified?

from speech-api.

guest271314 commented on July 28, 2024

@Pehrsons Re use of static files for STT for TTS espeak-ng currently has the functionality to process static files several ways, including

espeak-ng -x -m -f input.txt

-f Text file to speak
-x Write phoneme mnemonics to stdout

from speech-api.

Pehrsons commented on July 28, 2024

@Pehrsons

Are you suggesting allowing to run SpeechRecognition on a buffer of data in non-realtime? That seems like an orthogonal proposal, please file your own issue if you want to argue for that.

That is what occurs now at Chromium, Chrome.

Then they're not implementing the spec. Or are they, but they're using a buffer internally? Well, then you're conflating their implementation with the spec.

Reading the specification the term "real-time" does not appear at all in the document.

The term would need to be clearly defined anyway, as "real-time" can have different meanings in different domains or be interpreted differently by different individuals.

And it doesn't have to be if we use a MediaStreamTrack, since mediacapture-streams defines what we need.

SpeechRecognition() should be capable of accepting either a live MediaStreamTrack or a static file or JavaScript object.

Downloaded https://github.com/guest271314/mozilla-central/tree/libdeep yesterday. Will try to build and test.

It should not be any issue setting MediaStreamTrack as a possible parameter to start(). It should also not be an issue allowing that parameter to be a Float32Array, ArrayBuffer.

Again, file a separate issue if you think that is the right way to go.

Am particularly interested in how you intend to test what you are proposing to be specified?

Give the API some input that you control, and observe that it gives you the expected output.

from speech-api.

guest271314 commented on July 28, 2024

Then they're not implementing the spec.

Are you referring to the following language?

https://w3c.github.io/speech-api/#speechreco-methods

When the speech input is streaming live through the input media stream

which does not necessarily exclusively mean a "real-time" input media stream.

Or are they, but they're using a buffer internally?

The last time checked the audio was compiled into a single buffer (cannot recollect at the moment if conversion to WAV file occurs at the browser or not, do recollect locating some code to convert audio to a WAV in the source code) then sent to an undisclosed remote web service at a single request.

Again, file a separate issue if you think that is the right way to go.

Filed.

from speech-api.

Pehrsons commented on July 28, 2024

Then they're not implementing the spec.

Are you referring to the following language?

https://w3c.github.io/speech-api/#speechreco-methods

When the speech input is streaming live through the input media stream

Not necessarily. Yes, the spec is bad and hand-wavy so it's hard to find definitions for the terms used. But it's fairly easy to understand the spec authors intent. In this case it is that start() works a lot like getUserMedia({audio: true}) but with special UI bits to tell the user that speech recognition is in progress (note that this spec originates from a time when getUserMedia was not yet available, chrome now seems to use the same UI as for getUserMedia). Read chapter 3 "Security and privacy considerations and this becomes abundantly clear.

which does not necessarily exclusively mean a "real-time" input media stream.

I'm afraid it does. Otherwise, where do you pass in this non-realtime media stream?
If you're suggesting the UA would provide the means through a user interface picker or the like; well, the spec repeatedly refers to "speech input", and by reading through all those references "speech input" can only be interpreted as speech input from the user, i.e., through a microphone which the user is speaking into, i.e., real-time.

Or are they, but they're using a buffer internally?

The last time checked the audio was compiled into a single buffer (cannot recollect at the moment if conversion to WAV file occurs at the browser or not, do recollect locating some code to convert audio to a WAV in the source code) then sent to an undisclosed remote web service at a single request.

Implementation detail and irrelevant to the spec.

Again, file a separate issue if you think that is the right way to go.

Filed.

from speech-api.

guest271314 commented on July 28, 2024

I'm afraid it does. Otherwise, where do you pass in this non-realtime media stream?
If you're suggesting the UA would provide the means through a user interface picker or the like; well, the spec repeatedly refers to "speech input", and by reading through all those references "speech input" can only be interpreted as speech input from the user, i.e., through a microphone which the user is speaking into, i.e., real-time.

Do not agree with that assessment. Have not noticed any new UI at Chromium. The specification as-is permits gleaning or massing various plausible interpretations from the language, not of which are necessarily conclusive and binding, at least not unambiguous - thus the current state of the art is to attempt to "incubate" the specification. A user can schedule an audio file (or reading of a Float32Array at AudioWorkletNode) to play in the future. That is not "real-time" "speech input" from the user.

from speech-api.

guest271314 commented on July 28, 2024

Since there is current interest in amending and adding to the specification the question must be asked why would SpeechRecognition not be specified to accept both non-"real-time" (e.g., MediaStreamTrack) audio input from an audio file or buffer and a MediaStreamTrack?

from speech-api.

Pehrsons commented on July 28, 2024

Since there is current interest in amending and adding to the specification the question must be asked why would SpeechRecognition not be specified to accept both non-"real-time" (e.g., MediaStreamTrack) audio input from an audio file or buffer and a MediaStreamTrack?

IMO because that would make the spec very complicated. Unnecessarily so, since there are other specs already allowing conversions between the two.

from speech-api.

mbayou commented on July 28, 2024

Hi There!

I apologize for resurrecting this discussion almost five years later.

I was wondering if a conclusion has been reached regarding whether the start() method should take an input. I am trying to allow my users to select the microphone they want to use for recognition within our app. Currently, I am forcing them to change their default device, but it would be much easier if we could let them decide in-app.

Thank you in advance for your time!

from speech-api.

Support SpeechRecognition on an audio MediaStreamTrack about speech-api HOT 17 OPEN

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent