wicg / speech-api Goto Github PK

Web Speech API

Home Page: https://wicg.github.io/speech-api/

Bikeshed 100.00%

speech-api's Issues

onpause doesn't fire in chrome

speechSynthesis.cancel();
text = 'This is a story about a man who once lived in a mountains far into Norther region of our country in a village which was known for its tasty food. He always had a dream as a kid to travel around the world.';
utterance = new SpeechSynthesisUtterance(text);
utterance.voice = speechSynthesis.getVoices().filter(x => x.lang == 'en-US' && x.voiceURI == 'Google US English')[0]
utterance.onpause = e => console.log(e);
speechSynthesis.speak(utterance);

setTimeout(function(){
  speechSynthesis.pause(utterance);
}, 3000);

but if you set to this voice it works

utterance.voice = speechSynthesis.getVoices().filter(x => x.lang == 'en-US' && x.name == 'Alex')[0]

Mac
Chrome Version 86.0.4240.75 (Official Build) (x86_64)

Getter operations return null without proper IDL

For example, the SpeechRecognitionResult's getter operation says:

If index is greater than or equal to length, this returns null

But the IDL does not indicate so:

[Exposed=Window]
interface SpeechRecognitionResult {
    readonly attribute unsigned long length;
    getter SpeechRecognitionAlternative item(unsigned long index);
    readonly attribute boolean isFinal;
};

The return types must be nullable here. I'm not sure how Blink allows returning null here, though?

SpeechGrammarList's addFromURI method is named addFromUri in Chrome

https://w3c.github.io/speech-api/#dom-speechgrammarlist-addfromuri

There's a naming mismatch between the spec and Chrome. I will add a use counter in Chrome to see if it's feasible to rename it, or if it's better to just change the spec.

API and permissions

This specification is defining an API that might be gated by a user prompt.
As per spec: User agents must only start speech input sessions with explicit, informed user consent.

In that case, it seems some language could be added and refer to https://w3c.github.io/permissions/
That might also help defining the behavior for third party contexts, like done for other APIs such as getUserMedia.

Constructors for speech events

Currently new SpeechSynthesisEvent("pending", { utterance: new SpeechSynthesisUtterance() }) works on Edge and Firefox but throws on Chrome.

Edge: only requires the first argument
Firefox: requires both the first and second arguments
Chrome: always throws

Can we spec this to align browser behaviors?

The UA should be able to disallow speak() from autoplaying

See here for context from [email protected].

I propose we only allow speak() to succeed without error if the caller frame or any of its parents have receive user activation since their last load.

There are two options for error handling in the case that speak() fails this way:

Use SpeechSynthesis' existing error mechanism with either the "not-allowed" or a new type.
Make speak() return a promise, to align with HTMLMediaElement.play

I slightly prefer (1) just because the ergonomics are not much different from (2), and it keeps the API surface consistent, but I'm open to either.

Fix URL next to description of the project

I guess the “website” field that is shown on the front page of the repo, at the top, should be either
https://w3c.github.io/speech-api/webspeechapi.html
or
https://w3c.github.io/speech-api/speechapi.html
?

Remove SpeechRecognitionEvent's interpretation and emma

They're only implemented in Chrome:
http://web-confluence.appspot.com/#!/catalog?releases=%5B%22Edge_18.17763_Windows_10.0%22,%22Safari_12.1_OSX_10.14.4%22,%22Chrome_74.0.3729.108_Windows_10.0%22,%22Firefox_67.0_Windows_10.0%22%5D&q=%22interpretation%22
http://web-confluence.appspot.com/#!/catalog?releases=%5B%22Edge_18.17763_Windows_10.0%22,%22Safari_12.1_OSX_10.14.4%22,%22Chrome_74.0.3729.108_Windows_10.0%22,%22Firefox_67.0_Windows_10.0%22%5D&q=%22emma%22

However, in Chromium they always return null:
https://cs.chromium.org/chromium/src/third_party/blink/renderer/modules/speech/speech_recognition_event.h?l=61&rcl=ebd558bcafa1b17fa0681321118f217a0587d0a2

Beyond Speech-to-Text

With respect to post-text speech recognition (e.g. speech-to-SSML, speech-to-hypertext, speech-to-X₁), we can consider:

from:

// Item in N-best list
[Exposed=Window]
interface SpeechRecognitionAlternative {
    readonly attribute DOMString transcript;
    readonly attribute float confidence;
};

to:

// Item in N-best list
[Exposed=Window]
interface SpeechRecognitionAlternative {
    readonly attribute object transcript;
    readonly attribute float confidence;
};

then client-side, server-side or third-party components or services could return either text or XML content per recognition result. That is, transcript could be either a DOMString or a DOMElement.

Speech-to-text is too lossy. Information pertaining to prosody, intonation, emphases and pauses are discarded in text-formatted output. Such information can be useful, for instance, in informing machine translation components and services.

I found the accuracy of your demo is lower than the demo in google speech to text website

Dose your api use google's speech to text api? Why I found the accuracy is lower than the demo in google's official site? Thank you for your time.

Add profanity/offensive words filter attribute

No idea if i am on the right track as to why curse words appear differently in the transcript of SpeechRecognitionResult in different browsers. Therefore thought it best to open an issue here.

Question
If browsers implement the transcript SpeechRecognitionResult in such a way where the output differs maybe a profanity filter attribute could be useful so that the developer using the API has a choice in that matter? For example offensiveWordFilter attribute, of type boolean?

Background Story
While experimenting with the SpeechRecognition Interface
in the phrase-matcher from https://github.com/mdn/web-speech-api/ the following occurred:

When using Chrome and saying a curse word like shit, the transcript in SpeechRecognitionResult is censored as s****
When using Firefox Nightly and saying a curse word like shit the transcript in SpeechRecognitionResult is not censored

In neither Chrome nor Nightly this type of censoring is applied for the speechSynthesis interface as used in the speak-easy-synthesis.

In my search into why this happens i found the following:
On https://github.com/chromium/chromium/blob/master/content/browser/speech/speech_recognition_engine.cc on line 277 filter_profanities is set to false on line 579 it should result in pFilter=0. According to https://stackoverflow.com/questions/15030339/remove-profanity-censor-from-google-speech-recognition/15071054 the setting pfilter=0 results in removing the profanity filter. Which could lead to the conclusion in chrome this is changed. I do not feel confident in this conclusion however.

In Nightly I have found no reference in the code to a profanity filter https://dxr.mozilla.org/mozilla-central/source/dom/media/webspeech/recognition

Convert to Bikeshed

https://github.com/tabatkins/bikeshed is used for most modern specs.

Clarify what happens if two windows try to speak

See Need to clarify what happens if two windows try to speak. Issue moved here in order to discontinue use of Bugzilla for Speech API.

Hypertext-to-Speech

I would like to share some ideas with respect to advancing the Web Speech API. The ideas pertain to hypertext-to-speech, the speech synthesis of XHTML.

Hypertext-to-Speech

Hypertext-to-speech involves processing XHTML DOM trees with subtopics including SSML attributes, CSS Speech Module properties and new JavaScript events such as onsynthesis.

<html xmlns:ssml="http://www.w3.org/2001/10/synthesis">
  <head>
    <script type="text/javascript" src="..."></script>
    <style>
      [role="topic"] { voice-stress: strong; }
    </style>
  </head>
  <body ssml:alphabet="ipa">
    <p>
      <span id="sentence-1" onsynthesis="function(event);"><span ssml:ph="/ðɪs/" role="topic">This</span> is a sentence of <span ssml:ph="/tɛkst/">text</span> with markup for hypertext-to-speech.</span>
    </p>
  </body>
</html>

JavaScript API

speechSynthesis.speak(document);

speechSynthesis.speak(document.getElementById('sentence-1'));

var fragment = document.createDocumentFragment();
...
speechSynthesis.speak(fragment);

var doc = document.implementation.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
...
speechSynthesis.speak(doc);

References

SpeechSynthesisEvent does not support charLength

CharLength is used during word and sentence boundary events, and allows API consumers to create features that highlight words/sentences as they are spoken out loud. For those with dyslexia or problems concentrating, this is a powerful accessibility feature. In addition, most major computing platforms already support some form of charLength. Therefore, it would make some sense to add charLength to the web standards. An older request for the same thing

Implement gh-pages branch

Support runtimeDuration attributes in SpeechSynthesisUtterance

I'm not sure if this is possible, but I would be useful if I could get the duration of runtime for an instantiated utterance.

According to https://wicg.github.io/speech-api/#utterance-attributes, the runtime attribute is not available.

Given text, voice, and rate, is it possible for SpeechSynthesisUtterance to also provide the runtimeDuration?

How is a complete SSML document expected to be parsed when set once at .text property of SpeechSynthesisUtterance instance?

According to the specification

5.2.3 SpeechSynthesisUtterance Attributes text attribute This attribute specifies the text to be synthesized and spoken for this utterance. This may be either plain text or a complete, well-formed SSML document.

It is not clear how an entire SSML document is expected to be parsed when set at .text property at single instance of SpeechSynthesisUtterance().

See guest271314/SSMLParser#1

Remove the SpeechGrammar constructor?

Originally reported in https://www.w3.org/Bugs/Public/show_bug.cgi?id=26992

Some discussion also in #32.

Which technique is used for voice to text transformation?

I am trying to find the technique used in transforming voice into text, if neural network is used, or some other method, but I was unsuccessful.

Question
Which AI technique is used for voice to text transformation?

Background Story
I am writing an article where a web application uses WebSpeechAPI, and important information is the technique used.

Timing of SpeechSynthesisUtterance events firing not defined

https://w3c.github.io/speech-api/#utterance-events

These descriptions don't describe precisely when the events fire compared to when speak(), pause(), resume() or cancel() are fired. This makes it impossible to write a detailed test for those methods based on the spec.

I needed this to write a test for https://bugs.chromium.org/p/chromium/issues/detail?id=679043.

Precisely define when speak() should fail due to autoplay rules

Currently, Chrome's autoplay policy allows an element to play if the caller frame or any of its parents have receive user activation since their last load.

See #27 for a more general discussion of expanding this spec to disallow speak().

SpeechRecognition MUST be granted permission to record and send users voice to a remote web service

Google Chrome and the ostensibly open source browser Chromium implementation of SpeechRecognition records the user voice and sends the users' biometric data to a remote web service https://bugs.chromium.org/p/chromium/issues/detail?id=816095 without the notifying the user or being granted permission by the user (cannot grant permission if not notified) that use of SpeechRecognition at that browser will perform the preceding with their voice. Some of the issues relating to that undisclosed practice are described at w3c/webappsec-secure-contexts#66 (comment).

To remedy that horrendous issue at the specification level it should be a simple matter of inccluding language that states SpeechRecognition MUST do at least the following if the implementation records the user voice and sends that recording to a remote web service

If SpeechRecognition is not performed ~~in real-time at the browser source code~~ locally in the browser, MUST notify the user that if they use that implementation of Web Speech API at that browser, their voice will be recorded and send to a remote web service when they use SpeechRecognition;
The SpeechRecognition implemented MUST get permission from the user before recording their voice to send to a remote web service;
The SpeechRecognition implemented MUST let the user know precisely where their biometric data is being sent and for how long their recorded voice will be stored and the when the recording of their voice is deleted from the storage devices at the remote web service.

Will file the PR if necessary to fix this long-standing issue at the specification level.

Support SpeechSynthesis to a MediaStreamTrack

It would be very helpful to be able to get a stream of the output of SpeechSynthesis.

For an explicit use cases, I would like to:

position speech synthesis in a virtual world in WebXR (using Web Audio's PannerNode)
be able to feed speech synthesis output through a WebRTC connection
have speech synthesis output be able to be processed through Web Audio

(This is similar/inverse/matching/related feature to #66.)

Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation

Introduction

We can envision and consider client-side, server-side and third-party speech recognition, synthesis and translation scenarios for a next version of the Web Speech API.

Advancing the State of the Art

Speech Recognition

Beyond speech-to-text, speech recognition includes speech-to-SSML and speech-to-hypertext. With speech-to-SSML and speech-to-hypertext, there can be a higher degree of fidelity possible for round-tripping speech audio through speech recognition and synthesis components or services.

Speech Synthesis

Beyond text-to-speech, speech synthesis includes SSML-to-speech and hypertext-to-speech.

Translation

Translation scenarios include processing text, SSML, hypertext or audio in a source language into text, SSML, hypertext or audio in a target language.

Desirable features include interoperability between client-side, server-side and third-party translation and WebRTC with translations available as subtitles or audio tracks.

Multimodal Dialogue Systems

Interesting scenarios include Web-based multimodal dialogue systems which efficiently utilize client-side, server-side and third-party speech recognition, synthesis and translation.

Client-side Scenarios

Client-side Speech Recognition

These scenarios are considered in the current version of the Web Speech API.

Client-side Speech Synthesis

These scenarios are considered in the current version of the Web Speech API.

Client-side Translation

These scenarios are new to the Web Speech API and involve the client-side translation of text, SSML, hypertext or audio into text, SSML, hypertext or audio.

Server-side Scenarios

Server-side Speech Recognition

These scenarios are new to the Web Speech API and involve one or more audio streams from a client being streamed to a server which performs speech recognition, optionally providing speech recognition results to the client.

Server-side Speech Synthesis

These scenarios are new to the Web Speech API and involve a client sending text, SSML or hypertext to a server which performs speech synthesis and streams audio to the client.

Server-side Translation

These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a server for translation into text, SSML, hypertext or audio.

Third-party Scenarios

Third-party Speech Recognition

These scenarios are new to the Web Speech API and involve one or more audio streams from a client or server being streamed to a third-party service which performs speech recognition providing speech recognition results to the client or server.

Third-party Speech Synthesis

These scenarios are new to the Web Speech API and involve a client or server sending text, SSML or hypertext to a third-party service which performs speech synthesis and streams audio to the client or server.

Third-party Translation

These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a third-party translation service for translation into text, SSML, hypertext or audio.

Hyperlinks

Amazon Web Services

Google Cloud AI

IBM Watson Products and Services

Microsoft Cognitive Services

Real Time Translation in WebRTC

Remove serviceURI

It's not implemented by any engine:
http://web-confluence.appspot.com/#!/catalog?releases=%5B%22Edge_18.17763_Windows_10.0%22,%22Safari_12.1_OSX_10.14.4%22,%22Chrome_74.0.3729.108_Windows_10.0%22,%22Firefox_67.0_Windows_10.0%22%5D&q=%22serviceuri%22

Define SpeechSynthesisUtterance attribute setters

https://w3c.github.io/speech-api/#utterance-attributes

None of these attributes define what should happen when set, and except for voice it's not clear what the setter might do with invalid values:

text: #10
lang: what happens if a non-BCP 47 string is set?
volume: setting something outside the range [0, 1]?
rate: values <=0? Is there a maximum value?
pitch: similarly, what happens when setting outside the allowed range?

Defining what should happen if an utterance is changed after passing it to speak()

Consider the following case:

utterance = new SpeechSynthesisUtterance('hello world');
speechSynthesis.speak(utterance)
utterance.text = 'hello world two';

The spec currently says

If changes are made to the SpeechSynthesisUtterance object after calling this method and prior to the corresponding end or error event, it is not defined whether those changes will affect what is spoken, and those changes may cause an error to be returned.

but it's not very satisfactory, as the web platform tries very hard to avoid any possible undefined behavior.

Chrome seems to “snapshot” the actual utterance information at the time of the speak(), which sounds like a good thing to specify.

Also see #29.

api throw "network" error when in Electron env

it works success in browser env, but when I use it in electron, it throw
this error instantly

env: both macos / win10 electron 9.0.0

Support SpeechRecognition on an audio MediaStreamTrack

There is an old issue in bugzilla but it doesn't discuss much.

We should revive this, to give the application control over the source of audio.

Not letting the application be in control has several issues:

The spec needs to re-define (mediacapture-main does this too) what sources there are, today I don't see this mentioned. It seems implied that a microphone is used.
The error with code "audio-capture" groups all kinds of errors that capturing audio from a microphone could have.
There'd have to be an additional permission setting for speech, in addition to that of audio capture through getUserMedia. The spec doesn't help in clearing out how this relates to getUserMedia's permissions currently, and doing so could become complicated (if capture is already ongoing, do we ask again? if not, how does a user choose device? etc.)
Depending on implementation, if we rely on start() requesting audio from getUserMedia() (seems reasonable), doing multiple requests after each other could lead to a new permission prompt for each one, unless the user consents to giving a permanent permission. This would be an issue in Firefox as through the SpeechRecognition API an application cannot control the lifetime of the audio capture.
Probably more.

Letting the application be in control has several advantages:

It can rely on mediacapture-main and its extension specs to define sources of audio and all security and privacy aspects around them. Some language might still be needed around cross-origin tracks. There's already a concept of isolated tracks in webrtc-identity, that will move into the main spec in the future, that one could rely on for the rest.
If no backwards-compatible path is kept, the spec can be simplified by removing all text, attributes, errors, etc. related to audio-capture.
The application is in full control of the track's lifetime, and thus can avoid any permission prompts the user agent might otherwise throw at the user, when doing multiple speech recognitions.
The application can recognize speech from other sources than microphones.
Probably more.

To support a MediaStreamTrack argument to start(), we need to:

Throw in start() if the track is not of kind "audio".
Throw in start() if the track's readyState is not "live".
Throw in start() if the track is isolated.
If the track becomes isolated while recognizing, discard any pending results and fire an error.
If the track ends while recognizing, treat it as the end of speech and handle it gracefully.
If the track is muted or disabled, do nothing special as this means the track contains silence. It could become unmuted or enabled at any time.

What to throw and what to fire I leave unsaid for now.

Implement travis-ci to run tests and deploy automatically

Transfer to WICG

@siusin could you please transfer this repository for us to the WICG based on:
https://lists.w3.org/Archives/Public/public-speech-api/2019Oct/0001.html

I'll send a PR the w3c.github.io repo so links don't break.

Update to constructor operations

See whatwg/webidl#778.

How to get receiver's voice like in zoom meeting

Hello,
I'm using speech to text, consider this simple example , is there a way to get the voice of other people, like in zoom meetings how can I get the voices of other attendees.

I have checked the documentation https://wicg.github.io/speech-api/ but can't find anything.

Thanks.

Transcripts are Insufficient for Japanese

Currently, the only implementation of the spec that I'm aware of, in Google Chrome, returns a mixture of surface kanji and kana when using the Japanese language speech to text.

Having kanji is great for semantics, but in Japanese phonology is important in many cases. Without phonology, ambiguity is left in the intended transcription.

A whole class of use cases becomes impossible. For example:

Japanese names are written in kanji but there can be multiple readings of those kanji only one of which maps to an individual.
If I utter てい the Google WebSpeech API returns 体. This kanji has four readings, and てい could belong to the reading of many other kanji. (I'm creating a voice-enabled Japanese flashcard app and this problem only lets me guess whether a user is uttering a kanji/vocab item correctly, I cannot be sure because of the API shortcomings for Japanese).

Ideally we would always get the furigana (the kana that represent which sounds were made for the kanji) along with the kanji. Kana does have, in essence, a 1-to-1 phonetic mapping.

Better yet, a more general solution that would work for other languages with similar issues could be returning an IPA pronunciation along with the transcript, like so:

interface SpeechRecognitionAlternative {
    readonly attribute DOMString transcript;
    readonly attribute DOMString pronunciation;
    readonly attribute float confidence;
};

It looks like some changes were made in bug 23737.

A few instances of "SpeechSynthesisVoiceList" remain. In the IDL it should be "sequence getVoices()" and in the prose it needs to say things about the sequence of SpeechSynthesisVoice objects instead.

Define how grammars work and give examples

You should provide an fully working example about how to use grammar. Because I did not see any use case where I could use that.

I do not understand what is the purpose of having grammar and how to use it.
"google search" did not help me.

Update spec to new IDL syntax for optional dictionaries

See whatwg/webidl#758

SSML support needs to be possible to feature detect

In ~~web-platform-tests/wpt#12568~~ web-platform-tests/wpt#12689 I found that apparently SSML isn't supported anywhere.

If there is no immediate implementer interest, I suggest removing the SSML bits from the spec.

Update: it does work on Windows, see #37 (comment)

Web Speech Api not working on chinese network

Hi, we have a problem with our WebSpeech API solution on chrome with a Chinese Client. The speech recognizing start but not give us a result. Only work with VPN (because google server are blocked in China). We would like to make available our solution without VPN in Chrome. There's a solution for this problem?

Fingerprinting

See WebSpeech API mustn't allow fingerprinting. Issue moved here in order to discontinue use of Bugzilla for Speech API.

Move tests to web-platform-tests

https://github.com/w3c/speech-api/tree/master/conformance

These should be moved into https://github.com/w3c/web-platform-tests/tree/master/speech-api

Is SpeechSynthesisUtterance reusable?

From https://bugzilla.mozilla.org/show_bug.cgi?id=1372325#c14

Sample is the following.

var utter = new SpeechSynthesisUtterance("Hello World");
utter.onend = () => {
  // Edge, WebKit and Blink allow this, but Firefox doesn't.
  window.speechSynthesis.speak(utter);
};
window.speechSynthesis.speak(utter);

Google search page uses that SpeechSynthesisUtterance is reused. But Speech API doesn't define whether this object is reusable.

Edge, WebKit, and Blink allow SpeechSynthesisUtterance is reusable now, but Gecko doesn't. If it allow to reuse this object, we should add spec/document about reusable.

Timing of SpeechSynthesis state changes not defined

https://w3c.github.io/speech-api/#speechsynthesis

The spec doesn't define when precisely the pending, speaking and paused states change. This makes it impossible to write a detailed test for SpeechSynthesisUtterance pause() and resume() based on the spec.

What exception should speechSynthesis.speak() throw for reused SpeechSynthesis?

https://w3c.github.io/speech-api/webspeechapi.html#dfn-ttsspeak says "The SpeechSynthesis object takes exclusive ownership of the SpeechSynthesisUtterance object. Passing it as a speak() argument to another SpeechSynthesis object should throw an exception."

It doesn't say what exception to throw. It should.

Spun off from #7 and a TODO in web-platform-tests/wpt#7510

Support SpeechRecognition input from audio files and Float32Array and ArrayBuffer

Support .wav, .webm, .ogg, .mp3 files (file types supported by the implementation decoders) and Float32Array and ArrayBuffer input to SpeechRecognition.

Use cases for static audio file and ArrayBuffer (non-"real-time") input to SpeechRecognition, includem but are not limited to:

TTS to audio file, audio file to SST, audio output to TTS (document reader to audio output)
Research, development, testing and analysis of speech recognition technologies in general and the accuacy of the application itself
Editing and modifying existing static audio files pre-SpeechRecognition input to achieve expected text output

AudioWorkletNode can be used to stream Float32Array input.

Related #66

onspeechstart onspeechend onsoundstart onsoundend

I have tried all the method to detect when sound start and end but there are not triggered correctly.
I assume that onspeechstart should be triggered only when a speech input is detected and onspeechend as soon as no sound is detected
With onsoundstart and onspeachstart the event is triggered soon after the browther start listening even in silent.
And the onspeechend is triggered a few seconds after all noise have stopped.

I’m testing on my laptop with the build in microphone on chrome 76.0.3809.100.

Use FrozenArray<SpeechRecognitionResult> and remove SpeechRecognitionResultList

There was a previous attempt to remove SpeechRecognitionResultList but was reverted because Web IDL forbids attributes to return sequences.

The type of the attribute, after resolving typedefs, must not be a nullable or non-nullable version of any of the following types:

a sequence type

a dictionary type

a record type

a union type that has a nullable or non-nullable sequence type, dictionary, or record as one of its flattened member types

Attributes still can use FrozenArray which is also represented by JS native array object.

wicg / speech-api Goto Github PK

speech-api's Issues

Hypertext-to-Speech

JavaScript API

References

Introduction

Advancing the State of the Art

Speech Recognition

Speech Synthesis

Translation

Multimodal Dialogue Systems

Client-side Scenarios

Client-side Speech Recognition

Client-side Speech Synthesis

Client-side Translation

Server-side Scenarios

Server-side Speech Recognition

Server-side Speech Synthesis

Server-side Translation

Third-party Scenarios

Third-party Speech Recognition

Third-party Speech Synthesis

Third-party Translation

Hyperlinks

Recommend Projects

Recommend Topics

Recommend Org