wicg / speech-api Goto Github PK
View Code? Open in Web Editor NEWWeb Speech API
Home Page: https://wicg.github.io/speech-api/
Web Speech API
Home Page: https://wicg.github.io/speech-api/
speechSynthesis.cancel();
text = 'This is a story about a man who once lived in a mountains far into Norther region of our country in a village which was known for its tasty food. He always had a dream as a kid to travel around the world.';
utterance = new SpeechSynthesisUtterance(text);
utterance.voice = speechSynthesis.getVoices().filter(x => x.lang == 'en-US' && x.voiceURI == 'Google US English')[0]
utterance.onpause = e => console.log(e);
speechSynthesis.speak(utterance);
setTimeout(function(){
speechSynthesis.pause(utterance);
}, 3000);
but if you set to this voice it works
utterance.voice = speechSynthesis.getVoices().filter(x => x.lang == 'en-US' && x.name == 'Alex')[0]
Mac
Chrome Version 86.0.4240.75 (Official Build) (x86_64)
For example, the SpeechRecognitionResult's getter operation says:
If index is greater than or equal to length, this returns null
But the IDL does not indicate so:
[Exposed=Window]
interface SpeechRecognitionResult {
readonly attribute unsigned long length;
getter SpeechRecognitionAlternative item(unsigned long index);
readonly attribute boolean isFinal;
};
The return types must be nullable here. I'm not sure how Blink allows returning null here, though?
https://w3c.github.io/speech-api/#dom-speechgrammarlist-addfromuri
There's a naming mismatch between the spec and Chrome. I will add a use counter in Chrome to see if it's feasible to rename it, or if it's better to just change the spec.
This specification is defining an API that might be gated by a user prompt.
As per spec: User agents must only start speech input sessions with explicit, informed user consent.
In that case, it seems some language could be added and refer to https://w3c.github.io/permissions/
That might also help defining the behavior for third party contexts, like done for other APIs such as getUserMedia.
Currently new SpeechSynthesisEvent("pending", { utterance: new SpeechSynthesisUtterance() })
works on Edge and Firefox but throws on Chrome.
Edge: only requires the first argument
Firefox: requires both the first and second arguments
Chrome: always throws
Can we spec this to align browser behaviors?
See here for context from [email protected].
I propose we only allow speak() to succeed without error if the caller frame or any of its parents have receive user activation since their last load.
There are two options for error handling in the case that speak() fails this way:
I slightly prefer (1) just because the ergonomics are not much different from (2), and it keeps the API surface consistent, but I'm open to either.
I guess the “website” field that is shown on the front page of the repo, at the top, should be either
https://w3c.github.io/speech-api/webspeechapi.html
or
https://w3c.github.io/speech-api/speechapi.html
?
They're only implemented in Chrome:
http://web-confluence.appspot.com/#!/catalog?releases=%5B%22Edge_18.17763_Windows_10.0%22,%22Safari_12.1_OSX_10.14.4%22,%22Chrome_74.0.3729.108_Windows_10.0%22,%22Firefox_67.0_Windows_10.0%22%5D&q=%22interpretation%22
http://web-confluence.appspot.com/#!/catalog?releases=%5B%22Edge_18.17763_Windows_10.0%22,%22Safari_12.1_OSX_10.14.4%22,%22Chrome_74.0.3729.108_Windows_10.0%22,%22Firefox_67.0_Windows_10.0%22%5D&q=%22emma%22
However, in Chromium they always return null:
https://cs.chromium.org/chromium/src/third_party/blink/renderer/modules/speech/speech_recognition_event.h?l=61&rcl=ebd558bcafa1b17fa0681321118f217a0587d0a2
With respect to post-text speech recognition (e.g. speech-to-SSML, speech-to-hypertext, speech-to-X1), we can consider:
from:
// Item in N-best list
[Exposed=Window]
interface SpeechRecognitionAlternative {
readonly attribute DOMString transcript;
readonly attribute float confidence;
};
to:
// Item in N-best list
[Exposed=Window]
interface SpeechRecognitionAlternative {
readonly attribute object transcript;
readonly attribute float confidence;
};
then client-side, server-side or third-party components or services could return either text or XML content per recognition result. That is, transcript
could be either a DOMString
or a DOMElement
.
Speech-to-text is too lossy. Information pertaining to prosody, intonation, emphases and pauses are discarded in text-formatted output. Such information can be useful, for instance, in informing machine translation components and services.
Dose your api use google's speech to text api? Why I found the accuracy is lower than the demo in google's official site? Thank you for your time.
No idea if i am on the right track as to why curse words appear differently in the transcript of SpeechRecognitionResult in different browsers. Therefore thought it best to open an issue here.
Question
If browsers implement the transcript SpeechRecognitionResult in such a way where the output differs maybe a profanity filter attribute could be useful so that the developer using the API has a choice in that matter? For example offensiveWordFilter attribute, of type boolean?
Background Story
While experimenting with the SpeechRecognition Interface
in the phrase-matcher from https://github.com/mdn/web-speech-api/ the following occurred:
In neither Chrome nor Nightly this type of censoring is applied for the speechSynthesis interface as used in the speak-easy-synthesis.
In my search into why this happens i found the following:
On https://github.com/chromium/chromium/blob/master/content/browser/speech/speech_recognition_engine.cc on line 277 filter_profanities is set to false on line 579 it should result in pFilter=0. According to https://stackoverflow.com/questions/15030339/remove-profanity-censor-from-google-speech-recognition/15071054 the setting pfilter=0 results in removing the profanity filter. Which could lead to the conclusion in chrome this is changed. I do not feel confident in this conclusion however.
In Nightly I have found no reference in the code to a profanity filter https://dxr.mozilla.org/mozilla-central/source/dom/media/webspeech/recognition
https://github.com/tabatkins/bikeshed is used for most modern specs.
See Need to clarify what happens if two windows try to speak. Issue moved here in order to discontinue use of Bugzilla for Speech API.
I would like to share some ideas with respect to advancing the Web Speech API. The ideas pertain to hypertext-to-speech, the speech synthesis of XHTML.
Hypertext-to-speech involves processing XHTML DOM trees with subtopics including SSML attributes, CSS Speech Module properties and new JavaScript events such as onsynthesis
.
<html xmlns:ssml="http://www.w3.org/2001/10/synthesis">
<head>
<script type="text/javascript" src="..."></script>
<style>
[role="topic"] { voice-stress: strong; }
</style>
</head>
<body ssml:alphabet="ipa">
<p>
<span id="sentence-1" onsynthesis="function(event);"><span ssml:ph="/ðɪs/" role="topic">This</span> is a sentence of <span ssml:ph="/tɛkst/">text</span> with markup for hypertext-to-speech.</span>
</p>
</body>
</html>
speechSynthesis.speak(document);
speechSynthesis.speak(document.getElementById('sentence-1'));
var fragment = document.createDocumentFragment();
...
speechSynthesis.speak(fragment);
var doc = document.implementation.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
...
speechSynthesis.speak(doc);
CharLength is used during word and sentence boundary events, and allows API consumers to create features that highlight words/sentences as they are spoken out loud. For those with dyslexia or problems concentrating, this is a powerful accessibility feature. In addition, most major computing platforms already support some form of charLength. Therefore, it would make some sense to add charLength to the web standards. An older request for the same thing
I'm not sure if this is possible, but I would be useful if I could get the duration of runtime for an instantiated utterance.
According to https://wicg.github.io/speech-api/#utterance-attributes, the runtime attribute is not available.
Given text
, voice
, and rate
, is it possible for SpeechSynthesisUtterance to also provide the runtimeDuration
?
According to the specification
5.2.3 SpeechSynthesisUtterance Attributes text attribute This attribute specifies the text to be synthesized and spoken for this utterance. This may be either plain text or a complete, well-formed SSML document.
It is not clear how an entire SSML document is expected to be parsed when set at .text
property at single instance of SpeechSynthesisUtterance()
.
Originally reported in https://www.w3.org/Bugs/Public/show_bug.cgi?id=26992
Some discussion also in #32.
I am trying to find the technique used in transforming voice into text, if neural network is used, or some other method, but I was unsuccessful.
Question
Which AI technique is used for voice to text transformation?
Background Story
I am writing an article where a web application uses WebSpeechAPI, and important information is the technique used.
https://w3c.github.io/speech-api/#utterance-events
These descriptions don't describe precisely when the events fire compared to when speak()
, pause()
, resume()
or cancel()
are fired. This makes it impossible to write a detailed test for those methods based on the spec.
I needed this to write a test for https://bugs.chromium.org/p/chromium/issues/detail?id=679043.
Currently, Chrome's autoplay policy allows an element to play if the caller frame or any of its parents have receive user activation since their last load.
See #27 for a more general discussion of expanding this spec to disallow speak().
Google Chrome and the ostensibly open source browser Chromium implementation of SpeechRecognition
records the user voice and sends the users' biometric data to a remote web service https://bugs.chromium.org/p/chromium/issues/detail?id=816095 without the notifying the user or being granted permission by the user (cannot grant permission if not notified) that use of SpeechRecognition
at that browser will perform the preceding with their voice. Some of the issues relating to that undisclosed practice are described at w3c/webappsec-secure-contexts#66 (comment).
To remedy that horrendous issue at the specification level it should be a simple matter of inccluding language that states SpeechRecognition
MUST do at least the following if the implementation records the user voice and sends that recording to a remote web service
SpeechRecognition
is not performed SpeechRecognition
;SpeechRecognition
implemented MUST get permission from the user before recording their voice to send to a remote web service;SpeechRecognition
implemented MUST let the user know precisely where their biometric data is being sent and for how long their recorded voice will be stored and the when the recording of their voice is deleted from the storage devices at the remote web service.Will file the PR if necessary to fix this long-standing issue at the specification level.
It would be very helpful to be able to get a stream of the output of SpeechSynthesis.
For an explicit use cases, I would like to:
(This is similar/inverse/matching/related feature to #66.)
We can envision and consider client-side, server-side and third-party speech recognition, synthesis and translation scenarios for a next version of the Web Speech API.
Beyond speech-to-text, speech recognition includes speech-to-SSML and speech-to-hypertext. With speech-to-SSML and speech-to-hypertext, there can be a higher degree of fidelity possible for round-tripping speech audio through speech recognition and synthesis components or services.
Beyond text-to-speech, speech synthesis includes SSML-to-speech and hypertext-to-speech.
Translation scenarios include processing text, SSML, hypertext or audio in a source language into text, SSML, hypertext or audio in a target language.
Desirable features include interoperability between client-side, server-side and third-party translation and WebRTC with translations available as subtitles or audio tracks.
Interesting scenarios include Web-based multimodal dialogue systems which efficiently utilize client-side, server-side and third-party speech recognition, synthesis and translation.
These scenarios are considered in the current version of the Web Speech API.
These scenarios are considered in the current version of the Web Speech API.
These scenarios are new to the Web Speech API and involve the client-side translation of text, SSML, hypertext or audio into text, SSML, hypertext or audio.
These scenarios are new to the Web Speech API and involve one or more audio streams from a client being streamed to a server which performs speech recognition, optionally providing speech recognition results to the client.
These scenarios are new to the Web Speech API and involve a client sending text, SSML or hypertext to a server which performs speech synthesis and streams audio to the client.
These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a server for translation into text, SSML, hypertext or audio.
These scenarios are new to the Web Speech API and involve one or more audio streams from a client or server being streamed to a third-party service which performs speech recognition providing speech recognition results to the client or server.
These scenarios are new to the Web Speech API and involve a client or server sending text, SSML or hypertext to a third-party service which performs speech synthesis and streams audio to the client or server.
These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a third-party translation service for translation into text, SSML, hypertext or audio.
https://w3c.github.io/speech-api/#utterance-attributes
None of these attributes define what should happen when set, and except for voice it's not clear what the setter might do with invalid values:
Consider the following case:
utterance = new SpeechSynthesisUtterance('hello world');
speechSynthesis.speak(utterance)
utterance.text = 'hello world two';
The spec currently says
If changes are made to the SpeechSynthesisUtterance object after calling this method and prior to the corresponding end or error event, it is not defined whether those changes will affect what is spoken, and those changes may cause an error to be returned.
but it's not very satisfactory, as the web platform tries very hard to avoid any possible undefined behavior.
Chrome seems to “snapshot” the actual utterance information at the time of the speak()
, which sounds like a good thing to specify.
Also see #29.
There is an old issue in bugzilla but it doesn't discuss much.
We should revive this, to give the application control over the source of audio.
Not letting the application be in control has several issues:
Letting the application be in control has several advantages:
To support a MediaStreamTrack argument to start()
, we need to:
What to throw and what to fire I leave unsaid for now.
@siusin could you please transfer this repository for us to the WICG based on:
https://lists.w3.org/Archives/Public/public-speech-api/2019Oct/0001.html
I'll send a PR the w3c.github.io repo so links don't break.
See whatwg/webidl#778.
Hello,
I'm using speech to text, consider this simple example , is there a way to get the voice of other people, like in zoom meetings how can I get the voices of other attendees.
I have checked the documentation https://wicg.github.io/speech-api/ but can't find anything.
Thanks.
Currently, the only implementation of the spec that I'm aware of, in Google Chrome, returns a mixture of surface kanji and kana when using the Japanese language speech to text.
Having kanji is great for semantics, but in Japanese phonology is important in many cases. Without phonology, ambiguity is left in the intended transcription.
A whole class of use cases becomes impossible. For example:
Ideally we would always get the furigana (the kana that represent which sounds were made for the kanji) along with the kanji. Kana does have, in essence, a 1-to-1 phonetic mapping.
Better yet, a more general solution that would work for other languages with similar issues could be returning an IPA pronunciation along with the transcript, like so:
interface SpeechRecognitionAlternative {
readonly attribute DOMString transcript;
readonly attribute DOMString pronunciation;
readonly attribute float confidence;
};
These are the open bugs in https://www.w3.org/Bugs/Public/buglist.cgi?product=Speech%20API:
Moved from https://www.w3.org/Bugs/Public/show_bug.cgi?id=30009:
It looks like some changes were made in bug 23737.
A few instances of "SpeechSynthesisVoiceList" remain. In the IDL it should be "sequence getVoices()" and in the prose it needs to say things about the sequence of SpeechSynthesisVoice objects instead.
You should provide an fully working example about how to use grammar. Because I did not see any use case where I could use that.
I do not understand what is the purpose of having grammar and how to use it.
"google search" did not help me.
In web-platform-tests/wpt#12568 web-platform-tests/wpt#12689 I found that apparently SSML isn't supported anywhere.
If there is no immediate implementer interest, I suggest removing the SSML bits from the spec.
Update: it does work on Windows, see #37 (comment)
Hi, we have a problem with our WebSpeech API solution on chrome with a Chinese Client. The speech recognizing start but not give us a result. Only work with VPN (because google server are blocked in China). We would like to make available our solution without VPN in Chrome. There's a solution for this problem?
See WebSpeech API mustn't allow fingerprinting. Issue moved here in order to discontinue use of Bugzilla for Speech API.
From https://bugzilla.mozilla.org/show_bug.cgi?id=1372325#c14
Sample is the following.
var utter = new SpeechSynthesisUtterance("Hello World");
utter.onend = () => {
// Edge, WebKit and Blink allow this, but Firefox doesn't.
window.speechSynthesis.speak(utter);
};
window.speechSynthesis.speak(utter);
Google search page uses that SpeechSynthesisUtterance
is reused. But Speech API doesn't define whether this object is reusable.
Edge, WebKit, and Blink allow SpeechSynthesisUtterance
is reusable now, but Gecko doesn't. If it allow to reuse this object, we should add spec/document about reusable.
https://w3c.github.io/speech-api/#speechsynthesis
The spec doesn't define when precisely the pending
, speaking
and paused
states change. This makes it impossible to write a detailed test for SpeechSynthesisUtterance pause() and resume() based on the spec.
https://w3c.github.io/speech-api/webspeechapi.html#dfn-ttsspeak says "The SpeechSynthesis object takes exclusive ownership of the SpeechSynthesisUtterance object. Passing it as a speak() argument to another SpeechSynthesis object should throw an exception."
It doesn't say what exception to throw. It should.
Spun off from #7 and a TODO in web-platform-tests/wpt#7510
Support .wav
, .webm
, .ogg
, .mp3
files (file types supported by the implementation decoders) and Float32Array
and ArrayBuffer
input to SpeechRecognition
.
Use cases for static audio file and ArrayBuffer
(non-"real-time") input to SpeechRecognition
, includem but are not limited to:
SpeechRecognition
input to achieve expected text outputAudioWorkletNode
can be used to stream Float32Array
input.
Related #66
I have tried all the method to detect when sound start and end but there are not triggered correctly.
I assume that onspeechstart should be triggered only when a speech input is detected and onspeechend as soon as no sound is detected
With onsoundstart and onspeachstart the event is triggered soon after the browther start listening even in silent.
And the onspeechend is triggered a few seconds after all noise have stopped.
I’m testing on my laptop with the build in microphone on chrome 76.0.3809.100.
Related 2013 discussion: https://lists.w3.org/Archives/Public/public-speech-api/2013Feb/0011.html
There was a previous attempt to remove SpeechRecognitionResultList but was reverted because Web IDL forbids attributes to return sequences.
The type of the attribute, after resolving typedefs, must not be a nullable or non-nullable version of any of the following types:
- a sequence type
- a dictionary type
- a record type
- a union type that has a nullable or non-nullable sequence type, dictionary, or record as one of its flattened member types
Attributes still can use FrozenArray which is also represented by JS native array object.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.