Giter Site home page Giter Site logo

speech-api's Issues

onpause doesn't fire in chrome

speechSynthesis.cancel();
text = 'This is a story about a man who once lived in a mountains far into Norther region of our country in a village which was known for its tasty food. He always had a dream as a kid to travel around the world.';
utterance = new SpeechSynthesisUtterance(text);
utterance.voice = speechSynthesis.getVoices().filter(x => x.lang == 'en-US' && x.voiceURI == 'Google US English')[0]
utterance.onpause = e => console.log(e);
speechSynthesis.speak(utterance);

setTimeout(function(){
  speechSynthesis.pause(utterance);
}, 3000);

but if you set to this voice it works

utterance.voice = speechSynthesis.getVoices().filter(x => x.lang == 'en-US' && x.name == 'Alex')[0]

Mac
Chrome Version 86.0.4240.75 (Official Build) (x86_64)

Getter operations return null without proper IDL

For example, the SpeechRecognitionResult's getter operation says:

If index is greater than or equal to length, this returns null

But the IDL does not indicate so:

[Exposed=Window]
interface SpeechRecognitionResult {
    readonly attribute unsigned long length;
    getter SpeechRecognitionAlternative item(unsigned long index);
    readonly attribute boolean isFinal;
};

The return types must be nullable here. I'm not sure how Blink allows returning null here, though?

API and permissions

This specification is defining an API that might be gated by a user prompt.
As per spec: User agents must only start speech input sessions with explicit, informed user consent.

In that case, it seems some language could be added and refer to https://w3c.github.io/permissions/
That might also help defining the behavior for third party contexts, like done for other APIs such as getUserMedia.

Constructors for speech events

Currently new SpeechSynthesisEvent("pending", { utterance: new SpeechSynthesisUtterance() }) works on Edge and Firefox but throws on Chrome.

Edge: only requires the first argument
Firefox: requires both the first and second arguments
Chrome: always throws

Can we spec this to align browser behaviors?

The UA should be able to disallow speak() from autoplaying

See here for context from [email protected].

I propose we only allow speak() to succeed without error if the caller frame or any of its parents have receive user activation since their last load.

There are two options for error handling in the case that speak() fails this way:

  1. Use SpeechSynthesis' existing error mechanism with either the "not-allowed" or a new type.
  2. Make speak() return a promise, to align with HTMLMediaElement.play

I slightly prefer (1) just because the ergonomics are not much different from (2), and it keeps the API surface consistent, but I'm open to either.

Fix URL next to description of the project

I guess the “website” field that is shown on the front page of the repo, at the top, should be either
https://w3c.github.io/speech-api/webspeechapi.html
or
https://w3c.github.io/speech-api/speechapi.html
?

Remove SpeechRecognitionEvent's interpretation and emma

Beyond Speech-to-Text

With respect to post-text speech recognition (e.g. speech-to-SSML, speech-to-hypertext, speech-to-X1), we can consider:

from:

// Item in N-best list
[Exposed=Window]
interface SpeechRecognitionAlternative {
    readonly attribute DOMString transcript;
    readonly attribute float confidence;
};

to:

// Item in N-best list
[Exposed=Window]
interface SpeechRecognitionAlternative {
    readonly attribute object transcript;
    readonly attribute float confidence;
};

then client-side, server-side or third-party components or services could return either text or XML content per recognition result. That is, transcript could be either a DOMString or a DOMElement.

Speech-to-text is too lossy. Information pertaining to prosody, intonation, emphases and pauses are discarded in text-formatted output. Such information can be useful, for instance, in informing machine translation components and services.

Add profanity/offensive words filter attribute

No idea if i am on the right track as to why curse words appear differently in the transcript of SpeechRecognitionResult in different browsers. Therefore thought it best to open an issue here.

Question
If browsers implement the transcript SpeechRecognitionResult in such a way where the output differs maybe a profanity filter attribute could be useful so that the developer using the API has a choice in that matter? For example offensiveWordFilter attribute, of type boolean?

Background Story
While experimenting with the SpeechRecognition Interface
in the phrase-matcher from https://github.com/mdn/web-speech-api/ the following occurred:

  1. When using Chrome and saying a curse word like shit, the transcript in SpeechRecognitionResult is censored as s****
  2. When using Firefox Nightly and saying a curse word like shit the transcript in SpeechRecognitionResult is not censored

In neither Chrome nor Nightly this type of censoring is applied for the speechSynthesis interface as used in the speak-easy-synthesis.

In my search into why this happens i found the following:
On https://github.com/chromium/chromium/blob/master/content/browser/speech/speech_recognition_engine.cc on line 277 filter_profanities is set to false on line 579 it should result in pFilter=0. According to https://stackoverflow.com/questions/15030339/remove-profanity-censor-from-google-speech-recognition/15071054 the setting pfilter=0 results in removing the profanity filter. Which could lead to the conclusion in chrome this is changed. I do not feel confident in this conclusion however.

In Nightly I have found no reference in the code to a profanity filter https://dxr.mozilla.org/mozilla-central/source/dom/media/webspeech/recognition

Hypertext-to-Speech

I would like to share some ideas with respect to advancing the Web Speech API. The ideas pertain to hypertext-to-speech, the speech synthesis of XHTML.

Hypertext-to-Speech

Hypertext-to-speech involves processing XHTML DOM trees with subtopics including SSML attributes, CSS Speech Module properties and new JavaScript events such as onsynthesis.

<html xmlns:ssml="http://www.w3.org/2001/10/synthesis">
  <head>
    <script type="text/javascript" src="..."></script>
    <style>
      [role="topic"] { voice-stress: strong; }
    </style>
  </head>
  <body ssml:alphabet="ipa">
    <p>
      <span id="sentence-1" onsynthesis="function(event);"><span ssml:ph="/ðɪs/" role="topic">This</span> is a sentence of <span ssml:ph="/tɛkst/">text</span> with markup for hypertext-to-speech.</span>
    </p>
  </body>
</html>

JavaScript API

speechSynthesis.speak(document);
speechSynthesis.speak(document.getElementById('sentence-1'));
var fragment = document.createDocumentFragment();
...
speechSynthesis.speak(fragment);
var doc = document.implementation.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
...
speechSynthesis.speak(doc);

References

  1. https://w3c.github.io/publ-epub-revision/epub32/spec/epub-contentdocs.html#sec-xhtml-ssml-attrib
  2. https://www.w3.org/TR/css3-speech/
  3. https://www.w3.org/community/exercises-and-activities/wiki/Hypertext-to-Speech_and_Media_Overlays

SpeechSynthesisEvent does not support charLength

CharLength is used during word and sentence boundary events, and allows API consumers to create features that highlight words/sentences as they are spoken out loud. For those with dyslexia or problems concentrating, this is a powerful accessibility feature. In addition, most major computing platforms already support some form of charLength. Therefore, it would make some sense to add charLength to the web standards. An older request for the same thing

How is a complete SSML document expected to be parsed when set once at .text property of SpeechSynthesisUtterance instance?

According to the specification

5.2.3 SpeechSynthesisUtterance Attributes text attribute This attribute specifies the text to be synthesized and spoken for this utterance. This may be either plain text or a complete, well-formed SSML document.

It is not clear how an entire SSML document is expected to be parsed when set at .text property at single instance of SpeechSynthesisUtterance().

See guest271314/SSMLParser#1

Which technique is used for voice to text transformation?

I am trying to find the technique used in transforming voice into text, if neural network is used, or some other method, but I was unsuccessful.

Question
Which AI technique is used for voice to text transformation?

Background Story
I am writing an article where a web application uses WebSpeechAPI, and important information is the technique used.

SpeechRecognition MUST be granted permission to record and send users voice to a remote web service

Google Chrome and the ostensibly open source browser Chromium implementation of SpeechRecognition records the user voice and sends the users' biometric data to a remote web service https://bugs.chromium.org/p/chromium/issues/detail?id=816095 without the notifying the user or being granted permission by the user (cannot grant permission if not notified) that use of SpeechRecognition at that browser will perform the preceding with their voice. Some of the issues relating to that undisclosed practice are described at w3c/webappsec-secure-contexts#66 (comment).

To remedy that horrendous issue at the specification level it should be a simple matter of inccluding language that states SpeechRecognition MUST do at least the following if the implementation records the user voice and sends that recording to a remote web service

  1. If SpeechRecognition is not performed in real-time at the browser source code locally in the browser, MUST notify the user that if they use that implementation of Web Speech API at that browser, their voice will be recorded and send to a remote web service when they use SpeechRecognition;
  2. The SpeechRecognition implemented MUST get permission from the user before recording their voice to send to a remote web service;
  3. The SpeechRecognition implemented MUST let the user know precisely where their biometric data is being sent and for how long their recorded voice will be stored and the when the recording of their voice is deleted from the storage devices at the remote web service.

Will file the PR if necessary to fix this long-standing issue at the specification level.

Support SpeechSynthesis *to* a MediaStreamTrack

It would be very helpful to be able to get a stream of the output of SpeechSynthesis.

For an explicit use cases, I would like to:

  • position speech synthesis in a virtual world in WebXR (using Web Audio's PannerNode)
  • be able to feed speech synthesis output through a WebRTC connection
  • have speech synthesis output be able to be processed through Web Audio

(This is similar/inverse/matching/related feature to #66.)

Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation

Introduction

We can envision and consider client-side, server-side and third-party speech recognition, synthesis and translation scenarios for a next version of the Web Speech API.

Advancing the State of the Art

Speech Recognition

Beyond speech-to-text, speech recognition includes speech-to-SSML and speech-to-hypertext. With speech-to-SSML and speech-to-hypertext, there can be a higher degree of fidelity possible for round-tripping speech audio through speech recognition and synthesis components or services.

Speech Synthesis

Beyond text-to-speech, speech synthesis includes SSML-to-speech and hypertext-to-speech.

Translation

Translation scenarios include processing text, SSML, hypertext or audio in a source language into text, SSML, hypertext or audio in a target language.

Desirable features include interoperability between client-side, server-side and third-party translation and WebRTC with translations available as subtitles or audio tracks.

Multimodal Dialogue Systems

Interesting scenarios include Web-based multimodal dialogue systems which efficiently utilize client-side, server-side and third-party speech recognition, synthesis and translation.

Client-side Scenarios

Client-side Speech Recognition

These scenarios are considered in the current version of the Web Speech API.

Client-side Speech Synthesis

These scenarios are considered in the current version of the Web Speech API.

Client-side Translation

These scenarios are new to the Web Speech API and involve the client-side translation of text, SSML, hypertext or audio into text, SSML, hypertext or audio.

Server-side Scenarios

Server-side Speech Recognition

These scenarios are new to the Web Speech API and involve one or more audio streams from a client being streamed to a server which performs speech recognition, optionally providing speech recognition results to the client.

Server-side Speech Synthesis

These scenarios are new to the Web Speech API and involve a client sending text, SSML or hypertext to a server which performs speech synthesis and streams audio to the client.

Server-side Translation

These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a server for translation into text, SSML, hypertext or audio.

Third-party Scenarios

Third-party Speech Recognition

These scenarios are new to the Web Speech API and involve one or more audio streams from a client or server being streamed to a third-party service which performs speech recognition providing speech recognition results to the client or server.

Third-party Speech Synthesis

These scenarios are new to the Web Speech API and involve a client or server sending text, SSML or hypertext to a third-party service which performs speech synthesis and streams audio to the client or server.

Third-party Translation

These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a third-party translation service for translation into text, SSML, hypertext or audio.

Hyperlinks

Amazon Web Services

Google Cloud AI

IBM Watson Products and Services

Microsoft Cognitive Services

Real Time Translation in WebRTC

Defining what should happen if an utterance is changed after passing it to speak()

Consider the following case:

utterance = new SpeechSynthesisUtterance('hello world');
speechSynthesis.speak(utterance)
utterance.text = 'hello world two';

The spec currently says

If changes are made to the SpeechSynthesisUtterance object after calling this method and prior to the corresponding end or error event, it is not defined whether those changes will affect what is spoken, and those changes may cause an error to be returned.

but it's not very satisfactory, as the web platform tries very hard to avoid any possible undefined behavior.

Chrome seems to “snapshot” the actual utterance information at the time of the speak(), which sounds like a good thing to specify.

Also see #29.

Support SpeechRecognition on an audio MediaStreamTrack

There is an old issue in bugzilla but it doesn't discuss much.

We should revive this, to give the application control over the source of audio.

Not letting the application be in control has several issues:

  • The spec needs to re-define (mediacapture-main does this too) what sources there are, today I don't see this mentioned. It seems implied that a microphone is used.
  • The error with code "audio-capture" groups all kinds of errors that capturing audio from a microphone could have.
  • There'd have to be an additional permission setting for speech, in addition to that of audio capture through getUserMedia. The spec doesn't help in clearing out how this relates to getUserMedia's permissions currently, and doing so could become complicated (if capture is already ongoing, do we ask again? if not, how does a user choose device? etc.)
  • Depending on implementation, if we rely on start() requesting audio from getUserMedia() (seems reasonable), doing multiple requests after each other could lead to a new permission prompt for each one, unless the user consents to giving a permanent permission. This would be an issue in Firefox as through the SpeechRecognition API an application cannot control the lifetime of the audio capture.
  • Probably more.

Letting the application be in control has several advantages:

  • It can rely on mediacapture-main and its extension specs to define sources of audio and all security and privacy aspects around them. Some language might still be needed around cross-origin tracks. There's already a concept of isolated tracks in webrtc-identity, that will move into the main spec in the future, that one could rely on for the rest.
  • If no backwards-compatible path is kept, the spec can be simplified by removing all text, attributes, errors, etc. related to audio-capture.
  • The application is in full control of the track's lifetime, and thus can avoid any permission prompts the user agent might otherwise throw at the user, when doing multiple speech recognitions.
  • The application can recognize speech from other sources than microphones.
  • Probably more.

To support a MediaStreamTrack argument to start(), we need to:

  • Throw in start() if the track is not of kind "audio".
  • Throw in start() if the track's readyState is not "live".
  • Throw in start() if the track is isolated.
  • If the track becomes isolated while recognizing, discard any pending results and fire an error.
  • If the track ends while recognizing, treat it as the end of speech and handle it gracefully.
  • If the track is muted or disabled, do nothing special as this means the track contains silence. It could become unmuted or enabled at any time.

What to throw and what to fire I leave unsaid for now.

Transcripts are Insufficient for Japanese

Currently, the only implementation of the spec that I'm aware of, in Google Chrome, returns a mixture of surface kanji and kana when using the Japanese language speech to text.

Having kanji is great for semantics, but in Japanese phonology is important in many cases. Without phonology, ambiguity is left in the intended transcription.

A whole class of use cases becomes impossible. For example:

  • Japanese names are written in kanji but there can be multiple readings of those kanji only one of which maps to an individual.
  • If I utter てい the Google WebSpeech API returns 体. This kanji has four readings, and てい could belong to the reading of many other kanji. (I'm creating a voice-enabled Japanese flashcard app and this problem only lets me guess whether a user is uttering a kanji/vocab item correctly, I cannot be sure because of the API shortcomings for Japanese).

Ideally we would always get the furigana (the kana that represent which sounds were made for the kanji) along with the kanji. Kana does have, in essence, a 1-to-1 phonetic mapping.

Better yet, a more general solution that would work for other languages with similar issues could be returning an IPA pronunciation along with the transcript, like so:

interface SpeechRecognitionAlternative {
    readonly attribute DOMString transcript;
    readonly attribute DOMString pronunciation;
    readonly attribute float confidence;
};

☂ Bugzilla bugs

Define how grammars work and give examples

You should provide an fully working example about how to use grammar. Because I did not see any use case where I could use that.

I do not understand what is the purpose of having grammar and how to use it.
"google search" did not help me.

Web Speech Api not working on chinese network

Hi, we have a problem with our WebSpeech API solution on chrome with a Chinese Client. The speech recognizing start but not give us a result. Only work with VPN (because google server are blocked in China). We would like to make available our solution without VPN in Chrome. There's a solution for this problem?

Is SpeechSynthesisUtterance reusable?

From https://bugzilla.mozilla.org/show_bug.cgi?id=1372325#c14

Sample is the following.

var utter = new SpeechSynthesisUtterance("Hello World");
utter.onend = () => {
  // Edge, WebKit and Blink allow this, but Firefox doesn't.
  window.speechSynthesis.speak(utter);
};
window.speechSynthesis.speak(utter);

Google search page uses that SpeechSynthesisUtterance is reused. But Speech API doesn't define whether this object is reusable.

Edge, WebKit, and Blink allow SpeechSynthesisUtterance is reusable now, but Gecko doesn't. If it allow to reuse this object, we should add spec/document about reusable.

Support SpeechRecognition input from audio files and Float32Array and ArrayBuffer

Support .wav, .webm, .ogg, .mp3 files (file types supported by the implementation decoders) and Float32Array and ArrayBuffer input to SpeechRecognition.

Use cases for static audio file and ArrayBuffer (non-"real-time") input to SpeechRecognition, includem but are not limited to:

  • TTS to audio file, audio file to SST, audio output to TTS (document reader to audio output)
  • Research, development, testing and analysis of speech recognition technologies in general and the accuacy of the application itself
  • Editing and modifying existing static audio files pre-SpeechRecognition input to achieve expected text output

AudioWorkletNode can be used to stream Float32Array input.

Related #66

onspeechstart onspeechend onsoundstart onsoundend

I have tried all the method to detect when sound start and end but there are not triggered correctly.
I assume that onspeechstart should be triggered only when a speech input is detected and onspeechend as soon as no sound is detected
With onsoundstart and onspeachstart the event is triggered soon after the browther start listening even in silent.
And the onspeechend is triggered a few seconds after all noise have stopped.

I’m testing on my laptop with the build in microphone on chrome 76.0.3809.100.

Use FrozenArray<SpeechRecognitionResult> and remove SpeechRecognitionResultList

Related 2013 discussion: https://lists.w3.org/Archives/Public/public-speech-api/2013Feb/0011.html

There was a previous attempt to remove SpeechRecognitionResultList but was reverted because Web IDL forbids attributes to return sequences.

The type of the attribute, after resolving typedefs, must not be a nullable or non-nullable version of any of the following types:

  • a sequence type
  • a dictionary type
  • a record type
  • a union type that has a nullable or non-nullable sequence type, dictionary, or record as one of its flattened member types

Attributes still can use FrozenArray which is also represented by JS native array object.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.