Giter Site home page Giter Site logo

Comments (30)

foolip avatar foolip commented on July 28, 2024 3

With help from @gshires I've been able to locate what happens with the grammars in Chromium, the weights and URLs are passed along to the speech recognition engine. Since neither are interpreted by Chromium, this is effectively just a way to pass engine-specific configuration or options along.

As with text layout engines or WebRTC, having some controls is reasonable, and standardizing some options across speech engines might be possible.

So, at least as implemented in Chromium, grammars aren't quite what you'd expect them to be, and I would not recommend trying to use it since the behavior isn't documented and could change.

I've sent https://chromium-review.googlesource.com/c/chromium/src/+/1732790 to measure the usage of these APIs in more detail.

from speech-api.

foolip avatar foolip commented on July 28, 2024 2

@kdavis-mozilla Chromium only exposes the interfaces and passes along the grammar URL and weight as given without interpretation. Any actual effect would be in the speech engine service and that is neither open source nor documented, AFAICT. I wasn't able to find any use of grammars in httparchive that seems to have any effect in Chrome.

@compulim I'm pretty sure at this point that addFromString doesn't do anything at all in Chromium, and the source code doesn't mention JSGF anywhere.

What I'd like to do at this point:

  • Understand what configuration the engine is capable of via the grammar-related APIs
  • Wait for stats from new use counters to be available

I think the outcome will likely be adding other ways to configure the speech recognition, maybe more attributes. But it depends a lot on the feasibility of changing/removing the existing APIs.

from speech-api.

marcoscaceres avatar marcoscaceres commented on July 28, 2024 1

New issue for removing that would be SpeechGrammarList and associated uses would be great.

from speech-api.

foolip avatar foolip commented on July 28, 2024 1

https://bugs.chromium.org/p/chromium/issues/detail?id=680944 has some clues about what grammars are for, although it's a bug about something that doesn't work:

recognizer = new SpeechRecognition();  //standard
recognizer.continuous = false; //stops listening in a pause
recognizer.lang = "en-GB";   //"en-GB"  "el-GR"
recognizer.interimResults = true;
recognizer.maxAlternatives = 5;
var commands = ['transfer' , 'inquiry', 'statement','balance'];
var grammar = '#JSGF V1.0; grammar commands; public <command> = '
+ commands.join(' \| ') + ' ;';
var speechRecognitionList = new SpeechGrammarList();
speechRecognitionList.addFromString(grammar, 1);
recognizer.grammars = speechRecognitionList;

This example is about limiting recognition to a small set of words. A banking use case can be inferred from the set of strings in the example: 'transfer' , 'inquiry', 'statement', 'balance'.

But again, this doesn't work in Chrome.

@jlguenego I'll go ahead and rename this issue to go beyond just asking for examples, hope you don't mind.

from speech-api.

kdavis-mozilla avatar kdavis-mozilla commented on July 28, 2024 1

...maybe more attributes...

As a designer of an STT engine I'd say this is not the way to go.

The only attributes I can think of that might make sense across engines are grammars/language models with weights. Even this is very engine specific.

Generally, there are many other such "attributes" that are STT engine specific and even vary from one version of an STT engine to the next. So exposing them in this API is not going to be a good design decision.

from speech-api.

foolip avatar foolip commented on July 28, 2024

Do you mean how to use recognition.grammars.addFromURI(...) and recognition.grammars.addFromString(...)? A while ago I tried to work this out myself by looking at the Chromium source code, but I couldn't get to the bottom of what they actually do.

@gshires do you have any context on what these methods do?

@marcoscaceres when you've looked at the API, could you make sense of this bit?

I'm adding use counters to Chromium to figure out how this is used, and if the grammar stuff isn't really used in the wild it's possible it could be removed.

from speech-api.

jlguenego avatar jlguenego commented on July 28, 2024

Yes, I mean having a understanding use case where recognition.grammars.addFromURI(...) would be useful.

from speech-api.

kdavis-mozilla avatar kdavis-mozilla commented on July 28, 2024

I do not understand what is the purpose of having grammar...

The basic idea is to decrease word error rate for STT.

...and how to use it

You'd set grammars on a SpeechRecognition instance before calling start().

An example, say you are doing English STT for restaurant that has Indian dishes. So you'd want to have a grammar that includes the names of Indian dishes, e.g. palak paneer, to decrease word error rate for STT unfamiliar with such terms.

As to if it's used in Chromium, I do not know.

from speech-api.

marcoscaceres avatar marcoscaceres commented on July 28, 2024

The spec fails to specify the format the the grammar is in (see "ISSUE 3" in the spec). This really not be in the spec at all, given how poorly specified that all is.

We should get rid of all of SpeechGrammarList entirely (looks like more fingerprinting surface).

In Chrome, the src just returns the URL from the web page - so seems completely useless.

from speech-api.

kdavis-mozilla avatar kdavis-mozilla commented on July 28, 2024

@marcoscaceres, here @jlguenego was asking: What is a use case for a grammar.

The issues you mention, while valid, are orthogonal to "what is a use case for a grammar" question. So maybe they should be is a different GitHub issue?

from speech-api.

marcoscaceres avatar marcoscaceres commented on July 28, 2024

Sent PR #58

from speech-api.

foolip avatar foolip commented on July 28, 2024

Searching for "webkitSpeechRecognition JSGF" there are some examples:
https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition/grammars#Examples

StackOverflow questions:

from speech-api.

kdavis-mozilla avatar kdavis-mozilla commented on July 28, 2024

A fuller example from MDN is the tutorial we wrote several years ago.

However, I doubt if any browser supports the tutorial as written as it uses JSGF which as far as I know is not supported in any browser.

from speech-api.

foolip avatar foolip commented on July 28, 2024

looks like more fingerprinting surface

Unless the interface works and actually does something useful by changing the outcome of speech recognition, the API surface itself is write-only and doesn't reveal anything, there isn't even a way to feature detect what's supported :)

from speech-api.

foolip avatar foolip commented on July 28, 2024

I've done some digging in HTTP Archive for pages containing "SpeechRecognition" and ".grammars" and 381 results. Most are variations of the same script, and all the bits producing grammar strings that I could interpret are using JSGF. So, probably that has worked to some extent.

from speech-api.

foolip avatar foolip commented on July 28, 2024

I found 32 references to "jspeech" which is probably https://github.com/tur-nr/node-jspeech maintained by @tur-nr. @tur-nr, do you know which browsers JSGF works in?

from speech-api.

kdavis-mozilla avatar kdavis-mozilla commented on July 28, 2024

@foolip Myself and Andre Natal implemented JSGF in Firefox and Firefox OS many years ago, both "prefed off" by default, but removed both.

I'd guest most of the HTTP Archive hits you found are variations of the tutorial we made for the now dead FirefoxOS.

from speech-api.

foolip avatar foolip commented on July 28, 2024

@kdavis-mozilla numerically most actually seem to be https://cdn.botframework.com/botframework-webchat/latest/botchat.js or variations on this. botframework.com is a Microsoft framework, so perhaps we could find someone to help shed light on how grammar is being used in this project. @thejohnjansen, are you able to help?

from speech-api.

tur-nr avatar tur-nr commented on July 28, 2024

@foolip I wrote that library a few years back when doing a hackathon project with the speech API. I don't know the browser support unfortunately, but I used Chromium (WebKit) Speech Recognition. Wasn't sure it was doing anything as Chrome just captured anything I said regardless of the grammar I gave it 🤷‍♀️

Microsoft have used it in their bot framework yes. You can review their usage on GitHub.

Hope that helps 😬

from speech-api.

foolip avatar foolip commented on July 28, 2024

Thanks @tur-nr! I tried to find use of grammars in https://github.com/microsoft/BotFramework-WebChat but couldn't find the same code as I see in https://cdn.botframework.com/botframework-webchat/latest/botchat.js.

@billba @danmarshall @corinagum, I see you are among the top contributors to that repo at Microsoft. Can any of you shed light on how https://cdn.botframework.com/botframework-webchat/latest/botchat.js uses the web-exposed APIs in https://w3c.github.io/speech-api/#speechreco-speechgrammar, which we're discussing in this issue? Specifically, what kind of values are you passing to addFromString and have you found it to have an effect on any browser?

from speech-api.

saschanaz avatar saschanaz commented on July 28, 2024

Thanks @tur-nr! I tried to find use of grammars in https://github.com/microsoft/BotFramework-WebChat but couldn't find the same code as I see in https://cdn.botframework.com/botframework-webchat/latest/botchat.js.

That code resembles https://github.com/microsoft/BotFramework-WebChat/blob/4849ce2928125475ee801a7bc90973cfa8db9c6e/src/SpeechModule.ts

from speech-api.

foolip avatar foolip commented on July 28, 2024

Yes, that's it, added by @compulim in microsoft/BotFramework-WebChat#937. @compulim, you mentioned "Web Speech API + JSRF" there, did you get that working in any browser at the time?

from speech-api.

kdavis-mozilla avatar kdavis-mozilla commented on July 28, 2024

So, at least as implemented in Chromium

What does "implemented" mean here?

Is the grammar only retained or is it retained and used to affect STT results?

from speech-api.

compulim avatar compulim commented on July 28, 2024

@foolip Agree with your observations.

Although Chromium say it support JSGF, when I send JSGF with weighted phrases, I don't feel the difference. It feels like the JSGF is simply ignored.

But my observation is very subjective because I am using my voice to test the engine. Without looking at the source code, it's very hard to say whether JSGF is working or ignored.

from speech-api.

foolip avatar foolip commented on July 28, 2024

It's probably easier to discuss a concrete example than the general idea of adding attributes, but I don't have a concrete example at this time.

from speech-api.

kdavis-mozilla avatar kdavis-mozilla commented on July 28, 2024

I can come up with many concrete examples, all of them bad 😊 Off the top of my head here are two...

Beam Width
For example, when using a language model one usually introduces a beam search in which the beam elements are ordered by language model and acoustic model scoring. The width of this beam "beam width" could be one of these "attributes".

However, the beam width is usually tuned such that the beam is as large as possible for the given resource budget. Allowing the users to set the beam width above this value will increase the quality but invalidate the financial calculations that went into deployment of the STT engine. Allowing users to decrease this beam width will decrease the quality of the STT results and frustrate users.

Language Model vs Acoustic Model Weighting
When one creates a STT engine it generally has two components, a language model and an acoustic model. Both models work together to assign probabilities to proposed transcripts. The language model suggests its probability for a transcript and the acoustic model suggests its probability for the same transcript. The final probability is given but assigning a weight to the language model's probabilities and a weight to the acoustic model's probabilities. These weights are laboriously tuned to optimize performance for particular use cases.

These weights could be some of these "attributes". However, giving users access to these weights will allow them to tune the engine away from its optimal configuration for the in-browser use case. Decreasing quality and frustrating users.

If you want more examples I can provide more.

from speech-api.

foolip avatar foolip commented on July 28, 2024

No, no, I wasn't looking for examples, of course there are more bad ideas than good ideas, and bad ones are easy to list. I'll open new issues for any actual proposals if they show up.

from speech-api.

guest271314 avatar guest271314 commented on July 28, 2024

@compulim

@foolip Agree with your observations.

Although Chromium say it support JSGF, when I send JSGF with weighted phrases, I don't feel the difference. It feels like the JSGF is simply ignored.

But my observation is very subjective because I am using my voice to test the engine. Without looking at the source code, it's very hard to say whether JSGF is working or ignored.

The Chrome/Chromium implementation of SpeechRecognition is essentially a black box.

What is known is that at Chrome/Chromium no permission is requested and no notification is provided that the user PII biometric data (the users' voice) is being recorded and sent to an undisclosed third-party web service. It is unclear if the users' voice is stored forever, and further used for research and development of proprietary technologies #56. It is not documented exactly how the third-party web service performs STT.

Additionally, so-called "curse" words should not be censored in the result.

Until the glaring issue at Chrome/Chromium of users not being notified and not being asked permission for their voice to be recorded and sent to an undisclosed web service, there is no way to practically test or implement grammars.

What can be done now is to 1) review how https://github.com/cmusphinx/pocketsphinx handles grammars https://github.com/cmusphinx/pocketsphinx/search?q=grammar&unscoped_q=grammar; 2) start from scratch converting voice to IPA (https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) to words in a given language, essentially the reverse of
https://github.com/itinerarium/phoneme-synthesis; see also

from speech-api.

szewai avatar szewai commented on July 28, 2024

Hello @foolip, is there any update on the usage report for the grammars API?

from speech-api.

foolip avatar foolip commented on July 28, 2024

Hi @szewai!

Here's the data from the use counters we have in Chrome, including the ones added in #57 (comment) and some more:

The new webkitSpeechRecognition() and new webkitSpeechGrammarList() usage is much higher than I would have guessed, but SpeechRecognition.prototype.start() gives a much better idea of the real usage. The addFromString() and addFromUri() usage in particular should be understood in relation to that, and a reasonable interpretation is that addFromUri() is often used when there's real usage of the API happening. (It could be that the start() and addFromUri() is mutually exclusive, but I see no reason to suspect it.)

However, as stated in #57 (comment), these "grammars" are effectively engine-specific options. If we find that usage in the wild depends on this in some important way, I think we should first try to define what the effect of certain invocations of addFromUri() should be, or if that turns out impractical to implement for other engines, define an alternative way to communicate those settings, and try to migrate the usage in the wild to that standardized mechanism.

from speech-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.