ideasman42 / nerd-dictation Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 107.0 124 KB

Simple, hackable offline speech to text - using the VOSK-API.

License: GNU General Public License v3.0

Python 100.00%

nerd-dictation's People

Contributors

Stargazers

Watchers

Forkers

papoteur-mga daoos ajichand2009 mikeybeez corallus-caninus jgyllinsky jroghidno forresthilton omlins rogervaas rpavlik aguinet jessebila12 asamwow luomor-ai krid garybozek wiltonlazary danielrode klei22 hbcbh1999 cblagden dotancohen matkoniecz joncampbell123 neuroradiology franciscot henrykrumb neverude aaroncwacker jneumann-dev tpoindex mco-system muxichu marcfabr ruckard ryanrhall dmproger beginnerslvl nathanlovato wwalker mklcp nikking-dg normantud mankykitty vassago1911 lume6 keevan rsettle xpcom-bsd johngebbie jdinabox chispasgg denisro2010 ripxorip jsemer acloserview electronstudio optissimum twelho boruch-baum nprevail promethiuml kj7lnw aditya9824 alecpapierniak berts83231 anialex wenminggong nicolas-graves goatchurchprime adjectiveallison salimshadiey standardgalactic kohane27 israaar drericebert senttient cowile purplesparkle foodrev stonehold76 5l1v3r1 raffthebaff wit-scds-coop arvidjohansen zawulon coderlencello waynejin88888 mervick storminstakk andriamfr mounta11n macasieb to-thesis-mwucs dtrckd xydinesh cosmic-zip zerotume marcellino-palerme

nerd-dictation's Issues

[TODO] Wayland support

First, very cool application!
I use wayland on KDE and found that ydotool (https://github.com/ReimuNotMoe/ydotool) works pretty well as a replacement for xdotool. To use ydotool I changed the command from xdotool to ydotool, removed the --clearmodifiers and changed the backspace command from ["backspace" to ['14', '14:0']
Ydotool is a bit harder to use because of wayland, but it works with both x and wayland so I find that a plus

Suggestion for docs: give "hint" pointing to daanzu large model as better for n-d than the "main" VOSK large model

The documentation says

Once this is working properly you may wish to download one of the larger language models for more accurate dictation. They are available here.

When one goes to that page, the intuitive thing to do is to download the top large model, currently "vosk-model-en-us-0.22"

However, that model on my i7 processor with nerd-dictation takes 5-10 seconds to start responding to speech input, plus it has the bug described in #31 issuing "the" intermittently if you leave it running. Together these really impair the functionality. My presumption is that the default use case for VOSK is to stay running and transcribe something longer than just a snippet of spoken text, where that loading time is not really an issue; but nerd-dictation users I think are going to tend to be turning it on and off, loading and unloading the model from memory.

So I tried the next model down, under "English Other", which is "vosk-model-en-us-daanzu-20200905". That model starts up within a second and seems to have extremely good accuracy. As a user experience, with nerd-dictation, I would describe it as night and day, not-usable vs usable, versus the "main" big VOSK model.

So just for the benefit of new users coming in and trying out nerd-dictation, to not have to do that experimentation and frustration with the "main" large model, please consider pointing in that hint to the daanzu model as one that likely works better for the context of nerd-dictation, at least for now.

Use audio or video file as input instead of microphone

It would be nice if there was a flag so you could convert an audio or video file to text.

At the moment I use desktop background sound as a virtual microphone with pavucontrol and it works flawlessly.

'huh' outputted after exiting

Hello, thanks for creating this project! Very cool.

I've noticed that huh is outputted after I stop nerd-dictation from running. Maybe outputted twice? I'm using the small English model from the install instructions.

Mostly thanks!

I am aware this is not the right place for that, but wanted to reach out just to say thanks for creating this project!
Only on Windows 11 is the Voice Dictation tool universally available to write anywhere.

As a hobby, I do some writing and this is exactly what I was looking for.

Just as an anecdote, I am using the following changes to support for some basic keyboard commands

    text = text.replace(" new line", "\n")
    text = text.replace(" dash", "-")
    text = text.replace(" slash", "/")
    text = text.replace(" period", ".")
    text = text.replace(" comma", ",")
    text = text.replace(" coma", ",") # This is probably because of my accent, serves as an example too
    text = text.replace(" calmer", ",") # This is probably because of my accent, serves as an example too
    text = text.replace(" colon", ":")
    text = text.replace(" question mark", "?")
    text = text.replace(" exclamation mark", "!")

I had to take care a bit of my microphone's input quality, but could do with Pipewire + Easy Effects, to deal with noise suppression and the such, but afterwards, the english standard models works more than fine with this.

Best regards and take care!

pa_context_connect() failed: Connection refused

Hi, I'm trying to run nerd-dictator on Kubutu 20.04.
I created a virtualenv, activated it and installed vosk by pip3.

I'm running newrd-dictation as root user and I get

./nerd-dictation begin --vosk-model-dir=./model &
pa_context_connect() failed: Connection refused

(the process still runs on background).
What is causing this error?
Am I missing something?

If I try ti run it as normal user, I get some permission error:

  File "./nerd-dictation", line 1188, in <module>
    main()
  File "./nerd-dictation", line 1184, in main
    args.func(args)
  File "./nerd-dictation", line 1107, in <lambda>
    func=lambda args: main_begin(
  File "./nerd-dictation", line 747, in main_begin
    touch(path_to_cookie)
  File "./nerd-dictation", line 65, in touch
    os.utime(filepath, None)
PermissionError: [Errno 13] Permission denied

I tried to change the ownership of the main folder and model/ folder so they belong to my current user, but I still get the error.
I notice the error mention a "path_to_cookie" but I have no idea of what path it could be.

Add option for single-pass word-by-word processing to enable continous listening to commands

It would be nice to have an option for single-pass word-by-word processing (or similar) in order to enable continuous listening to commands without emitting them multiple times (c.f. #17).

To avoid undesired behavior in continuous listening the option to Add '--commands' command line argument to restrict input to a limited set of commands (#3) could also be a solution here.

Influence of keyboard layout (xdotool limitation)

Hello,
Thanks for this tool which improves the usage of vosk.
I have done some tests in French. What surprised me is that the dictation before the first output seems to does take into account the keyboard layout.
For example I got:
Ceci est un essqi
instead of
Ceci est un essai
which is what I said.
Next sentences are well transcribed.
I didn't yet explore the code.

xdotool: freezes the OS

When I run the program by assigning the command "nerd-dictation begin --timeout 1 --numbers-as-digits --numbers-use-separator"
to a custom Keyboard Shortcut on Ubuntu 20.04 it seems to be freezing every single time. Any fixes for this?
It seems to behave like a memory leak, It completely crashes the OS.

Large model surrounding sentences with 'the'

Not sure why, but all of my sentences when using the large english model, appear as if delimited with the word 'the'

the that is a success the

special characters and xdotool

xdotool makes the program crash when special characters are introduced, a workaround for now is to use unidecode to get rid of the special characters, i use spanish so i will need to proofread this more troughly, anyways, this is the workaround, although a better solution would be to have a tool that allows utf-8 characters or other languages. I already tried the workaround with accented words 'á' and the special character 'ñ' , the accent is removed and the ñ is replaced with n.

Thanks for the tool, i was having problems finding something and i was thinking on doing exactly this, i need to make a big document in a short time and without this tool it might have been very difficult, if i can afterwards i will look for a complete solution instead of a workaround.

pip install unidecode

from unidecode import unidecode

def run_xdotool(subcommand: str, payload: List[str]) -> None:
    cmd_base = "xdotool"
    payload = [unidecode(word) for word in payload]

Model on same dir doesn't work

I have to copy the model to "~/.config/nerd-dictation", it doesn't test for the "model" folder on the script dir.

New possible output method and tts doubts

So I want to respeak my live recorded speech.
That means: mic -> text -> sound. Or in another words: Speech to Text and then Text to Speech.

The part for converting sounds from the microphone to text I achieve it thanks to nerd-dictation.
The part for converting text to sound again I want to implement it thanks to festival.

1 - I have sort of added a new output method to nerd-dictation. I call it file because it's meant to go into a file.
My current work can be found at https://github.com/ruckard/nerd-dictation/tree/speech_to_file_v2 . As you can see I have not added a new option for this mode because I'm not sure if it's worth it.

The current way that I run nerd-dictation is like this:
./nerd-dictation begin --vosk-model-dir=/home/playg/vosk-models/vosk-model-small-es-0.22 --full-sentence --punctuate-from-previous-timeout 1 --idle-time 0.5 --continuous --timeout 0.5 --output=STDOUT > /tmp/output_test_file.txt.

Then I just tail -f /tmp/output_test_file.txt.

2 - The current changes ( ruckard@5acbd54 ) abuse the timeout option so that instead of exiting the program it process the audio again and gives me another sentence. It also makes sure not to output new text if there is nothing else said.

The idea is to read every line (after \n is issued) and reproduce it thanks to festival.

3 - Anyways in the end I have three questions for you:

Do you want me to send you a new file output mode which works as described pull request so that it gets added in the upstream project?
Would you accept a pull request about a new functionality that converts the text back to sound thanks to festival (or espeak-ng or similar tool)?
Do you know any other project that already does what I'm trying to do?

Thank you very much for your feedback.

use pypy3 to speed up desktop lockup

this should use pypy3 to launch nerd-dictation instead of python3. i got pypy3 to work but i did not implement as an executable like python3. it seemingly gives a performance increase although i still have some lock up when i wait for the text to process. another alternative is to use numba but i haven't looked into this or whether it is possible nor have i profiled the code to see where this would be most appropriate.

thank you i wrote this issue using nerd dictation with minor editing.

Delay typing speed and accept partial word matches?

I would like to reduce the typing speed and limit text prediction period. It should look like 75-90 words per minute. A major issue is that I will dictate a sentence then it will backspace a big portion of that sentence replacing wit with something else that isn't as accurate the first time around. Please advise what file I should change. Thank you.

UPDATE:
Modifying xdotool parameters can slow down the backspaces (set to 50ms)...

def simulate_backspace_presses(count: int, simulate_input_tool: str) -> None:
    cmd = [simulate_input_tool.lower()]
    if simulate_input_tool == "XDOTOOL":
        cmd += ["key", "--delay", "50"] + ["BackSpace"] * count

... and typing speed (set to 100ms)

def simulate_typing(text: str, simulate_input_tool: str) -> None:
    cmd = [simulate_input_tool.lower()]
    if simulate_input_tool == "XDOTOOL":
        cmd += ["type", "--clearmodifiers", "--delay", "100", text]

Would still like to know how to prevent the program from backspacing most of a completed sentence only to reprint the a slightly less accurate replacement.

Add support for speech to commands

PR #17 is a Proof of Concept of how speech to commands could be supported. The idea is the following:

Match the 1st word to a command name stored in a command dictionary (WORD_CMD_MAP) or to the command name reserved for dictation ("type"); retry if no match (resetting everything).
Process depending on the command name:
- if dictation command: process text as before
- else: match the command arguments against the command tree dictionary (WORD_CMD_MAP) until a full command is identified; then, launch it and reset nerd-dictation

For this workflow, nerd-dictation should provide a reset function that can be called from within the configuration script (nerd-dictation.py). Moreover, it could be very useful if this reset function accepted a command name which could then be passed further to nerd_dictation_process as an optional argument. When this optional argument is given, then the first step above could be skipped, e.g., one could directly enter into the dictation mode just like before (and avoid that the first word would be the dictation command name which likely influences the statistical natural language prediction negatively). Furthermore, with little modification this would allow also to easily pass freely dictated arguments to certain commands. Finally, the whole workflow enables continuous listening to commands avoiding a reloading of the VOSK model and commands are only emitted if securely identified as the first word (and the following would not be needed: Add '--commands' command line argument to restrict input to a limited set of commands (#3) ).

Now, while this seems all very straightforward, there is one crucial issue to solve in order to enable efficient speech to commands: commands seem to be very badly recognized by the normal VOSK natural language model(s) (at least in my few tests). The model expects as first word, e.g, "hi" and "hello" instead of any random word that we might want to use as command name. As a result, a command name like "right" (right mouseclick) is most of the time recognized as "hi" by the VOSK model. In consequence, I believe it will be necessary to use a different VOSK model for the command recognition then for natural language dictation. I don't know if the "Speaker identification model" (see: https://alphacephei.com/vosk/models) might be of any use; else, one could create a very simple VOSK module based on the command tree dictionary (WORD_CMD_MAP). For technical details, it would certainly help to learn more about how the VOSK model for command recognition was built for this Android app:
https://realize.be/blog/offline-speech-text-trigger-custom-commands-android-kaldi-and-vosk
alphacep/vosk-api#41

While the creation of a simple VOSK model for command recognition is probably a bit of work, I believe that it would lead to an exceptional model (as it would contain only exactly what should be recognized).

High usage of CPU

When executing nerd-dictation with or without --continuous option, the load of CPU is still high, of about one complete core.
When stopping the execution with keyboard interrupt, I got always interruption in function exit_fn.
If I try to add:

    time.sleep(PROGRESSIVE_CONTINUOUS_SLEEP_WHEN_IDLE)

just before calling return 0, the CPU usage is low, without big impact on accuracy.

Question: delete space in front of "." or ","

Hi. I have a question that I cannot find the answer to, since I am a beginner in Python. So I use a german model for nerd-dication. Full sentences do not work here, so I have to dictate "Komma" or "Punkt" for comma and full-stop. Now I have configured the user configuration file that those two words are replaced with "," and ".". I need to delete the space in front of those characters. How can I achieve this via the configuration file? Your help is much appreciated.

By the way I solved the "proper noun" thing for me, as I integrated spacy and let it check for nouns and then capitalize the word. It is a bit slow then, but it works.

Thanks for this really great tool!

How to capitalize every sentence

--full-sentence --punctuate-from-previous-timeout 2

This only capitalizes the first sentence, but not subsequent, plse advise.

Packaging

Hello,
I have the idea to package nerd-dictation for Pypi.org. I tested adding a setup.py and setup.cfg file.
Thus I tried to consider nerd-dictation file as a module, adding a console script entry.
At this step, I'm facing a problem that the name nerd-dictation is not allowed because of the dash, the name generates syntax error with `import nerd-dictation".
Can it be considered to change the name in nerd_dictation instead of nerd-dictation?
I didn't yet explored another way to not use module/console script, but to install directly the nerd-dictation script.
What do you think about that?
The background idea is to distribute it with easy installation with pip install, and also that elograf can require it as dependency.

How to capitalize the proper names ?

Hi,

First, I'd like to congratulate Campbell Barton. Thank you very much for this wonderful script !

Melbourne, Berlin, John, etc. are recognized with lower case first letter. If possible, who could write a script to add to nerd-dictation.py ?
Unfortunately, I can't do it !
Thanks to you.

looong processing time

Hello,

after starting recognition, for any short sentence to be recognized, it takes ages before producing any output.
(by ages, I mean from 30 to 90 seconds; in the meantime, the computer is frozen).
after getting the first output the terminal starts responding again (although not fluidly) and I'm able to stop it.

I am running Ubuntu 20.04.1 on a i7 .3GHz 8 core CPU with 15.4 GB RAM

any suggestion is highly appreciated..

improve false postitves "huh" "the"

Hi, I love ND!

I'm finally trying to get it integrated with the CLI with Bash aliases, but I'm getting a lot of false positives like "huh".

What is the best way to decrease the sensitivity to these?

punctuation help

First of all thank you for this very useful project.
I need to incorporate punctuation into dictation (in french).
Replace single and multiple words works very well. But I encounter a difficulty with the need to be able to make a point and then return to the line. I am beginner in python ...

The best I've managed to do looks like this :
if text == "point à la ligne":
text = "."
run_xdotool("key", ["Return"])

And by the way make sure that there is no space before the point.
I searched for a while but here I am stuck, thanks in advance.

No keystrokes appear in LibreOffice Writer

With some sort of recent upgrade either Ubuntu or LibreOffice the I have noticed that I cannot use nerd-dictation in LibreOffice writer. No text appears. nerd-dictation works fine with chrome or thunderbird windows. It did not used to be this way. I have upgraded from ubuntu 18 to 21.10 recently, so perhaps there was some sort of change with regard to that period maybe there's some sort of security policy that prevents simulated keystrokes? Just a guess. Libreoffice is 7.2.3.2.

Consider editing repository setings to remove "Packages" section

"Packages No packages published" is displayed right now, fortunately this pointless section can be removed.

Edit repo page config to remove it (cog next to the description).

I am not making a PR as it is defined in proprietary github settings, not in a git repository - and I have no rights to modify repo settings.

Maybe also remove releases section

BTW, if there is some appearance of new people it may be result of https://news.ycombinator.com/item?id=29972579

Feature request: single word with time stamp

Usecase: with audiofile as input

Example flag: --single-word-timestamp

Output: word: hi beginning:1.54,6 end: 1.55,3

`--numbers-as-digits` not working with the French language

First of all thank you for this project, I haven't tried FOSS Speech-to-Text for a while, and I'm pleasantly surprised by the quality of the result of VOSK, and nerd-dictation makes it easily hackable, great!

I have a little issue though, the --numbers-as-digits doesn't seem to work with French model (the biggest one): when I try I have this result:

% nerd-dictation begin --numbers-as-digits
deux mille vingt-deux un deux trois quatre cinq 6 sept huit neuf dix zéro

(curiously, 6 is output correctly as a number, but it's the only one). Is this feature supported for English only, or is it supposed to work with French too?

I've installed nerd-dictation on Arch Linux from AUR package nerd-dictation-git.

Thanks!

Add '--commands' command line argument to restrict input to a limited set of commands

In situations where only a limited set of commands is needed, it would be useful to pass this list in as an argument.

This has the advantage that dictation could end immediately once a unique command was matched.

It would also allow for fuzzy matching if exact matches could not be found.

Example:

COMMAND="$(nerd-dictation begin --commands=valid_commands.txt --timeout=1.0)"

ydotool alternative: dotool

Hello, I wrote a ydotool alternative called dotool and think it could help.
It is designed to run without root permissions and the daemon is optional.

I wrote it because ydotool and all the required patching was stopping my program (https://numen.johngebbie.com) that depended on it getting packaged on distros.

All the best,

John

EDIT: I have got dotool packaged on Void Linux, hopefully it will be packaged elsewhere soon.

English text is out of order and includes extra characters

The characters are strangely out of order. Using the vosk-model-en-us-0.22-lgraph.zip model. Saying "This is a test of the emergency broadcast system" multiple times:

$ ./nerd-dictation begin --vosk-model-dir=./model --timeout=1.0
this i tstfesa  o the mycnegeer broactdas ysstem
$ ./nerd-dictation begin --vosk-model-dir=./model --timeout=1.0
tihs is  atesoft  theem ergencbortsy adca systme
$ ./nerd-dictation begin --vosk-model-dir=./model --timeout=1.0
this is a se oftt the mereg aorbnecscdyta system

The Vosk API test_microphone.py works correctly:

$ python3 test_microphone.py
LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=6
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:11:12:13:14:15
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:CompileLooped():nnet-compile-looped.cc:345) Spent 0.089 seconds in looped compilation.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from model/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:281) Loading HCL and G from model/graph/HCLr.fst model/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:302) Loading winfo model/graph/phones/word_boundary.int
################################################################################
Press Ctrl+C to stop the recording
################################################################################
{
  "partial" : ""
}
<SNIP DUPLICATES>
{
  "partial" : "this"
}
{
  "partial" : "this"
}
{
  "partial" : "this is"
}
{
  "partial" : "this is a"
}
{
  "partial" : "this is a test of"
}
{
  "partial" : "this is a test of"
}
{
  "partial" : "this is a test of the"
}
{
  "partial" : "this is a test of the emergency"
}
{
  "partial" : "this is a test of the emergency broadcast"
}
<SNIP DUPLICATES>
{
  "partial" : "this is a test of the emergency broadcast system"
}
<SNIP DUPLICATES>
{
  "text" : "this is a test of the emergency broadcast system"
}
{
  "partial" : ""
}
<SNIP DUPLICATES>
^C
Done

Is it possible to output ctrl, shift... key strokes?

Hi there,
First of all, congrats on this tool, it's light-weight, simple, customizable, can be executed from emacs, just perfect. I was just wondering whether it was possible to convert a spoken command into a command with modifier keys (C-c C-c typically...!)?
Cheers,
Vian

Error during installation on Arch Linux

I am on Arch Linux. After opening a terminal and typing pamac search vosk

I execute pamac install python-vosk and then I get the error Failed to compile vosk-api

The same error happens if I execute pamac install nerd-dictation-git

Not a super big issue because I have been able to install the code successfully by following the tutorial. But I wonder why pamac is not working for installing these packages.

Initial caps

Is there an easy way to fix missing initial caps? Maybe a keystroke?

German speech input ends input with superfluous words

I have been using nerd-dictation for a while and it's fantastic - open-source, adaptable, hackable, Python, Linux-friendly :) Thanks!

But there's a strange thing happening: English input works without any problems, German input however nearly always prints nein (no) or einen (one, pronoun) at the end of a spoken information chunk. I have no idea why.

System settings:
Linux Mint Cinnamon 20.2 (Ubuntu 20.04.1)

I invoke both via keyboard shortcuts that call a bash script. Commands in the bash script (I skipped the venv-related paths etc)

# German
nerd-dictation begin --vosk-model-dir ~/opt/nerd-dictation/model-de --numbers-as-digits --timeout 5 --punctuate-from-previous-timeout 3

# English
nerd-dictation begin --vosk-model-dir ~/opt/nerd-dictation/model-en-us --numbers-as-digits --timeout 5 --punctuate-from-previous-timeout 3

Models are the full versions: German is vosk-model-de-0.21.zip, English is vosk-model-en-us-0.22.zip.

I suppose it might be related to:

xdotool or something like that
the applications in which I use nerd-dictation
... something else?

I am at a loss how to debug it or discover the origin of the superfluous words. Any ideas, explanations, possibilities?

Feature Request: Phoneme Output

Curious to look at phoneme data from input speech, would this be part of the speech-to-text pipeline, if so would there be a part of the program I could look to to modify and provide this output?

punctuate-from-previous-timeout not punctuating

The current documentation makes it seem like using this command should result in fully punctuated sentences:

nerd-dictation begin --full-sentence  --continuous --punctuate-from-previous-timeout=2 --timeout=4

But instead I'm getting something like this:
"Sentence oneSentence two"

UX: if you do `--help` it's not clear you're not seeing everything.

Specifically you don't see nerd-dictation begin --help, which basically contains all the goodies?

Is using nerd-dictation to control software a solved problem?

I want to use nerd-dictation for processing my photos, basically:

show photo
wait for command (next previous delete promote)
if command is detected: show what was detected (or produce sound feedback?), execute action

I am not entirely sure what would be the best way to implement this - has anyone did something like that already? Seems a relatively obvious use of actually working voice-to-text.

(maybe using nerd-dictation is a mistake and I should be using vosk API directly?)

Adding simple built-in support for a config and model per language

I would like to have support for easily having a separate script and model for languages other than English. I'd gladly contribute this if it's a feature you would like in nerd-dictation.

Currently, you have a config folder where you can place a model, and a Python file nerd-dictation will use by default.

The idea would be to add a new command-line option to choose a config subfolder, e.g. --config-subfolder=fr. In that case, nerd-dictation would try to find the model and the Python file in .config/nerd-dictation/fr/ instead of .config/nerd-dictation/.

This would allow users to have a configuration per language (or, I dunno, for different use cases) without adding complexity to the program.

Is this something you'd like to have in nerd-dictation? If so, Is there anything you would like me to do to contribute this? Documentation, specific cli option name, example configuration in another language...

Capitalize words arbitrarily with preceding command / escape word

This is a feature request. In Dragon, one can capitalize any word by saying "cap" before it, so saying "cap nerd cap diction" outputs "Nerd Dictation". It would be nice to have this functionality in nerd-dictation.

It would also be nice if the command word was configurable. Although "cap" was short and intuitive, it really wasn't the best choice since it is a word one actually uses fairly often.

Russian input lags entire interface

Russian input lags entire interface. But some programs (Blender for example) don't lag at all (also Blender usually launched in fullscreen). English input works fine. Model: "vosk-model-small-ru-0.22"

Add '--config' command line argument for a custom configuration file

It would be good to support a --config=filepath argument so each command can specify a different configuration to use.

This would allow different use cases based on whoever launches the command, where 1 call could be used for dictation another call might be used for home-assistant actions (just as an example).

Lots of numbers being spit out

Thank you for writing this interesting project. It's running, but it's spitting out a lot of garbage along with the text.

❯ ./nerd-dictation begin
0.09997663497924805
0.09870014190673829
0.09955344200134278
0.09974346160888672
0.09971175193786622
0.0929502010345459
0.09946784973144532
0.09947595596313477
0.0925527572631836
0.09944138526916504
0.09245476722717286
0.09949836730957032
0.09236202239990235
0.09945592880249024
0.09939346313476563
0.0923090934753418
0.09901008605957032
THIS0.09907612800598145
IS0.039521551132202154
0.09932670593261719
0.09929046630859376
ANOTHER0.07741460800170899
0.09929213523864747
0.09936389923095704
TERRORIST0.015120840072631841
0.09926352500915528
ST0.09925565719604493
0.0896986484527588
0.09934697151184083
0.09947404861450196
0.09257588386535645
0.09938035011291504
0.09136066436767579
0.09934458732604981
0.06850967407226563
0.09943637847900391
0.09936747550964356
0.09154710769653321
0.09944114685058594
0.09195122718811036
0.09947142601013184

How do I suppress all these logits?

Launcher in systray

I have created for my usage a launcher which displays an icon in the systray and on which action launches or stop nerd-dictation.
What is working uses PyQt and the icon acts as a toggle button. Memory impact seems to be 17 Mo.
I have also tried pystray. It works but with some caveats: in my LXQt, the icon is not displayed, but a gear instead. It acts by displaying a contextual menu, thus needs two actions, I did find how to catch a single click on the icon. And on a recent LXQt, it doesn't work at all because of an error on DBus. Memory impact seems to be 13 Mo.
I didn't check the size of what is pulled as requirement.
This is not yet ready for large usage, because I have no installation system, and paths are hardcoded.
But do you mean it's something to add to nerd-dictation or to put in another place ?

What is the correct format for --pulse-device-name?

First off - thank you. This is precisely what I have been looking for. Great work here!

I want to ensure that the program is using the right microphone - I want to make sure it uses the external one, not the one on my laptop. Running pactl list gives me a WHOLE slew of stuff, but I think this is the chunk I'm most interested in, since it lists my external microphone:

Card #2
	Name: alsa_card.usb-BLUE_MICROPHONE_Blue_Snowball_201603-00
	Driver: module-alsa-card.c
	Owner Module: 28
	Properties:
		alsa.card = "1"
		alsa.card_name = "Blue Snowball"
		alsa.long_card_name = "BLUE MICROPHONE Blue Snowball at usb-0000:00:14.0-3, full speed"
		alsa.driver_name = "snd_usb_audio"
		device.bus_path = "pci-0000:00:14.0-usb-0:3:1.0"
		sysfs.path = "/devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3:1.0/sound/card1"
		udev.id = "usb-BLUE_MICROPHONE_Blue_Snowball_201603-00"
		device.bus = "usb"
		device.vendor.id = "0d8c"
		device.vendor.name = "C-Media Electronics, Inc."
		device.product.id = "0005"
		device.product.name = "Blue Snowball"
		device.serial = "BLUE_MICROPHONE_Blue_Snowball_201603"
		device.string = "1"
		device.description = "Blue Snowball"
		module-udev-detect.discovered = "1"
		device.icon_name = "audio-card-usb"
	Profiles:
		input:mono-fallback: Mono Input (sinks: 0, sources: 1, priority: 1, available: yes)
		input:multichannel-input: Multichannel Input (sinks: 0, sources: 1, priority: 1, available: yes)
		off: Off (sinks: 0, sources: 0, priority: 0, available: yes)
	Active Profile: input:mono-fallback
	Ports:
		analog-input-mic: Microphone (priority: 8700, latency offset: 0 usec)
			Properties:
				device.icon_name = "audio-input-microphone"
			Part of profile(s): input:mono-fallback
		multichannel-input: Multichannel Input (priority: 0, latency offset: 0 usec)
			Part of profile(s): input:multichannel-input

I have tried feeding the "Name" value (alsa_card.usb-BLUE_MICROPHONE_Blue_Snowball_201603-00), the udev.id, and the device.icon_name (longshot) into the CLI, each time getting the error Stream error: No such entity. If I don't include the --pulse-device-name, dictation works fine, but I want to ensure it's getting the best input possible.

Which of the values from the pactl list output should we use for that flag? Or is there another value further up in the stream - i.e. not "Card #2' - that I should be looking at?

Thanks!

Using listen on / off to begin / end dictation

Instead of using a keyboard shortcut, I would like to use 2 keywords to begin and end dictation. Meaning that nerd-dictation will always listen to microphone and when recognised: listen on = start typing and listen off, stop typing (xdotool).

How to implement that?

Stream error: No such entity

Hi, when starting nerd dictation, I get the following message
Stream error: No such entity
When speaking, no text shows.

request: add an icon to known when it is active

I know this might be beyond the simplicity of this program, but it would be awesome to have at least an icon in the taskbar, that way i could put the timeout in the shortcut and know when it has timed out, it would be a quality of life improvement, although it is non essential.

pause-option would be nice

so far there is "begin", "end" and "cancel" - and it is wonderful. but I sometimes struggle to find words and start mumbling and I do not want that to be transcribed. I just mute the mic now, but that results in fragments which are highly annoying see #26. since this is due to vosk, a nice work around would be a "pause" mode of the input that I can set a key binding to.

maybe it is even possible to change the vosk model during "pause" mode? so one could switch languages..