at16k / at16k Goto Github PK

Trained models for automatic speech recognition (ASR). A library to quickly build applications that require speech to text conversion.

Home Page: https://at16k.com

License: MIT License

Python 100.00%

asr asr-model automatic-speech-recognition pretrained-models speech-analysis speech-api speech-recognition speech-recognizer speech-to-text voice-commands voice-recognition

at16k's Introduction

at16k

Pronounced as at sixteen k.

What is at16k?

at16k is a Python library to perform automatic speech recognition or speech to text conversion. The goal of this project is to provide the community with a production quality speech-to-text library.

Installation

It is recommended that you install at16k in a virtual environment.

Prerequisites

Python >= 3.6
Tensorflow = 1.14
Scipy (for reading wav files)

Install via pip

$ pip install at16k

Install from source

Requires: poetry

$ git clone https://github.com/at16k/at16k.git
$ poetry env use python3.6
$ poetry install

Download models

Currently, three models are available for speech to text conversion.

en_8k (Trained on English audio recorded at 8 KHz, supports offline ASR)
en_16k (Trained on English audio recorded at 16 KHz, supports offline ASR)
en_16k_rnnt (Trained on English audio recorded at 16 KHz, supports real-time ASR)

To download all the models:

$ python -m at16k.download all

Alternatively, you can download only the model you need. For example:

$ python -m at16k.download en_8k
$ python -m at16k.download en_16k
$ python -m at16k.download en_16k_rnnt

By default, the models will be downloaded and stored at <HOME_DIR>/.at16k. To override the default, set the environment variable AT16K_RESOURCES_DIR. For example:

$ export AT16K_RESOURCES_DIR=/path/to/my/directory

You will need to reuse this environment variable while using the API via command-line, library or REST API.

Preprocessing audio files

at16k accepts wav files with the following specs:

Channels: 1
Bits per sample: 16
Sample rate: 8000 (en_8k) or 16000 (en_16k)

Use ffmpeg to convert your audio/video files to an acceptable format. For example,

# For 8 KHz
$ ffmpeg -i <input_file> -ar 8000 -ac 1 -ab 16 <output_file>

# For 16 KHz
$ ffmpeg -i <input_file> -ar 16000 -ac 1 -ab 16 <output_file>

Usage

at16k supports two modes for performing ASR - offline and real-time. And, it comes with a handy command line utility to quickly try out different models and use cases.

Here are a few examples -

# Offline ASR, 8 KHz sampling rate
$ at16k-convert -i <path_to_wav_file> -m en_8k

# Offline ASR, 16 KHz sampling rate
$ at16k-convert -i <path_to_wav_file> -m en_16k

# Real-time ASR, 16 KHz sampling rate, from a file, beam decoding
$ at16k-convert -i <path_to_wav_file> -m en_16k_rnnt -d beam

# Real-time ASR, 16 KHz sampling rate, from mic input, greedy decoding (requires pyaudio)
$ at16k-convert -m en_16k_rnnt -d greedy

If the at16k-convert binary is not available for some reason, replace it with -

python -m at16k.bin.speech_to_text ...

Library API

Check this file for examples on how to use at16k as a library.

Limitations

The max duration of your audio file should be less than 30 seconds when using en_8k, and less than 15 seconds when using en_16k. An error will not be thrown if the duration exceeds the limits, however, your transcript may contain errors and missing text.

License

This software is distributed under the MIT license.

Acknowledgements

We would like to thank Google TensorFlow Research Cloud (TFRC) program for providing access to cloud TPUs.

at16k's People

Contributors

Stargazers

Watchers

Forkers

nieshaoshuai xu-shihao edantonio505 sohantirpude mac-and-cheese nayanhalder sricharan879 tharun4477 entn-at shubhamghuwara pu55yf3r kishoreselvakumar epignatelli sphara-app sshuster garain cybertrust1 soltrinox mshans66

at16k's Issues

Max duration limitation

Hi Mohit,
Thanks for sharing your work.

I just wanted to check if there is any workaround or recommended way to avoid the time duration limit.
I was thinking to break audio in segments and then combine using overlapping transcripts.
Would you suggest any better approach than this?

Provide an option to accept .mp3 files

It would be great to provide an option to accept .mp3 files if .wav files are not available and instead of converting .wav files to .mp3.

GPU for Model Inference + Real time transcription

Hi Mohit,

Thanks for providing this pre-trained model.
I have a few questions on this:

Is there a way to run this model on GPU for inference?
Can we utilize these model weights to do real-time transcription using audio from the mic?

Thanks!

tatoeba

Very good free database of sentences in many languages https://tatoeba.org/

Time to audio maping

Hi I would like to know a way or workaround to get the following results:

If I pass an audio file to transcribe is there any way of getting the transcribe results based on seconds like -

{
"0.0.0 to 0.0.1 : sample transctibe,
0.0.1 to 0.0.2 : second second audio,
0.0.2 to 0.0.3 : third second text,
}

the mapping between transcribe text and time

Tried to convert wav audio file to transcribe

Hi,
I have got this error while trying to convert wav file to text.

File "c:\users\muhammad abdullah\anaconda3\lib\site-packages\at16k\api\live_speech_to_text.py", line 35, in from_file
samples = media.waveform.ravel().tolist()
File "c:\users\muhammad abdullah\anaconda3\lib\site-packages\at16k\core\media.py", line 38, in waveform
waveform = waveform.astype(self._dtype) / np.iinfo(waveform.dtype).max
File "c:\users\muhammad abdullah\anaconda3\lib\site-packages\numpy\core\getlimits.py", line 506, in init
raise ValueError("Invalid integer data type %r." % (self.kind,))
ValueError: Invalid integer data type 'f'.

Access issue while downloading model files.

I am trying to download both model files but probably GCP storage is giving access issue. Can someone pls look into this

(at16k) D:\Research>python -m at16k.download en_16k
Downloading from https://storage.googleapis.com/at16k-ce/models\en_16k.tar.gz
Traceback (most recent call last):
File "C:\Anaconda3\envs\at16k\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Anaconda3\envs\at16k\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Anaconda3\envs\at16k\lib\site-packages\at16k\download.py", line 83, in
main()
File "C:\Anaconda3\envs\at16k\lib\site-packages\at16k\download.py", line 76, in main
download_model(remote_path, local_path)
File "C:\Anaconda3\envs\at16k\lib\site-packages\at16k\download.py", line 48, in download_model
urllib.request.urlretrieve(remote_path, local_path, show_progress)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Original checkpoint and hparams

@mohitsshah can you provide the training checkpoint and hparams of the tensor2tensor model.

I am working on accessing the beam search results (to see if adding language model scoring improves the result). Although I might be able to access the beam results by finding the node before decoding, it will be much easier if original checkpoint and hparams are there.

Custom Vocabulary

Hi Mohit,

I want to add custom vocabulary to transcribe some specific words. What would you suggest on this?
Do I need to fine tune this model on new training data which include these words or do we have any flexibility with tensor2tensor where we don't need to retrain acoustic model (like kaldi) and just updating the Language model would solve the purpose.
Apologies if this question sounds silly, I am very new to this tensor2tensor library.

Thank you :)

training code

Any plans to release the training code or the the details of the architecture?

Update the vocab of the at16k

Hi,
I really find this git and speech to text usable. I am trying to implement it to medical but it does have issues with some words.
Kindly, Can you tell me what is the way to update vocab of the model.

at16k / at16k Goto Github PK

at16k's Introduction

at16k

What is at16k?

Installation

Prerequisites

Install via pip

Install from source

Download models

Preprocessing audio files

Usage

Library API

Limitations

License

Acknowledgements

at16k's People

Contributors

Stargazers

Watchers

Forkers

at16k's Issues

Recommend Projects

Recommend Topics

Recommend Org