Giter Site home page Giter Site logo

at16k / at16k Goto Github PK

View Code? Open in Web Editor NEW
128.0 128.0 19.0 274 KB

Trained models for automatic speech recognition (ASR). A library to quickly build applications that require speech to text conversion.

Home Page: https://at16k.com

License: MIT License

Python 100.00%
asr asr-model automatic-speech-recognition pretrained-models speech-analysis speech-api speech-recognition speech-recognizer speech-to-text voice-commands voice-recognition

at16k's Introduction

Maintenance made-with-python PyPI license Open Source Love svg1 PyPI - Python Version Downloads

at16k

Pronounced as at sixteen k.

What is at16k?

at16k is a Python library to perform automatic speech recognition or speech to text conversion. The goal of this project is to provide the community with a production quality speech-to-text library.

Installation

It is recommended that you install at16k in a virtual environment.

Prerequisites

  • Python >= 3.6
  • Tensorflow = 1.14
  • Scipy (for reading wav files)

Install via pip

$ pip install at16k

Install from source

Requires: poetry

$ git clone https://github.com/at16k/at16k.git
$ poetry env use python3.6
$ poetry install

Download models

Currently, three models are available for speech to text conversion.

  • en_8k (Trained on English audio recorded at 8 KHz, supports offline ASR)
  • en_16k (Trained on English audio recorded at 16 KHz, supports offline ASR)
  • en_16k_rnnt (Trained on English audio recorded at 16 KHz, supports real-time ASR)

To download all the models:

$ python -m at16k.download all

Alternatively, you can download only the model you need. For example:

$ python -m at16k.download en_8k
$ python -m at16k.download en_16k
$ python -m at16k.download en_16k_rnnt

By default, the models will be downloaded and stored at <HOME_DIR>/.at16k. To override the default, set the environment variable AT16K_RESOURCES_DIR. For example:

$ export AT16K_RESOURCES_DIR=/path/to/my/directory

You will need to reuse this environment variable while using the API via command-line, library or REST API.

Preprocessing audio files

at16k accepts wav files with the following specs:

  • Channels: 1
  • Bits per sample: 16
  • Sample rate: 8000 (en_8k) or 16000 (en_16k)

Use ffmpeg to convert your audio/video files to an acceptable format. For example,

# For 8 KHz
$ ffmpeg -i <input_file> -ar 8000 -ac 1 -ab 16 <output_file>

# For 16 KHz
$ ffmpeg -i <input_file> -ar 16000 -ac 1 -ab 16 <output_file>

Usage

at16k supports two modes for performing ASR - offline and real-time. And, it comes with a handy command line utility to quickly try out different models and use cases.

Here are a few examples -

# Offline ASR, 8 KHz sampling rate
$ at16k-convert -i <path_to_wav_file> -m en_8k

# Offline ASR, 16 KHz sampling rate
$ at16k-convert -i <path_to_wav_file> -m en_16k

# Real-time ASR, 16 KHz sampling rate, from a file, beam decoding
$ at16k-convert -i <path_to_wav_file> -m en_16k_rnnt -d beam

# Real-time ASR, 16 KHz sampling rate, from mic input, greedy decoding (requires pyaudio)
$ at16k-convert -m en_16k_rnnt -d greedy

If the at16k-convert binary is not available for some reason, replace it with -

python -m at16k.bin.speech_to_text ...

Library API

Check this file for examples on how to use at16k as a library.

Limitations

The max duration of your audio file should be less than 30 seconds when using en_8k, and less than 15 seconds when using en_16k. An error will not be thrown if the duration exceeds the limits, however, your transcript may contain errors and missing text.

License

This software is distributed under the MIT license.

Acknowledgements

We would like to thank Google TensorFlow Research Cloud (TFRC) program for providing access to cloud TPUs.

at16k's People

Contributors

imbdb avatar mohitsshah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

at16k's Issues

Max duration limitation

Hi Mohit,
Thanks for sharing your work.

I just wanted to check if there is any workaround or recommended way to avoid the time duration limit.
I was thinking to break audio in segments and then combine using overlapping transcripts.
Would you suggest any better approach than this?

GPU for Model Inference + Real time transcription

Hi Mohit,

Thanks for providing this pre-trained model.
I have a few questions on this:

  1. Is there a way to run this model on GPU for inference?
  2. Can we utilize these model weights to do real-time transcription using audio from the mic?

Thanks!

Time to audio maping

Hi I would like to know a way or workaround to get the following results:

If I pass an audio file to transcribe is there any way of getting the transcribe results based on seconds like -

{
"0.0.0 to 0.0.1 : sample transctibe,
0.0.1 to 0.0.2 : second second audio,
0.0.2 to 0.0.3 : third second text,
}

the mapping between transcribe text and time

Tried to convert wav audio file to transcribe

Hi,
I have got this error while trying to convert wav file to text.

File "c:\users\muhammad abdullah\anaconda3\lib\site-packages\at16k\api\live_speech_to_text.py", line 35, in from_file
samples = media.waveform.ravel().tolist()
File "c:\users\muhammad abdullah\anaconda3\lib\site-packages\at16k\core\media.py", line 38, in waveform
waveform = waveform.astype(self._dtype) / np.iinfo(waveform.dtype).max
File "c:\users\muhammad abdullah\anaconda3\lib\site-packages\numpy\core\getlimits.py", line 506, in init
raise ValueError("Invalid integer data type %r." % (self.kind,))
ValueError: Invalid integer data type 'f'.

Access issue while downloading model files.

I am trying to download both model files but probably GCP storage is giving access issue. Can someone pls look into this

(at16k) D:\Research>python -m at16k.download en_16k
Downloading from https://storage.googleapis.com/at16k-ce/models\en_16k.tar.gz
Traceback (most recent call last):
File "C:\Anaconda3\envs\at16k\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Anaconda3\envs\at16k\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Anaconda3\envs\at16k\lib\site-packages\at16k\download.py", line 83, in
main()
File "C:\Anaconda3\envs\at16k\lib\site-packages\at16k\download.py", line 76, in main
download_model(remote_path, local_path)
File "C:\Anaconda3\envs\at16k\lib\site-packages\at16k\download.py", line 48, in download_model
urllib.request.urlretrieve(remote_path, local_path, show_progress)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Anaconda3\envs\at16k\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Original checkpoint and hparams

@mohitsshah can you provide the training checkpoint and hparams of the tensor2tensor model.

I am working on accessing the beam search results (to see if adding language model scoring improves the result). Although I might be able to access the beam results by finding the node before decoding, it will be much easier if original checkpoint and hparams are there.

Custom Vocabulary

Hi Mohit,

I want to add custom vocabulary to transcribe some specific words. What would you suggest on this?
Do I need to fine tune this model on new training data which include these words or do we have any flexibility with tensor2tensor where we don't need to retrain acoustic model (like kaldi) and just updating the Language model would solve the purpose.
Apologies if this question sounds silly, I am very new to this tensor2tensor library.

Thank you :)

training code

Any plans to release the training code or the the details of the architecture?

Update the vocab of the at16k

Hi,
I really find this git and speech to text usable. I am trying to implement it to medical but it does have issues with some words.
Kindly, Can you tell me what is the way to update vocab of the model.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.