Giter Site home page Giter Site logo

kathyreid / opensource-voice-tools Goto Github PK

View Code? Open in Web Editor NEW
23.0 4.0 2.0 84 KB

A repo listing known open source voice tools, ordered by where they sit in the voice stack

License: GNU General Public License v3.0

TeX 100.00%
stt asr tts voice speech speech-recognition corpus conversational-ui chatbot

opensource-voice-tools's Introduction

A listing of open source voice tools

Introduction

Voice technology is taking off in a big way. For organisations, businesses and individuals trying to make sense of voice and where it sits in their technical architectures, it can be really confusing to understand the open source offerings that are out there.

This repo is a listing of known open source voice tools, structured by where those tools sit in the voice stack.

Transcription

Wake words

Speech to text

Website Tool name License Description
openslr.org Open Speech Language Resources N/A Run by @danpovey, who is also a key maintainer of the Kaldi-ASR speech to text tool
kaldi-asr.org Kaldi Automatic Speech Recognition toolkit. Apache 2 One of the first open source speech recognition toolkits. Academic reference is: Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.

Intent parsing

Intent resolution

Text to speech

Website Tool name License Description
Flowtron by Nvidia A Tacotron-based speech synthsis tool which can be tweaked for pitch and prosody, setting it apart from other Tacotron-based TTS implementations Apache2 First released at the GTC 2020 Conference in May 2020. Academic paper is avaialble here. Citation is Valle, R., Shih, K., Prenger, R., & Catanzaro, B. (2020). Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis. arXiv preprint arXiv:2005.05957.

^ This is a great article that explains the differences in the evolutions or generations of text to speech - from concatenative to statistical parametric to generative. More modern TTS approaches like Tacotron and WaveNet are generative approaches.

Chatbots and Conversational UI tools

Website Tool name License Description
Mindmeld by Cisco . Apache2 The MindMeld Conversational AI platform is among the most advanced AI platforms for building production-quality conversational applications. It is a Python-based machine learning framework which encompasses all of the algorithms and utilities required for this purpose. Evolved over several years of building and deploying dozens of the most advanced conversational experiences achievable, MindMeld is optimized for building advanced conversational assistants which demonstrate deep understanding of a particular use case or domain while providing highly useful and versatile conversational experiences. The academic reference for this tool is:

Raghuvanshi, A., Carroll, L. and Raghunathan, K., 2018, November. Developing Production-Level Conversational Interfaces with Shallow Semantic Parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 157-162) |

Voice assistant wrappers

  • Mycroft.AI - an open source, layered voice assistant that works on a range of Linux-compatible hardware, such as x86 or ARM devices such as Raspberry Pi. Supported by a strong community of open source developers.

  • OVAL / Genie project at Stanford - Funded by the Alfred P Sloan Foundation and by a NIST grant, Stanford's OVAL project aims to provide an open source alternative to commercial voice assistants. The project is currently in its infancy and is attempting to build an open source community.

Natural language processing (NLP)

  • Python Natural Language Toolkit NLTK - NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

  • ECCO explainab - ECCO is a Python library that provides explainability for NLP using interactive visualisations.

  • Detext source code DeText is a Deep Text understanding framework for NLP related ranking, classification, and language generation tasks. It leverages semantic matching using deep neural networks to understand member intents in search and recommender systems. As a general NLP framework, currently DeText can be applied to many tasks, including search & recommendation ranking, multi-class classification and query understanding tasks. Published by the AI team at LinkedIn.

  • pglex - First presented at the ICLDC 7 conference in 2021, pglex is a 'pretty good' lexical service designed to facilitate the construction of dictionary websites and other applications that incorporate lexical data. With pglex, researchers can provide lexical entries in JSON format to an instance of the pglex API and get 'pretty good' search results without requiring language-specific configurations. Built on ElasticSearch.

Bias in voice assistants and NLP

  • Artie Bias Corpus - A corpus and set of tools for detecting demographic bias in ASR systems.

  • [Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of" Bias" in NLP. arXiv preprint arXiv:2005.14050.] https://arxiv.org/pdf/2005.14050.pdf

Speaker recognition

Forced aligners

Forced aligners help to align audio recordings with orthographic transcription

  • aeneas | Docs is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment).

Voice and language corpora

  • Berlin Database of Emotional Speech - A tagged corpus (in German/Deutsche) of speech tagged with emotions.
  • The Pile - The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

Data cleaning and repair tools

  • ActiveClean - ActiveClean is an iterative cleaning framework that can correctly retrain the machine learning model when data is cleaned, and provides a set of optimizations to select the best data to be cleaned. In this way, you only need to clean a small subset of the data in order to produce a model similar to if the full dataset were cleaned. Written in Python.

  • DataLinter - The Data Linter identifies potential issues (lints) in your ML training data.

  • Holoclean - Machine learning system for data enrichment

_There's also BoostClean from Columbia University but I can't find a code reference anywhere on the web.

Machine translation

  • No language left behind - Released by Meta, the NLLB project aims to make low-resource languages more accessible by providing a machine translation model which can translate between 200 languages. The model is evaluated using a human translated benchmark, FLORES-200, and perform 44% better than state of the art scores using BLEU.

Papers listings

Glossary

There are a lot of terms and acronyms in open source voice technology. This section provides explanations for each of them.

  • Cognitive arbitration: The process a voice assistant uses to understand what services and skills are available to it, depending on its context - such as being online or offline.

  • CRF: Conditional random field. A statistical modelling method which can take into account context. Used in some neural-network based intent-parsing and semantic extraction software.

  • LSTM: long short-term memory. Used within recurrent neural networks to help process sequences of data, such as audio or speech. In order to know what is likely to come next, LSTM records what came previously.

  • LVCSR: Large vocabulary continuous speech recognition. Used in speech recognition tools to denote that a) the vocabulary on which the recognizer works has not been restricted or constrained - for example if it is deployed on embedded or low-powered hardware which cannot handle the memory or compute requirements of a large vocabulary and b) the recognizer works continuously, in contrast to a Wake Word or Keyword spotter which cedes control to the STT once a Wake Word is detected.

opensource-voice-tools's People

Contributors

kathyreid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

shaunholt 00mjk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.