This repo contains a fully working web-based Real Time Transcription application, powered by Azure Speech to Text. You can deploy it to your Azure subscription and local PC in less than 20 minutes. You can then modify it for your specific needs.
It is part 1 of a series of repos on how to build real-time-transcription applications using Azure Speech to Text.
Once you deploy this application, you will have something like this:
webapprtt.mov
This post is the number 1 post of a series of posts, demonstrating Azure Speech to Text in real-time, in scenarios that are increasingly more complex. This scenario is the simplest of them all: real-time transcription of spoken work in English, using the standard Azure model.
You will be using:
- Azure Speech-to-Text
- React.js with Javascript (with Bootstrap, to make things look good)
- Devcontainers - You can use this repo on your PC/MAC or on Codespaces in Github. This makes it easy for you to develop without having to install Node, NPM, or the React modules you will need
By following the steps under Installation you wil be able to get started quickly
The architecture is shown below:
This repository is used to build the application on a personal computer or on GitHub Codespaces. It uses Devcontainers.
You need to have the following on your personal computer (PC, Mac or Linux):
- Visual Studio Code - you can install it from here: https://code.visualstudio.com/Download
- Docker Desktop - you can install it from here: https://docs.docker.com/desktop/
- Git - you can install it from here: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
NB -> Alternatively, you can run this application entirely in a codespace.
The first step is to enable the required Azure Cognitive Services. To do that:
- Click the button bellow:
Once you have the prerequisites met:
- Clone the repository
- Open it with Visual Studio Code.
You will get a screen like this:
- Click Reopen in Container to open the Devcontainer. Node, yarn, etc. are installed on it, and that is all you need to build and run your webapp.
This only needs to be done once.
- Create the .env file. On your Visual Studio code terminal, enter
cd webapp
cp env-template .env
- update the file, entering the credentials for your recently created Azure Cognitive Services:
REACT_APP_COG_SERVICE_KEY=<your congnitive service key>
REACT_APP_COG_SERVICE_LOCATION=<your cognitive service region>
- On the terminal enter:
cd webapp
yarn start
You will have a response like this on Visual Studio Code:
- Click Open in Browser
And, after some time you will see the app on the browser:
- click start and start speaking English.
It will start transcribing what you say:
- Click stop to stop transcribing.
- Click export to download the transcription.
This solution demonstrates the use of Azure Speech to Text's Continuous Recognition, using the Javascript SDK.
It follows the instructions provided in this Azure Speech to Text - Use continuous recognition
The "magic" happens in this part of the code: /webapp/src/Components/Transcription.js
First, we define a speech configuration with code like this:
const speechConfig = sdk.SpeechConfig.fromSubscription("YourSpeechKey", "YourSpeechRegion");
In order to capture the computer microphone, we use media services. This is done when the website is first rendered, via this code:
const getMedia = async (constraints) => {
let stream = null
try {
stream = await navigator.mediaDevices.getUserMedia(constraints)
// stream is then passed to the recognizer
} catch (err) {
/* handle the error */
alert(err)
console.log(err)
}
}
We define the audio configuration, to read the stream, using the stream defined above:
This is done with code like this:
// configure Azure STT to listen to an audio Stream
const audioConfig = AudioConfig.fromStreamInput(stream)
Then we put it all together in one recognizer:
const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);
And create callbacks for when continuous recognition is running:
recognizer.recognizing = (s, e) => {
// uncomment to debug
// console.log(`RECOGNIZING: Text=${e.result.text}`)
setRecognisingText(e.result.text)
textRef.current.scrollTop = textRef.current.scrollHeight
}
recognizer.recognized = (s, e) => {
setRecognisingText("")
if (e.result.reason === sdk.ResultReason.RecognizedSpeech) {
// uncomment to debug
// console.log(`RECOGNIZED: Text=${e.result.text}`)
setRecognisedText((recognisedText) => {
if (recognisedText === '') {
return `${e.result.text} `
}
else {
return `${recognisedText}${e.result.text} `
}
})
textRef.current.scrollTop = textRef.current.scrollHeight
}
else if (e.result.reason === sdk.ResultReason.NoMatch) {
console.log("NOMATCH: Speech could not be recognized.")
}
}
recognizer.canceled = (s, e) => {
console.log(`CANCELED: Reason=${e.reason}`)
if (e.reason === sdk.CancellationReason.Error) {
console.log(`"CANCELED: ErrorCode=${e.errorCode}`)
console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`)
console.log("CANCELED: Did you set the speech resource key and region values?")
}
recognizer.stopContinuousRecognitionAsync()
}
recognizer.sessionStopped = (s, e) => {
console.log("\n Session stopped event.")
recognizer.stopContinuousRecognitionAsync()
}
Then, whenever we want the recognizer to run, we run:
recognizer.startContinuousRecognitionAsync()
And, to stop it we run:
recognizer.stopContinuousRecognitionAsync()
The entire code of this solution is here: /webapp/src/Components/Transcription.js
This is the quick and simple way to build a working real-time-transcription webapp using React and javascript. This webapp recognizes English only, and simply transcribes what was said. It uses the standard model, so, some industry or subject specific words may be missed.
Next, you will learn how to recognize other languages, and how to improve accuracy by providing extra vocabulary.