chengsokdara / use-whisper Goto Github PK

View Code? Open in Web Editor NEW

649.0 12.0 124.0 319 KB

React hook for OpenAI Whisper with speech recorder, real-time transcription, and silence removal built-in

License: MIT License

TypeScript 100.00%

api openai whisper hook react real-time

use-whisper's People

Contributors

Stargazers

Watchers

Forkers

emadoz00 vgrafe daniel112 xkrsz greeksolid marcuswilstrup rapiz1 chidibede ybenkirane ezhomelabs codeyourwayup t3mma rhkdgh255 171906502 tomyeoman lucabeetz burtonator alaias g010329 huang47 docmind1 paul-paul-paul-dev albertsyh donecollectively epkatz elyager olha-diachek lelchek sgrove cwatkins obby1 kkaczynski gty3310 weizihua mehran9 stelo-labs therusskiy lyxsus moorthi07 liaodong siisee11 daoyuan14 squidgyai jonaxeln yola-0316 vitaly-z beckjiang sujaykar dansivewright gear273 leooneone kcmkcm1234 billy1kaplan darrensem webstorage119 mostafaegouda ongrid joiemoie moe-hassan-123 rkimball ch-jwoo emanuelcampos dkaser phpepe galdayan pahrizal huajianmao qq137321 greenpeer ricson-hoo softwaredreamer shaungt1 devanshsingh7727 pressw-llc donaldngai boooya octoml jahmad7 vinayreddy100 linksonder john-royal aamorel a7med3bdulbaset oneatdrt wynn-dev 5l1v3r1 mhilscher fcosense manibatra alihinnawe bculo ronaldcurtis williamtran29 tomchapin linhuiqing brainvineai smirnoff eruslmu shawnchiao untilhamza

use-whisper's Issues

Apologies for the confusion. But I wanted to have automatic translation to English of the transcribed text set to False.

Hi,
Apologies for not explaining myself properly. Currently the hook returns the transcribed text which is automatically translated in English. I was hoping that the user can control this feature by passing a prop e.g. "translation: False". Based on the documentation of Whisper API, the input is called "translation".

Great job with this hook.

Module not found: Can't resolve '/node_modules/@ffmpeg/core/dist/ffmpeg-core.js'

Anyone getting this? This is on a fresh install of this package. @chengsokdara

difference in the pitch and speed

When replay the recorded speech, it would be lower in the pitch or (when I use headphone, it would higher and resultin like Donald duck sound).
I have no idea why it would be like this, I compare using webAPI to record and replay the same sound (pronounce by myself with same headphone), webAPI is normal.
Anyone experience the same issue?

Severe bug: uploading lots of audio produces HEAVY openai costs

Issue

In 5 minutes useWhisper sent 100's of requests, uploaded 74684 seconds of audio which is over 20 hours, and cost over $17!

It looks like you're uploading the entire cumulative recording every second?

Luckily I had billing limits set.

Config

const {
  recording,
  speaking,
  transcribing,
  transcript,
  pauseRecording,
  startRecording,
  stopRecording,
} = useWhisper({
  apiKey:  getApiKey() ,
  streaming: true,
  timeSlice: 1_000, // 1 second
  whisperConfig: {
    language: 'he',
  },
});

Proof

You may need an additional loader to handle the result of these loaders.

Failed to compile.

./node_modules/@chengsokdara/use-whisper/dist/chunk-3CCW4YJS.js 185:29
Module parse failed: Unexpected token (185:29)
File was processed with these loaders:

./node_modules/babel-loader/lib/index.js
You may need an additional loader to handle the result of these loaders.
| }, [p, C, f, s, h]),
| ee = useCallbackAsync(async e => {

if (f && t.current && (H?.(e), l.current.push(e), (await t.current.getState()) === "recording")) {

| let n = new Blob(l.current, {
| type: "audio/webm;codecs=opus"

Cannot read properties of undefined (reading 'transcript')

Help me resolving this issue please. I tried to save 10seconds of transcript for 5times in an array and display it.

import React, { useState } from 'react';
import { useWhisper } from '@chengsokdara/use-whisper';
import './App.css';

const App = () => {
  const [transcriptions, setTranscriptions] = useState([]);
  const { startRecording, stopRecording } = useWhisper({
    apiKey: 'API_KEY', // Replace with your actual OpenAI API token
    streaming: true,
    removeSilence: true,
    timeSlice: 1000, // 1 second
    whisperConfig: {
      language: 'en',
    },
    onTranscribe: (blob) => {
      return new Promise((resolve) => {
        const reader = new FileReader();
        reader.onloadend = () => {
          const text = reader.result;
          resolve({ text });
        };
        reader.readAsText(blob);
      });
    },
  });

  const recordAndSave = async () => {
    const recordingResult = await startRecording();
    setTimeout(async () => {
      const transcription = await stopRecording();
      setTranscriptions((prevTranscriptions) => [...prevTranscriptions, recordingResult.transcript.text]);
    }, 10000); // Record for 10 seconds
  };

  const repeatRecording = async () => {
    for (let i = 0; i < 5; i++) {
      await recordAndSave();
    }
  };

  return (
    <div className="App">
      <header className="App-header">
        <h1>Real-Time Audio Transcription</h1>
      </header>
      <main>
        <div>
          <p>Transcribed Texts:</p>
          <ul>
            {transcriptions.map((text, index) => (
              <li key={index}>{text}</li>
            ))}
          </ul>
        </div>
        <div>
          <button onClick={repeatRecording}>Start Recording 5 Times</button>
        </div>
      </main>
    </div>
  );
};

export default App;

Exposing api key

Even if it is stored in an env variable, using your api key in the client can still expose it. Do you have any suggestions to fix this issue? Maybe moving part of the architecture to the server?

Calling onTranscribe as soon as a speaker stops talking

Hi @chengsokdara, I'm loving this project, and I especially like the custom server functionality.

I am working on a project where I am recording multiple speakers talking in turns. My hope is to use useWhisper to transcribe the entire conversation, but I'd like to do so one piece at a time.

I was wondering if there is a way to configure useWhisper to trigger an on onTranscribe event when a speaker stops talking for a brief period (basically when the speaker variable goes from true to false) and then reset the audio file to set up for recording the next speaking block.

I saw your examples using the customizable callback functions but I wasn't sure how to properly configure them for this case.

Thanks,
Kyle

Docker Image Webservice API for Whisper AI instead of OpenAI API Token

I don't know much about AI stuffs but I learned that OpenAI costs us money for using their API GPU Calculation. There's a docker webservice api for whisper ai. It won't cost money. So, I prefer that way. So, my question is can I use that docker stuffs on my react project using whisper ai?

useWisper runs on client, how to protect OPENAI_API_TOKEN?

Storing transcript text in a variable problem

I'm currently using your library in my project and everything work fine. But somehow, I can't store tanscript text as soon as the response get back from api. I could use setTimeout and store the textContext of DOM element output in a variable but it's not flexible and efficient. It would be better if I could store the transcript text right after getting the response.
This is how I'm currently using -
const data = useWhisper({ apiKey: "<MY API KEY>" }) const handleInput = async () => { await data.stopRecording(); setTimeout(() => { let text = document.getElementById('myText').textContent console.log(text); handleAnswer(text) }, 500); }

please add more examples

for typical backend streaming, how to call whisper.decode()

that would be helpful for noobs

Any way to disable all these console.logs?

error while streaming or after stop recording POST https://api.openai.com/v1/audio/transcriptions 401

Hi, iam running a 401 error while iam speaking:
I am using REACT Nextjs.
xhr.js:251 POST https://api.openai.com/v1/audio/transcriptions 401
dispatchXhrRequest @ xhr.js:251
xhr @ xhr.js:49
dispatchRequest @ dispatchRequest.js:51
request @ Axios.js:146
httpMethod @ Axios.js:185
wrap @ bind.js:5
eval @ chunk-32KRFHOA.js:5
await in eval (async)
eval @ chunk-YORICPLC.js:1
Z @ chunk-32KRFHOA.js:5
await in Z (async)
eval @ RecordRTC.js:3201
webWorker.onmessage @ RecordRTC.js:2810
client.js:1 useMemo AxiosError {message: 'Request failed with status code 401', name: 'AxiosError', code: 'ERR_BAD_REQUEST', config: {…}, request: XMLHttpRequest, …}

the status of transcribing was true when i was recording but producing this error

The code iam using is the same as you provided exept my API token
Any suggestions and help please?

`import { useWhisper } from "@chengsokdara/use-whisper";

const LiveWhisper = () => {
const {
recording,
speaking,
transcribing,
transcript,
pauseRecording,
startRecording,
stopRecording,
} = useWhisper({
apiKey: process.env.NEXT_PUBLIC_OPENAI_API_TOKEN, // YOUR_OPEN_AI_TOKEN
streaming: true,
timeSlice: 1_000, // 1 second
whisperConfig: {
language: "en",
},
});
console.log(
``

return (

Recording: {recording}

Speaking: {speaking}

Transcribing: {transcribing}

Transcribed Text: {transcript.text}

<button onClick={() => startRecording()}>Start
<button onClick={() => pauseRecording()}>Pause
<button onClick={() => stopRecording()}>Stop

);
};

export default LiveWhisper;
`

Microphone Selection

Is there any way we can specify which microphone useWhisper should use?

uncaught TypeError: Cannot read properties of null (reading 'useRef') at useRef

When I tried custom url code and

const { transcript } = useWhisper({
// callback to handle transcription with custom server
onTranscribe
})

It breaks before page load and
I get these errors -------

Uncaught TypeError: Cannot read properties of null (reading 'useRef')
at useRef (react.development.js:1630:1)
at ue (chunk-32KRFHOA.js:5:1)
at App (App.tsx:38:1)
at renderWithHooks (react-dom.development.js:16305:1)

Warning: Invalid hook call. Hooks can only be called inside of the body of a function component. This could happen for one of the following reasons:

You might have mismatching versions of React and the renderer (such as React DOM)
You might be breaking the Rules of Hooks
You might have more than one copy of React in the same app
See https://reactjs.org/link/invalid-hook-call for tips about how to debug and fix this problem.
at App (http://localhost:3000/main.7485a4e66b96db027f0c.hot-update.js:67:76)

Reduce speed up effect (it causes inaccurate transcriptions)

Issue chaining actions

First off, thanks for your work. I find this very useful. The base application is working fine. I can use all my buttons and I can get the transcript ok.

But where I am facing issues is when I want to chain the actions. For example, I want to:

Press Start to record and then press Stop.
Then I want the voiced prompt to be transcribed into text and fed into my google speech library to get a response.
I want that response Audio to be played

My functions:

startRecording -> stopRecording will generate the transcript.
sendMessage(): Takes the transcript and prompts chatGPT, then returns the text response.
listenAudio: Takes the text response and uses google TTS to voice the response.

My issue is getting an undefined transrcribe.text after stopRecording ends, so I can't feed it into sendMessage. I've tried a few different approaches and got close, but not quite there yet.

So, now It's all a manual task. Start, Stop, Send Message, Listen to response.

Any clues to make better use of the API and get transrcribe on demand?

Invalid file format

When using your standard configuration:

const App = () => {
const {
recording,
speaking,
transcribing,
transcript,
pauseRecording,
startRecording,
stopRecording,
} = useWhisper({
apiKey: import.meta.env.VITE_OPENAI_API_KEY, // YOUR_OPEN_AI_TOKEN
})

return (

Recording: {recording}

Speaking: {speaking}

Transcribing: {transcribing}

Transcribed Text: {transcript.text}

<button onClick={() => startRecording()}>Start
<button onClick={() => pauseRecording()}>Pause
<button onClick={() => stopRecording()}>Stop

)
}

Error: Invalid file format. Supported formats: ['m4a', 'mp3', 'webm', 'mp4', 'mpga', 'wav', 'mpeg']

Can you please add a prop to disable autotranslate to English?

First of all, thank you for creating this awesome hook. The hook is automatically translating the transcribed text to English. Will it be possible to add a prop so that we can choose to set this translation to on or off? Additionally, instead of passing the server link, can we just also pass a function instead? The function then connects with the API directly & returns the response object.

Thank you once again.

Issues with vite dev server

So I tried to get this component to work for a couple hours now, finally realized that vite (in a fresh project initialized with pnpm create vite) dev server does in fact not play well with it.

Have not yet figured out what exactly is stopping it from working, but wanna leave this issue here in the meantime so others are aware.
If you're using vite, the awkward but possible workaround is to vite build --watch and vite preview, then this component will work as expected.

Suggested mode

Is there interest for a mode / is it already possible to do the following;

Start listening automatically to voice until a break (this is already possible using the config below), but later allow the user to restart the same flow again at any point by talking again.

    nonStop: true, // keep recording as long as the user is speaking
    stopTimeout: 2000, // auto stop after 5 seconds

At the minute the only option seems to be the streaming option to achieve this, but the problem I found was that the transcript becomes one continuous message, whereas I would like it broke up into different chunks, as of when they are spoken.

Getting undefined in output for transcript

It is not giving any error though the output for trascript is
{blob: undefined, text: undefined}

I am using the very first example in the git repo.

`import React, { useState, useEffect } from 'react'
import { useWhisper } from '@chengsokdara/use-whisper'

export default function OpenAIDialog() {
    const {
        recording,
        speaking,
        transcribing,
        transcript,
        pauseRecording,
        startRecording,
        stopRecording,
    } = useWhisper({
        apiKey: 'Key', // YOUR_OPEN_AI_TOKEN
    })
    useEffect(() => {
        console.log('transcribing', transcribing)
        console.log('transcript', transcript)
        console.log('recording', recording)
        console.log('speaking', speaking)
    }, [recording, speaking, transcribing, transcript])
    return (
        <div>
            <p>Recording: {recording}</p>
            <p>Speaking: {speaking}</p>
            <p>Transcribing: {transcribing}</p>
            <p>Transcribed Text: {transcript.text}</p>
            <button onClick={() => startRecording()}>Start</button>
            <button onClick={() => pauseRecording()}>Pause</button>
            <button onClick={() => stopRecording()}>Stop</button>
        </div>
    )
}
`

Transcript object always undefined

When running a first code snippet provided in Readme I always get transcript text and blob as undefined
Can you help or advise please?

Integrate Monsterapi Whisper ASR

Hi @chengsokdara this is a very good project to explore real time streaming use case.

We released a very optimised whisper large-v2 API on MonsterAPI which reduces the cost of access of whisper model compared to openAI api by upto 6x. Our API scales on-demand as well.

I am raising a request for integrating whisper API into your project to make it super cost effective for developers using your project to simply get access to powerful whisper ASR using MonsterAPI.

Please find below links to our API docs and free playground:

API Docs: https://developer.monsterapi.ai/reference/post_generate-whisper
API Playground: https://monsterapi.ai/playground/speech2text

All that a developer needs is an API token to get started with accessing the APIs.

Let me know your thoughts.

Reducing streaming costs over extended periods of time

Currently, the streaming feature works perfectly fine, but it sends the entire audio stream from the beginning based on the timeSlice seconds. This results in exponential costs with longer periods. For instance, recording around 15 minutes can cost up to $10 with a timeSlice of 1 second.

To avoid such high costs, I suggest implementing a new feature that would resend only the last n-seconds of the audio stream. This would provide some context while reducing the amount of seconds being sent and thus, lowering the costs.

I believe that this improvement would not only make the streaming feature more cost-effective but also enhance its overall performance.

In the attached screenshot you can see the API usage from a 15 minutes streaming transcription:

It does not seem to work for any reason

Hi @chengsokdara

The package does not seem to work for me for no apparent reason. I added a bunch of extra packages like @ffmpeg/ffmpeg, hark, openai, and recordrtc just to be absolutely sure.

I have a pretty simple setup just for demo purpose and when I start recording I do get the following logs in the console

These do suggest that everything seems to be working fine but when I log the transcript, I always get the following => {blob: undefined, text: undefined} as the output.

My recording status is also true when I am speaking but for some reason the output blob and text is always undefined.

My env:

Mac M1
CRA
React 17.0.2
use-whisper 0.2.0

Following is a snippet of what I have done:

import React from "react";
import {useWhisper} from "@chengsokdara/use-whisper";

const App: React.FC = () => {

  const {startRecording, stopRecording, transcript, recording} = useWhisper({
    apiKey: key,
  });
  console.log("transcript", transcript);
  console.log("recording", recording);
  console.log("...........");

  return (
    <>
      <button onClick={() => startRecording()}>start</button>
    </>
  );
};

export default App;

Thanks in advance
Cheers

Reset to the default value transcript object

Thank you so much for this great project!

I am using your library to develop a chat application with voice input. However, I have encountered an issue where the transcript variable retains the previous value after sending a message.

In this case, it would be great if you provide a method that resets the transcript variable to its default state.
Here is my PR #29 where I provide my solution.
Alternatively, if you have any other suggestions for resolving this issue, please let me know.

Add error handling

First, thank you for making this hook!

It seems like there's no way to capture errors? We'd like to get insight into the failures, as we've been getting complaints from users about transcription failing but we have no visibility into errors, for example:

Is there a workaround to get access to internal errors?

Issues using it on replit

If I use secrets for my openAI key in Replit, I get this

Unhandled Runtime Error
Error: apiKey is required if onTranscribe is not provided

Here is where I call the key:

  const {
    recording,
    speaking,
    transcribing,
    transcript,
    pauseRecording,
    startRecording,
    stopRecording,
  } = useWhisper({
    apiKey: process.env['OPEN_API_KEY'],
  })

If I just pass the key as a string it works, but obviously this is not something you want to do :)

Error on deploying

npm ERR! Error while executing: npm ERR! /usr/bin/git ls-remote -h -t ssh://[email protected]/zhuker/lamejs.git npm ERR! npm ERR! Host key verification failed.

Using streaming + onTranscribe (custom server) together?

Very impressed by this project, thank you so much for it!

Is there some way to be able to stream the audio to a server endpoint (as in the examples) but also have it iteratively return results? Right now it seems like if streaming: true is set, it will only hit the whisper api directly from the frontend (e.g. https://api.openai.com/v1/audio/transcriptions).

That means there's quite a long pause at the end of recording to getting the result (since ffmpeg has to run at the end, and then upload quite a large file before getting the transcription). I'm curious if there's a way to avoid that with the current design?

Feature Request: Retry transcription using previous recording on internet connectivity error

Current behavior:

If internet connectivity is lost during recording, the transcription process fails.
The user needs to manually restart the recording and wait for the entire audio to be captured again.

Desired behavior:

When internet connectivity is lost during transcription, the library should automatically attempt to retry using the previously recorded audio.
This will save users time and prevent them from having to re-record the entire audio.

Benefits:

Improved user experience by preventing unnecessary re-recordings.
Increased reliability and robustness of the transcription process.

Thank you for considering this feature request.

Module not found @ffmpeg/core

I've had an issue running this and had to add @ffmpeg/core as a dependency to fix. Should this be added as a library dependency?

Safari support?

I think Safari may not support the current codec used, but I'm not sure if there's a way to detect what codecs a browser supports at runtime.

Streaming mode with onTranscribe function doesn't work correctly

When streaming mode is enabled, the onWhispered function is always called instead of the onTranscribe function.

This is easily fixed. I'd issue a pull request, but I made a few other commits, which shouldn't be integrated. So here's a diff of a fix:

squidgyai@4b38d9a

Thanks for the great component!

Thanks a lot for the great work!

Is it possible to capture the state if the user is currently speaking or not in the object that is returned by the hook?

I want to automatically start recording as soon as the browser detects that the user is speaking, thus if I can get that state in the return object, that will be really awesome.

Thanks a lot once again.