Giter Site home page Giter Site logo

chengsokdara / use-whisper Goto Github PK

View Code? Open in Web Editor NEW
649.0 12.0 124.0 319 KB

React hook for OpenAI Whisper with speech recorder, real-time transcription, and silence removal built-in

License: MIT License

TypeScript 100.00%
api openai whisper hook react real-time

use-whisper's People

Contributors

chengsokdara avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

use-whisper's Issues

difference in the pitch and speed

When replay the recorded speech, it would be lower in the pitch or (when I use headphone, it would higher and resultin like Donald duck sound).
I have no idea why it would be like this, I compare using webAPI to record and replay the same sound (pronounce by myself with same headphone), webAPI is normal.
Anyone experience the same issue?

Severe bug: uploading lots of audio produces HEAVY openai costs

Issue

In 5 minutes useWhisper sent 100's of requests, uploaded 74684 seconds of audio which is over 20 hours, and cost over $17!

It looks like you're uploading the entire cumulative recording every second?

Luckily I had billing limits set.

Config

const {
  recording,
  speaking,
  transcribing,
  transcript,
  pauseRecording,
  startRecording,
  stopRecording,
} = useWhisper({
  apiKey:  getApiKey() ,
  streaming: true,
  timeSlice: 1_000, // 1 second
  whisperConfig: {
    language: 'he',
  },
});

Proof

image

You may need an additional loader to handle the result of these loaders.

Failed to compile.

./node_modules/@chengsokdara/use-whisper/dist/chunk-3CCW4YJS.js 185:29
Module parse failed: Unexpected token (185:29)
File was processed with these loaders:

  • ./node_modules/babel-loader/lib/index.js
    You may need an additional loader to handle the result of these loaders.
    | }, [p, C, f, s, h]),
    | ee = useCallbackAsync(async e => {
if (f && t.current && (H?.(e), l.current.push(e), (await t.current.getState()) === "recording")) {

| let n = new Blob(l.current, {
| type: "audio/webm;codecs=opus"

Cannot read properties of undefined (reading 'transcript')

Help me resolving this issue please. I tried to save 10seconds of transcript for 5times in an array and display it.

import React, { useState } from 'react';
import { useWhisper } from '@chengsokdara/use-whisper';
import './App.css';

const App = () => {
  const [transcriptions, setTranscriptions] = useState([]);
  const { startRecording, stopRecording } = useWhisper({
    apiKey: 'API_KEY', // Replace with your actual OpenAI API token
    streaming: true,
    removeSilence: true,
    timeSlice: 1000, // 1 second
    whisperConfig: {
      language: 'en',
    },
    onTranscribe: (blob) => {
      return new Promise((resolve) => {
        const reader = new FileReader();
        reader.onloadend = () => {
          const text = reader.result;
          resolve({ text });
        };
        reader.readAsText(blob);
      });
    },
  });

  const recordAndSave = async () => {
    const recordingResult = await startRecording();
    setTimeout(async () => {
      const transcription = await stopRecording();
      setTranscriptions((prevTranscriptions) => [...prevTranscriptions, recordingResult.transcript.text]);
    }, 10000); // Record for 10 seconds
  };

  const repeatRecording = async () => {
    for (let i = 0; i < 5; i++) {
      await recordAndSave();
    }
  };

  return (
    <div className="App">
      <header className="App-header">
        <h1>Real-Time Audio Transcription</h1>
      </header>
      <main>
        <div>
          <p>Transcribed Texts:</p>
          <ul>
            {transcriptions.map((text, index) => (
              <li key={index}>{text}</li>
            ))}
          </ul>
        </div>
        <div>
          <button onClick={repeatRecording}>Start Recording 5 Times</button>
        </div>
      </main>
    </div>
  );
};

export default App;

Screenshot from 2023-08-15 14-21-52

Exposing api key

Even if it is stored in an env variable, using your api key in the client can still expose it. Do you have any suggestions to fix this issue? Maybe moving part of the architecture to the server?

Calling onTranscribe as soon as a speaker stops talking

Hi @chengsokdara, I'm loving this project, and I especially like the custom server functionality.

I am working on a project where I am recording multiple speakers talking in turns. My hope is to use useWhisper to transcribe the entire conversation, but I'd like to do so one piece at a time.

I was wondering if there is a way to configure useWhisper to trigger an on onTranscribe event when a speaker stops talking for a brief period (basically when the speaker variable goes from true to false) and then reset the audio file to set up for recording the next speaking block.

I saw your examples using the customizable callback functions but I wasn't sure how to properly configure them for this case.

Thanks,
Kyle

Docker Image Webservice API for Whisper AI instead of OpenAI API Token

I don't know much about AI stuffs but I learned that OpenAI costs us money for using their API GPU Calculation. There's a docker webservice api for whisper ai. It won't cost money. So, I prefer that way. So, my question is can I use that docker stuffs on my react project using whisper ai?

Storing transcript text in a variable problem

I'm currently using your library in my project and everything work fine. But somehow, I can't store tanscript text as soon as the response get back from api. I could use setTimeout and store the textContext of DOM element output in a variable but it's not flexible and efficient. It would be better if I could store the transcript text right after getting the response.
This is how I'm currently using -
const data = useWhisper({ apiKey: "<MY API KEY>" }) const handleInput = async () => { await data.stopRecording(); setTimeout(() => { let text = document.getElementById('myText').textContent console.log(text); handleAnswer(text) }, 500); }

error while streaming or after stop recording POST https://api.openai.com/v1/audio/transcriptions 401

Hi, iam running a 401 error while iam speaking:
I am using REACT Nextjs.
xhr.js:251 POST https://api.openai.com/v1/audio/transcriptions 401
dispatchXhrRequest @ xhr.js:251
xhr @ xhr.js:49
dispatchRequest @ dispatchRequest.js:51
request @ Axios.js:146
httpMethod @ Axios.js:185
wrap @ bind.js:5
eval @ chunk-32KRFHOA.js:5
await in eval (async)
eval @ chunk-YORICPLC.js:1
Z @ chunk-32KRFHOA.js:5
await in Z (async)
eval @ RecordRTC.js:3201
webWorker.onmessage @ RecordRTC.js:2810
client.js:1 useMemo AxiosError {message: 'Request failed with status code 401', name: 'AxiosError', code: 'ERR_BAD_REQUEST', config: {…}, request: XMLHttpRequest, …}

the status of transcribing was true when i was recording but producing this error

The code iam using is the same as you provided exept my API token
Any suggestions and help please?

`import { useWhisper } from "@chengsokdara/use-whisper";

const LiveWhisper = () => {
const {
recording,
speaking,
transcribing,
transcript,
pauseRecording,
startRecording,
stopRecording,
} = useWhisper({
apiKey: process.env.NEXT_PUBLIC_OPENAI_API_TOKEN, // YOUR_OPEN_AI_TOKEN
streaming: true,
timeSlice: 1_000, // 1 second
whisperConfig: {
language: "en",
},
});
console.log(
``

return (


Recording: {recording}


Speaking: {speaking}


Transcribing: {transcribing}


Transcribed Text: {transcript.text}


<button onClick={() => startRecording()}>Start
<button onClick={() => pauseRecording()}>Pause
<button onClick={() => stopRecording()}>Stop

);
};

export default LiveWhisper;
`

uncaught TypeError: Cannot read properties of null (reading 'useRef') at useRef

When I tried custom url code and

const { transcript } = useWhisper({
// callback to handle transcription with custom server
onTranscribe
})

It breaks before page load and
I get these errors -------

Uncaught TypeError: Cannot read properties of null (reading 'useRef')
at useRef (react.development.js:1630:1)
at ue (chunk-32KRFHOA.js:5:1)
at App (App.tsx:38:1)
at renderWithHooks (react-dom.development.js:16305:1)

Warning: Invalid hook call. Hooks can only be called inside of the body of a function component. This could happen for one of the following reasons:

  1. You might have mismatching versions of React and the renderer (such as React DOM)
  2. You might be breaking the Rules of Hooks
  3. You might have more than one copy of React in the same app
    See https://reactjs.org/link/invalid-hook-call for tips about how to debug and fix this problem.
    at App (http://localhost:3000/main.7485a4e66b96db027f0c.hot-update.js:67:76)

Issue chaining actions

First off, thanks for your work. I find this very useful. The base application is working fine. I can use all my buttons and I can get the transcript ok.

But where I am facing issues is when I want to chain the actions. For example, I want to:

  1. Press Start to record and then press Stop.
  2. Then I want the voiced prompt to be transcribed into text and fed into my google speech library to get a response.
  3. I want that response Audio to be played

My functions:

  1. startRecording -> stopRecording will generate the transcript.
  2. sendMessage(): Takes the transcript and prompts chatGPT, then returns the text response.
  3. listenAudio: Takes the text response and uses google TTS to voice the response.

My issue is getting an undefined transrcribe.text after stopRecording ends, so I can't feed it into sendMessage. I've tried a few different approaches and got close, but not quite there yet.

So, now It's all a manual task. Start, Stop, Send Message, Listen to response.

Any clues to make better use of the API and get transrcribe on demand?

Invalid file format

When using your standard configuration:

const App = () => {
const {
recording,
speaking,
transcribing,
transcript,
pauseRecording,
startRecording,
stopRecording,
} = useWhisper({
apiKey: import.meta.env.VITE_OPENAI_API_KEY, // YOUR_OPEN_AI_TOKEN
})

return (


Recording: {recording}


Speaking: {speaking}


Transcribing: {transcribing}


Transcribed Text: {transcript.text}


<button onClick={() => startRecording()}>Start
<button onClick={() => pauseRecording()}>Pause
<button onClick={() => stopRecording()}>Stop

)
}

Error: Invalid file format. Supported formats: ['m4a', 'mp3', 'webm', 'mp4', 'mpga', 'wav', 'mpeg']

Can you please add a prop to disable autotranslate to English?

First of all, thank you for creating this awesome hook. The hook is automatically translating the transcribed text to English. Will it be possible to add a prop so that we can choose to set this translation to on or off? Additionally, instead of passing the server link, can we just also pass a function instead? The function then connects with the API directly & returns the response object.

Thank you once again.

Issues with vite dev server

So I tried to get this component to work for a couple hours now, finally realized that vite (in a fresh project initialized with pnpm create vite) dev server does in fact not play well with it.

Have not yet figured out what exactly is stopping it from working, but wanna leave this issue here in the meantime so others are aware.
If you're using vite, the awkward but possible workaround is to vite build --watch and vite preview, then this component will work as expected.

Suggested mode

Is there interest for a mode / is it already possible to do the following;

Start listening automatically to voice until a break (this is already possible using the config below), but later allow the user to restart the same flow again at any point by talking again.

    nonStop: true, // keep recording as long as the user is speaking
    stopTimeout: 2000, // auto stop after 5 seconds

At the minute the only option seems to be the streaming option to achieve this, but the problem I found was that the transcript becomes one continuous message, whereas I would like it broke up into different chunks, as of when they are spoken.

Getting undefined in output for transcript

It is not giving any error though the output for trascript is
{blob: undefined, text: undefined}
Screenshot 2024-04-02 153201

I am using the very first example in the git repo.

`import React, { useState, useEffect } from 'react'
import { useWhisper } from '@chengsokdara/use-whisper'

export default function OpenAIDialog() {
    const {
        recording,
        speaking,
        transcribing,
        transcript,
        pauseRecording,
        startRecording,
        stopRecording,
    } = useWhisper({
        apiKey: 'Key', // YOUR_OPEN_AI_TOKEN
    })
    useEffect(() => {
        console.log('transcribing', transcribing)
        console.log('transcript', transcript)
        console.log('recording', recording)
        console.log('speaking', speaking)
    }, [recording, speaking, transcribing, transcript])
    return (
        <div>
            <p>Recording: {recording}</p>
            <p>Speaking: {speaking}</p>
            <p>Transcribing: {transcribing}</p>
            <p>Transcribed Text: {transcript.text}</p>
            <button onClick={() => startRecording()}>Start</button>
            <button onClick={() => pauseRecording()}>Pause</button>
            <button onClick={() => stopRecording()}>Stop</button>
        </div>
    )
}
`

Transcript object always undefined

When running a first code snippet provided in Readme I always get transcript text and blob as undefined
Can you help or advise please?

Integrate Monsterapi Whisper ASR

Hi @chengsokdara this is a very good project to explore real time streaming use case.

We released a very optimised whisper large-v2 API on MonsterAPI which reduces the cost of access of whisper model compared to openAI api by upto 6x. Our API scales on-demand as well.

I am raising a request for integrating whisper API into your project to make it super cost effective for developers using your project to simply get access to powerful whisper ASR using MonsterAPI.

Please find below links to our API docs and free playground:

All that a developer needs is an API token to get started with accessing the APIs.

Let me know your thoughts.

Reducing streaming costs over extended periods of time

Currently, the streaming feature works perfectly fine, but it sends the entire audio stream from the beginning based on the timeSlice seconds. This results in exponential costs with longer periods. For instance, recording around 15 minutes can cost up to $10 with a timeSlice of 1 second.

To avoid such high costs, I suggest implementing a new feature that would resend only the last n-seconds of the audio stream. This would provide some context while reducing the amount of seconds being sent and thus, lowering the costs.

I believe that this improvement would not only make the streaming feature more cost-effective but also enhance its overall performance.

In the attached screenshot you can see the API usage from a 15 minutes streaming transcription:

Screenshot 2023-03-26 at 19 49 47

It does not seem to work for any reason

Hi @chengsokdara

The package does not seem to work for me for no apparent reason. I added a bunch of extra packages like @ffmpeg/ffmpeg, hark, openai, and recordrtc just to be absolutely sure.

I have a pretty simple setup just for demo purpose and when I start recording I do get the following logs in the console
Screenshot 2023-04-11 at 11 40 48 AM

These do suggest that everything seems to be working fine but when I log the transcript, I always get the following => {blob: undefined, text: undefined} as the output.

My recording status is also true when I am speaking but for some reason the output blob and text is always undefined.

My env:

  1. Mac M1
  2. CRA
  3. React 17.0.2
  4. use-whisper 0.2.0

Following is a snippet of what I have done:

import React from "react";
import {useWhisper} from "@chengsokdara/use-whisper";

const App: React.FC = () => {

  const {startRecording, stopRecording, transcript, recording} = useWhisper({
    apiKey: key,
  });
  console.log("transcript", transcript);
  console.log("recording", recording);
  console.log("...........");

  return (
    <>
      <button onClick={() => startRecording()}>start</button>
    </>
  );
};

export default App;

Thanks in advance
Cheers

Reset to the default value transcript object

Thank you so much for this great project!

I am using your library to develop a chat application with voice input. However, I have encountered an issue where the transcript variable retains the previous value after sending a message.

In this case, it would be great if you provide a method that resets the transcript variable to its default state.
Here is my PR #29 where I provide my solution.
Alternatively, if you have any other suggestions for resolving this issue, please let me know.

Add error handling

First, thank you for making this hook!

It seems like there's no way to capture errors? We'd like to get insight into the failures, as we've been getting complaints from users about transcription failing but we have no visibility into errors, for example:

image

Is there a workaround to get access to internal errors?

Issues using it on replit

If I use secrets for my openAI key in Replit, I get this

Unhandled Runtime Error
Error: apiKey is required if onTranscribe is not provided

Here is where I call the key:

  const {
    recording,
    speaking,
    transcribing,
    transcript,
    pauseRecording,
    startRecording,
    stopRecording,
  } = useWhisper({
    apiKey: process.env['OPEN_API_KEY'],
  })

If I just pass the key as a string it works, but obviously this is not something you want to do :)

Error on deploying

npm ERR! Error while executing: npm ERR! /usr/bin/git ls-remote -h -t ssh://[email protected]/zhuker/lamejs.git npm ERR! npm ERR! Host key verification failed.

Using streaming + onTranscribe (custom server) together?

Very impressed by this project, thank you so much for it!

Is there some way to be able to stream the audio to a server endpoint (as in the examples) but also have it iteratively return results? Right now it seems like if streaming: true is set, it will only hit the whisper api directly from the frontend (e.g. https://api.openai.com/v1/audio/transcriptions).

That means there's quite a long pause at the end of recording to getting the result (since ffmpeg has to run at the end, and then upload quite a large file before getting the transcription). I'm curious if there's a way to avoid that with the current design?

Feature Request: Retry transcription using previous recording on internet connectivity error

Feature Request: Retry transcription using previous recording on internet connectivity error

Current behavior:

  • If internet connectivity is lost during recording, the transcription process fails.
  • The user needs to manually restart the recording and wait for the entire audio to be captured again.

Desired behavior:

  • When internet connectivity is lost during transcription, the library should automatically attempt to retry using the previously recorded audio.
  • This will save users time and prevent them from having to re-record the entire audio.

Benefits:

  • Improved user experience by preventing unnecessary re-recordings.
  • Increased reliability and robustness of the transcription process.

Thank you for considering this feature request.

Module not found @ffmpeg/core

I've had an issue running this and had to add @ffmpeg/core as a dependency to fix. Should this be added as a library dependency?

Safari support?

I think Safari may not support the current codec used, but I'm not sure if there's a way to detect what codecs a browser supports at runtime.

Input a rtmp stream

is it possible that the stream for real time transcription can be from an rtmp stream for example ?

Is the repo actively maintained

Hey @chengsokdara! awesome job on writing this hook, I would absolutely love to try it out but I was not sure if this library is maintained anymore as there are couple of critical requests pending in issues but no response to them.

Are you still planning to maintain it or giving maintainer rights to anybody?

Thanks a lot for the great work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.