collabora / whisperlive Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 150.0 6.02 MB

A nearly-live implementation of OpenAI's Whisper.

License: MIT License

Python 71.14% Shell 1.54% HTML 5.69% JavaScript 20.50% CSS 1.13%

dictation obs openai tensorrt tensorrt-llm text-to-speech translation voice-recognition whisper whisper-tensorrt

whisperlive's People

Contributors

Stargazers

Watchers

Forkers

makaveli10 zoq niodiehard jpc yaospacetim gsn516 eliascm17 koljab ricson-hoo programingtw justinlevi studiobox-dev huangyingting vechtomov futurecoder unimatrix099 vqvu tristargod ianblenke rpavlik shubh587 koolaji theo77186 chayim1 issebisse heng-xiu furyhawk essingen123 sbrnaderi orrest widyhu2020 zhzb110 yuxiang822 ngeniedeveloper alvarocardenasc calebjacobs ethanzrd himmelroman matteokjh esatboucaud he-22 octonawish-akcodes sujianwei1 by123 wittech assassindesign asadal restrekai k0hacuu xiepengli ekp1k80 sarayel chronoz rpurinton gaojianj96 raivisdejus tarunkalra jonathanlehner henriklied codehardth abyssdawn librinostri positivewon rkelln bllxk yonlux ankit1057 turbo-agi israelws wingjoezhou vicki20july lightwastak3n vincentsider viningr gbouslov tuanbc mrcodechef gchust henri-edh hustshawn stinosko zhiyu-01 regression-io jimmy1984xu wangchengqun stkr22 fenardh luvwinnie rochemedia dave-parsons wikiup xjspace jeanvetorello ahmed-mh-ibrahim sensi426 zihan15 larrymai mtscnsbg aonoa aoeu256

whisperlive's Issues

problem when transcribe audio

Hello,

I have a problem when transcribe audio file.

File "C:\Users\atrabels\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 1538, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] Le fichier spécifié est introuvable
i have solved this problem by changing shell=True in the subprocess. py file by i have faced new problem

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 84: invalid start byte

Any solution ?

Chrome extension doesn't work

When I installed and ran it, nothing happened and got the below error message. Could you help resolve this issue? Thanks!

ValueError: too many values to unpack

When the server tries to process the first 30 seconds of input from the browser I get this error:

INFO:faster_whisper:Processing audio with duration 00:30.000
Exception in thread Thread-11:
Traceback (most recent call last):
  File "/media/UltraStorageBTRFS/Programs/Linux/anaconda3/envs/whisper-live/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/media/UltraStorageBTRFS/Programs/Linux/anaconda3/envs/whisper-live/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/marc/Documents/GitClones/whisper-live/server.py", line 152, in speech_to_text
    self.language, lang_prob = self.transcriber.transcribe(
ValueError: too many values to unpack (expected 2)

The issue is with line 152, even though it has two variables to unpack to it's throwing an error. I fixed it by doing this but there's probably a better way. Line 152 onwards:

transcriber_output = self.transcriber.transcribe(
    input_bytes, 
    initial_prompt=None,
    language=self.language,
    task=self.task
)
self.language = transcriber_output[0]
lang_prob = transcriber_output[1]

Simple Client Recording Attempt

I start up the server via $ python ./run_server.py

(whisper_live)  whisperlive git:(main)✗  🚀 python ./run_server.py
Downloading: "https://github.com/snakers4/silero-vad/archive/master.zip" to /Users/justinwinter/.cache/torch/hub/master.zip
2023-08-21 12:14:34.119619 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '628'. It is not used by any node and should be removed from the model.
2023-08-21 12:14:34.119647 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '629'. It is not used by any node and should be removed from the model.
2023-08-21 12:14:34.119652 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '623'. It is not used by any node and should be removed from the model.
2023-08-21 12:14:34.119655 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '625'. It is not used by any node and should be removed from the model.
2023-08-21 12:14:34.119659 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '620'. It is not used by any node and should be removed from the model.
2023-08-21 12:14:34.119696 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '139'. It is not used by any node and should be removed from the model.
2023-08-21 12:14:34.119701 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '131'. It is not used by any node and should be removed from the model.
2023-08-21 12:14:34.119704 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '140'. It is not used by any node and should be removed from the model.
2023-08-21 12:14:34.119708 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '134'. It is not used by any node and should be removed from the model.
2023-08-21 12:14:34.119711 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '136'. It is not used by any node and should be removed from the model.
ERROR:root:no close frame received or sent

Then start up the client via:

(whisper_live)  whisperlive git:(main)✗  🚀 python ./run_client.py
[INFO]: * recording
[INFO]: Waiting for server ready ...
False en transcribe
[INFO]: Opened connection
[INFO]: Server Ready!
Traceback (most recent call last):
  File "/Users/justinwinter/projects/whisperlive/./run_client.py", line 3, in <module>
    client()
  File "/Users/justinwinter/projects/whisperlive/whisper_live/client.py", line 298, in __call__
    self.client.record()
  File "/Users/justinwinter/projects/whisperlive/whisper_live/client.py", line 234, in record
    data = self.stream.read(self.CHUNK)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/whisper_live/lib/python3.9/site-packages/pyaudio/__init__.py", line 570, in read
    return pa.read_stream(self._stream, num_frames,
OSError: [Errno -9981] Input overflowed

// run_client.py

from whisper_live.client import TranscriptionClient
client = TranscriptionClient("0.0.0.0", "8080", is_multilingual=False, lang="en", translate=False)
client()

requirements need onnxruntime==1.15.1

the new onnxruntime==1.16.0 breaks the whisper server

Docker installation on Mac doesn't work

I installed it by "docker build . -t whisper-live -f docker/Dockerfile.cpu" and then ran it by "docker run -it -p 9090:9090 whisper-live:latest". But it seems to just hang on terminal (after a while, it asked for microphone access, but that was it). Could you help troubleshoot this issue? Thank you!

Stuck at whisper-live sil

Integration with Twilio

Hello!
I am currently using Twilio as my method to make phone calls + streaming the voice data to my server.
However, when I tried to convert Twilio's x-mulaw audio format to PCM Linear (as expected from WhisperLive), I don't get any response from WhisperLive. In other words, I know that my GPU is working, I know that the audio data works (I converted the audio to a wav file and listened to it clearly) but I'm not getting any transcription.

I also thought it was worth noting that the audio was a bit quiet--not sure if that could be a source of suspicion?

Here's my code conversion (from x-mulaw to PCM)

audio = base64.b64decode(packet['media']['payload'])
audio = audioop.ulaw2lin(audio, 2)
audio = audioop.ratecv(audio, 2, 1, 8000, 16000, None)[0]
await websocket.send(audio)

Let me know if you'd like me to provide any additional context :)
Thanks for the help in advance!

how to consume it with webapplication? (forward the client() to another websocket)

Hi,

I need a bit technical aid,

I want to forward the transciption, but the client is just forwarding it to an html client.

How can i do it by tweaking the client calling?

String Representation and Precision Control for Start and End Outputs

I suggest setting the output of start and end as strings, and controlling them to three decimal places, which is accurate to millimeters.

Current outputs:

{
  "start": 29.304000000000002,
  "end": 30.624000000000002,
  "text": "OK"
}

Suggested:

{
  "start": "29.304"
  "end": "30.624",
  "text": "OK"
}

Modell size and modell change

Hi!
Thank you for this great project!

How can we switch the whisper model, to the latest whisperV3, and how can we use the large-v3 model?

Also, If I have a fine-tuned large-v3 model, how can I use the custom model?

The `setup.sh` file is using `CRLF`, which should be `LF` since it's a shell script.

The problem occurs when build the docker image.

Unexpected keyword argument 'model_size'

I ran the server.py on my Mac and was successful.
I then ran the client.py and got the following error:

# Run the client
from whisper_live.client import TranscriptionClient
client = TranscriptionClient(
  "localhost",
  9090,
  is_multilingual=True,
  lang="ko",
  translate=False,
  model_size="base"
)
client()

Traceback (most recent call last):
  File "/Users/asadal/Documents/Dev/Hani/WhisperLive_streamlit.py", line 5, in <module>
    client = TranscriptionClient(
TypeError: TranscriptionClient.__init__() got an unexpected keyword argument 'model_size'

How do I fix this?

==============

Plus:
Is there a way to save the text output by the chrome extension to a file?

Thanks,

The given NumPy array is not writable, and PyTorch does not support non-writable tensors

Looks like this package not testing on Windows

(whisper-live) E:\Workspace\github\whisper-live>python server.py
INFO:websockets.server:connection open
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to C:\Users\ufo/.cache\torch\hub\master.zip
2023-07-02 10:32:14.0518293 [W:onnxruntime:, graph.cc:3543 onnxruntime::Graph::CleanUnusedInitializersAndNodeArgs] Removing initializer '620'. It is not used by any node and should be removed from the model.
2023-07-02 10:32:14.0551451 [W:onnxruntime:, graph.cc:3543 onnxruntime::Graph::CleanUnusedInitializersAndNodeArgs] Removing initializer '623'. It is not used by any node and should be removed from the model.
2023-07-02 10:32:14.0586288 [W:onnxruntime:, graph.cc:3543 onnxruntime::Graph::CleanUnusedInitializersAndNodeArgs] Removing initializer '625'. It is not used by any node and should be removed from the model.
2023-07-02 10:32:14.0626213 [W:onnxruntime:, graph.cc:3543 onnxruntime::Graph::CleanUnusedInitializersAndNodeArgs] Removing initializer '629'. It is not used by any node and should be removed from the model.
2023-07-02 10:32:14.0667860 [W:onnxruntime:, graph.cc:3543 onnxruntime::Graph::CleanUnusedInitializersAndNodeArgs] Removing initializer '628'. It is not used by any node and should be removed from the model.
2023-07-02 10:32:14.0703673 [W:onnxruntime:, graph.cc:3543 onnxruntime::Graph::CleanUnusedInitializersAndNodeArgs] Removing initializer '131'. It is not used by any node and should be removed from the model.
2023-07-02 10:32:14.0745995 [W:onnxruntime:, graph.cc:3543 onnxruntime::Graph::CleanUnusedInitializersAndNodeArgs] Removing initializer '134'. It is not used by any node and should be removed from the model.
2023-07-02 10:32:14.0776970 [W:onnxruntime:, graph.cc:3543 onnxruntime::Graph::CleanUnusedInitializersAndNodeArgs] Removing initializer '136'. It is not used by any node and should be removed from the model.
2023-07-02 10:32:14.0812995 [W:onnxruntime:, graph.cc:3543 onnxruntime::Graph::CleanUnusedInitializersAndNodeArgs] Removing initializer '140'. It is not used by any node and should be removed from the model.
2023-07-02 10:32:14.0841789 [W:onnxruntime:, graph.cc:3543 onnxruntime::Graph::CleanUnusedInitializersAndNodeArgs] Removing initializer '139'. It is not used by any node and should be removed from the model.
E:\Workspace\github_me\whisper-live\server.py:112: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\cb\pytorch_1000000000000\work\torch\csrc\utils\tensor_numpy.cpp:205.)
  speech_prob = self.vad_model(torch.from_numpy(frame_np), self.RATE).item()

timestamp_offset vs. frame_offset

what's the difference between timestamp_offset and frame_offset? i get what the timestamp_offset is doing but not sure why the frame_offset is also necessary. thanks!

Add option to transcribe system audio

With option to record and transcribe system audio it would be possible to cover all sorts of useful use cases like:

Transcribe videos in any random video player or format
Transcribe any online meeting in any meeting tool

By transcribing system audio I mean to capture and transcribe anything that is coming out of the system speakers or headphones, similar to this example https://github.com/tez3998/loopback-capture-sample

Send from javascript to websocket server

I'm trying to do realtime transcription from the browser, I have the server running on a container. I connect to it and do the handshake. After the handshake I start recording, when I stop the audio I convert the audio to a Float32Array, so that it can be understood by the server. Right now I'm just trying to send one full length audio Float32Array, however, when I send it I don't get a response. I'm wondering if my logic is correct also if the format I'm sending the data is correct. Removed some stuff from my code for brevity.

let socket = new WebSocket('wss://');

socket.onopen = function(e) {
  socket.send(
    JSON.stringify({
      uid: v4(),
      multilingual: true,
      language: "en",
      task: "transcribe"
    })
  );
};

socket.addEventListener('message', (event) => {
  console.log('Message from server: ', event);
});

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
mediaRecorder = new MediaRecorder(stream);

mediaRecorder.ondataavailable = (e) => {
  mediaChunks.push(e.data);
};

mediaRecorder.onstop = function() {
  const audioBlob = new Blob(mediaChunks, { 'type': 'audio/ogg; codecs=opus' });
  mediaChunks = [];

  audioBlob.arrayBuffer().then(buffer => {
    const audioContext = new (window.AudioContext || window.webkitAudioContext)();
    audioContext.decodeAudioData((buffer, decodedData) => {
      const monoChannelData = decodedData.getChannelData(0); // Get the mono channel data
      connection.send(monoChannelData.buffer);
    });
  });

help with custom use case

Hi I have a specific use case in mind for this and i'm looking for any assistance I can get modifying this software for my use case. it might be a beneficial addition to the project to have a mode which operates this way.

i have Zello (a walkie talkie app) running and playing the output through a virtual audio cable and the other end is the default input.

When i run the whisper-live client in microphone mode it gets the audio stream from zello just fine and transcription proceeds. Great!

Now my goal is to get it so that when i speak a message the transcribed text gets url encoded and passed to

mpg123 http://my.server/my.php?text=<text>

This relates to an AI text-to-speech generator which replies with streaming mp3 audio (and Zello is set to VOX transmit so the voice gets played back to me on Zello).

The idea is that both server and client would stay running 24/7, transcribing messages only when recieved, waiting for the message to be completely transcribed, then running mpg123.

Once a message has been played we can dispose of any temporary audio chunks since we don't want to repeat any old text.

A few challenges I'm facing currently...

it seems that the longer the client is running it is writing chunks of audio to the disk. the longer i have the client running the longer and longer transcription takes.

also, in the on_message function, the message keeps getting longer and longer and i dont want the full message, only new text.

hope this makes sense, if anyone can help please reply here or discord laozi101. thanks!

Twilio Voice to Text Implementation?

Hey guys. Really appreciate the project. I'm really new to Whisper and Python but have a fair amount of coding background in other languages. Wondering if you could provide any strategy ideas or an outline on the best way to approach the below.

I've got an existing websocket server implementation that accepts a websocket connection from Twilio

The websocket media messages look like this:

{ 
 "event": "media",
 "sequenceNumber": "4",
 "media": { 
   "track": "inbound", 
   "chunk": "2", 
   "timestamp": "5",
   "payload": "no+JhoaJjpzSHxAKBgYJDhtEopGKh4aIjZm7JhILBwYIDRg1qZSLh4aIjJevLBUMBwYHDBUsr5eMiIaHi5SpNRgNCAYHCxImu5mNiIaHipGiRBsOCQYGChAf0pyOiYaGiY+e/x4PCQYGCQ4cUp+QioaGiY6bxCIRCgcGCA0ZO6aSi4eGiI2YtSkUCwcGCAwXL6yVjIeGh4yVrC8XDAgGBwsUKbWYjYiGh4uSpjsZDQgGBwoRIsSbjomGhoqQn1IcDgkGBgkPHv+ej4mGhomOnNIfEAoGBgkOG0SikYqHhoiNmbsmEgsHBggNGDWplIuHhoiMl68sFQwHBgcMFSyvl4yIhoeLlKk1GA0IBgcLEia7mY2IhoeKkaJEGw4JBgYKEB/SnI6JhoaJj57/Hg8JBgYJDhxSn5CKhoaJjpvEIhEKBwYIDRk7ppKLh4aIjZi1KRQLBwYIDBcvrJWMh4aHjJWsLxcMCAYHCxQptZiNiIaHi5KmOxkNCAYHChEixJuOiYaGipCfUhwOCQYGCQ8e/56PiYaGiY6c0h8QCgYGCQ4bRKKRioeGiI2ZuyYSCwcGCA0YNamUi4eGiIyXrywVDAcGBwwVLK+XjIiGh4uUqTUYDQgGBwsSJruZjYiGh4qRokQbDgkGBgoQH9KcjomGhomPnv8eDwkGBgkOHFKfkIqGhomOm8QiEQoHBggNGTumkouHhoiNmLUpFAsHBggMFy+slYyHhoeMlawvFwwIBgcLFCm1mI2IhoeLkqY7GQ0IBgcKESLEm46JhoaKkJ9SHA4JBgYJDx7/no+JhoaJjpzSHxAKBgYJDhtEopGKh4aIjZm7JhILBwYIDRg1qZSLh4aIjJevLBUMBwYHDBUsr5eMiIaHi5SpNRgNCAYHCxImu5mNiIaHipGiRBsOCQYGChAf0pyOiYaGiY+e/x4PCQYGCQ4cUp+QioaGiY6bxCIRCgcGCA0ZO6aSi4eGiI2YtSkUCwcGCAwXL6yVjIeGh4yVrC8XDAgGBwsUKbWYjYiGh4uSpjsZDQgGBwoRIsSbjomGhoqQn1IcDgkGBgkPHv+ej4mGhomOnNIfEAoGBgkOG0SikYqHhoiNmbsmEgsHBggNGDWplIuHhoiMl68sFQwHBgcMFSyvl4yIhoeLlKk1GA0IBgcLEia7mY2IhoeKkaJEGw4JBgYKEB/SnI6JhoaJj57/Hg8JBgYJDhxSn5CKhoaJjpvEIhEKBwYIDRk7ppKLh4aIjZi1KRQLBwYIDBcvrJWMh4aHjJWsLxcMCAYHCxQptZiNiIaHi5KmOxkNCAYHChEixJuOiYaGipCfUhwOCQYGCQ8e/56PiYaGiY6c0h8QCgYGCQ4bRKKRioeGiA=="                        
 },
"streamSid": "MZ18ad3ab5a668481ce02b83e7395059f0" 
}

source: https://www.twilio.com/docs/voice/twiml/stream#websocket-messages-from-twilio

Here is my existing websocket proof of concept that accepts an incoming stream fine and I can transcribe using whisper_cpp after the stream has completed. I'm looking to get realtime transcription working though if possible.


@app.websocket("/stream")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    audio_bytes_buffer = bytearray()
    try:
        while True:
            message = await websocket.receive_text()
            packet = json.loads(message)
            if packet["event"] == "start":
                print("Streaming is starting")
            elif packet["event"] == "stop":
                print("\nStreaming has stopped")
                # global accumulated_audio, accumulated_frames
                # accumulated_audio = bytearray()  # Reset accumulated_audio
                # accumulated_frames = []  # Reset accumulated_frames
                break
            elif packet["event"] == "media":
                audio = bytes.fromhex(packet["media"]["payload"])
                audio = audioop.ulaw2lin(audio, 2)
                audio = audioop.ratecv(audio, 2, 1, 8000, 16000, None)[0]
                audio_bytes_buffer.extend(audio)

                # Append the processed audio to the audio buffer for asynchronous processing
                audio_buffer.append(audio)

        # length of audio_bytes_buffer in seconds
        length_in_seconds = len(audio_bytes_buffer) / BYTES_IN_1_MS / 1000
        logger.info(f"audio_bytes_buffer seconds: {length_in_seconds}")

        # Schedule background task for transcription
        asyncio.create_task(execute_transcription(model, audio_bytes_buffer))

        # SAVE COMPLETE AUDIO FILE
        filename = f"99_complete_audio.wav"
        length_in_seconds = len(audio_bytes_buffer) / BYTES_IN_1_MS / 1000
        print(f"Saving {filename} seconds: {length_in_seconds}")
        asyncio.create_task(execute_save_segment(audio_bytes_buffer, filename))

    except Exception as e:
        print(f"WebSocket closed unexpectedly: {e}")

What I'm wondering is what would be the best way to send the live streaming audio data to the server? Would it make sense to create a new websocket server to listen for incoming Twilio stream data and then send that to the TwilioClient somehow. Thinking of modifying the record method to handle incoming audio data instead of recording from the mic. Any feedback would be greatly appreciated.

cheers!

[W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '131'. It is not used by any node and should be removed from the model.

Hi Dear i have a issue in runing run_server how can i solve?this issue?
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /root/.cache/torch/hub/master.zip
2023-09-16 19:29:25.999206343 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '131'. It is not used by any node and should be removed from the model.
2023-09-16 19:29:25.999219739 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '136'. It is not used by any node and should be removed from the model.
2023-09-16 19:29:25.999221888 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '139'. It is not used by any node and should be removed from the model.
2023-09-16 19:29:25.999223751 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '140'. It is not used by any node and should be removed from the model.
2023-09-16 19:29:25.999225479 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '134'. It is not used by any node and should be removed from the model.
2023-09-16 19:29:25.999253937 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '628'. It is not used by any node and should be removed from the model.
2023-09-16 19:29:25.999255898 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '623'. It is not used by any node and should be removed from the model.
2023-09-16 19:29:25.999258033 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '629'. It is not used by any node and should be removed from the model.
2023-09-16 19:29:25.999259496 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '620'. It is not used by any node and should be removed from the model.
2023-09-16 19:29:25.999261155 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '625'. It is not used by any node and should be removed from the model.

cant change the language

i'm not sure if what i'm doing is right but adjusting the language string to "ar" didn't switch to Arabic and was still showing English.

how to make speaker diarization

Hello,

It is interesting ti have speaker diarization for off transcription. Have you already this fonctionality ?

best regards

can`t start server in windows

here is the log after this code
from whisper_live.server import TranscriptionServer
server = TranscriptionServer()
server.run("0.0.0.0", 9090)

Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 110, in _get_module_details
import(pkg_name)
File "D:\Code\whisper\WhisperLive-0.0.7\main.py", line 2, in
server = TranscriptionServer()
File "D:\Code\whisper\WhisperLive-0.0.7\whisper_live\server.py", line 38, in init
self.vad_model = VoiceActivityDetection()
File "D:\Code\whisper\WhisperLive-0.0.7\whisper_live\vad.py", line 21, in init
self.session = onnxruntime.InferenceSession(path, providers=['CPUExecutionProvider'], sess_options=opts)
File "C:\Users\25813\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in init
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "C:\Users\25813\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 452, in _create_inference_session
sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from C:\Users\25813/.cache/whisper-live/silero_vad.onnx failed:C:\a_work\1\s\onnxruntime\core\graph\model.cc:134 onnxruntime::Model::Model ModelProto does not have a graph.

i have installed requirement from client.txt and server.txt,and i can not installed them from setup.sh,cause i am running server and client both in the same windwos system,as well as setup.py ,it error Perhaps my account does not have write access to this directory

and i don`t installed cuda,is this error linked with this or something else?
Lokk forward for your early reply

demo?

is there a demo of this somewhere to get a sense of the transcription latency?

reopening https://github.com/collabora/WhisperLive/issues/23

#23

can't see to reopen the issue so just bumping here

Is it possible to add transcribed text to a file in real time?

Set `initial_prompt` and `vad_parameters` from the client.

I have gone through server.py and noticed that the initial_prompt and vad_parameters are hardcoded. Is there a way to make them configurable by the client? Perhaps set them in the first message after establishing a WebSocket connection? Alternatively, could you provide some guidance so that I could submit a PR?

Output to file

I feel a little foolish here, as i am probably missing something obvious.

I get that the client has the onMessage function - but i am struggling to see where i can access this to either save to a file or to pass to another function?

Also, while i am here - do any flags need to be passed to use GPU? output feels a little sluggish, for having a 3090.

where i can change and specifie the whisper model

where i can change and specifie the whisper model like switching from base to medium or other

How to set the specified local model?

How to set the specified local model?I want to use the large model

Is `bufferSize` required to be `4096` sample frames?

I want to know if bufferSize is required to be a fixed value. Thank you. ❤️

https://github.com/collabora/WhisperLive/blob/e30286c046ca6d8c7a9eac42da59f7fd97514fc0/Audio-Transcription-Chrome/options.js#L159C39-L159C39

len(text) always 1?why?

client.py
if len(text) > 3: text = text[-3:]

Initial WebSocket Error on First WhisperLive Deployment and Model Download

The first time WhisperLive is deployed and a client request is made, the model is downloaded. However, at this time, there is always a WebSocket error. But subsequent client requests all proceed normally.

python3 run_server.py
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2.39k/2.39k [00:00<00:00, 16.6MB/s]
preprocessor_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 340/340 [00:00<00:00, 2.30MB/s]
vocabulary.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1.07M/1.07M [00:00<00:00, 1.65MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 2.48M/2.48M [00:00<00:00, 3.08MB/s]
model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 3.09G/3.09G [04:18<00:00, 11.9MB/s]
ERROR:websockets.server:connection handler failed
Traceback (most recent call last):████████████████████████████████████████████████████████████████████████████████| 3.09G/3.09G [04:18<00:00, 11.5MB/s]
  File "/home/ubuntu/.local/lib/python3.10/site-packages/websockets/sync/server.py", line 499, in conn_handler
    handler(connection)
  File "/home/ubuntu/WhisperLive/whisper_live/server.py", line 99, in recv_audio
    client = ServeClient(
  File "/home/ubuntu/WhisperLive/whisper_live/server.py", line 255, in __init__
    self.websocket.send(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/websockets/sync/connection.py", line 284, in send
    with self.send_context():
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/websockets/sync/connection.py", line 724, in send_context
    raise self.protocol.close_exc from original_exc
websockets.exceptions.ConnectionClosedError: no close frame received or sent

Scaling server and adding load balancers or adding clusters

Hi,

I tested for single client on Tesla T4 and the transcriptions are real-time. But what is the best way to scale server code to 50-100 concurrent users at least (more preferred but I mentioned 50-100 concurrent users because it will need 3 k8 pods with tesla t4 each)?

How to use it?

I've successfully build the docker image, and run the docker run ... command come from the documentation, then I got this log:

Downloading VAD ONNX model...
--2023-11-18 04:58:37--  https://github.com/snakers4/silero-vad/raw/master/files/silero_vad.onnx
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/snakers4/silero-vad/master/files/silero_vad.onnx [following]
--2023-11-18 04:58:38--  https://raw.githubusercontent.com/snakers4/silero-vad/master/files/silero_vad.onnx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 182.43.124.6, ::
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|182.43.124.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1807522 (1.7M) [application/octet-stream]
Saving to: ‘/root/.cache/whisper-live/silero_vad.onnx’

/root/.cache/whisper-live/sil 100%[=================================================>]   1.72M   318KB/s    in 5.8s

2023-11-18 04:58:46 (302 KB/s) - ‘/root/.cache/whisper-live/silero_vad.onnx’ saved [1807522/1807522]

After this, I tried to load the extension to the browser, after I click Start Capture, I see nothing happens from the browser.

I tried to go to localhost:9090 from the browser, got this:

Failed to open a WebSocket connection: invalid Connection header: keep-alive.

You cannot access a WebSocket server directly with a browser. You need a WebSocket client.

So, am I got the image built correctly? And how to use the client for the docker server?

Thanks.

Speaker Diarization

I want to add speaker diarization. Wondering how much I would have to change (existing code), also if you can point me to how you would approach this since I'm still getting used to the code base.

Problem trying to run server

Hello! I'm trying to run the server but I keep getting this error:

ImportError: cannot import name '_LANGUAGE_CODES' from 'faster_whisper.tokenizer' (C:\Users\Víctor Masip\AppData\Local\Programs\Python\Python310\lib\site-packages\faster_whisper\tokenizer.py)

Thanks for your work!

Transcribing a HLS stream?

I've tried transcribing an HLS stream by inputing a link to a .m3u8 file, but it does not seem to work. Is that feature supported somehow?

Text is Repeated (On Server Side)

@author I am using this repo for server client based Speech to Text for live transcription. I want to generate text on only server side, client will ony send the audio to the server, and text recognized by server will be shown in the server side.

But, when I say something one time, it prints more than one time. Can you please help me in this?

Bug? File Transcription never completes.

Using the JFK wave file here: https://github.com/ggerganov/whisper.cpp/blob/master/samples/jfk.wav

Here is my run_client.py

from whisper_live.client import TranscriptionClient
client = TranscriptionClient("0.0.0.0", "8080", is_multilingual=False, lang="en", translate=False)
client("./jfk.wav")

running the client only transcribes the following:

And so, my fellow America, ask not what your country can do
for you.

Windows support?

Hi,

I would like to know if Windows will be supported, I tried with the actual files but I am not able to make it work…

Any hint on how to make it work would be welcome.

Initial_prompt and vad_parameters not sent in first message

This issue refers to changes introduced in the recent commit 72ead71, related to the addition of vad_parameters and initial_prompt parameters. I didn't see any branch related to it so I'm raising an issue instead of a comment

Client requests seem to fail because of the lack of initial_prompt and vad_parameter keys in the first message. I reproduced this error with servers ran through either run_server.py or a Docker container.

Here are small snippets to reproduce the error:

Server

from whisper_live.server import TranscriptionServer

if __name__ == "__main__":
    server = TranscriptionServer()
    server.run("0.0.0.0", 9090)

Client

from whisper_live.client import TranscriptionClient

client = TranscriptionClient(
    "localhost",
    9090,
    is_multilingual=False,
    lang="en",
    translate=False,
    model_size="small",
)

client("tests/jfk.flac")

Error obtained

 ERROR:websockets.server:connection handler failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/websockets/sync/server.py", line 499, in conn_handler
    handler(connection)
  File "/app/whisper_live/server.py", line 104, in recv_audio
    initial_prompt=options["initial_prompt"],
KeyError: 'initial_prompt'

I guess this comes from the lack of parameter forwarding in the client.py file during the opening of the websocket, which could be solved by this:

    def on_open(self, ws):
        """
        Callback function called when the WebSocket connection is successfully opened.

        Sends an initial configuration message to the server, including client UID, multilingual mode,
        language selection, and task type.

        Args:
            ws (websocket.WebSocketApp): The WebSocket client instance.

        """
        print(self.multilingual, self.language, self.task)

        print("[INFO]: Opened connection")
        ws.send(
            json.dumps(
                {
                    "uid": self.uid,
                    "multilingual": self.multilingual,
                    "language": self.language,
                    "task": self.task,
                    "model_size": self.model_size,
                    "initial_prompt": self.initial_prompt, # added line
                    "vad_parameters": self.vad_parameters, # added line
                }
            )
        )

Client Gets Stuck

Hi everyone,

I am working on the WhisperLive to get the transcriptions on client side. It is working now, but there is an issue. The issue is that after a few minutes (1 or 2 minutes), it does not print the updated (new) transcription for the audio on the terminal. It feels like that it gets stuck.

Can anyone help me in the issue mentioned above? Thanks in advance.

Chrome Extension error?

Hello, I got this error after Start Capture with Use Multilingual Model

Chrome v118.0.5993.70

Do you have any idea on this?

Issue with Error Messages and 'WhisperModel' Attribute in Code

Could you please help explain what these error messages mean and how to resolve them? Thank you.

ERROR:root:[ERROR]: sent 1000 (OK); then received 1000 (OK)
ERROR:root:received 1001 (going away); then sent 1001 (going away)
ERROR:root:[ERROR]: 'WhisperModel' object has no attribute 'model'
ERROR:root:received 1001 (going away); then sent 1001 (going away)
ERROR:root:[ERROR]: 'WhisperModel' object has no attribute 'model'
ERROR:root:no close frame received or sent
ERROR:root:[ERROR]: 'WhisperModel' object has no attribute 'model'

but if the duration is decreased then accuracy is also degraded. any suggestions for fixing this?

Add Speaker Diarization

Can we incorporate speaker identification into the transcription results?

I found a project called whisper-diarization from Faster Whisper's Community integrations section.

Is it possible for us to integrate it?

Mac M1 Docker Issue

Seeing the following error trying to run the docker image on my Mac M1

whisperlive git:(main)✗  🚀  docker run -it --gpus all -p 9090:9090 whisper-live:latest
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container:                 
 whisperlive git:(main)✗  🚀  docker run -it -p 9090:9090 whisper-live:latest 

==========
== CUDA ==
==========

CUDA Version 11.2.2

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

It wasn't clear to me that this docker implementation required NVDIA gpus