Comments (11)
Yes, I'll clean up the code and share it.
from whisperlive.
Gave it a shot using pyannote, but it slows down transcription until it becomes unusable because I'm passing the whole stream of audio every time. I think we need to store speaker embeddings somehow and have some sort of sliding window. Thought I'd provide an update.
diarization.py
from pyannote.audio import Pipeline
import torch
from intervaltree import IntervalTree
from dotenv import load_dotenv, find_dotenv
import os
load_dotenv(find_dotenv())
HUGGINGFACE_ACCESS_TOKEN = os.environ["HUGGINGFACE_ACCESS_TOKEN"]
class Diarization():
def __init__(self):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.0",
use_auth_token=HUGGINGFACE_ACCESS_TOKEN).to(device)
def transform_diarization_output(self, diarization):
l = []
for segment, speaker in diarization.itertracks():
l.append({"start": segment.start,
"end": segment.end, "speaker": speaker})
return l
def process(self, waveform, sample_rate):
# convert samples to tensor
audio_tensor = torch.tensor(waveform, dtype=torch.float32).unsqueeze(0)
# run diarization model on tensor
diarization = self.pipeline(
{"waveform": audio_tensor, "sample_rate": sample_rate})
# convert output to list of dicts
diarization = self.transform_diarization_output(diarization)
return diarization
def join_transcript_with_diarization(self, transcript, diarization):
diarization_tree = IntervalTree()
# Add diarization to interval tree
for dia in diarization:
diarization_tree.addi(dia['start'], dia['end'], dia['speaker'])
joined = []
for seg in transcript:
interval_start = seg['start']
interval_end = seg['end']
# Get overlapping diarization
overlaps = diarization_tree[interval_start:interval_end]
speakers = {overlap.data for overlap in overlaps}
# Add to result
joined.append({
'start': interval_start,
'end': interval_end,
'speakers': list(speakers),
'text': seg['text']
})
return joined
from whisperlive.
@yehiaabdelm I can take a look but not sure when. Although, the way you could quickly solve it is by removing VAD.
from whisperlive.
True, but I need the VAD.
from whisperlive.
@yehiaabdelm To start with, i think it could be added similar to how Voice Activity Detection
is done i.e. before the transcription to figure out the speaker id for the segment and send it back to the client.
from whisperlive.
Adding on that, we do have some code that perform offline diarization, I guess you need live diarization? I'm trying to think about how we could take the offline code we have and apply it to a live setting.
from whisperlive.
Can you please share the offline code? I want to try and add it. @makaveli10 Yes I think it can be done like the vad model, but I think we would be processing the whole audio stream every time, right? It needs to segment/cluster the recording every time.
from whisperlive.
@yehiaabdelm looks good for a start, we can process only the new segment we receive from whisper-live transcription pipeline. But in that case, we would need to have the correct timestamps from whisper because right now with VAD we dont process when the audio frames are below speech threshold so the timestamps are off. i think i can fix it if it would help.
from whisperlive.
Yup, that would be super helpful.
from whisperlive.
@yehiaabdelm Sorry for the late response. #64 should give you correct timestamps with VAD.
from whisperlive.
Closing due to inactivity. Feel free to re-open the issue.
from whisperlive.
Related Issues (20)
- How real-time transcription work?
- Parallel request to the server HOT 1
- Code error in client.py HOT 1
- Live Browser Plugin Transcription HOT 4
- replace the model on Hugging Face HOT 4
- Server complains about float16 compute type. HOT 2
- Issue with live transcription HOT 2
- Integrating with Hugging Face pretrained models HOT 1
- What is the scipy version? HOT 1
- TranscriptionTeeClient __call__ ---> class methods HOT 1
- GPU mode consumes half of available CPUs HOT 1
- 运行服务器没反应,客户端运行后只输出:[INFO]: * recording HOT 1
- Incomplete Transcriptions HOT 5
- Whisper-live taking same time on CPU and GPU to transcribe an audio HOT 5
- HLS Transcription Stops After A Couple Minutes
- Tensorrt backend with base.en not working HOT 1
- Tensor backend core dumped HOT 2
- 'Backend' error in command prompt
- Delay in transcription
- Word-level timestamps HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from whisperlive.