I want to add speaker diarization. Wondering how much I would have to change (existing

Gave it a shot using pyannote, but it slows down tranion until it becomes unusab

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can you please share the offline code? I want to try and add it. <a class="user-mentio

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Speaker Diarization about whisperlive HOT 11 CLOSED

yehiaabdelm commented on May 13, 2024 1

Speaker Diarization

from whisperlive.

Comments (11)

zoq commented on May 13, 2024 1

Yes, I'll clean up the code and share it.

from whisperlive.

yehiaabdelm commented on May 13, 2024 1

Gave it a shot using pyannote, but it slows down transcription until it becomes unusable because I'm passing the whole stream of audio every time. I think we need to store speaker embeddings somehow and have some sort of sliding window. Thought I'd provide an update.

diarization.py

from pyannote.audio import Pipeline
import torch
from intervaltree import IntervalTree


from dotenv import load_dotenv, find_dotenv
import os

load_dotenv(find_dotenv())
HUGGINGFACE_ACCESS_TOKEN = os.environ["HUGGINGFACE_ACCESS_TOKEN"]


class Diarization():
    def __init__(self):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.0",
            use_auth_token=HUGGINGFACE_ACCESS_TOKEN).to(device)

    def transform_diarization_output(self, diarization):
        l = []
        for segment, speaker in diarization.itertracks():
            l.append({"start": segment.start,
                     "end": segment.end, "speaker": speaker})
        return l

    def process(self, waveform, sample_rate):
        # convert samples to tensor
        audio_tensor = torch.tensor(waveform, dtype=torch.float32).unsqueeze(0)
        # run diarization model on tensor
        diarization = self.pipeline(
            {"waveform": audio_tensor, "sample_rate": sample_rate})
        # convert output to list of dicts
        diarization = self.transform_diarization_output(diarization)
        return diarization

    def join_transcript_with_diarization(self, transcript, diarization):

        diarization_tree = IntervalTree()
        # Add diarization to interval tree
        for dia in diarization:
            diarization_tree.addi(dia['start'], dia['end'], dia['speaker'])

        joined = []
        for seg in transcript:
            interval_start = seg['start']
            interval_end = seg['end']
            # Get overlapping diarization
            overlaps = diarization_tree[interval_start:interval_end]
            speakers = {overlap.data for overlap in overlaps}
            # Add to result
            joined.append({
                'start': interval_start,
                'end': interval_end,
                'speakers': list(speakers),
                'text': seg['text']
            })

        return joined

from whisperlive.

makaveli10 commented on May 13, 2024 1

@yehiaabdelm I can take a look but not sure when. Although, the way you could quickly solve it is by removing VAD.

from whisperlive.

yehiaabdelm commented on May 13, 2024 1

True, but I need the VAD.

from whisperlive.

makaveli10 commented on May 13, 2024

@yehiaabdelm To start with, i think it could be added similar to how Voice Activity Detection is done i.e. before the transcription to figure out the speaker id for the segment and send it back to the client.

from whisperlive.

zoq commented on May 13, 2024

Adding on that, we do have some code that perform offline diarization, I guess you need live diarization? I'm trying to think about how we could take the offline code we have and apply it to a live setting.

from whisperlive.

yehiaabdelm commented on May 13, 2024

Can you please share the offline code? I want to try and add it. @makaveli10 Yes I think it can be done like the vad model, but I think we would be processing the whole audio stream every time, right? It needs to segment/cluster the recording every time.

from whisperlive.

makaveli10 commented on May 13, 2024

@yehiaabdelm looks good for a start, we can process only the new segment we receive from whisper-live transcription pipeline. But in that case, we would need to have the correct timestamps from whisper because right now with VAD we dont process when the audio frames are below speech threshold so the timestamps are off. i think i can fix it if it would help.

from whisperlive.

yehiaabdelm commented on May 13, 2024

Yup, that would be super helpful.

from whisperlive.

makaveli10 commented on May 13, 2024

@yehiaabdelm Sorry for the late response. #64 should give you correct timestamps with VAD.

from whisperlive.

makaveli10 commented on May 13, 2024

Closing due to inactivity. Feel free to re-open the issue.

from whisperlive.

Speaker Diarization about whisperlive HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent