Giter Site home page Giter Site logo

Speaker Diarization about whisperlive HOT 11 CLOSED

yehiaabdelm avatar yehiaabdelm commented on May 13, 2024 1
Speaker Diarization

from whisperlive.

Comments (11)

zoq avatar zoq commented on May 13, 2024 1

Yes, I'll clean up the code and share it.

from whisperlive.

yehiaabdelm avatar yehiaabdelm commented on May 13, 2024 1

Gave it a shot using pyannote, but it slows down transcription until it becomes unusable because I'm passing the whole stream of audio every time. I think we need to store speaker embeddings somehow and have some sort of sliding window. Thought I'd provide an update.

diarization.py

from pyannote.audio import Pipeline
import torch
from intervaltree import IntervalTree


from dotenv import load_dotenv, find_dotenv
import os

load_dotenv(find_dotenv())
HUGGINGFACE_ACCESS_TOKEN = os.environ["HUGGINGFACE_ACCESS_TOKEN"]


class Diarization():
    def __init__(self):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.0",
            use_auth_token=HUGGINGFACE_ACCESS_TOKEN).to(device)

    def transform_diarization_output(self, diarization):
        l = []
        for segment, speaker in diarization.itertracks():
            l.append({"start": segment.start,
                     "end": segment.end, "speaker": speaker})
        return l

    def process(self, waveform, sample_rate):
        # convert samples to tensor
        audio_tensor = torch.tensor(waveform, dtype=torch.float32).unsqueeze(0)
        # run diarization model on tensor
        diarization = self.pipeline(
            {"waveform": audio_tensor, "sample_rate": sample_rate})
        # convert output to list of dicts
        diarization = self.transform_diarization_output(diarization)
        return diarization

    def join_transcript_with_diarization(self, transcript, diarization):

        diarization_tree = IntervalTree()
        # Add diarization to interval tree
        for dia in diarization:
            diarization_tree.addi(dia['start'], dia['end'], dia['speaker'])

        joined = []
        for seg in transcript:
            interval_start = seg['start']
            interval_end = seg['end']
            # Get overlapping diarization
            overlaps = diarization_tree[interval_start:interval_end]
            speakers = {overlap.data for overlap in overlaps}
            # Add to result
            joined.append({
                'start': interval_start,
                'end': interval_end,
                'speakers': list(speakers),
                'text': seg['text']
            })

        return joined

from whisperlive.

makaveli10 avatar makaveli10 commented on May 13, 2024 1

@yehiaabdelm I can take a look but not sure when. Although, the way you could quickly solve it is by removing VAD.

from whisperlive.

yehiaabdelm avatar yehiaabdelm commented on May 13, 2024 1

True, but I need the VAD.

from whisperlive.

makaveli10 avatar makaveli10 commented on May 13, 2024

@yehiaabdelm To start with, i think it could be added similar to how Voice Activity Detection is done i.e. before the transcription to figure out the speaker id for the segment and send it back to the client.

from whisperlive.

zoq avatar zoq commented on May 13, 2024

Adding on that, we do have some code that perform offline diarization, I guess you need live diarization? I'm trying to think about how we could take the offline code we have and apply it to a live setting.

from whisperlive.

yehiaabdelm avatar yehiaabdelm commented on May 13, 2024

Can you please share the offline code? I want to try and add it. @makaveli10 Yes I think it can be done like the vad model, but I think we would be processing the whole audio stream every time, right? It needs to segment/cluster the recording every time.

from whisperlive.

makaveli10 avatar makaveli10 commented on May 13, 2024

@yehiaabdelm looks good for a start, we can process only the new segment we receive from whisper-live transcription pipeline. But in that case, we would need to have the correct timestamps from whisper because right now with VAD we dont process when the audio frames are below speech threshold so the timestamps are off. i think i can fix it if it would help.

from whisperlive.

yehiaabdelm avatar yehiaabdelm commented on May 13, 2024

Yup, that would be super helpful.

from whisperlive.

makaveli10 avatar makaveli10 commented on May 13, 2024

@yehiaabdelm Sorry for the late response. #64 should give you correct timestamps with VAD.

from whisperlive.

makaveli10 avatar makaveli10 commented on May 13, 2024

Closing due to inactivity. Feel free to re-open the issue.

from whisperlive.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.