Can you add speaker labels so we can use this to fully diarize an audio file with spea

Thanks a lot for the help! I created <a href="https://github.com/thewh1teagle/pyan

Please try it again: <a class="commit-link" data-hovercard-type="commit" data-hovercar

Please try it again: <a href="https://github.com/pengzhendong/pyannote-on

2. then use the <a href="https://github.com/pengzhendong/pyannote-onnx/bl

Add voice embedding support about pyannote-onnx HOT 15 CLOSED

thewh1teagle commented on September 19, 2024

Add voice embedding support

from pyannote-onnx.

Comments (15)

thewh1teagle commented on September 19, 2024 1

Thanks a lot for the help!
I created pyannote-rs
And it works good. Now it can be used in wasm more easily

from pyannote-onnx.

pengzhendong commented on September 19, 2024

I don't know what form of speaker labels you want. You could start here:

pyannote-onnx/pyannote_onnx/pyannote_onnx.py

Line 95 in 2682155

def __call__(self, x, step=5.0, return_chunk=False):

from pyannote-onnx.

thewh1teagle commented on September 19, 2024

I don't know what form of speaker labels you want. You could start here:

Thanks. eventually I reimplemented it in onnx runtime https://github.com/thewh1teagle/sherpa-rs with silero vad.
However when trying both silero vad or your implementation in this repo, the segmentations result isn't good as pyannote.audio.
When comparing the following script:

main.py

from pyannote.audio import Pipeline # pip install pyannote.audio
from pathlib import Path 
import os

# on macOS: 
# pip install numpy==1.26.4

CUR_DIR = Path(__file__).parent
DST = CUR_DIR / "../samples/sam_altman.wav"

# 1. Accept conditions in:
# https://hf.co/pyannote/segmentation-3.0
# https://hf.co/pyannote/speaker-diarization-3.1

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
	# 2. Get token from https://huggingface.co/settings/tokens 
	# export HK_TOKEN="<token>"
    use_auth_token=os.getenv('HF_TOKEN')
) 

# send pipeline to GPU (when available)
import torch
device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
print("Using device", device)
pipeline.to(torch.device(device))

# apply pretrained pipeline
diarization = pipeline(str(DST))

# print the result
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

I'm wondering what's needed to be changed to make that better.
I compared with this file:

he following wav file:

elon.mp4

Result from pyannote-onnx:

Pyannote processing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 20.88frames/s] | 100.00%
[{'start': 0.032, 'end': 21.624}, {'start': 21.919, 'end': 31.631}, {'start': 32.145, 'end': 34.853}]
Pyannote processing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 21.45frames/s] | 100.00%
2

pyannote.audio:

python3 scripts/diarize.py
torchvision is not available - cannot save figures
Using device mps
start=0.0s stop=5.2s speaker_SPEAKER_01
start=5.4s stop=17.9s speaker_SPEAKER_01
start=17.9s stop=21.4s speaker_SPEAKER_00
start=21.9s stop=24.3s speaker_SPEAKER_00
start=24.5s stop=31.5s speaker_SPEAKER_00
start=32.2s stop=34.8s speaker_SPEAKER_00

You can see that pyannote.audio segments start and stop are much more accurate

from pyannote-onnx.

pengzhendong commented on September 19, 2024

Please try it again: c6a2460

{'speaker': 1, 'start': 0.062, 'stop': 5.074}
{'speaker': 1, 'start': 5.513, 'stop': 17.882}
{'speaker': 2, 'start': 17.966, 'stop': 21.409}
{'speaker': 2, 'start': 21.949, 'stop': 24.278}
{'speaker': 2, 'start': 24.514, 'stop': 31.534}
{'speaker': 2, 'start': 32.259, 'stop': 34.841}

from pyannote-onnx.

thewh1teagle commented on September 19, 2024

Please try it again: c6a2460

Works almost perfect!

I would like to create program to diarize wav files just like Pyannote pipeline I posted here , do you know what needs to be added to do diarization for long wav files? I saw in the segmentation model card that it doesn't works well for more than 10 seconds. does vad detection is all we need to fully diarize wav file including speaker IDs? or additional embedding model or similar steps are required? thanks

The code I tested the last update:

main.py

# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/sam_altman.wav
# pip install git+https://github.com/pengzhendong/pyannote-onnx click librosa matplotlib numpy tqdm
# python main.py sam_altman.wav
# python main.py motivation.wav
import click
import librosa
import matplotlib.pyplot as plt
import numpy as np

from pyannote_onnx import PyannoteONNX


@click.command()
@click.argument("wav_path", type=click.Path(exists=True, file_okay=True))
@click.option("--plot/--no-plot", default=False, help="Plot the vad probabilities")
def main(wav_path: str, plot: bool):
    pyannote = PyannoteONNX()
    for turn in pyannote.itertracks(wav_path):
        print(turn)

    if plot:
        pyannote = PyannoteONNX(show_progress=True)
        wav, sr = librosa.load(wav_path, sr=pyannote.sample_rate)
        outputs = list(pyannote(wav))
        x1 = np.arange(0, len(wav)) / sr
        x2 = [(i * 270 + 721) / sr for i in range(0, len(outputs))]
        plt.plot(x1, wav)
        plt.plot(x2, outputs)
        plt.show()


if __name__ == "__main__":
    main()

from pyannote-onnx.

pengzhendong commented on September 19, 2024

use the sliding window to deal with the long wav file.
then use the permutation to find the best order(however, it's not right all the time).
use the additional embedding model and kmeans to get the right speakers ID.

from pyannote-onnx.

thewh1teagle commented on September 19, 2024

2. then use the permutation to find the best order(however, it's not right all the time).

Does the order necessary if I only need vad? (detect speech)

I'm trying to implement simple vad without speaker labels with it based on your code but I'm not sure how to get the output from onnx correctly.
By the way I didn't understood why the speaker labels useful if the model supports only 10s and I will need additional model with k means

main.py

# python3 -m venv venv 
# source venv/bin/activate
# pip3 install onnxruntime numpy librosa
# wget https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -Otest.wav
# python3 main.py

import onnxruntime as ort
import librosa
import numpy as np

def init_session(model_path):
    opts = ort.SessionOptions()
    opts.inter_op_num_threads = 1
    opts.intra_op_num_threads = 1
    opts.log_severity_level = 3
    sess = ort.InferenceSession(model_path, sess_options=opts)
    return sess

def read_wav(path: str):
    return librosa.load(path, sr=16000)

if __name__ == '__main__':
    samples, sample_rate = read_wav('test.wav')
    window_size = sample_rate * 10 # 10s
    session = init_session('segmentation-3.0.onnx')
    speech_classes = {1, 2, 3, 4, 5, 6}
    for start in range(0, len(samples), window_size):
        end = start + window_size
        window = samples[start:end]
        ort_outs = session.run(None, {'input': window[None, None, :]})[0][0]

from pyannote-onnx.

pengzhendong commented on September 19, 2024

I will give you an example tomorrow.

from pyannote-onnx.

thewh1teagle commented on September 19, 2024

I will give you an example tomorrow.

Thanks. I think that it works now, pretty accurate.

main.py

# python3 -m venv venv 
# source venv/bin/activate
# pip3 install onnxruntime numpy librosa
# wget https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -Otest.wav
# python3 main.py

import onnxruntime as ort
import librosa
import numpy as np

def init_session(model_path):
    opts = ort.SessionOptions()
    opts.inter_op_num_threads = 1
    opts.intra_op_num_threads = 1
    opts.log_severity_level = 3
    sess = ort.InferenceSession(model_path, sess_options=opts)
    return sess

def read_wav(path: str):
    return librosa.load(path, sr=16000)

if __name__ == '__main__':
    session = init_session('segmentation-3.0.onnx')
    samples, sample_rate = read_wav('test.wav')
    
    # Conv1d & MaxPool1d & SincNet https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html https://pytorch.org/docs/stable/generated/torch.nn.MaxPool1d.html https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/models/blocks/sincnet.py#L50-L71
    frame_size = 270
    frame_start = 721
    window_size = sample_rate * 10 # 10s

    # State and offset
    is_speeching = False
    offset = frame_start
    start_offset = 0

    # Pad end with silence for full last segment
    samples = np.pad(samples, (0, window_size), 'constant') 

    for start in range(0, len(samples), window_size):
        window = samples[start:start + window_size]
        ort_outs: np.array = session.run(None, {'input': window[None, None, :]})[0][0]
        for probs in ort_outs:
            predicted_id = np.argmax(probs)
            if predicted_id != 0:
                if not is_speeching:
                    start_offset = offset
                    is_speeching = True
            elif is_speeching:
                start = round(start_offset / sample_rate, 3)
                end = round(offset / sample_rate, 3)
                print(f'{start}s - {end}s')
                is_speeching = False
            offset += frame_size

I'm looking for simple embedding onnx model for speaker embedding (maybe wespeak) so I'll have accurate diarization on each segment.
After that I'll try to implement it in Rust.

from pyannote-onnx.

pengzhendong commented on September 19, 2024

The classes (without softmax) of the models are:
no speech, spk1, spk2, spk3, spk1 & spk2, spk1 & spk3, spk2 & spk3
So, why did you use np.argmax? Just set the speech_prob to 1 - ort_outs[0], and then compare it with your threshold.

from pyannote-onnx.

thewh1teagle commented on September 19, 2024

So, why did you use np.argmax?

I understood that the model output is series of probs (7) floats. And the index correspond for their class type (eg. index 0 is no speech).
So I used np.argmax to take the highest prob index.
Maybe I completely wrong but it works somehow.

Just set the speech_prob to 1 - ort_outs[0]

I didn't understand, where did I used speech_prob?

Why do I need additional embedding model if pyannote model already return speaker label?

Also, I implemented it with Rust! the segmentation works pretty good.
However the speaker embedding doesn't work well. Maybe you can take a look on the computing of the embedding?

https://github.com/thewh1teagle/pyannote-rs/blob/main/src/embedding.rs#L19-L37

from pyannote-onnx.

pengzhendong commented on September 19, 2024

Maybe np.argmax(probs) == 0 => probs[0] > 0.5 is better. Because the probs are not normalized.

Why do I need additional embedding model if pyannote model already return speaker label?

Because the speaker index may change in different windows (the model does not have context of the speakers):

There are two speakers Alice and Bob.
Alice speaks in the first window. The model labeled Alice as spk1.
Bob speaks in the second window. The model may label Bob as spk1, too.

Sorry, I'm not familiar with Rust.

from pyannote-onnx.

thewh1teagle commented on September 19, 2024

Because the speaker index may change in different windows (the model does not have context of the speakers):

Oh now I understand, it's a bit confusing.
Do you think that embedding models works better / accurate? from what I've tested they not at least not accurate as pyannote speaker labels. even your repo with reordering is pretty accurate.
Do you have idea how to get good speaker lables? I tried to use nemo / speaker3d models but it's not accurate at all.

from pyannote-onnx.

pengzhendong commented on September 19, 2024

Did you try any clustering algorithms? How did you try nemo or speaker3d models?

from pyannote-onnx.

thewh1teagle commented on September 19, 2024

Did you try any clustering algorithms? How did you try nemo or speaker3d models?

I didn't tried clustering algorithms.
I tried nemo with onnxruntime like so:
https://github.com/thewh1teagle/ort-diarize/blob/main/extract.py#L41

And compared distance with:
https://github.com/thewh1teagle/ort-diarize/blob/main/identify.py#L9

Actually it's not that bad in Python. Maybe I did mistake in the fbank computing in Rust

from pyannote-onnx.

Add voice embedding support about pyannote-onnx HOT 15 CLOSED

Comments (15)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent