Comments (15)
Thanks a lot for the help!
I created pyannote-rs
And it works good. Now it can be used in wasm more easily
from pyannote-onnx.
I don't know what form of speaker labels you want. You could start here:
from pyannote-onnx.
I don't know what form of speaker labels you want. You could start here:
Thanks. eventually I reimplemented it in onnx runtime https://github.com/thewh1teagle/sherpa-rs with silero vad.
However when trying both silero vad or your implementation in this repo, the segmentations result isn't good as pyannote.audio.
When comparing the following script:
main.py
from pyannote.audio import Pipeline # pip install pyannote.audio
from pathlib import Path
import os
# on macOS:
# pip install numpy==1.26.4
CUR_DIR = Path(__file__).parent
DST = CUR_DIR / "../samples/sam_altman.wav"
# 1. Accept conditions in:
# https://hf.co/pyannote/segmentation-3.0
# https://hf.co/pyannote/speaker-diarization-3.1
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
# 2. Get token from https://huggingface.co/settings/tokens
# export HK_TOKEN="<token>"
use_auth_token=os.getenv('HF_TOKEN')
)
# send pipeline to GPU (when available)
import torch
device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
print("Using device", device)
pipeline.to(torch.device(device))
# apply pretrained pipeline
diarization = pipeline(str(DST))
# print the result
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
I'm wondering what's needed to be changed to make that better.
I compared with this file:
he following wav file:
elon.mp4
Result from pyannote-onnx
:
Pyannote processing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 20.88frames/s] | 100.00%
[{'start': 0.032, 'end': 21.624}, {'start': 21.919, 'end': 31.631}, {'start': 32.145, 'end': 34.853}]
Pyannote processing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 21.45frames/s] | 100.00%
2
pyannote.audio
:
python3 scripts/diarize.py
torchvision is not available - cannot save figures
Using device mps
start=0.0s stop=5.2s speaker_SPEAKER_01
start=5.4s stop=17.9s speaker_SPEAKER_01
start=17.9s stop=21.4s speaker_SPEAKER_00
start=21.9s stop=24.3s speaker_SPEAKER_00
start=24.5s stop=31.5s speaker_SPEAKER_00
start=32.2s stop=34.8s speaker_SPEAKER_00
You can see that pyannote.audio
segments start and stop are much more accurate
from pyannote-onnx.
Please try it again: c6a2460
{'speaker': 1, 'start': 0.062, 'stop': 5.074}
{'speaker': 1, 'start': 5.513, 'stop': 17.882}
{'speaker': 2, 'start': 17.966, 'stop': 21.409}
{'speaker': 2, 'start': 21.949, 'stop': 24.278}
{'speaker': 2, 'start': 24.514, 'stop': 31.534}
{'speaker': 2, 'start': 32.259, 'stop': 34.841}
from pyannote-onnx.
Please try it again: c6a2460
Works almost perfect!
I would like to create program to diarize wav files just like Pyannote pipeline I posted here , do you know what needs to be added to do diarization for long wav files? I saw in the segmentation model card that it doesn't works well for more than 10 seconds. does vad detection is all we need to fully diarize wav file including speaker IDs? or additional embedding model or similar steps are required? thanks
The code I tested the last update:
main.py
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/sam_altman.wav
# pip install git+https://github.com/pengzhendong/pyannote-onnx click librosa matplotlib numpy tqdm
# python main.py sam_altman.wav
# python main.py motivation.wav
import click
import librosa
import matplotlib.pyplot as plt
import numpy as np
from pyannote_onnx import PyannoteONNX
@click.command()
@click.argument("wav_path", type=click.Path(exists=True, file_okay=True))
@click.option("--plot/--no-plot", default=False, help="Plot the vad probabilities")
def main(wav_path: str, plot: bool):
pyannote = PyannoteONNX()
for turn in pyannote.itertracks(wav_path):
print(turn)
if plot:
pyannote = PyannoteONNX(show_progress=True)
wav, sr = librosa.load(wav_path, sr=pyannote.sample_rate)
outputs = list(pyannote(wav))
x1 = np.arange(0, len(wav)) / sr
x2 = [(i * 270 + 721) / sr for i in range(0, len(outputs))]
plt.plot(x1, wav)
plt.plot(x2, outputs)
plt.show()
if __name__ == "__main__":
main()
from pyannote-onnx.
- use the sliding window to deal with the long wav file.
- then use the permutation to find the best order(however, it's not right all the time).
- use the additional embedding model and kmeans to get the right speakers ID.
from pyannote-onnx.
2. then use the permutation to find the best order(however, it's not right all the time).
Does the order necessary if I only need vad? (detect speech)
I'm trying to implement simple vad without speaker labels with it based on your code but I'm not sure how to get the output from onnx correctly.
By the way I didn't understood why the speaker labels useful if the model supports only 10s and I will need additional model with k means
main.py
# python3 -m venv venv
# source venv/bin/activate
# pip3 install onnxruntime numpy librosa
# wget https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -Otest.wav
# python3 main.py
import onnxruntime as ort
import librosa
import numpy as np
def init_session(model_path):
opts = ort.SessionOptions()
opts.inter_op_num_threads = 1
opts.intra_op_num_threads = 1
opts.log_severity_level = 3
sess = ort.InferenceSession(model_path, sess_options=opts)
return sess
def read_wav(path: str):
return librosa.load(path, sr=16000)
if __name__ == '__main__':
samples, sample_rate = read_wav('test.wav')
window_size = sample_rate * 10 # 10s
session = init_session('segmentation-3.0.onnx')
speech_classes = {1, 2, 3, 4, 5, 6}
for start in range(0, len(samples), window_size):
end = start + window_size
window = samples[start:end]
ort_outs = session.run(None, {'input': window[None, None, :]})[0][0]
from pyannote-onnx.
I will give you an example tomorrow.
from pyannote-onnx.
I will give you an example tomorrow.
Thanks. I think that it works now, pretty accurate.
main.py
# python3 -m venv venv
# source venv/bin/activate
# pip3 install onnxruntime numpy librosa
# wget https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -Otest.wav
# python3 main.py
import onnxruntime as ort
import librosa
import numpy as np
def init_session(model_path):
opts = ort.SessionOptions()
opts.inter_op_num_threads = 1
opts.intra_op_num_threads = 1
opts.log_severity_level = 3
sess = ort.InferenceSession(model_path, sess_options=opts)
return sess
def read_wav(path: str):
return librosa.load(path, sr=16000)
if __name__ == '__main__':
session = init_session('segmentation-3.0.onnx')
samples, sample_rate = read_wav('test.wav')
# Conv1d & MaxPool1d & SincNet https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html https://pytorch.org/docs/stable/generated/torch.nn.MaxPool1d.html https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/models/blocks/sincnet.py#L50-L71
frame_size = 270
frame_start = 721
window_size = sample_rate * 10 # 10s
# State and offset
is_speeching = False
offset = frame_start
start_offset = 0
# Pad end with silence for full last segment
samples = np.pad(samples, (0, window_size), 'constant')
for start in range(0, len(samples), window_size):
window = samples[start:start + window_size]
ort_outs: np.array = session.run(None, {'input': window[None, None, :]})[0][0]
for probs in ort_outs:
predicted_id = np.argmax(probs)
if predicted_id != 0:
if not is_speeching:
start_offset = offset
is_speeching = True
elif is_speeching:
start = round(start_offset / sample_rate, 3)
end = round(offset / sample_rate, 3)
print(f'{start}s - {end}s')
is_speeching = False
offset += frame_size
I'm looking for simple embedding onnx model for speaker embedding (maybe wespeak) so I'll have accurate diarization on each segment.
After that I'll try to implement it in Rust.
from pyannote-onnx.
The classes (without softmax
) of the models are:
no speech
, spk1
, spk2
, spk3
, spk1 & spk2
, spk1 & spk3
, spk2 & spk3
So, why did you use np.argmax
? Just set the speech_prob
to 1 - ort_outs[0]
, and then compare it with your threshold
.
from pyannote-onnx.
So, why did you use
np.argmax
?
I understood that the model output is series of probs (7) floats. And the index correspond for their class type (eg. index 0 is no speech
).
So I used np.argmax
to take the highest prob index.
Maybe I completely wrong but it works somehow.
Just set the speech_prob to 1 - ort_outs[0]
I didn't understand, where did I used speech_prob?
Why do I need additional embedding model if pyannote model already return speaker label?
Also, I implemented it with Rust! the segmentation works pretty good.
However the speaker embedding doesn't work well. Maybe you can take a look on the computing of the embedding?
https://github.com/thewh1teagle/pyannote-rs/blob/main/src/embedding.rs#L19-L37
from pyannote-onnx.
Maybe np.argmax(probs) == 0
=> probs[0] > 0.5
is better. Because the probs are not normalized.
Why do I need additional embedding model if pyannote model already return speaker label?
Because the speaker index may change in different windows (the model does not have context
of the speakers):
- There are two speakers Alice and Bob.
- Alice speaks in the first window. The model labeled Alice as spk1.
- Bob speaks in the second window. The model may label Bob as spk1, too.
Sorry, I'm not familiar with Rust.
from pyannote-onnx.
Because the speaker index may change in different windows (the model does not have
context
of the speakers):
Oh now I understand, it's a bit confusing.
Do you think that embedding models works better / accurate? from what I've tested they not at least not accurate as pyannote speaker labels. even your repo with reordering is pretty accurate.
Do you have idea how to get good speaker lables? I tried to use nemo / speaker3d models but it's not accurate at all.
from pyannote-onnx.
Did you try any clustering algorithms? How did you try nemo or speaker3d models?
from pyannote-onnx.
Did you try any clustering algorithms? How did you try nemo or speaker3d models?
I didn't tried clustering algorithms.
I tried nemo with onnxruntime like so:
https://github.com/thewh1teagle/ort-diarize/blob/main/extract.py#L41
And compared distance with:
https://github.com/thewh1teagle/ort-diarize/blob/main/identify.py#L9
Actually it's not that bad in Python. Maybe I did mistake in the fbank computing in Rust
from pyannote-onnx.
Related Issues (8)
- would it be possbile to run it in mac or win locally with CPU only? HOT 1
- 试了一下,非常不准 HOT 14
- 关于转onnx模型后模型大小和精度发生变化 HOT 4
- 为什么是3个人,如果我只要2个人应该怎么做? HOT 2
- nevermind
- pyannote_onnx.py#L98 HOT 1
- Limit to 30s HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyannote-onnx.