Giter Site home page Giter Site logo

speaker-transcription's Introduction

speaker-transcription

This repository contains the Cog definition files for the associated speaker transcription model deployed on Replicate.

The pipeline transcribes the speech segments of an audio file, identifies the individual speakers and annotates the transcript with timestamps and speaker labels. An optional prompt string can guide the transcription by providing additional context. The pipeline outputs additional global information about the number of detected speakers and an embedding vector for each speaker to describe the quality of their voice.

Model description

There are two main components involved in this process:

  • a pre-trained speaker diarization pipeline from the pyannote.audio package (also available as a stand-alone diarization model without transcription):

    • pyannote/segmentation for permutation-invariant speaker segmentation on temporal slices
    • speechbrain/spkrec-ecapa-voxceleb for generating speaker embeddings
    • AgglomerativeClustering for matching embeddings across temporal slices
  • OpenAI's whisper model for general-purpose English speech transcription (the medium.en model size is used for a good balance between accuracy and performance).

The audio data is first passed in to the speaker diarization pipeline, which computes a list of timestamped segments and associates each segment with a speaker. The segments are then transcribed with whisper.

Input format

The pipeline uses ffmpeg to decode the input audio, so it supports a wide variety of input formats - including, but not limited to mp3, aac, flac, ogg, opus, wav.

The prompt string gets injected as (off-screen) additional context at the beginning of the first Whisper transcription window for each segment. It won't be part of the final output, but it can be used for guiding/conditioning the transcription towards a specific domain.

Output format

The pipeline outputs a single output.json file with the following structure:

{
  "segments": [
    {
      "speaker": "A",
      "start": "0:00:00.497812",
      "stop": "0:00:09.762188",
      "transcript": [
        {
          "start": "0:00:00.497812",
          "text": " What are some cool synthetic organisms that you think about, you dream about?"
        },
        {
          "start": "0:00:04.357812",
          "text": " When you think about embodied mind, what do you imagine?"
        },
        {
          "start": "0:00:08.017812",
          "text": " What do you hope to build?"
        }
      ]
    },
    {
      "speaker": "B",
      "start": "0:00:09.863438",
      "stop": "0:03:34.962188",
      "transcript": [
        {
          "start": "0:00:09.863438",
          "text": " Yeah, on a practical level, what I really hope to do is to gain enough of an understanding of the embodied intelligence of the organs and tissues, such that we can achieve a radically different regenerative medicine, so that we can say, basically, and I think about it as, you know, in terms of like, okay, can you what's the what's the what's the goal, kind of end game for this whole thing? To me, the end game is something that you would call an"
        },
        {
          "start": "0:00:39.463438",
          "text": " anatomical compiler. So the idea is you would sit down in front of the computer and you would draw the body or the organ that you wanted. Not molecular details, but like, here, this is what I want. I want a six legged, you know, frog with a propeller on top, or I want I want a heart that looks like this, or I want a leg that looks like this. And what it would do if we knew what we were doing is put out, convert that anatomical description into a set of stimuli that would have to be given to cells to convince them to build exactly that thing."
        },
        {
          "start": "0:01:08.503438",
          "text": " Right? I probably won't live to see it. But I think it's achievable. And I think what that if, if we can have that, then that is basically the solution to all of medicine, except for infectious disease. So birth defects, right, traumatic injury, cancer, aging, degenerative disease, if we knew how to tell cells what to build, all of those things go away. So those things go away, and the positive feedback spiral of economic costs, where all of the advances are increasingly more"
        },
      ]
    }
  ],
  "speakers": {
    "count": 2,
    "labels": [
      "A",
      "B"
    ],
    "embeddings": {
      "A": [<array of 192 floats>],
      "B": [<array of 192 floats>]
    }
  }
}

Performance

The current T4 deployment has an average processing speed factor of 4x (relative to the length of the audio input) - e.g. it will take the model approx. 1 minute of computation to process 4 minutes of audio.

Intended use

Data augmentation and segmentation for a variety of transcription and captioning tasks (e.g. interviews, podcasts, meeting recordings, etc.). Speaker recognition can be implemented by matching the speaker embeddings against a database of known speakers.

Ethical considerations

This model may have biases based on the data it has been trained on. It is important to use the model in a responsible manner and adhere to ethical and legal standards.

Citations

For pyannote.audio:

@inproceedings{Bredin2020,
  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
  Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
  Year = {2020},
}
@inproceedings{Bredin2021,
  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
  Booktitle = {Proc. Interspeech 2021},
  Year = {2021},
}

For OpenAI whisper:

@misc{https://doi.org/10.48550/arxiv.2212.04356,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Machine Learning (cs.LG), Sound (cs.SD), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

speaker-transcription's People

Contributors

johnislarry avatar meronym avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

speaker-transcription's Issues

Allow specifying initial_prompt for transcription

The initial_prompt passed to whisper for each audio segment is hardcoded to None:

https://github.com/meronym/speaker-transcription/blob/master/predict.py#L105

It would be really great if we could provide a prompt via the replicate api input to use for all transcriptions.

I've seen that transcribing audio with proper nouns (in particular names) doesn't work well unless a prompt is provided that spells them out.

What do you think? If you're busy I could put up a PR, however I don't have a good setup to test an actual inference.

Question: Support for Spanish

What does it takes for doing this but for spanish language? If you outline the steps I need to do, I will give it a shot

Not compatible with amd64 (apple silicon)

I run an M2 Macbook and attempted to run this on my computer. The docker couldn't run because it failed to find a GPU (it was looking for Nvidia), and because it couldn't find the cpu.

View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/0hdgzb5ya0erjztlhe3w8szry

Starting Docker image cog-speaker-transcription-mast-base and running setup()...
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Missing device driver, re-trying without GPU
Error response from daemon: page not found
Error while loading predictor:

Traceback (most recent call last):
  File "/root/.pyenv/versions/3.8.18/lib/python3.8/site-packages/cog/server/http.py", line 131, in create_app
    predictor = load_predictor_from_ref(predictor_ref)
  File "/root/.pyenv/versions/3.8.18/lib/python3.8/site-packages/cog/predictor.py", line 184, in load_predictor_from_ref
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "predict.py", line 12, in <module>
    import torch
  File "/root/.pyenv/versions/3.8.18/lib/python3.8/site-packages/torch/__init__.py", line 199, in <module>
    from torch._C import *  # noqa: F403
ImportError: libtorch_cpu.so: cannot enable executable stack as shared object requires: Invalid argument

โ…น Model setup failed

Questions: guide to understanding `embeddings`?

I have not been able to find much information on what the speakers.embeddings.<speakerLabel> signifies. For example, some example output from this model:

{
  "segments": [...],
  "speakers": {
    "count": 2,
    "labels": [
      "A",
      "B"
    ],
    "embeddings": {
      "A": [
        10.238464819208188,
        -25.505741863932762,
        19.784254991155674,
        ...
        -10.456077188253403
      ],
      "B": [
        10.044449877522242,
        -6.685148896164118,
        20.095289962208614,
        ...
      ]
    }
  }
}

Can you share some details (or guides/links) to what the embeddings represent?

Error 413 when calling Meronym docker

I get this error when I call the Meronym API (hosted on a remote server) with a file of more than 100mb. Is there an env var I could assign to a greater value ?

Question about embedding model choice

Hey you, thank you for the package :)

I'm researching around how to improve diarization errors related to overlapping speech, and I'd like to ask you about your choice of a embedding model.

Is there any particular reason for you to pick the speechbrain's model instead of the default pyannote's model for speaker embedding in your pipeline?

From my research, the speechbrain ecapa-tdnn model gets 0.8% EER for speaker verification in the Voxceleb benchmark, while the wespeaker resnet34-LM models provided by pyannote gets 0.74% EER in the same benchmark. Is there a big difference between their performance for diarization that I'm not aware of? Or is there any other reason for choosing one over the other?

Again, thank you for the code and for the info!

EDIT: I just found that pyannote's pipeline v2.0 requires speechbrain as a dependency. Was it the default back there? Sorry for the stupid question if it was ๐Ÿ˜….

ModelError: Prediction interrupted; please retry (code: PA)

Unsure if this is a replicate issue not.

I run the model on in a 8 minute mp3 via the api which ran fine in the expected amount of time.

However I'm trying a 32 minute mp3 and the prediction is stays in "starting" stage and doesn't move.

Comes back with an error after a while: ModelError: Prediction interrupted; please retry (code: PA)

The same 32 minute file runs fine on the web interface, but can't get it going via the API.

Thoughts?

What can we do to customize larger GPUs?

I am trying to use this with a large audio input (3.5 hours or so).

Since the GPU it uses is fixed, replicate.com fails with:

Prediction failed for an unknown reason. It might have run out of memory (exitcode -9)?

Iโ€™m assuming that this means we need larger GPU machines. Is there anything I can do to customize this so it works with large inputs?

Allow specifying whisper model

Similar to #2 - it would be great if we could specify the whisper model to use: large/base etc.

Using smaller models would probably be fine for my workflow right now, and would save run-time (and cost) on Replicate.com.

Empty output

Using the stock input:

Running predict()...
pre-processing audio file...
Input #0, image2, from '/tmp/tmpvplq5ejv1678544615656.jpg':
Duration: 00:00:00.04, start: 0.000000, bitrate: 148832 kb/s
Stream #0:0: Video: mjpeg (Baseline), yuvj420p(pc, bt470bg/unknown/unknown), 2718x3624 [SAR 1:1 DAR 3:4], 25 tbr, 25 tbn, 25 tbc
Output #0, wav, to '/tmp/tmpxy7_d_va/audio.wav':
Output file #0 does not contain any stream
transcribing segments...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.