Giter Site home page Giter Site logo

Comments (9)

MahmoudAshraf97 avatar MahmoudAshraf97 commented on July 16, 2024

Interesting results indeed thanks for sharing, but afaik whispers2t is just an interface for multiple backends, so which one are you using here?

from whisperx.

BBC-Esq avatar BBC-Esq commented on July 16, 2024

Oh yeah, sorry, using the ctranslate2 backend. It's important to note that it's ctranslate2 and not just faster-whisper. As far as I know, whisperX and whisperS2T are the only repositories that have batch processing using ctranslate2. faster-whisper should hopefully be getting it soon, however. See Here.

At any rate, out of respect for the hard work of all the repositories I'm benching, it's important to note that different libraries have different benefits/drawbacks...my benchmarks are only for speed purposes.

from whisperx.

Infinitay avatar Infinitay commented on July 16, 2024

Recently https://github.com/ictnlp/StreamSpeech was released and I'm curious how it pairs up. Although currently it doesn't support many language unless you train it yourself and it's more real-time focused. Any chance you could benchmark it alongside whisperx if possible? Thanks

from whisperx.

BBC-Esq avatar BBC-Esq commented on July 16, 2024

Recently https://github.com/ictnlp/StreamSpeech was released and I'm curious how it pairs up. Although currently it doesn't support many language unless you train it yourself and it's more real-time focused. Any chance you could benchmark it alongside whisperx if possible? Thanks

Interesting...Thanks for the link. I briefly checked it out and the model names imply that they only handle translation. I didn't see a model that handled straight transcription from one language to the same language. With that being said, if you find out otherwise and provide me with a basic script that can perform inference, I'll fine tune it to get vram measurements and timing and process the same audio file that my other benchmarks did?

from whisperx.

stri8ed avatar stri8ed commented on July 16, 2024

It looks like whispers2t does not use the previous segment transcription as context. This is the same with WhisperX. Would be interesting to see WER benchmarks alongside the performance, especially for long audio, which may be more sensitive to the context, or lack thereof.

from whisperx.

MahmoudAshraf97 avatar MahmoudAshraf97 commented on July 16, 2024

I guess whisperx paper showed that using previous segment transcription in the prompt isn't useful

from whisperx.

stri8ed avatar stri8ed commented on July 16, 2024

I guess whisperx paper showed that using previous segment transcription in the prompt isn't useful

Indeed. I recall reading that. Anecdotally, that does not seem to be the case for me, but interested to hear if anyone else has more data on that. Intuitively, I would expect additional context to be useful, given the model was trained to condition the result based on the prompt/context.

from whisperx.

BBC-Esq avatar BBC-Esq commented on July 16, 2024

It looks like whispers2t does not use the previous segment transcription as context. This is the same with WhisperX. Would be interesting to see WER benchmarks alongside the performance, especially for long audio, which may be more sensitive to the context, or lack thereof.

If you go here you can see that the WER rate is actually better...lol. Still trying to figure that out out, but the guy seems solid in his testing so far:

https://github.com/shashikg/WhisperS2T/releases

from whisperx.

Jiltseb avatar Jiltseb commented on July 16, 2024

Generally very long context (>30 sec) is not needed for ASR (unlike paralinguistic tasks). By not passing in the previous context, we can prevent some repetitions/hallucinations from passing on to the next segment, as we see in batched faster_whisper, and inturn better WER.

from whisperx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.