Comments (6)
@SeeknnDestroy,
I am using WhisperX from the command line, as I included in the issue description and example command.
The text format that I would like to see when the diarization is enabled via the --diarize
flag is something like this:
[SPEAKER_08]: Can you show me Jill's last email on that subject.
[SPEAKER_05]: Sure, let me share my screen. I don't know if there's a response to my questions, but we can review it together.
The python that I included above is the workaround I implemented for the time being, which take the "srt" format and simplifies it to a "txt" format that includes SPEAKER labels, like the example I included above.
from whisperx.
In case anyone is interested...
I put together some python that formats the "srt" format into something closer to what I am looking for from the "txt" format.
import re
import sys
import os
def reformat_srt(file_path):
with open(file_path, 'r') as file:
content = file.read()
pattern = re.compile(r'\d+\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(\[SPEAKER_\d+\]: .+?(?=\n\n|\Z))', re.DOTALL)
matches = pattern.findall(content)
# Remove underline markup
matches = [(start, end, re.sub(r'<u>|</u>', '', text.strip())) for start, end, text in matches]
# Keep messages in the original order
unique_lines = []
seen = set()
for _, _, text in matches:
if text not in seen:
seen.add(text)
unique_lines.append(text)
reformatted = '\n'.join(unique_lines)
return reformatted
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python clean_srt.py <srt_input_filename> [output_directory]")
sys.exit(1)
input_file_path = sys.argv[1]
output_dir = sys.argv[2] if len(sys.argv) > 2 else os.getcwd()
# Generate the output filename
base_name = os.path.basename(input_file_path)
name, _ = os.path.splitext(base_name)
output_file_path = os.path.join(output_dir, f"{name}-clean.txt")
reformatted_content = reformat_srt(input_file_path)
with open(output_file_path, 'w') as file:
file.write(reformatted_content)
from whisperx.
I have been using WhisperX for transcribing multi-speaker audio files and I enabled diarization to distinguish between different speakers. However, I noticed that the TXT format output does not include speaker labels, which are crucial for my use case to identify who is speaking at any given time.
Could you provide some insights on why the speaker labels are missing in the TXT output when diarization is enabled? Is this an intended behavior or a potential oversight? Additionally, if this feature is not currently supported, are there any plans to include speaker labels in future updates of the TXT format output?
Thank you for your assistance and for the great work on this tool!
PS: Here's a sample command if it helps...
whisperx --hf_token <your_hf_token> --print_progress True --language en --diarize --compute_type int8 voice_chat.mp4 -o ~/transcriptions -f txt --min_speakers 4 --max_speakers 12
Can you please elaborate on text format? Are you using whisperx on command line or as python library? Could you share example snippet and what are your conclusions about this?
from whisperx.
@veenified Currently WriteTXT class writes only the transcripts to the file.
We can modify it as follows,
class WriteTXT(ResultWriter):
extension: str = "txt"
def write_result(self, result: dict, file: TextIO, options: dict):
for segment in result["segments"]:
start = format_timestamp(segment["start"])
end = format_timestamp(segment["end"])
speaker = segment.get("speaker", "Unknown")
text = segment["text"].strip()
print(f"{start}\t{end}\t{speaker}\t{text}", file=file, flush=True)
The output .txt
will be in this format, with start-end timestamps along with speaker labels.
from whisperx.
@nkilm This works great!
I've switched over to using this for my TXT format, because I am always using Diarization.
I would suggest keying off the --diarization
to decide whether the conventional TXT format is used or the revised format with SPEAKERs identified.
I tried to do this and submit a pull request, but I am failing to pass the diarization flag/parameter through to utils.py as an option.
from whisperx.
I would suggest keying off the --diarization to decide whether the conventional TXT format is used or the revised format with
SPEAKERs identified.
Please can you explain more about what you trying to achieve?
I tried to do this and submit a pull request, but I am failing to pass the diarization flag/parameter through to utils.py as an option.
Is the PR still open? I'll see if I can help.
from whisperx.
Related Issues (20)
- Support for vulkan (intel arc gpu)
- IGNORE
- Diarization precision - is there way to improve it? HOT 4
- torchaudio._backend.set_audio_backend has been deprecated. HOT 3
- Probability or score coming from faster-whisper and not alignment model
- Timing of subtitles is way off if I limit max_line_count and max_line_width==bad things? HOT 3
- TypeError: TranscriptionOptions.__new__() got an unexpected keyword argument 'hotwords' HOT 2
- Load Model To CPU and Then GPU HOT 1
- My timestamps with whisperX are way off HOT 14
- Issue with Periods in Dates or numbers Causing Incorrect Segment Splitting in German Transcriptions
- Unable to Transcribe More Than 90 Minutes (1h30m) HOT 2
- Empty transcript is generated
- Benchmarks for whisperx, faster-whisper, and whispers2t! HOT 8
- Readability trashed after putting length limits.
- matrix contains invalid numeric entries
- Hallucinations in silent videos - need suggestions
- Running Finetuned Model Error
- torch 1.10.0+cu102, yours is 2.0.0
- Chinese and English multi-language recognition, there is a problem of ignoring English recognition,How to solve this problem?
- Translate japanese audio
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from whisperx.