Hi.
I have been developing a tool for language learning using smart-whisper
, but I have a problem. Basically, I'll need to transcribe word by word from videos/audios, retrieving their offsets and/or timestamps, to repeat listening the word by setting the related time to replay.
Using the example main
from the official Whisper.cpp repo, we have the following output:
{
...
"transcription": [
{
...
"tokens": [
{
"text": " And",
"timestamps": {
"from": "00:00:00,220",
"to": "00:00:00,220"
},
"offsets": {
"from": 220,
"to": 220
},
"id": 843,
"p": 0.651429
},
{
"text": " so",
"timestamps": {
"from": "00:00:00,330",
"to": "00:00:00,450"
},
"offsets": {
"from": 330,
"to": 450
},
"id": 523,
"p": 0.992918
},
...
{
"text": " country",
"timestamps": {
"from": "00:00:10,260",
"to": "00:00:10,990"
},
"offsets": {
"from": 10260,
"to": 10990
},
"id": 1499,
"p": 0.995465
}
]
}
]
}
So, is it possible to show the timestamps and offsets as below when we configure format as detail in smart-whisper
?
TIA for any help!