Whisper with Urro: Native speech recognition, speaker diarization, and word-level timestamps, all in one model #2581
urroxyz
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
Hey this looks interesting to me. The speaker segregation is an important aspect of generating subtitle as it can give clues for video editing via automation. However, is your project support api if I ran it locally ? Also to test your project should I just install your package ? can you help ? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Whisper with Urro
WHISPER + URRO
Multilingual automatic speech recognition (ASR) with speaker segmentation (SS) / speaker diarization (SD) and word-level timestamps (WLT)
Installation
Latest
Development
Introduction
After immense experimentation, I have discovered that, yes, Whisper can segment speakers and timestamp words! And I have created WHISPER + URRO to offer an easy solution therefor.
By modifying the thinking process of the OpenAI model, we can force it to delimit new speakers with symbols like hyphens (
-
) or greater-thans (>
), or even with complete labels such as[SPEAKER 1]
and[SPEAKER 2]
to keep track of who is speaking and when.1 By extracting cross-attentions and processesing them with dynamic-time warping, we can reconstruct timestamps on the word level rather than relying on occasional generated time tokens.2Supported models
Official
tiny.en4
base.en6
small.en8
medium.en10
Third-party
Comparison
[SPEAKER 2] Hey, sit down, that’s wrong of you.
[SPEAKER 1] The little lady who is to become Mrs. Harvey Yates over my dead body.
[SPEAKER 3] I know I have the sincere wishes of all my friends…
and can only tell you how much I appreciate it.
I think I can honestly say this is the happiest moment of my life.
Look what I have here.
It’s a little engagement present just given me by Mr. Yates.
medium
)No speaker labels
Hey, sit down,
that’s fine.The little lady who is to become Mrs. Harvey Yates over my dead body.
[APPLAUSE]
I know I have the sincere wishes of all my friends,
and can only tell you how much I appreciate it.
I think I can honestly say this is the happiest moment of my life.
Look what I have here…
It’s a little engagement present just given me by Mr. Yates.
medium
)with WHISPER + URRO
delimiter=SPEAKER()
prompt=SPEAKERS(3, "en")
Correct speaker labels
[SPEAKER 2] Hey, sit down,
that’s fine.[SPEAKER 1] The little lady who is to become Mrs. Harvey Yates over my dead body.
[APPLAUSE]
[SPEAKER 3] I know I have the sincere wishes of all my friends,
and can only tell you how much I appreciate it.
I think I can honestly say this is the happiest moment of my life.
Look what I have here…
It’s a little engagement present just given me by Mr. Yates.
d-v1a
)Incorrect speaker labels
[S2] Hey, sit down,
it’s warm.[S1] The little lady who is to become Mrs. Harvey Yates, over my dead body.
[S2]I know I have the sincere wishes of all my friends,and can only tell you how much I appreciate it.
I think I can honestly say this is the happiest moment of my life.
Look what I have here.
d-v1a
)with WHISPER + URRO
delimiter=SPEAKER(short=True)
prompt=SPEAKERS(3, "en", short=True)
Correct speaker labels
[S2] Hey, sit down,
it’s warm.[S1] The little lady who is to become Mrs. Harvey Yates, over my dead body.
[S3] I know I have the sincere wishes of all my friends,
and can only tell you how much I appreciate it.
I think I can honestly say this is the happiest moment of my life.
Look what I have here.
Footnotes
Unique to WHISPER + URRO. ↩
As explicitly implemented in
whisper-timestamped
, alongside other libraries, such asopenai-whisper
. ↩https://huggingface.co/onnx-community/whisper-tiny_timestamped ↩
https://huggingface.co/onnx-community/whisper-tiny.en_timestamped ↩
https://huggingface.co/onnx-community/whisper-base_timestamped ↩
https://huggingface.co/onnx-community/whisper-base.en_timestamped ↩
https://huggingface.co/onnx-community/whisper-small_timestamped ↩
https://huggingface.co/onnx-community/whisper-small.en_timestamped ↩
https://huggingface.co/onnx-community/whisper-medium-ONNX ↩
https://huggingface.co/onnx-community/whisper-medium.en-ONNX ↩
https://huggingface.co/onnx-community/whisper-large-v3-ONNX ↩
https://huggingface.co/onnx-community/whisper-large-v3-turbo_timestamped ↩
onnx-community/whisper-d-v1a-ONNX ↩
Beta Was this translation helpful? Give feedback.
All reactions