Closed
Description
I'm not sure if this is a variant of #412, but check out this partial output:
[00:25:16.880 --> 00:25:20.240] And you're like, this character needs some like thigh highs and like, it should have
[00:25:20.240 --> 00:25:21.240] been a bit of a dresser.
[00:25:21.240 --> 00:25:22.240] It should have been a dresser.
[00:25:22.240 --> 00:25:23.240] It should have been a dresser.
[00:25:23.240 --> 00:25:24.240] It should have been a dresser.
[00:25:24.240 --> 00:25:25.240] It should have been a dresser.
[3333 additional repetitions elided]
[01:21:40.240 --> 01:21:41.240] It should have been a dresser.
[01:21:41.240 --> 01:21:42.240] It should have been a dresser.
[01:21:42.240 --> 01:21:43.240] It should have been a dresser.
[01:21:43.240 --> 01:21:44.240] It should have been a dresser.
[01:21:44.240 --> 01:21:45.240] It should have been a dresser.
[01:21:45.240 --> 01:21:51.240] Whether it's true or not is first and foremost a bluff to stop you from doing the right thing.
Reproduction:
./models/download-ggml-model.sh base.en
make
curl -o episode.mp3 -L https://mcdn.podbean.com/mf/web/5ein65/07-31-Clear-Present-free.mp3
ffmpeg -ar 16 -i episode.mp3 episode.wav
./main -f episode.wav
Standard error:
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem required = 215.00 MB (+ 6.00 MB per decoder)
whisper_model_load: kv self size = 5.25 MB
whisper_model_load: kv cross size = 17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 140.60 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: processing 'episode.wav' (94221793 samples, 5888.9 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
whisper_print_timings: fallbacks = 3 p / 9 h
whisper_print_timings: load time = 120.07 ms
whisper_print_timings: mel time = 8174.57 ms
whisper_print_timings: sample time = 21253.98 ms / 46180 runs ( 0.46 ms per run)
whisper_print_timings: encode time = 84284.79 ms / 246 runs ( 342.62 ms per run)
whisper_print_timings: decode time = 139710.86 ms / 46321 runs ( 3.02 ms per run)
whisper_print_timings: total time = 253756.25 ms
I'm on the main branch at v1.2.0.