--max_line_width is not behaving as "max", it's a --line-width #1543

Francoyy · 2023-07-22T13:17:08Z

Francoyy
Jul 22, 2023

I realize that using --max_line_width doesn't seem to behave as expected (?)
When I add --max_line_width 56, all the sentences are exactly 56 characters.
It should be the maximum value allowed, but if the context and the sentence happens to be shorter, they shouldn't force their way to the max. (same concept as width and max-width in CSS).
As a result, when I add --max_line_width, it often adds the next sentence as well, which really doesn't make sense in terms of subtitles.
In the end, I can not use --max_line_width at all. I don't use that setting, but instead manually truncates the ~10% sentences that are really too long.

ryanheise · 2023-07-22T13:45:01Z

ryanheise
Jul 22, 2023

Have you tested the latest Git version?

all the sentences are exactly 56 characters.

I'm not sure if this is the root of your problem, although I could imagine it giving "exactly" 56 characters for languages like Chinese or Japanese which don't use spaces to separate words, and therefore provide Whisper with no simple method to avoid wrapping a line in the middle of a word. In most languages which use spaces, lines will rarely be exactly 56 characters because if a word doesn't entirely fit on the end of the current line, it is moved to the next line, with spaces dictating the break points.

But I suspect your issue isn't to do with it being "exactly" 56 characters but rather that it doesn't have any idea of the natural points to split a line in sentences that are either too short or too long. E.g. You want it to at least split always after each sentence. The problem here is again that Whisper doesn't have a reliable way to detect sentence endings because it doesn't give you POS tags or anything like that. It can't just split the line after a "." because then a line like "I saw Mr. Jones the other day." would be split after "Mr." So instead it uses a heuristic where it splits after a significant pause (a few seconds). Since there is more likely to be a pause after "day." but not after "Mr." this gives somewhat a prediction of where to split without knowing anything about the POS tags.

If you need something more sophisticated than that, you are actually doing the right thing by taking it into your own hands and implementing your own splitting logic. In your own application you would also be able to include other dependencies such as Spacy which would allow you to do POS and deprel tagging to develop a better heuristic on when to split the lines. Including Spacy as a dependency of Whisper is probably too heavyweight a dependency to include in core Whisper, but the API is extensible and so you could implement your own subtitle writer using tags from Spacy to choose smarter split points.

One other thing you can do is try to edit the current pause threshold and make it shorter. Although note that if you make it too short, it will split too often.

2 replies

Francoyy Jul 22, 2023
Author

Thanks for your reply!
If I don't specify the --max_line_width, then whisper really understands all my sentences. They all end at the dot, and they start the new sentence at the next line, which is great.
I would hope that the --max_line_width would keep it exactly the same for most of the sentences, and hopefully it would only split the long sentences in a logical way, with a maximum of 56 characters per part (the long sentences are maybe only 20% of my video).

I haven't updated git this week, but i'm keeping it relatively updated. I did lots of testing with the version I have, I'm a little afraid to update too often and get unexpected results. Because Whisper is only one part of my pipeline, I'm processing a little bit further, and then translating to another language with some google translate API.
But I will try to update, and see if it gets any better! Cheers!

ryanheise Jul 22, 2023

In theory, rewriting these options to exploit Whisper's default segmentation boundaries would work well, however those default segmentation boundaries are currently not reliable, in particular when a sentence spans across a window boundary. So once that is fixed first, then I think it would be possible to exploit that in the line wrapping/splitting code.

wenchaoliu-93 · 2025-06-09T09:15:32Z

wenchaoliu-93
Jun 9, 2025

I have a problem where the option doesn't work at all. It's just unstable of a feature.

6 replies

wenchaoliu-93 Jun 9, 2025

#1548

ryanheise Jun 9, 2025

I scrolled down on that page to find your comment, but did not see the SRT or VTT file. What you pasted was the visual progress indicator displayed while whisper is doing it's work, not the actual transcript file. Note that SRT and VTT are the only two formats that you can choose that inherently have a concept of line width, so if you choose the JSON format, it will be meaningless.

wenchaoliu-93 Jun 9, 2025

Sorry. It looks like the whisper output on the terminal can be different from the actual output file.

That said, it does look like the output can be better formatted. See below.

342
00:21:52,400 --> 00:21:58,840
Man is free. Christ has won the victory.

343
00:21:59,820 --> 00:22:03,060
Has one of his special witnesses on earth

344
00:22:03,060 --> 00:22:07,220
today. This glorious Easter Sunday, I

345
00:22:07,220 --> 00:22:12,860
declare that this is true in his sacred

346
00:22:12,860 --> 00:22:19,840
name, even the name of Jesus Christ, our

347
00:22:19,840 --> 00:22:22,000
Savior, Amen.

The best practice is probably using whisper for just text transcription, and another tool for what's called forced alignment. aeneas looks like a good tool.

ryanheise Jun 9, 2025

Forced alignment is actually what Whisper does internally (when the --word_timestamps option is invoked).

wenchaoliu-93 Jun 10, 2025

Is there a way to feed Whisper with script to add timestamps with?

wenchaoliu-93 · 2025-06-09T10:41:57Z

wenchaoliu-93
Jun 9, 2025

#1548

…

On Mon, Jun 9, 2025, 17:31 ryanheise ***@***.***> wrote: @wenchaoliu-93 <https://github.com/wenchaoliu-93> , can you share an example output SRT or VTT file along with the command line options used to produce it? — Reply to this email directly, view it on GitHub <#1543 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A47TGB3DQXXQQOAHZRIYF4T3CVH6DAVCNFSM6AAAAAB64NVRF6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTGNBQHAZDQMY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

wenchaoliu-93 · 2025-06-09T13:23:33Z

wenchaoliu-93
Jun 9, 2025

Well, YouTube does it way better from personal experience.

…

On Mon, Jun 9, 2025 at 9:21 PM ryanheise ***@***.***> wrote: Forced alignment is actually what Whisper does internally (when the --word_timestamps option is invoked). — Reply to this email directly, view it on GitHub <#1543 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A47TGBZCFQRYMOHST5COPRT3CWC63AVCNFSM6AAAAAB64NVRF6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTGNBRGAYTGNA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

wenchaoliu-93 · 2025-06-09T13:27:08Z

wenchaoliu-93
Jun 9, 2025

It's called "auto-sync": Add subtitles & captions - YouTube Help <https://support.google.com/youtube/answer/2734796?hl=en#zippy=%2Cauto-sync>

…

On Mon, Jun 9, 2025 at 9:23 PM Wenchao Liu ***@***.***> wrote: Well, YouTube does it way better from personal experience. On Mon, Jun 9, 2025 at 9:21 PM ryanheise ***@***.***> wrote: > Forced alignment is actually what Whisper does internally (when the > --word_timestamps option is invoked). > > — > Reply to this email directly, view it on GitHub > <#1543 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/A47TGBZCFQRYMOHST5COPRT3CWC63AVCNFSM6AAAAAB64NVRF6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTGNBRGAYTGNA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

0 replies

wenchaoliu-93 · 2025-06-10T09:28:59Z

wenchaoliu-93
Jun 10, 2025

I realize that using --max_line_width doesn't seem to behave as expected (?) When I add --max_line_width 56, all the sentences are exactly 56 characters.

I indeed have a similar problem now that I have checked the actual output file. The line width varies, but vast majority is close to the max setting. I now wonder what YouTube uses for forced alignment, as it is so much better in my experience.

0 replies

--max_line_width is not behaving as "max", it's a --line-width #1543

Uh oh!

Replies: 6 comments · 8 replies

Uh oh!

Uh oh!

Francoyy Jul 22, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 8 replies

Francoyy Jul 22, 2023
Author