a parable about timing

This is loosely based on a conversation I had over Discord with someone just learning to subtitle. It's probably not entirely historically accurate, but I believe it exposes certain fundamental concepts in subtitling, and should offer some insight on why we do things the way we do.

CH: Does the lead in and lead out add a little bit of time before and after your actual selection?

CN: Are you asking why we add lead-in and lead-out? Or are you asking about a specific feature of Aegisub?

CH: The former.

CN: In ye-olde-days, people would time precisely to the audio waveform. That is, as soon as the speech started to play, the subtitle would appear on screen. Intuitively, this makes sense: since the subtitle is supposed to correspond in meaning to the speech, it ought to correspond to the timing of the speech as well.

CH: A subtitle that appears at the same time as the audio is obviously associated with that audio.

CN: Right. But this does present some problems: First, people are often slower at reading text than they are at understanding speech. This means if a character is talking fast enough that they might be difficult for someone to understand, there's no way the subtitles would be on-screen long enough for them to read.

CH: I imagine that's an even bigger problem with anime, since Japanese tends to be more concise than English.

CN: That's precisely correct. The second problem is that timing subtitles like that results in subtitle flash, which is what happens when subtitles are close together but not connected. there are subs on screen, a handful of frames with no subtitles, and then subs again. It's distracting and unpleasant to look at.

There's a second kind of flash that happens around keyframes or cuts in the video, where a subtitle starts or ends within a few frames of the cut. Again, distracting and unpleasant. The other side of that is scene bleed, where a subtitle goes only a few frames over a cut, and ends up being the only static element in what is otherwise a massive change in the visuals.

CH: Oh, I've definitely seen that last one in a lot of professional subtitles.

CN: Yeah, of course. The professionals are getting paid to do it, so there's no reason for them to really think about this stuff for the most part.

CH: That's kind of ironic.

CN: That's capitalism in action. Anyways, people decided they ought to come up with a solution to these problems, and so they did: they started adding an amount of timing buffer to the end of their subtitles. Today, we call this lead-out. Since we're all using good subtitle tooling like Aegisub, we can have a standard length for buffer that's added with a keypress or a post-processing utility. Back then, though, it was all done by feel.

Regardless, end buffer solved a lot of these problems by adding some flexibility to the process: lines could be on-screen longer, you could connect the ending of one line to the beginning of the next to avoid flash, and you could even set the ends of lines to cuts in the video in order to avoid flash there as well. Some real visionaries even used negative buffer to prevent scene bleeds! This is how many subtitlers were doing things into the modern era.

CH: Okay, that makes a lot of sense. But what about lead-in, then?

CN: Surprisingly, lead-in has a somewhat different origin. I mean, it can be lengthened to avoid flash immediately after a cut, of course, but it actually mostly has to do with how your brain processes visual and auditory information in parallel: visual information is generally processed more slowly than sound. The result is that lines starting precisely at the beginning of the speech waveform actually appear to start late. A short lead-in is added to compensate.

Lead-in is less flexible than lead-out. It can be lengthened to avoid flash immediately after a cut, but generally you want to avoid presenting new subtitle information out-of-sync with the audio. So, lead-in stays more-or-less constant, and lead-out closes gaps to mitigate flash.

CH: I see, thank you!

That explains why despite feeling that the sounds match pretty well when replaying the segment with space, I felt like the subtitles are just slightly behind when I'm playing the video.

You mentioned closing gaps to avoid flash, but exactly how close together is too close for subtitles?

CN: Okay, so this is gonna be different between timers, but there are some best practices that've made their way around the scene: You want at least 500ms between lines, otherwise you should close the gap. 500ms is also the magic number for snapping the end of a line to a keyframe. Snapping the start of a line to a keyframe is a bit more lenient, and people tend to aim for around 300ms.

Unfortunately there's no magic numbers for scene bleeds. Those have more to do with the relationship between the audio and the video. As a general rule, if only one syllable of speech goes over the cut, you're safe to snap to the keyframe. This is the same for both the beginning and end of lines. But again, there's no hard and fast rules. Scene bleeds can be one of those situations where you really just have to go by feel.

CH: Alright, thank you! I'll keep practicing and see if I can get a feel for things.

CN: Cool! Let me know if you have any more questions.