Designing Captions: Disruptive Experiments with Typography, Color, Icons, and Effects

Sean Zdenek

Typography and color
How simple choices can aid speaker identification

$A frame from 10.0 Earthquake featuring a kitchen scene and the caption: [LOW RUMBLING]. The caption has been treated with fractal noise and a shake effect to give it meaning through its rumbling form.$ Form follows function in this scene from 10.0 Earthquake (2014). As the room and camera start to rumble, the captions literally rumble too, increasing in intensity until the woman in the scene runs screaming out of the room. I created custom captions using presets in Adobe After Effects, including fractal mattes and random text transformations for speed and rotation.

Overlapping speech and multiple speakers in a scene present ongoing challenges for traditional captioning. Typography and color are explored in this section as promising methods to distinguish speakers and signal identity. Color is already standard in United Kingdom subtitling (BBC, 2017). When paired with color, typography can distinguish characters (which is the goal of color-based subtitling) while also embodying them with personality. Changes in type can also signal significant shifts in characters' identities that may be easy for caption readers to miss. In this section, I offer the concept of character profiles, which are composites of type, size, and color specifications that are specific to each character but also flexible enough to adjust to changing identities (e.g., when a human is revealed to be an humanoid robot).

In the United Kingdom, color is the standard and preferred method for distinguishing speakers in a scene. The BBC's 2017 "Subtitle Guidelines," for example, lists a number of techniques for speaker identification, starting with "Colour," which is the "preferred method that should be used in most cases." Each speaker is assigned a different color in the subtitles. When two speakers are represented in the same subtitle, the subtitle will have two different colors, each color corresponding to a different speaker's utterance. In the BBC context, color isn't a supplement for other identification practices (such as labeling/naming) but can be sufficient on its own. The BBC identifies a limited number of colors for speakers: white (#FFFFFF), yellow (#FFFF00), cyan (#00FFFF), and green or lime (#00FF00). Furthermore, the BBC requires that subtitles follow the European Broadcasting Union's XML-based Timed Text format (EBU-TT), which provides extensive support for metadata (including metadata about the color of individual captions).

In the United States, speaker identification is detached from considerations of color. According to The Captioning Key, a major style guide produced by the Described and Captioned Media Program (DCMP) (2017) and funded by the U.S. Department of Education, speakers should be identified with screen placement and labels if it isn't already clear who is speaking. "When a speaker cannot be identified by placement and his/her name is known, the speaker's name should be in parentheses" (Described and Captioned Media Program, "Speaker Identification," 2017). Color options are built into caption decoder technology at the device and interface levels. Following the EIA-708 standard, the FCC has mandated a number of requirements for digital television producers, including eight foreground and eight background color options (Federal Communications Commission, 2000). Users can choose one (or no) background color and one foreground color when customizing the appearance of captions on their home television screens. Users can also control type size (small, standard, large), and choose from among eight typeface styles. These choices are applied universally to programs, which is to say that users can't choose to style all of speaker A with one color/type style and all of Speaker B with another color/type style. Every caption is affected equally across every program and channel (until the user changes the settings). If you choose yellow captions on a black background in the preferences menu of your digital television in the United States, every caption will be styled in yellow on black.

U.S. viewers are not accustomed to reading multi-colored captions or associating captions with speakers based on color alone. The same goes for typography. Style guides discuss typography as a single treatment to be applied uniformly across the caption layer or as an option for users to select and apply universally to all captions. The BBC (2017) explains how "Subtitle fonts are determined by the platform, the delivery mechanism and the client." Font is controlled by the "FontFamily" attribute in the EBU-TT format, which applies the same font treatment to every subtitle character. In the United States, The Captioning Key specifies that "Font characteristics must be consistent throughout the media" (Described and Captioned Media Program, "Text," 2017).

Consistency is a good reason to limit caption files to a single typeface. But I want to push back against the unstated assumption that only color (in the United Kngdom at least) can be used to distinguish characters in a scene. After all, typefaces are not merely decorative. They have personalities and evoke emotions in ways that, granted, are still not well understood. Eva R. Brumberger's (2003) groundbreaking studies of typographic rhetoric "establish[ed] persona profiles for a series of typefaces" (p. 207). In one of her studies, participants were asked to rate 15 different typefaces against 20 different attributes, including cheap, dignified, feminine, inviting, loud, masculine, playful, scholarly, sloppy, and warm (p. 210). Notably, these connotative attributes were "not specific to typefaces but instead have been used as rating scales for a wide range of concepts" (p. 210). Brumberger's analysis revealed that, as a result of participants' ratings, typefaces could be "sorted very cleanly into three categories" (p. 214): elegance, directness, and friendliness. In other words, this study "provide[d] strong evidence that people do consistently ascribe particular personality attributes to a particular typeface" (p. 217).

Experiments with type casting

Subtitling artists of Night Watch (2004), John Wick (2014), and other films are leading the way in the design of rich and expressive open subtitles. Caption studies should be exploring the power of typefaces to evoke feelings, support meanings, distinguish speakers, and even shape characters' personas. Color is used in U.K. subtitling to distinguish speakers only, but typefaces can potentially do much more, in concert with color, to embody them. Leveraging the power of typefaces, we can redefine consistency as an attribute of characters rather than an attribute of the caption file. Each television or movie character could be associated with a different typeface, type size, and color profile, for example. Importantly, designed captions don't have to replace traditional captions but could be an additional option alongside current customization options.

For example, Season 19 of South Park (2015) parodied the presidential campaigns of Hillary Clinton and Donald Trump. Mr. Garrison (voiced by Trey Parker) ran for president as an obvious send-up of Donald Trump. I had originally been drawn to one scene from an episode entitled "Sponsored Content" that used the film technique of rack focus to shift visual emphasis from the speaker in the background to the speaker in the foreground. I wanted to suggest how captions can mirror and support film shots through blurring, fading, and partially hiding captions in the scene (to give captions the appearance of being placed as objects into a three-dimensional environment). As I explored rack focus as a captioning technique in a scene from this episode, I became more interested in using typography and color to signal Mr. Garrison's function as a parody of Trump. I tried setting Mr. Garrison's speech in one of the typefaces used by the Trump campaign (Akzidenz-Grotesk BQ Bold Extended) and then created a container for his speech that was modeled after a Trump campaign poster (blue background, red border interrupted by five stars at the top and bottom, white text in all caps). In other words, the visual form of Mr. Garrison's captions reinforced their function. (Warning: This clip includes some strong language.)

Source: South Park, "Sponsored Content," 2015. Hulu. Original captions.

Source: South Park, "Sponsored Content," 2015. Hulu. Custom captions were created by the author in Adobe After Effects.

Character profiles could be redefined as composites of typeface, size, and color settings for each character. They could be designed by captioners, ideally in collaboration with producers, and then consistently applied. But they do not necessarily need to be permanent. Characters that change fundamentally or abruptly over the course of a narrative could conceivably be redesigned on the caption layer to reflect their changing identities. According to current captioning guidelines, speaker identifiers are not supposed to change, because multiple names for the same character can presumably be confusing to readers. But characters change over the course of a narrative, sometimes radically, taking on new identities and new names. Using typographic resources, captioners might be able to follow characters' narrative arcs more faithfully.

Here's an experiment that uses multiple typefaces to track a radical change in a character's identity. The 100 (2015) is a post-apocalyptic television drama that takes place ninety-seven years after nuclear holocaust has wiped out human civilization. Four thousand survivors are living on an orbiting space station (the "Ark"). When the series opens, one hundred of those survivors are sent down to Earth to begin the process of re-inhabiting the planet. At the end of season two ("Blood Must Have Blood"), Thelonious Jaha (Isaiah Washington), formerly the Chancellor of the Ark, encounters a woman named Alie (Erica Cerra) who turns out to be a lifelike android. Thelonious realizes something is wrong when her body flickers and glitches. It is at this moment, for both Thelonious and the viewing audience, that her identity switches from human to machine.

Source: The 100, "Blood Must Have Blood, Part 2," 2015. Hulu. Original captions.

The first clue that Alie is an AI is in her name. In Hulu's default sans serif typeface, the lower case l and the capital I are visually identical. Caption readers can see AI in Alie's name, whereas listeners may only hear a sonic resemblance between Alie and AI. Nevertheless, it is easy to miss Alie's flickering body while reading captions at the same time. The flicker effect is quick and subtle. Initially, I wondered whether the captions could do a better job of signaling this shift visually. Might the captions and the AI body flicker together? Captioning advocates haven't explored how typefaces can be linked to meaning and identity (including how different typefaces can aid speaker identification). For this experiment, I used typefaces to embody the AI character's transformation, starting with a standard yellow typeface for both Thelonious and Alie (when Alie is assumed to be human). As Alie's body flickers and glitches, the typeface flickers too, before turning over to a futuristic, robotic-looking typeface for the rest of the scene (when it is clear that Alie is really an android). Thelonious' speech continues to be represented in the standard Hulu yellow, sans serif typeface.

Source: The 100, "Blood Must Have Blood, Part 2," 2015. Hulu. Custom captions were created by the author in Adobe After Effects. Mainframe BB typeface was used for Alie's speech.

Characters who speak over each other present another challenge for traditional captioning. Using screen positioning, captioners may be able to distinguish two people who are talking at the same time. The situation becomes more complex when speakers are swapping places in the frame (or the camera is circling them), when overlapping speech is the purpose of the scene/interaction, or when the number of speakers is greater than two. While overlapping speech from multiple characters may not be heard by listners as distinct utterances, the captions may still present it that way (i.e., every spoken word is printed on the screen to be read). Reading speed essentially doubles when two speakers are speaking at the same time.

An edge case from an episode of Rick and Morty suggests how difficult it can be to translate overlapping speech into a readable and quickly understood form for caption readers who are (always) operating under time and space constraints. In "A Rickle in Time" (2015), Rick unfreezes time after six months, leading to serious instability in the universe. When siblings Summer and Morty start arguing, they create a spatial rift. A parallel reality forms. The two realities are represented in a split screen, and each sibling yells something different in both universes at the same time. The original captions don't attempt to resolve the overlapping speech with screen placement. Instead, the bottom-middle of the screen is filled up with the speech of both Mortys and both Summers. No attempt is made to distinguish the utterances. A line from Rick ("COME ON, SPIT IT OUT!") mixes with Morty's speech in one caption and adds to the confusion because there are no speaker identifiers to distinguish them.

Source: Rick and Morty, "A Rickle in Time," 2015. Hulu. Original captions. In the poster image, three caption lines are spread among two speakers (or three speakers if you count each Morty separately). The top line belongs to Rick ("COME ON, SPIT IT OUT!), and lines two and three belong to the Mortys (WELL, YOU DON'T EXACTLY/ MAKE IT EASY, RICK). No speaker identifiers or punctuation (i.e., preceding hyphens) are used to connect caption lines to speakers.

Color, placement, and subtle animations can clarify the different contributions of the three speakers. Instead of covering the faces of the speakers in the lower reality with captions that are three and four lines high, we can split the difference between the two realities by moving the captions to the middle of the screen, thus placing them equally close to the lips of all of the speakers. The bluish rift itself, which resembles a vibrating energy wave, can serve as the contrasting background for our new captions after being sufficiently darkened with a translucent black band. Each character's captions can take on a color associated with him or her in this scene (blue to match Rick's iconic hair, yellow to match Morty's shirt, and purple to match Summer's shirt). Importantly, the text can be animated to reinforce the similarities and differences between the realities. For example, when both Mortys repeat the same line at virtually the same time, only one yellow caption is needed. When the Mortys' speech diverges, the caption literally splits in two (like reality itself), each new caption aligning underneath its Morty.

Source: Rick and Morty, "A Rickle in Time," 2015. Hulu. Custom captions were created by the author in Adobe After Effects.

Even with the most carefully designed captions, full access to this scene's speech remains elusive. If we try to read both Summers at once at the end of the clip, for example, the speech runs as fast as 452 words per minute, which is impossibly fast (see Jensema, 1998; Jensema et al., 2000). (Sixteen words on the screen for 2.12 seconds is equivalent to reading 452 words in 60 seconds.) But providing full access isn't the point of the scene anyway. Even so, we can increase the clarity of this scene by employing some simple techniques of color, placement, and animation that are highly rhetorical—that is, specific to the scene itself.

Avatars may have some value in contexts of rapid or overlapping speech. Quoc Vy and Deborah Fels (2009) argued that "Text descriptions are ineffective as they require prior knowledge (e.g., names of characters, especially when off-screen) and additional cognitive effort to associate a name and other visual indicators (e.g., lips moving) to the speaker" (p. 916). They designed a prototype for displaying small faces of characters on the screen next to the captioned speech of each character. Faces were framed by colored borders to match speakers' clothing. Feedback from users was mostly positive. Users called the idea of avatars "innovative and an improvement over existing text-based methods" (p. 918), even if some users felt "overwhelmed" by the added information on the screen (p. 919).

Vy and Fels (2009) designed avatars for Transformers, but it's not clear why Transformers was chosen or which scene they chose (p. 917). It appears that a scene was selected because it contained characters on screen who didn't have speaking parts (thus requiring special handling to indicate who was speaking). But no explicit reasons are given in the article. I would suggest a different approach, one that starts not with avatars in search of a movie scene but the other way around. Avatars need to be tied to contexts. They need to solve specific captioning problems. Movie scenes should be chosen by captioning researchers because they present challenges that avatars might be well suited to solve. The context needs to justify the solution, which is why avatars may not be viable in every context.

Consider a scene from Star Wars: The Force Awakens (2015) when Rey and Finn (Daisy Ridley and John Boyega) talk excitedly inside the Millennium Falcon after shooting down multiple TIE Fighter aircraft that had been pursuing them. Their speech overlaps. They circle each other. The camera cuts to a low shot of the BB-8 droid at their feet. The original bitmap captions use one speaker identifier for REY but rely on preceding hyphens to distinguish each speaker's contribution in each caption.

Source: Star Wars: The Force Awakens, 2015. DVD. Original bitmap captions with preceding hyphens to distinguish speakers in the same caption.

Because Rey occupies the top line in the first caption that identifies her by name, I believe we are supposed to assume that every top line that follows (when two-line captions use preceding hyphens) belongs to Rey and every bottom line belongs to Finn:

- It was perfect.
- That was pretty good.

Using preceding hyphens is imperfect at best and confusing at worst. Screen placement offers a better solution. Even though Rey and Finn switch places on the screen, all of Finn's captions could be placed on the left and all of Rey's captions could be placed on the right. This particular scene may also benefit from the addition of color and/or avatars. I explored both together (blue for Finn, red for Rey) in my reworking of this scene's challenging captions.

Source: Star Wars: The Force Awakens, 2015. DVD. Custom captions were created by the author in Adobe After Effects.

Typography and color
How simple choices can aid speaker identification

Experiments with type casting

Next: Icons, loops, and overlays: When words are not good enough

Contact the Author