The Challenges of PDF Text Layout
Text layout is the process of determining both the position of words and letters on a page and their appearance based on a set of fonts and additional attributes like underlining, alignment, typeface properties, etc. This post will provide a high-level overview of how this process works in general, along with some of the challenges it presents.
What Is Text Layout?
Computers were originally designed to deal with numeric data, but not text. It was only after computers become affordable that their text processing capabilities become important, because computers began replacing typewriters.
However, the initial idea to map every character in a computer to a number fell short as soon as people needed support for languages other than English. Computers then needed to handle thousands of possible characters in documents and render those abstract characters using fonts in a way that accurately resembles written text. Not only that, but the apparently simple rules about where to break a paragraph of text into multiple lines when it’s too long got complex, because some languages don’t split words using spaces.
Laying out text turns a text string and some attributes like font and font style (bold, italic, etc.) into pixels to draw on a computer screen.
A high-level overview of how a modern text system deals with laying out text on a computer screen is as follows:
-
First, the text is itemized, that is, it’s separated into text items, or blocks of smaller text units with the same properties (for example, the same font or the same font color).
-
Next, every text item is shaped, which means that its characters are converted into glyphs, that is, the visual representations of characters.
-
Given a sequence of glyphs and some physical space (for example, the size of a computer screen), the process determines where to insert line breaks to split the text into multiple lines (word wrapping).
-
Finally, the process takes the positioned glyphs and converts them into pixels that can be drawn onscreen.
Let’s look at each stage in more detail by working with an example. Imagine that a computer needs to draw the following paragraph onscreen:
Text layout is the process of arranging text on a page in a way that’s aesthetically pleasing and easy to read. It involves choosing the right font, size, color, and spacing for the text, as well as deciding how to split it into multiple lines.
Note that in this paragraph, some words are in bold, others are italicized, and the text is long enough that the system needs to split it into multiple lines (unless your computer screen is very wide).
Itemization
In the first step of the text layout process, the text is split into text blocks with the same properties. It will split the sample paragraph into:
-
“Text layout is the process of ” > Text without any special attribute (default style).
-
“arranging text on a page” > Text in bold.
-
“in a way that’s aesthetically pleasing and easy to read. It involves choosing the” > Text without any special attribute (default style).
-
“right” > Text in italic.
-
“font, size, color, and spacing for the text, as well as deciding how to split it into multiple lines.” > Text without any special attribute (default style).
Shaping
Each of the text items from the previous step is shaped, or converted into glyphs using a font. For example, the character “T” is rendered into an image of the English “T” character in the Verdana font with a size of 1 points. This process seems simple in our sample paragraph, but it presents some challenges in general.
To improve the aesthetics of text, fonts usually contain combinations of characters that are rendered as a single unit. For instance, it’s common that the “f, i” letter sequence appears in print as the single ligature glyph “fi.”
Ligatures are optional, but for some languages, there are crucial glyph variants depending on where characters appear in a sentence. For example, each Arabic character has four variants depending on if the character appears in the initial, medial, or final position in a sequence of text. There’s more information about this in our post on text shaping and fonts.
Font substitution in PDF files refers to the process of replacing a font that isn’t embedded in the PDF file with a similar font that is available on the user’s computer. This ensures that the text in the PDF file is displayed correctly. Font substitution is an important problem to solve, because fonts installed on a computer, in general, do not contain support for every possible language. It’s also a difficult problem, because font substitution requires an answer to the question “Is this font similar to that font?” — which is a partly a subjective question. In some cases, a run of text needs to be covered by multiple fonts. This is typical if the text contains characters from several scripts (for example, some English text in a Hebrew paragraph).
Line Breaking
Once the shaping step has produced a glyph representation for each text item, the text system needs to check if the sequence of glyphs fits in the allowed space and, if not, decide where to split into multiple lines. In our example, it’s possible that “Text layout is the process of” fits in the allowed space, but the next text item, “arranging text on a page,” doesn’t fit completely. The text layout engine would need to split that item into one or multiple lines.
Breaking text into multiple lines can be done at character boundaries or at word boundaries (what we typically understand as “word wrap”). Both problems are tricky when we consider languages other than English, because there are language-specific rules about where a line break is allowed in a sentence. The Unicode standard has a document with the rules that any text system should follow when performing line breaking. Another challenge is that the basic approach of adding words to a line until there’s no more space produces lines with excessive trailing whitespace, which makes the text harder to read. There are many clever algorithms that try to overcome that problem. A classic one is Knuth-Plass’s “Breaking Paragraphs into Lines.”
Glyph Rendering
The last step involves positioning the text glyphs in the layout (taking into account things like text alignment or justification) and then rendering the glyphs to produce a visual image of the text on the computer screen. In our example, the glyphs for “T,” “e,” “x,” “t,” and every other glyph in the paragraph will be rendered to pixels using specialized software that can draw pixels from font glyphs. The way glyphs are rendered depends on the font technology: Some fonts provide glyphs in a bitmap of pixels, and some other font technologies provide the instructions to draw glyphs onscreen in a way that’s independent of the screen.
Conclusion
Typesetting text for computer screens isn’t an easy task. Text can be written using multiple languages and scripts whose rules to render characters visually or to split text into multiple lines aren’t trivial. The result of the text layout process, that is, a list of font glyphs positioned in one or multiple lines, can be easily adapted to be rendered onscreen or stored in a PDF by converting the glyphs to a list of PDF operators with additional resources, like fonts.