Technology
Ligatures, Clusters, Combining Marks and Variation Sequences
On the surface, Unicode appears to be a just large collection of characters. But before Unicode text is displayed, substantial “shaping” can occur. This shaping is the process of mapping the Unicode characters to glyphs and placing them correctly on the display. The mapping is, in general, n characters to m glyphs. For most characters n = m = 1, but there are many exceptions. For example in Arabic, a lam ل (U+0644) followed by an alef ا (U+627) maps to a lam-alef ligature لا. In English print, you often see the character sequences fi, ff, ffi, and fl displayed as single-character ligatures. Sometimes the distinction isn’t obvious unless you look carefully, but it may well be there. This post discusses the user interfaces involved in editing text with ligatures and other n ≠ m mappings.
[La]TeX uses the standard English ligatures, so my interest was piqued in them a long time ago. Later on (2006), I decided to implement a feature in RichEdit called default Latin ligatures, which is enabled by sending an EM_SETEDITSTYLE message with wparam = lparam = SES_DEFAULTLATINLIGA. When a font contains the fi ligature, the feature glyphs all text runs with that font. Glyphing a text run automatically uses the default ligatures, kerning, and some kinds of contextual shaping. The feature was active for roughly two years during the development of Office 2007 when a tester discovered that the f and i were somehow connected! Big bug! So reluctantly we disabled the feature unless the message above is received.
Living with the feature enabled in my stand-alone builds, I realized that when you have active default ligatures, the arrow keys and the selection need to be handled carefully to avoid user confusion and ire. If you do nothing, typing the → key appears to bypass an fi ligature, but the program thinks the insertion point is between the f and the i. So if you type the delete key, the i is deleted instead of the character that follows the i. This can be disconcerting and the editor appears to be buggy.
The solution is to move the caret 1/m way through the fi ligature. In this case, that means half way through the ligature. In fact, if you don’t look carefully it seems to be exactly what you’d have if the two characters were displayed instead of the ligature.
Typing shift+→ selects a character. If the editing program does nothing special with ligatures, selecting the first character of a ligature will probably appear to select the whole ligature. But hitting the Delete key only deletes the first character of the ligature, once again confusing the user. The solution is similar to partial caret motion. Specifically the selection highlighting goes 1/m of the way through the ligature. For the fi ligature, this is half way. It looks as if the f is selected and this is, in fact, what is actually selected. Most users won’t even realize that a single glyph is used. The user is happy and no confusion arises. This technique is called partial ligature selection.
Generally English ligatures resemble the layout of the individual characters, so partial ligature selection is unambiguous. But occasionally there are English ligatures that display the component characters more over one another than side by side. For example, note the oo ligature in the logo for a nifty Australian Shiraz
Partial ligature selection of the first o would go half way through the oo ligature and is no longer unambiguous as it is for an fi ligature. The technique may nevertheless be good enough or conceivably it would be better to treat the ligature as a cluster, that is, as a single unit for selection purposes. If so, trying to select the first o would select both.
This leads one to scripts for which clusters are the norm, such as Thai and Indic scripts like Devanagari. Clusters are combinations of characters that are not displayed side by side. Typically they are displayed above one another or with completely different glyphs. Accordingly, they are treated as multicharacter units by the arrow keys. If the insertion point is at the start of a cluster, the Delete key deletes the whole cluster and shift+→ selects the whole cluster. For both ligatures and clusters (as well as combining-mark sequences), the Backspace key removes one character at a time.
Multiple character codes are also used in a common Unicode encoding called UTF-16. Many characters are represented by 16-bits. Many more are represented by two 16-bit codes, the first in the range U+D800..U+DBFF and the second in the range U+DC00..U+DFFF. Such a combination is called a surrogate pair. It must be treated as a single unit by the arrow, delete, and backspace keys. The same is true for variation sequences, which consist of a base character followed by a variation selector character. The base character may be represented by a surrogate pair and so may the variation selector. These sequences must be treated as single units by the arrow, delete, and backspace keys.
Back in the late 1980’s, people dreamed that Unicode would be able to represent all text characters by simple 16-bit units. Well it turned out to be a lot more complicated than that. Some folks say one should use UTF-32 (32-bit character codes), which at least gets rid of surrogate pairs. But the underlying characters of complex scripts can still consist of multiple codes or can be transformed into glyphs of various shapes. And that’s where much of the real complexity in editing and displaying Unicode occurs.
RichEdit 8.0 Image Support
Up until RichEdit 8.0, RichEdit’s native image support was limited to metafiles, enhanced metafiles, and simple images like bitmaps (bmp’s). If OLE (Object Linking and Embedding) had supported other types, such as jpg’s, png’s and gif’s, RichEdit would have supported them automatically. But OLE’s functionality was frozen years ago. RichEdit 5.0 added “blobs”, which are light-weight OLE-like objects that the RichEdit client, like OneNote, renders. These blobs allow OneNote to insert and render many kinds of images. More recently, Microsoft created the Windows Imaging Component (see also), which supports the most popular image formats and allows extensions. RichEdit 8.0 uses this component to provide image support for jpg’s png’s and gif’s. This blog post summarizes the capabilities and APIs to take advantage of this facility.
A number of facilities that ship in Office RichEdit’s are not available in the Windows 8 RichEdit mostly because there wasn’t time to test them thoroughly outside the Office environment and to document them properly. One such facility is the RichEdit blob. RichEdit 8 uses its own built-in implementation of the blob to store and render images using the Windows Imaging Component.
There are two ways to insert images into a classic RichEdit 8 instance (use Windows 8 msftedit.dll): the EM_INSERTIMAGE message and the TOM2 ITextRange2::InsertImage() method. In addition, you can insert images into a WinRT RichEditBox using the WinRT TOM Windows.UI.Text.ITextRange.InsertImage() method. The first two specify the image dimensions in HIMETRIC units (.01 mm = 2540/inch) and the WinRT method uses Device Independent Pixels (96/inch). For example, to have a 4”×3” image in a classic RichEdit instance, use a width of 4×2540 = 10160 and a height of 7620.
In addition to the height and width, the APIs have an ascent parameter, which is usually zero. It’s included in case the image contains text that should be aligned with the text base line. In the original blob implementation for OneNote, this was used (and still is) for aligning hand written images with the text baseline. It’s also useful for aligning images of mathematics with the text baseline. This second use is supported by the ITextServices2::TxGetNaturalSize2() to return the baseline of text images created using RichEdit. This facility is used in the Office Equation ribbon.
You can save files including images in the RTF file format. The images are saved via RichEdit’s blob extension to RTF. This uses the OLE \object destination with a RichEdit-specific type of \objblob1. Unfortunately this type is known only to RichEdit. It would be better to save it as a native RTF shape ({\*\shppict{\pict{…}}) so that Word and other programs could understand it. Also then RichEdit could support such images in Word-generated RTF files. Hopefully next time…
RichEdit 8.0 TOM Table Interfaces
An earlier post describes the RichEdit nested table facility and how the EM_INSERTTABLE and EM_GETTABLEPARMS messages could be used to insert and examine tables. Now those messages are documented in MSDN along with a new message, EM_SETTABLEPARMS that allows one to modify tables. For additional convenience, RichEdit 8.0 adds table support to the TOM text object model. The APIs are all documented in MSDN, but it’s worthwhile to give an overview here to help motivate the approach.
If you’d just like to insert a table with any number of identical rows and default properties, call ITextRange2::InsertTable(). To insert more complicated kinds of tables, RichEdit 8.0 has the ITextRow table interface. In addition to inserting tables, this interface allows you to examine tables and to perform table manipulations, such as inserting, deleting and resizing table columns. The interface is associated with an ITextRange2 and hence doesn’t depend on (and change) the selection, which is used by the table messages, EM_INSERTTABLE, EM_GETTABLEPARMS, and EM_SETTABLEPARMS.
To obtain an ITextRow interface, call ITextRange2::GetRow(). To insert one or more identical table rows, call ITextRow::Insert(). To insert nonidentical rows, call ITextRow::Insert() for each different row configuration. This allows you to have tables consisting of rows with mixtures of cell counts, row indents and other properties. In particular, tables don’t have to be rectangular.
To select a table, row, or cell, use ITextRange::Expand(), with the Unit tomTable, tomRow, and tomCell, respectively. These Units can also be used with the ITextRange::Move() methods to navigate and select multiple rows or cells. Although having only two interfaces for handling tables may seem sparse compared to extensive table models such as Microsoft Word’s, there’s a nice simplicity to the approach. Instead of dealing with interfaces for collections of cells, rows, columns and tables, along with individual cell, row, column and table interfaces, you only have to learn and use two interfaces. Nevertheless ITextRow and ITextRange2 harness the complete power of RichEdit’s nested table facility.
Some ITextRow properties apply to a whole row, such as the row alignment. In addition there are cell properties, such as cell alignment. Cell properties are applied to the active cell, which is the one selected via the ITextRow::SetCellIndex() method. To set cell properties on different cells, change the active cell in between property set calls. The use of an active cell along with the ITextRange navigation methods obviate the need for a cells collection interface.
ITextRow works similarly to ITextPara2, but does not modify the text in the associated range until either the ITextRow::Apply or the ITextRow::Insert method is called. In addition, the row and cell parameters are always active, that is, they cannot have the value tomDefault. On initialization, the ITextRow object acquires the table row properties (if any) at the active end of the associated ITextRange2. The ITextRow::Reset method can be used to update these properties to the current values for ITextRange2.
When the ITextRow::Apply method is given the tomCellStructureChangeOnly flag, only the number of cells and/or the cell widths are changed. Other properties remain the same. This is handy for deleting, inserting or resizing table columns. The EM_SETTABLEPARMS message also has this capability.
RichEdit tables are stored quite efficiently. An empty cell consists of a single cell mark (U+0007) plus the cell property information. The latter can be compressed a bit by allowing trailing cells to share the same set of properties. This is done via the ITextRow::SetCellCountCache(). In contrast to the ITextRow::SetCellCount() method, which sets the number of cells in a row, the SetCellCountCache() method sets the number of cells with cached parameters. Cells that follow the last cached cell use the same properties as the last cached cell. For example, the call ITextRow::SetCellCountCache(1) causes all cells in the row to use the same set of properties.
The architecture is quite flexible in that each table row can have any valid table-row parameters regardless of the parameters for other rows (except for vertical merge flags). For example the number of cells and the start indents of table rows can differ, unlike in HTML which has n×m rectangular format with all rows starting at the same indent.
On the other hand, no formal table description is stored anywhere. Information such as the table row count has to be figured out by navigating through the table. One way to obtain this count is to call ITextRange::StartOf (tomTable, tomFalse, NULL) to move to the start of the current table and then to call ITextRange::Move (tomRow, tomForward, &dcRow). On return, the table row count is given by dcRow + 1, since moving by tomRow’s doesn’t move beyond the last table row.

