Section 4.2. Unicode in a Nutshell

4.2. Unicode in a Nutshell

When talking about Unicode, many people have preconceived ideas of what it is and what it means for software development. We're going to try to dispel these myths, so to be safe we'll start from the basic principles with a clean slate. It's hard to get much more basic than Figure 4-1.

Figure 4-1. The letter "a"

So what is this? It's a lowercase Latin character "a." Well, really it's a pattern of ink on paper (or pixels on the screen, depending on your medium) representing an agreed on letter shape. We'll refer to this shape as a glyph. This is only one of many glyphs representing the lowercase Latin character "a." Figure 4-2 is also a glyph.

Figure 4-2. A different glyph

It's a different shape with different curves but still represents the same character. Okay, simple so far. Each character has multiple glyph representations. At a computer level, we call these glyph sets fonts . When we need to store a sequence of these characters digitally, we usually store only the characters, not the glyphs themselves. We can also store information telling us which font to use to render the characters into glyphs, but the core information we're storing is still a sequence of characters.

So how do we make the leap from a lowercase Latin character "a" to the binary sequence 01100001? We need two sets of mappings (although they're often grouped together into one set). The first, a character set, tells us how to take abstract characters and turn them into numbers. The second, an encoding, tells us how to take these numbers (or code points) and represent those using bits and bytes. So let's revisit; what is Figure 4-3?

Figure 4-3. The question of the letter "a" again

It's a glyph representing the lowercase Latin character "a." The ASCII character set tells us that the lowercase Latin character "a" has a code point of 0x61 (97 in decimal). The ASCII encoding then tells us that we can represent code point 0x61 by using the single byte 0x61.

Unicode was designed to be very compatible with ASCII, and so all ASCII code points are the same in Unicode. That is to say, the Latin lowercase "a" in Unicode also has the code point 0x61. In the UTF-8 encoding, code point 0x61 is represented by the single byte 0x61, just as with ASCII. In the UTF-16 encoding, code point 0x61 is represented as the pair of bytes 0x00 and 0x61. We'll take a look at some of the different Unicode encodings shortly.

Formatting Conventions

In Unicode, code points are referred to in the format U+AAAA, where AAAA is a four digit hexadecimal number. For code points beyond U+FFFF, the minimum number of digits needed to express the code point are used. For example, the code point for the Kharoshthi letter "A" is expressed as U+10A00. We'll be using this format for code points throughout the chapter, and the 0xAA format for encoded bytes.

So why do we want to use Unicode, since it looks so similar to ASCII? This is most easily answered with an example of something ASCII can't do, which is represent the character shown in Figure 4-4.

Figure 4-4. A character well outside of ASCII

This is the Bengali Vocalic RR character, Unicode code point U+09E0. In the UTF-8 encoding scheme, this code point maps to the bytes 0xE0 0xA7 0xA0. In the UTF-16 encoding, the same code point would be encoded using the pair of bytes 0x09 0xE0.