Section 4.4. The UTF-8 Encoding

4.4. The UTF-8 Encoding

UTF-8 is the encoding favored by most web application developers, which stands for Unicode Transformation Format 8-bit. UTF-8 is a variable-length encoding, optimized for compact storage of Latin-based characters. For those characters, it saves space over larger fixed-width encodings (such as UTF-16), and also provides support for encoding a huge range of code points. UTF-8 is completely compatible with ASCII (also known as ISO standard 646). Since ASCII only defines encodings for the code points 0 through 127 (using the first 7 bits of the byte), UTF-8 keeps all those encodings as is, and uses the high bit for higher code points.

UTF-8 works by encoding the length of the code point's representation in bytes into the first byte, and then using subsequent bytes to add to the number of representable bits. Each byte in a UTF-8 character encoding sequence contributes between 0 and 7 bits to the final code point, and works like a long binary number from left to right. The bits that make up the binary representation of each code point are based on the bit masks shown in Table 4-3.

Table 4-3. UTF-8 byte layout
Bytes
Bits
Representation
1
7
0bbbbbbb
2
11
110bbbbb 10bbbbbb
3
16
1110bbbb 10bbbbbb 10bbbbbb
4
21
11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
5
26
111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb
6
31
1111110b 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb
7
36
11111110 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb
8
42
11111111 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb

This means that for the code point U+09E0 (my favoritethe Bengali vocalic RR) we need to use 3 bytes, since we need to represent 12 bits of data (09E0 in hexadecimal is 100111100000 in binary). We combine the bits of the code point with the bit mask and get 11100000 10100111 10100000 or 0xE0 0xA7 0xA0 (which you might recognize from the previous example).

One of the nice aspects of the UTF-8 design is that since it encodes as a stream of bytes rather than a set of code points as WORDs or DWORDs, it ignores the endian-ness of the underlying machine. This means that you can swap a UTF-8 stream between a little endian and a big endian machine without having to do any byte reordering or adding a BOM. You can completely ignore the underlying architecture.

Another handy feature of the UTF-8 encoding is that as it stores the bits of the actual code point from left to right, performing a binary sort of the raw bytes that lists strings in code point order. While this isn't as good as using locale-based sorting rules, it's a great way of doing very cheap orderingthe underlying system doesn't need to understand UTF-8, just how to sort raw bytes.