4.3. Unicode Encodings
There are a number of encodings defined for storing Unicode data, both fixed and variable width. A fixed-width encoding is one in which every code point is represented by a fixed number of bytes, while a variable-length encoding is one in which different characters can be represented by different numbers of bytes. UTF-32 and UCS2 are fixed width, UTF-7 and UTF-8 are variable width, and UTF-16 is a variable width encoding that usually looks like a fixed width encoding.
UTF-32 (and UCS4, which is almost the same thing) encodes each code point using 4 bytes, so it can encode any code point from U+0000 to U+FFFFFFFF. This is usually overkill, given that there aren't nearly that many code points defined. UCS2 encodes each code point using 2 bytes, so it can encode any code point from U+0000 to U+FFFF. UTF-16 also uses 2 bytes for most characters, but the code points from U+D800 to U+DFFF are used in what's called surrogate pairs , which allows UTF-16 to encode the code points U+0000 to U+10FFFF.
UTF-8 uses between 1 and 4 (or 1 and 7 for the ISO 10646 version, which we'll discuss below) bytes for each code point and can encode code points U+0000 to U+10FFFF (or U+0000 to U+3FFFFFFFFFF for the ISO 10646 version). We'll discuss UTF-8 in more detail in a moment. UTF-7 is a 7-bit safe encoding that allows it to appear in emails without the need for base64 or quoted-printable encoding. UTF-7 never really caught on, and isn't widely used since it lacks UTF-8's ASCII transparency, and quoted-printable is more than adequate for sending UTF-8 by email.
So what's this ISO 10646 thing we've been talking about? The concept of Unicode was obviously such a good idea that two groups started working on it at the same timethe Unicode Consortium and the International Organization for Standardization (ISO). Before release, the standards were combined but still retained separate names. They are kept mostly in sync as time goes on, but have different documentation and diverge a little when it comes to encodings. For the sake of clarity, we'll treat them as the same standard.
What's important to notice here is that while we have multiple encodings (which map code points to bytes), we only have a single character set (which maps characters to code points). This is central to the idea of Unicodea single set of code points that all applications can use, with a set of multiple encodings to allow applications to store data in whatever way they see fit. All Unicode encodings are lossless, so we can always convert from one to another without losing any information (ignoring the fact that UTF-16 can't represent many private-use code points that UTF-32 can). With Unicode, code point U+09E0 always means the Bengali Vocalic RR character, regardless of the encoding used to store it.
4.3.1. Code Points and Characters, Glyphs and Graphemes
So far we've painted a fairly complex picturecharacters are symbols that have an agreed meaning and are represented by a code point. A code point can be represented by one or more bytes using an encoding. If only it were so simple.
A character doesn't necessarily represent what a human thinks of as a character. For instance, the Latin letter "a" with a tilde can be represented by either the code point U+00E3 (Latin small letter "a" with tilde) or by composing it from two code points, U+0061 (Latin small letter "a") and U+0303 (combining tilde). This composed form is referred to as a grapheme. A grapheme can be composed of one of more charactersa base character and zero or more combining characters.
The situation is further complicated by ligatures, in which a single glyph can be constructed from two or more characters. These characters are then represented by a single code point, or by two regular code points. For instance, the ligature fi ("f" followed by "i") can be represented by U+0066 (Latin small letter "f") and U+0131 (Latin small letter dotless "i"), or by U+FB01 (Latin small ligature "?").
So what does this mean at a practical level? It means that given a stream of code points, you can't arbitrarily cut them (such as with a substring function) and get the expected sequence of graphemes. It also means that there is more than one way to represent a single grapheme, using different sequences of ligatures and combining characters to create identical graphemes (although the Unicode normalization rules allow for functional decomposed grapheme comparison). To find the number of characters (the length) in a string encoded using UTF-8, we can't count the bytes. We can't even count the code points, since some code points may be combining characters that don't add an extra grapheme. You need to understand both where the code points lie in a stream of bytes and what the character class of the code point is. The character classes defined in Unicode are shown in Table 4-1.
In fact, Unicode defines more than just a general category for each characterthe standard also defines a name, general characteristics (alphabetic, ideographic, etc.), shaping information (bidi, mirroring, etc.), casing (upper, lower, etc.), numeric values, normalization properties, boundaries, and a whole slew of other useful information. This will mostly not concern us, and we won't realize when we're using this information since it happens magically in the background, but it's worth noting that a core part of the Unicode standard, in addition to the code points themselves, are their properties.
These properties and characteristics, together with the normalization rules, are all available from the Unicode web site (http://www.unicode.org/) in both human- and computer-readable formats.
4.3.2. Byte Order Mark
A byte order mark (BOM) is a sequence of bytes at the beginning of a Unicode stream used to designate the encoding type. Because systems can be big endian or little endian, multibyte Unicode encodings such as UTF-16 can store the bytes that constitute a single code point in either order (highest or lowest byte first). BOMs work by putting the code point U+FEFF (reserved for this purpose) at the start of the file. The actual output in bytes depends on the encoding used, so after reading the first four bytes of a Unicode stream, you can figure out the encoding used (Table 4-2).
Most other Unicode encodings have their own BOMs (including SCSU, UTF-7, and UTF-EBCDIC) that all represent the code point U+FEFF. BOMs should be avoided at the start of served HTML and XMLdocuments because they'll mess up some browsers. You also want to avoid putting a BOM at the start pf your PHP templates or source code files, even though they might be UTF-8 encoded, because PHP won't accept it.
For more specific information about the Unicode standard, you should visit the Unicode web site by the Unicode Consortium at http://www.unicode.org/ or buy the Unicode book The Unicode Standard 4.0 (Addison-Wesley) (which is a lot of fun, as it contains all 98,000 of the current Unicode code points), which you can order from http://www.unicode.org/book/bookform.html.