4.6. Using UTF-8 with PHP
One of the side effects of UTF-8 being a byte-oriented encoding is that so long as you don't want to mess with the contents of a string, you can pass it around blindly using any system that is xsbinary safe (by binary safe, we mean that we can store any byte values in a "string" and will always get exactly the same bytes back).
This means that PHP 4 and 5 can easily support a Unicode application without any character set or encoding support built into the language. If all we do is receive data encoded using UTF-8, store it, and then blindly output it, we never have to do anything more than copy a block of bytes around.
But there are some operations that you might need to perform that are impossible without some kind of Unicode support. For instance, you can't perform a regular substr( ) (substring) operation. substr( ) is a byte-wise operation, and you can't safely cut a UTF-8-encoded string at arbitrary byte boundaries. If you, for instance, cut off the first 3 bytes of a UTF-8-encoded string, that cut might come down in the middle of a character sequence, and you'll be left with an incomplete character.
If you're tempted at this point to move to a fixed-width encoding such as UCS2, it's worth noting that you still can't blindly cut Unicode strings, even at character boundaries (which can be easily found in a fixed width encoding). Because Unicode allows combining characters for diacritical and other marks, a chop between two code points could result in a character at the end of the string missing its accents, or stray accents at the beginning of the string (or strange side effects from double width combining marks, which are too confusing to contemplate here).
Any function that in turn relies on substring operations can not be safely used either. For PHP, this includes things such as wordwrap( ) and chunk_split( ).
In PHP, Unicode substring support comes from the mbstring (multibyte string) extension, which does not come bundled with the default PHP binaries. Once this extension is installed, it presents you with alternative string manipulation functions: mb_substr( ) replaces substr( ) and so on. In fact, the mbstring extension contains support for overloading the existing string manipulation functions, so simply calling the regular function will actually call the mb_...( ) function automatically. It's worth noting though that overloading can also cause issues. If you're using any of the string manipulation functions anywhere to handle binary data (and here we mean real binary data, not textual data treated as binary), then if you overload the string manipulation functions, you will break your binary handling code. Because of this, it's often safest to explicitly use multibyte functions where you mean to.
In addition to worrying about the manipulation of UTF-8-encoded strings, the other function you'll need at the language level is the ability to verify the validity of the data. Not every stream of bytes is valid UTF-8. We'll explore this in depth in Chapter 5.