Section 5.3. Filtering UTF-8

5.3. Filtering UTF-8

So you've configured your outgoing HTTP headers to specify UTF-8. All of your input will be valid UTF-8 sequences, right? If only it were that easy. The combination of people manually posting to your forms, old browsers with no Unicode support, and posting from pages with other encodings means that you can never be sure about the data you are sent.

Unlike ISO-8859-1, not every sequence of bytes is a valid UTF-8 string. Table 5-1 shows a portion of the byte layout table for UTF-8-encoded characters.

Table 5-1. UTF-8 byte layout
Bytes
Bits
Representation
1
7
0bbbbbbb
2
11
110bbbbb 10bbbbbb
3
16
1110bbbb 10bbbbbb 10bbbbbb
4
21
11110bbb 10bbbbbb 10bbbbbb 10bbbbbb

If we encounter a byte in the range 0xE0 to 0xEF (0b11100000 to 0b11101111), then we know that it should be followed by two bytes in the range 0x80 to 0xBF (0b10000000 to 0x10111111).

To filter out invalid sequences, we need to first describe the set of conditions in which a sequence of bytes is invalid. We can do this with three rules of thumb. First, we know how many trailing bytes should follow a lead byte. If any of the required trailing bytes don't exist or are lead bytes, then we have an invalid sequence. This can be expressed using these rules:

For a byte matching 110xxxxx, the following 1 byte should exist and match 10xxxxxx.
For a byte matching 1110xxxx, the following 2 bytes should exist and match 10xxxxxx.
For a byte matching 11110xxx, the following 3 bytes should exist and match 10xxxxxx.
And so on.

Using Perl-style regular expressions, these rules translate to these regular expressions (in which a byte stream matching any of these productions is invalid):

[\xC0-\xDF]([^\x80-\xBF]|$)
[\xE0-\xEF].{0,1}([^\x80-\xBF]|$)
[\xF0-\xF7].{0,2}([^\x80-\xBF]|$)
[\xF8-\xFB].{0,3}([^\x80-\xBF]|$)
[\xFC-\xFD].{0,4}([^\x80-\xBF]|$)
[\xFE-\xFE].{0,5}([^\x80-\xBF]|$)

The next set of rules are needed to ensure that the sequence of trailing bytes isn't too long. When we have a lead byte that expects three trailing bytes, we don't want the fourth byte after the leading byte to be a trailing byte (since all code points must start with a leading byte). In English, these rules can be expressed like this:

For a byte matching 0xxxxxxx, the byte 1 byte after must not match 10xxxxxx.
For a byte matching 110xxxxx, the byte 2 bytes after must not match 10xxxxxx.
For a byte matching 1110xxxx, the byte 3 bytes after must not match 10xxxxxx.
And so on.

Using regular expressions, these rules can be expressed like so (again, matches indicate invalid sequences):

[\x00-\x7F][\x80-\xBF]
[\xC0-\xDF].[\x80-\xBF]
[\xE0-\xEF]..[\x80-\xBF]
[\xF0-\xF7]...[\x80-\xBF]
[\xF8-\xFB]....[\x80-\xBF]
[\xFC-\xFD].....[\x80-\xBF]
[\xFE-\xFE]......[\x80-\xBF]

We need to add one final rule in order to check that the sequence did not start with a trailing byte. We can express that using this regular expression:

^[\x80-\xBF]

So we can now build our full validity checker using a single regular expression, by combining all of our rules into a single production:

function is_valid_utf8(&$input){
        $rx = '[\xC0-\xDF]([^\x80-\xBF]|$)';
        $rx .= '|[\xE0-\xEF].{0,1}([^\x80-\xBF]|$)';
        $rx .= '|[\xF0-\xF7].{0,2}([^\x80-\xBF]|$)';
        $rx .= '|[\xF8-\xFB].{0,3}([^\x80-\xBF]|$)';
        $rx .= '|[\xFC-\xFD].{0,4}([^\x80-\xBF]|$)';
        $rx .= '|[\xFE-\xFE].{0,5}([^\x80-\xBF]|$)';
        $rx .= '|[\x00-\x7F][\x80-\xBF]';
        $rx .= '|[\xC0-\xDF].[\x80-\xBF]';
        $rx .= '|[\xE0-\xEF]..[\x80-\xBF]';
        $rx .= '|[\xF0-\xF7]...[\x80-\xBF]';
        $rx .= '|[\xF8-\xFB]....[\x80-\xBF]';
        $rx .= '|[\xFC-\xFD].....[\x80-\xBF]';
        $rx .= '|[\xFE-\xFE]......[\x80-\xBF]';
        $rx .= '|^[\x80-\xBF]';
        return preg_match("!$rx!", $input) ? 0 : 1;
}

With PHP as our reference language, cleaning up input that doesn't conform is easy. If we receive data that is invalid as UTF-8, we can assume it used a different encoding. Since we don't know what that encoding was, we'll assume it was Latin-1 (ISO-8859-1) on the grounds that it's the most common character set that's still ASCII compatible.

Every sequence of bytes is valid Latin-1, so we can convert any bytes we receive. The formula for this conversion is fairly trivial (given that the resulting sequences for each input byte can be only 1 or 2 bytes long), but we don't even need to go that far. PHP provides a built-in function called utf8_encode( ) that does exactly what we want.

Bundling this knowledge together with our detection function, we can now create a function to ensure all of our input is safe:

function ensure_input_is_valid_utf8(&$input){
        if (!is_valid_utf8($input)){
                $input = utf8_encode($input);
        }
}

It's important to note that we're purposefully using pass-by-reference here to avoid copying strings around when we're not necessarily going to edit them. Usually, we will receive valid UTF-8 data, and the string will pass to our function by reference. We pass it on to the validation function, again by reference, where we perform a regular expression match on it. If the data is valid, we do nothingthe referenced string stays as is, and we've avoided any extra copying.

Performing a complex regular expression can be a little slow, especially when we're going to be doing it a lot. If we're unsure of how long something is going to take, there's nothing that gives a better idea of performance than running a quick benchmark. We can write a very quick benchmark harness to loop creating some random data and verifying it. With 1,000 loops of 1 KB of random data, we find that it takes about 0.66 seconds:

[calh@admin1 ~] $ php -q utf8_bench.php
Trying 1000 loops of 1024 bytes ... done
RegEx: 0.655966 secs (1524.490 per/sec)

With the function on our test hardware, we can verify about 1.5 MB of textual data per second. This is probably fine for any small to medium-scale application1.5 MB of text is a huge amount. If we're doing things on a huge scale, where we might be receiving that much data on a single box, then it might be important to optimize further.

Our basic rule of laziness suggests that somebody has already done this before us and we can use their work. We saw a few Unicode extensions for PHP in the previous chapter that we could probably use for this function. The iconv extension allows us to convert from one character set to another. If we ask it to transform some data from UTF-8 into UTF-8, it should filter out any invalid sequences. A quick check verifies that this is true, so we can write a short function to check for valid strings:

function is_valid_utf8_iconv(&$str){
  $out = iconv("UTF-8", "UTF-8", $str);
  return ($out == $str) ? 1 : 0;
}

We can add this to our benchmark to see how it shapes up against the regular expression version. The difference is interesting:

[calh@admin1 ~] $ php -q utf8_bench_2.php
Trying 1000 loops of 1024 bytes ... done
RegEx: 0.655966 secs ( 1524.490 per/sec)
iconv: 0.028700 secs (34843.068 per/sec)

So iconv can process roughly 34 megabytes of string data per second. It's shouldn't be much of a surprise that iconv is faster, since it's written in C, but the difference is a little surprising. Going the iconv route, we can process text much faster (almost 23 times faster) and have to write less code. There's always a downside, thoughwe have to compile and install the iconv extension and have it loaded into PHP. In addition to the up front effort and ongoing maintenance this adds to your environment, it also makes every Apache process a little bit fatter. Adding modules shouldn't be taken lightly, and unless we really need the raw speed, the PHP version will probably serve us fine.

Because we are talking about an extension, it might seem like it's not PHP's fault, but rather the fault of regular expressions that make the PHP code slow. It's fairly simple to write a small state machine verifier that loops over the characters one by one to verify the UTF-8 validity:

function is_valid_utf8_statemachine(&$input){
        $more = 0;
        $len = strlen($input);
        for($i=0; $i<$len; $i++){
                $c = ord($input{$i});
                if ($c <= 0x7F){
                         if ($more > 0){ return 0; }
                 }elseif ($c <= 0xBF){
                         if ($more == 0){ return 0; }
                         $more--;
                 }elseif ($c <= 0xDF){
                         if ($more > 0){ return 0; }
                         $more = 1;
                 }elseif ($c <= 0xEF){
                         if ($more > 0){ return 0; }
                         $more = 2;
                 }elseif ($c <= 0xF7){
                         if ($more > 0){ return 0; }
                         $more = 2;
                 }elseif ($c <= 0xFB){
                         if ($more > 0){ return 0; }
                         $more = 3;
                 }elseif ($c <= 0xFD){
                         if ($more > 0){ return 0; }
                         $more = 4;
                 }elseif ($c <= 0xFE){
                         if ($more > 0){ return 0; }
                         $more = 5;
                 }else{
                         if ($more > 0){ return 0; }
                         $more = 6;
                 }
       }
       return ($more == 0) ? 1 : 0;
}

Here we loop over each character and keep a counter with the remaining number of expected following characters. When we run into a following character, we decrement the counter after checking that it was greater than zero. When we run into a leading character, we check the counter is zero and set it up as appropriate. When we reach the end of the string, we check the counter is at zero.

So now we have three versions of the same functionone using regular expressions, one using the iconv library, and one in pure PHP. By plugging them all into the benchmarking script, we can see which comes out on top:

[calh@admin1 ~] $ php -q utf8_bench_3.php
Trying 1000 loops of 1024 bytes ... done
RegEx: 0.655966 secs (1524.490 per/sec)
iconv: 0.028700 secs (34843.068 per/sec)
State: 2.253097 secs ( 443.834 per/sec)

The state machine is much slower than either of our other efforts. That's not unexpected, since the regular expression extension, PCRE, is written in C and highly optimized. We'll stick with either our regular expression version or use iconv if we already have it loaded or don't mind taking the process-size hit.

Even though our implementations above don't allow any invalid byte sequences through, we need to remember that this only makes our data valid, not good. If our character sequence starts with a combining mark, such as a diacritical accent that binds to the character preceding it, then the string makes no sense. Dealing with these issues gets even more complicated.