Section 6.8. Character Sets and Encodings

6.8. Character Sets and Encodings

We've already looked at the character set and encoding issues presented by web applications, but email presents some new problems. When we create a web form, we specify the encoding the form is using, which implies the encoding which should be used for posting the form data. With email, there is no equivalentemail addresses don't contain information about what character set or encoding they want to receive, nor does the SMTP protocol provide a challenge/response mechanism to ask your mail client to send data in a specific format.

Hopefully, it isn't necessary to say this, but incoming email isn't always encoded in the Latin-1 character set. It's also worth pointing out that not all email is sent using UTF-8, although many emails are. Users often have control over the character set they send email with (though not always), so you could conceivably ask your users to always send email encoded using UTF-8. If you were creating a service for five of your tech-savvy friends, then that might be acceptable, but for the general public, it's not going to happen.

Luckily, the email standard contains a subheader just for this purposeto describe the encoding of the mail body. Email headers must be encoded using ASCII, but the ASCII headers can describe a non-ASCII body. The header should look quite familiar to you:

Content-Type: text/plain;charset="utf-8"

So we can find out the character set that the body segment was encoded in. This puts us in quite a different situation than that of specifying the encoding when sending out web pages. We know the encoding of our data, but we need to convert it to another encoding (assuming the mail wasn't sent as UTF-8). Converting between character sets is a tricky process and requires knowledge of all the mappings between them. Because of the way Unicode was designed, nearly every other encoding maps to it losslessly. That is to say, any data specified in any common character set can be transformed to Unicode with no loss of information, unlike between other character sets as we saw earlier.

We could go and find the mappings between the encodings and attempt to convert this data ourselves (the mappings are freely available from the International Components for Unicode/ ICU project), but if there's one thing we value, it's laziness. There are already programs and services to do this for us, so we'll just leverage the existing code. After all, it's not a problem unique to our situation, so very little customization will be needed.

In PHP we can use the iconv library (http://www.gnu.org/software/libiconv/) to convert between character sets without having to shell out to an external programyou should recognize iconv from the previous chapter. The PHP iconv extension isn't compiled by default, but is fairly easy to install. Once installed, the following PHP is all that's needed to convert your data:

$utf8_text = iconv("ISO-8859-1", "UTF-8", $latin1_text);

If we're using Perl, then the Encode module does much the same thing. The equivalent code would be:

use Encode;
$utf8_text = encode("utf8", decode("iso-8859-1", $latin1_text));

What should we do if the character set and encoding are not specified in the message headers? The first step should be to recurse though the parent chunks to look for encodings there. Some mailers will set a content type header with a character set in the main headers, but omit the character sets in the individual multipart chunk headers. If after recursing to the very top we haven't found a charset specifier, then we can treat the data as Latin-1. RFC 1521 actually specifies the default encoding:

Default RFC 822 messages are typed by this protocol as plain text in the 
US-ASCII character set, which can be explicitly specified as 
"Content-type: text/plain; charset=us-ascii".
If no Content-Type is specified, this default is assumed.

The spec goes on to say that in the absence of a MIME-Version header, you can't be sure that the input is ASCII but you have no alternative but to assume that it is. The problem with treating unknown input as ASCII is that there are bytes that fall outside of ASCII, which might confuse our parser. For this reason, we treat all unknown input as Latin-1, so we can take any stream and convert it to valid UTF-8. The worst that can happen is we get garbage data, but garbage that's still valid UTF-8.

The concept of degrading gracefully is covered extensively in the RFC documents and is best summed up by this quote from Jon Postel in RFC 760:

Be liberal in what you accept, and conservative in what you send.

This is a useful guiding principle for any data input and output from any publicly accessible web application.