4.5. UTF-8 Web Applications
When we talk about making an application use UTF-8, what do we mean? It means a few things, all of which are fairly simple but need to be borne in mind throughout your development.
4.5.1. Handling Output
We want all of our outputted pages to be served using UTF-8. To do this, we need to create our markup templates using an editor that is Unicode aware. When we go to save our files, we ask for them to be saved in UTF-8. For the most part, if you were previously using Latin-1 (more officially called ISO-8859-1), then nothing much will change. In fact, nothing at all will change unless you were using some of the higher accented characters. With your templates encoded into UTF-8, all that's left is to tell the browser how the pages that you're serving are encoded. You can do this using the content-type header's charset property:
Content-Type: text/html; charset=utf-8
If you haven't yet noticed, charset is a bizarre name to choose for this propertyit represents both character set and encoding, although mostly encoding. So how do we output this header with our pages? There are a few ways, and a combination of some or all will work well for most applications.
Sending out a regular HTTP header can be done via your application's code or through your web server configuration. If you're using Apache, then you can add the AddCharset directive to either your main httpd.conf file or a specific .htaccess file to set the charset header for all documents with the given extension:
AddCharset UTF-8 .php
In PHP, you can output HTTP headers using the simple header( ) function. To output the specific UTF-8 header, use the following code:
header("Content-Type: text/html; charset=utf-8");
The small downside to this approach is that you also need to explicitly output the main content-type (text/html in our example), rather than letting the web server automatically determine the type to send based on the browser's user agentthis can matter when choosing whether to send a content-type of text/html or application/xhtml+xml (since the latter is technically correct but causes Netscape 4 and some versions of Internet Explorer 6 to prompt you to download the page).
In addition to sending out the header as part of the regular HTTP request, you can include a copy of the header in the HTML body by using the meta tag. This can be easily added to your pages by placing the following HTML into the head tag in your templates:
<meta http-equiv="Content-Type" content= "text/html; charset=UTF-8">
The advantage of using a meta tag over the normal header is that should anybody save the page, which would save only the request body and not headers, then the encoding would still be present. It's still important to send a header and not just use the meta tag for a couple of important reasons. First, your web server might already be sending an incorrect encoding, which would override the http-equiv version; you'd need to either suppress this or replace it with the correct header. Second, most browsers will have to start re-parsing the document after reaching the meta tag, since they may have already parsed text assuming the wrong encoding. This can create a delay in page rendering or, depending on the user's browser, be ignored all together. It hopefully goes without saying that the encoding in your HTTP header should match that in your meta tag; otherwise, the final rendering of the page will be a little unpredictable.
header("Content-Type: text/xml; charset=utf-8");
Unlike HTML, XML has no way to include arbitrary HTTP headers in documents. Luckily, XML has direct support for encodings (appropriately named this time) as part of the XML preamble. To specify your XML document as UTF-8, you simply need to indicate it as such in the preamble:
<?xml version="1.0" encoding="utf-8"?>
4.5.2. Handling Input
Input sent back to your application via form fields will automatically be sent using the same character set and encoding as the referring page was sent out in. That is to say, if all of your pages are UTF-8 encoded, then all of your input will also be UTF-8 encoded. Great!
Of course, there are some caveats to this wonderful utopia of uniform input. If somebody creates a form on another site that submits data to a URL belonging to your application, then the input will be encoded using the character set of the form from which the data originated. Very old browsers may always send data in a particular encoding, regardless of the one you asked for. Users might build applications that post data to your application accidentally using the wrong encoding. Some users might create applications that purposefully post data in an unexpected encoding.
All of these input vectors result in the same outcomeall incoming data has to be filtered before you can safely use it. We'll talk about that in a lot more detail in the next chapter.