Section 4.1. Internationalization and Localization

4.1. Internationalization and Localization

Internationalization and localization are buzzwords in the web applications fieldpartly because they're nice long words you can dazzle people with, and partly because they're becoming more important in today's world. Internationalization and localization are often talked about as a pair, but they mean very distinct things, and it's important to understand the difference:

Internationalization is adding to an application the ability to input, process, and output international text.
Localization is the process of making a customized application available to a specific locale.

Internationalization is often shortened to i18n (the "18" representing the 18 removed letters) and localization to L10n (for the same reason, although an uppercase "L" is used for visual clarity) and we'll refer to them as such from this point on, if only to save ink. As with most hot issues, there are a number of other terms people have associate with i18n and L10n, which are worth pointing out if only to save possible confusion later on: globalization (g11n) refers to both i18n and L10n collectively, while personalization (p13n) and reach (r3h) refer solely to L10n.

4.1.1. Internationalization in Web Applications

Way back in the distant past (sometimes referred to as the 90s), having internationalization support meant that your application could input, store, and output data in a number of different character sets and encodings. Your English-speaking audience would converse with you in Latin-1, your Russian speakers in KOI8-R, your Japanese users in Shift_JIS, and so on. And all was well, unless you wanted to present data from two different user sets on the same page. Each of these character sets and encodings allowed the representation and encoding of a defined set of charactersusually somewhere between 100 and 250. Sometimes some of these characters would overlap, so you could store and display the character Ю (Cyrillic capital letter Yu) in both KOI8-Ukrainian as the byte 0xE0 and Apple Cyrillic as the byte 0x9E. But more often than not, characters from one character set weren't displayable in another. You can represent the character ね (Hiragana letter E) in IBM-971-Korean, and the character Ų (Latin capital letter U with Ogonek) in IBM-914-Baltic, but not vice versa.

And as lovely as these different character set were, there were additional problems beyond representing each other's characters. Every piece of stored data needed to be tagged with the character set it was stored as. Any operations on a string had to respect the character set; you can't perform a byte-based sub string operation on a shift-JIS string. When you came to output a page, the HTML and all the content had to be output using the correct character set.

At some point, somebody said enough was enough and he wouldn't stand for this sort of thing any more. The solution seemed obviousa single character set and encoding to represent all the characters you could ever wish to store and display. You wouldn't have to tag strings with their character sets, you wouldn't need many different string functions, and you could output pages in just one format. That sounded like a neat idea.

And so in 1991, Unicode was born. With a character set that spanned the characters of every known written language (including all the diacritics and symbols from all the existing character sets) and a set of fancy new encodings, it was sure to revolutionize the world. And after wallowing in relative obscurity for about 10 years, it finally did.

In this chapter, we're going to deal solely with the Unicode character set and the UTF-8 encoding for internationalization. It's true that you could go about it a different way, using either another Unicode encoding or going down the multiple character set path, but for applications storing primarily English data, UTF-8 usually makes the most sense. For applications storing a large amount of CJKV dataor any data with many high numbered code pointsUTF-16 can be a sensible choice. Aside from the variable length codepoint representations, the rest of this chapter applies equally well to UTF-16 as it does to UTF-8.

4.1.2. Localization in Web Applications

The localization of web applications is quite different from internationalization, though the latter is a prerequisite for the former. When we talk about localizing a web application, we mean presenting the user with a different interface (usually just textually) based on their preferred locale.

What the Heck Is . . . a Locale?

The term locale describes a set of localization preferences. This usually includes a language and region and can often also contain preferences about time zone, time and date formats, number display, and currency. A single locale is usually stored against a user's account, so that various parts of the application can be tailored to that user: times displayed in her own time zone, text displayed in her language, her own thousand separators, and so on.

There are a few of methods of localizing your site, none of which are very easy. This chapter deals primarily with internationalization, so we won't go into a lot of localization detail. We'll look briefly at three approaches you can take toward localization before we get back to internationalization basics.

4.1.2.1. String substitution

At the lowest level, you can use a library like GNU's gettext (http://www.gnu.org/software/gettext/), which allows you to substitute languages at the string level. For instance, take this simple piece of PHP code, which should output a greeting:

    printf("Hello %s!", $username);

Using a gettext wrapper function, we can substitute any language in there:

    printf(_("Hello %s!"), $username);

The only change is to call the gettext function, which is called _( ), passing along the English string. The gettext configuration files then contain a mapping of phrases into different languages, as shown in Examples 4-1 and 4-2.

Example 4-1. my_app.fr.po

msgid "Hello %s!" msgstr "Bonjour %s!"

Example 4-2. my_app.ja.po

msgid "Hello %s!" msgstr "
%s!"

At runtime, gettext returns the correct string, depending on the user's desired locale, which your application then outputs.

The problem with string substitution is that any time you change any of your application's visual structure (changing a flow, adding an explanation, etc.), you need to immediately update every translation. This is all very well if you have a team of fulltime translators on staff, but needing full translations before deploying any changes doesn't sit well with rapid web application development.

4.1.2.2. Multiple template sets

In an application where markup is entirely separated from any page logic, the templates act as processing-free documents (aside from conditionals and iterations). If you create multiple sets of templates, one in each locale you wish to support, then development in one set doesn't have to be synchronous with development in another. You can make multiple ongoing changes to your master locale, and then periodically batch those changes over to you other locales.

This approach does have its problems, though. Although the changes in markup can happen independently of any changes in page logic, any functional changes in page logic need to be reflected in the markup and copy. For instance, if a page in your application showed the latest five news items but was being changed to show a random selection of stories instead, then the copy saying so would have to be updated for all languages at once. The alternative is to not change the functionality for the different locales by supporting multiple functionalities simultaneously in the page logic and allowing it to be selected by the templates. This starts to get very complicated, putting multiple competing flows into the page logic layer.

4.1.2.3. Multiple frontends

Instead of smushing multiple logical flows into the page logic layer, you can instead create multiple page logic layers (including the markup and presentation layers above them). This effectively creates multiple sites on top of a single common storage and business logic layer.

By building your application's architecture around the layered model and exposing the business logic functions via an API (skip ahead to Chapter 11 for more information), you can initially support a single locale and then build other locale frontends later at your own pace. An internationalized business logic and storage layer then allows the sharing of data between localesthe data added via the Japanese locale application frontend can be seen in the Spanish locale application frontend.

For more general information about i18n and L10n for the Web, you should visit the W3C's i18n portal at http://www.w3.org/International/.