Section 10.12. Internationalization

10.12. Internationalization

Most programs present some information to users as text. Such text should be understandable and acceptable to the user. For example, in some countries and cultures, the date "March 7" can be concisely expressed as "3/7." Elsewhere, "3/7" indicates "July 3," and the string that means "March 7" is "7/3." In Python, such cultural conventions are handled with the help of standard module locale.

Similarly, a greeting can be expressed in one natural language by the string "Benvenuti," while in another language the string to use is "Welcome." In Python, such translations are handled with the help of standard module gettext.

Both kinds of issues are commonly called internationalization (often abbreviated i18n, as there are 18 letters between i and n in the full spelling). This is a misnomer, as the same issues also apply to users of different languages or cultures within a single nation.

10.12.1. The locale Module

Python's support for cultural conventions imitates that of C, though it is slightly simplified. In this architecture, a program operates in an environment of cultural conventions known as a locale. The locale setting permeates the program and is typically set at program startup. The locale is not thread-specific, and module locale is not thread-safe. In a multithreaded program, set the program's locale before starting secondary threads.

If a program does not call locale.setlocale, the program operates in a neutral locale known as the C locale. The C locale is named from this architecture's origins in the C language and is similar, but not identical, to the U.S. English locale. Alternatively, a program can find out and accept the user's default locale. In this case, module locale interacts with the operating system (via the environment or in other system-dependent ways) to establish the user's preferred locale. Finally, a program can set a specific locale, presumably determining which locale to set on the basis of user interaction or via persistent configuration settings such as a program initialization file.

Locale setting is normally performed across the board for all relevant categories of cultural conventions. This wide-spectrum setting is denoted by the constant attribute LC_ALL of module locale. However, the cultural conventions handled by module locale are grouped into categories, and, in some cases, a program can choose to mix and match categories to build up a synthetic composite locale. The categories are identified by the following constant attributes of module locale:

LC_COLLATE: String sorting; affects functions strcoll and strxfrm in locale
LC_CTYPE: Character types; affects aspects of module string (and string methods) that have to do with lowercase and uppercase letters
LC_MESSAGES: Messages; may affect messages displayed by the operating systemfor example, function os.strerror and module gettext
LC_MONETARY: Formatting of currency values; affects function locale.localeconv
LC_NUMERIC: Formatting of numbers; affects functions atoi, atof, format, localeconv, and str in locale
LC_TIME: Formatting of times and dates; affects function time.strftime

The settings of some categories (denoted by the constants LC_CTYPE, LC_TIME, and LC_MESSAGES) affect behavior in other modules (string, time, os, and gettext, as indicated). The settings of other categories (denoted by the constants LC_COLLATE, LC_MONETARY, and LC_NUMERIC) affect only some functions of locale itself.

Module locale supplies functions to query, change, and manipulate locales, as well as functions that implement the cultural conventions of locale categories LC_COLLATE, LC_MONETARY, and LC_NUMERIC.

atof
atof(s)

Converts string s to a floating-point number using the current LC_NUMERIC setting.

atoi
atoi(s)

Converts string s to an integer number using the current LC_NUMERIC setting.

format
format(fmt, num, grouping=False)

Returns the string obtained by formatting number num according to the format string fmt and the LC_NUMERIC setting. Except for cultural convention issues, the result is like fmt%num. If grouping is true, format also groups digits in the result string according to the LC_NUMERIC setting. For example:

>>> locale.setlocale(locale.LC_NUMERIC, 'en') 'English_United States.1252' >>> locale.format('%s', 1000*1000) '1000000' >>> locale.format('%s', 1000*1000, True) '1,000,000'

When the numeric locale is U.S. English and argument grouping is true, format supports the convention of grouping digits by threes with commas.

getdefaultlocale
getdefaultlocale(envvars=['LANGUAGE', 'LC_ALL', 'LC_TYPE', 'LANG'])

Checks the environment variables whose names are specified by envvars, in order. The first one found in the environment determines the default locale. getdefaultlocale returns a pair of strings (lang, encoding) compliant with RFC 1766 (except for the 'C' locale), such as ['en_US', 'ISO8859-1']. Each item of the pair may be None if gedefaultlocale is unable to discover what value the item should have.

getlocale
getlocale(category=LC_CTYPE)

Returns a pair of strings (lang, encoding) with the current setting for the given category. The category cannot be LC_ALL.

localeconv
localeconv( )

Returns a dict d with the cultural conventions specified by categories LC_NUMERIC and LC_MONETARY of the current locale. While LC_NUMERIC is best used indirectly, via other functions of module locale, the details of LC_MONETARY are accessible only through d. Currency formatting is different for local and international use. The U.S. currency symbol, for example, is '$' for local use only. '$' is ambiguous in international use, since the same symbol is also used for other currencies called "dollars" (Canadian, Australian, Hong Kong, etc.). In international use, therefore, the U.S. currency symbol is the unambiguous string 'USD'. The keys into d to use for currency formatting are the following strings:

'currency_symbol'

Currency symbol to use locally.

'frac_digits'

Number of fractional digits to use locally.

'int_curr_symbol'

Currency symbol to use internationally.

'int_frac_digits'

Number of fractional digits to use internationally.

'mon_decimal_point'

String to use as the "decimal point" for monetary values.

'mon_grouping'

List of digit-grouping numbers for monetary values.

'mon_thousands_sep'

String to use as digit-groups separator for monetary values.

'negative_sign' 'positive_sign'

Strings to use as the sign symbol for negative (positive) monetary values.

'n_cs_precedes' 'p_cs_precedes'

True if the currency symbol comes before negative (positive) monetary values.

'n_sep_by_space' 'p_sep_by_space'

True if a space goes between sign and negative (positive) monetary values.

'n_sign_posn' 'p_sign_posn'

Numeric codes to use to format negative (positive) monetary values:

0

The value and the currency symbol are placed inside parentheses.

1

The sign is placed before the value and the currency symbol.

2

The sign is placed after the value and the currency symbol.

3

The sign is placed immediately before the value.

4

The sign is placed immediately after the value.

CHAR_MAX

The current locale does not specify any convention for this formatting.

d['mon_grouping'] is a list of numbers of digits to group when formatting a monetary value. When d['mon_grouping'][-1] is 0, there is no further grouping beyond the indicated numbers of digits. When d['mon_grouping'][-1] is locale.CHAR_MAX, grouping continues indefinitely, as if d['mon_grouping'][-2] were endlessly repeated. locale.CHAR_MAX is a constant used as the value for all entries in d for which the current locale does not specify any convention.

normalize
normalize(localename)

Returns a string, suitable as an argument to setlocale, that is the normalized equivalent to localename. If normalize cannot normalize string localename, then normalize returns localename unchanged.

resetlocale
resetlocale(category=LC_ALL)

Sets the locale for category to the default given by getdefaultlocale.

setlocale
setlocale(category, locale=None)

Sets the locale for category to the given locale, if not None, and returns the setting (the existing one when locale is None; otherwise, the new one). locale can be a string or a pair of strings (lang, encoding). The lang string is normally a language code based on ISO 639 two-letter codes ('en' for English, 'nl' for Dutch, and so on). When locale is the empty string '', setlocale sets the user's default locale.

str
str(num)

Like locale.format('%f', num).

strcoll
strcoll(str1, str2)

Like cmp(str1, str2), but according to the LC_COLLATE setting.

strxfrm
strxfrm(s)

Returns a string sx such that the built-in comparison (e.g., by cmp) of strings so transformed is equivalent to calling locale.strcoll on the original strings. strxfrm lets you use the decorate-sort-undecorate (DSU) idiom for sorts that involve locale-conformant string comparisons. However, if all you need is to sort a list of strings in a locale-conformant way, strcoll's simplicity can make it faster. The following example shows two ways of performing such a sort; in this case, the simple variant is often faster than the DSU one, by about 10 percent for a list of a thousand words:

import locale # simpler and often faster def locale_sort_simple(list_of_strings): list_of_strings.sort(locale.strcoll) # less simple and often slower def locale_sort_DSU(list_of_strings): auxiliary_list = [(locale.strxfrm(s), s) for s in list_of_strings] auxiliary_list.sort( ) list_of_strings[:] = [s for junk, s in auxiliary_list] In Python 2.4, the key= argument to the sort method offers both simplicity and speed: # simplest and fastest, but 2.4-only: def locale_sort_2_4(list_of_strings): list_of_strings.sort(key=locale.strxfrm)

10.12.2. The gettext Module

A key issue in internationalization is the ability to use text in different natural languages, a task also known as localization. Python supports localization via module gettext, which was inspired by GNU gettext. Module gettext is optionally able to use the latter's infrastructure and APIs, but is simpler and more general. You do not need to install or study GNU gettext to use Python's gettext effectively.

10.12.2.1. Using gettext for localization

gettext does not deal with automatic translation between natural languages. Rather, gettext helps you extract, organize, and access the text messages that your program uses. Pass each string literal subject to translation, also known as a message, to a function named _ (underscore) rather than using it directly. gettext normally installs a function named _ in the _ _builtin_ _ module. To ensure that your program runs with or without gettext, conditionally define a do-nothing function, named _, that just returns its argument unchanged. Then you can safely use _('message') wherever you would normally use a literal 'message' that should be translated. The following example shows how to start a module for conditional use of gettext:

try: _
except NameError:
    def _(s): return s def greet( ): print _('Hello world')

If some other module has installed gettext before you run this example code, function greet outputs a properly localized greeting. Otherwise, greet outputs the string 'Hello world' unchanged.

Edit your source, decorating message literals with function _. Then use any of various tools to extract messages into a text file (normally named messages.pot) and distribute the file to the people who translate messages into the various natural languages your application must support. Python supplies a script pygettext.py (in directory Tools/i18n in the Python source distribution) to perform message extraction on your Python sources.

Each translator edits messages.pot to produce a text file of translated messages with extension .po. Compile the .po files into binary files with extension .mo, suitable for fast searching, using any of various tools. Python supplies script Tools/i18n/msgfmt.py for this purpose. Finally, install each .mo file with a suitable name in a suitable directory.

Conventions about which directories and names are suitable differ among platforms and applications. gettext's default is subdirectory share/locale/<lang>/LC_MESSAGES/ of directory sys.prefix, where <lang> is the language's code (two letters). Each file is named <name>.mo, where <name> is the name of your application or package.

Once you have prepared and installed your .mo files, you normally execute, at the time your application starts up, some code such as the following:

import os, gettext os.environ.setdefault('LANG', 'en')       # application-default language gettext.install('your_application_name')

This ensures that calls such as _('message') return the appropriate translated strings. You can choose different ways to access gettext functionality in your programfor example, if you also need to localize C-coded extensions, or to switch back and forth between languages during a run. Another important consideration is whether you're localizing a whole application or just a package that is distributed separately.

10.12.2.2. Essential gettext functions

Module gettext supplies many functions; the most often used ones are the following.

install
install(domain, localedir=None, unicode=False)

Installs in Python's built-in namespace a function named _ to perform translations given in file <lang>/LC_MESSAGES/<domain>.mo in directory localedir, with language code <lang> as per getdefaultlocale. When localedir is None, install uses directory os.path.join(sys.prefix, 'share', 'locale'). When unicode is true, function _ accepts and returns Unicode strings, not plain strings.

translation
TRanslation(domain, localedir=None, languages=None)

Searches for a .mo file similarly to function install. When languages is None, translation looks in the environment for the lang to use, like install. It examines, in order, environment variables LANGUAGE, LC_ALL, LC_MESSAGES, LANG; the first nonempty one is split on ':' to give a list of language names (for example, 'de:en' is split into ['de', 'en']). When not None, languages must be a list of one or more language names (for example, ['de', 'en']). TRanslation uses the first language name in the list for which it finds a .mo file. Function TRanslation returns an instance object that supplies methods gettext (to translate a plain string), ugettext (to translate a Unicode string), and install (to install either gettext or ugettext under name _ into Python's built-in namespace).

Function translation offers more detailed control than install, which is like translation(domain,localedir).install(unicode). With translation, you can localize a single package without affecting the built-in namespace by binding name _ on a per-module basisfor example, with:

_ = translation(domain).ugettext

TRanslation also lets you switch globally between several languages, since you can pass an explicit languages argument, keep the resulting instance, and call the install method of the appropriate language as needed:

import gettext translators = {} def switch_to_language(lang, domain='my_app', use_unicode=True): if not translators.has_key(lang): translators[lang] = gettext.translation(domain, languages=[lang]) translators[lang].install(use_unicode)

10.12.3. More Internationalization Resources

Internationalization is a very large topic. For general introductions and useful resources, see http://www.debian.org/doc/manuals/intro-i18n/ and http://www.i18ngurus.com/. One of the best packages of code and information for internationalization is ICU (http://icu.sourceforge.net/), which also includes the Unicode Consortium's excellent Common Locale Data Repository (CLDR) database of locale conventions and code to access the CLDR. Unfortunately, at the time of this writing, ICU supports only Java, C, and C++, not (directly) Python. You can easily use the Java version of ICU with Jython (see "Importing Java Packages in Jython" on page 656 for more information about using Java classes from Jython code); with more effort, you can wrap the C/C++ version of ICU with tools such as SWIG or SIP (covered in Chapter 25) to access ICU functionality from Classic Python.