Plain strings are converted into Unicode strings either explicitly, with the unicode built-in, or implicitly, when you pass a plain string to a function that expects Unicode. In either case, the conversion is done by an auxiliary object known as a codec (for coder-decoder). A codec can also convert Unicode strings to plain strings, either explicitly, with the encode method of Unicode strings, or implicitly.
To identify a codec, pass the codec name to unicode or encode. When you pass no codec name, and for implicit conversion, Python uses a default encoding, normally 'ascii'. You can change the default encoding in the startup phase of a Python program, as covered in "The site and sitecustomize Modules" on page 338; see also setdefaultencoding on page 170. However, such a change is not a good idea for most "serious" Python code: it might too easily interfere with code in the standard Python libraries or third-party modules, written to expect the normal 'ascii'.
Every conversion has a parameter errors, a string specifying how conversion errors are to be handled. The default is 'strict', meaning any error raises an exception. When errors is 'replace', the conversion replaces each character that causes an error with '?' in a plain-string result and with u'\ufffd' in a Unicode result. When errors is 'ignore', the conversion silently skips characters that cause errors. When errors is 'xmlcharrefreplace', the conversion replaces each character that causes an error with the XML character reference representation of that character in the result. You may also code your own function to implement a conversion-error-handling strategy and register it under an appropriate name by calling codecs.register_error.
9.6.1. The codecs Module
The mapping of codec names to codec objects is handled by the codecs module. This module also lets you develop your own codec objects and register them so that they can be looked up by name, just like built-in codecs. Module codecs also lets you look up any codec explicitly, obtaining the functions the codec uses for encoding and decoding, as well as factory functions to wrap file-like objects. Such advanced facilities of module codecs are rarely used, and are not covered further in this book.
The codecs module, together with the encodings package of the standard Python library, supplies built-in codecs useful to Python developers dealing with internationalization issues. Python comes with over 100 codecs; a list of these codecs, with a brief explanation of each, is at http://docs.python.org/lib/standard-encodings.html. Any supplied codec can be installed as the site-wide default by module sitecustomize, but the preferred usage is to always specify the codec by name whenever you are converting explicitly between plain and Unicode strings. The codec installed by default is 'ascii', which accepts only characters with codes between 0 and 127, the 7-bit range of the American Standard Code for Information Interchange (ASCII) that is common to almost all encodings. A popular codec is 'latin-1', a fast, built-in implementation of the ISO 8859-1 encoding that offers a one-byte-per-character encoding of all special characters needed for Western European languages.
The codecs module also supplies codecs implemented in Python for most ISO 8859 encodings, with codec names from 'iso8859-1' to 'iso8859-15'. On Windows systems only, the codec named 'mbcs' wraps the platform's multibyte character set conversion procedures. Many codecs specifically support Asian languages. Module codecs also supplies several standard code pages (codec names from 'cp037' to 'cp1258'), Mac-specific encodings (codec names from 'mac-cyrillic' to 'mac-turkish'), and Unicode standard encodings 'utf-8' and 'utf-16' (the latter also has specific big-endian and little-endian variants: 'utf-16-be' and 'utf-16-le'). For use with UTF-16, module codecs also supplies attributes BOM_BE and BOM_LE, byte-order marks for big-endian and little-endian machines, respectively, and BOM, the byte-order mark for the current platform.
Module codecs also supplies a function to let you register your own conversion-error-handling functions.
Module codecs also supplies two functions to ease dealing with files of encoded text.
9.6.2. The unicodedata Module
The unicodedata module supplies easy access to the Unicode Character Database. Given any Unicode character, you can use functions supplied by module unicodedata to obtain the character's Unicode category, official name (if any), and other, more exotic information. You can also look up the Unicode character (if any) that corresponds to a given official name. Such advanced facilities are rarely needed, and are not covered further in this book.