Previous Page
Next Page

1.5. Character Sets

C makes a distinction between the environment in which the compiler translates the source files of a programthe translation environment and the environment in which the compiled program is executed, the execution environment. Accordingly, C defines two character sets : the source character set is the set of characters that may be used in C source code, and the execution character set is the set of characters that can be interpreted by the running program. In many C implementations, the two character sets are identical. If they are not, then the compiler converts the characters in character constants and string literals in the source code into the corresponding elements of the execution character set.

Each of the two character sets includes both a basic character set and extended characters . The C language does not specify the extended characters, which are usually dependent on the local language. The extended characters together with the basic character set make up the extended character set .

The basic source and execution character sets both contain the following types of characters:


The letters of the Latin alphabet

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

a b c d e f g h i j k l m n o p q r s t u v w x y z


The decimal digits

0 1 2 3 4 5 6 7 8 9


The following 29 punctuation marks

! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~


The five whitespace characters

Space, horizontal tab, vertical tab, new line, and form feed

The basic execution character set also includes four nonprintable characters : the null character, which acts as the termination mark in a character string; alert; backspace; and carriage return. To represent these characters in character and string literals, type the corresponding escape sequences beginning with a backslash: \0 for the null character, \a for alert, \b for backspace, and \r for carriage return. See Chapter 3 for more details.

The actual numeric values of charactersthe character codes may vary from one C implementation to another. The language itself imposes only the following conditions:

  • Each character in the basic character set must be representable in one byte.

  • The null character is a byte in which all bits are 0.

  • The value of each decimal digit after 0 is greater by one than that of the preceding digit.

1.5.1. Wide Characters and Multibyte Characters

C was originally developed in an English-speaking environment where the dominant character set was the 7-bit ASCII code. Since then, the 8-bit byte has become the most common unit of character encoding, but software for international use generally has to be able to represent more different characters than can be coded in one byte, and internationally, a variety of multibyte character encoding schemes have been in use for decades to represent non-Latin alphabets and the nonalphabetic Chinese, Japanese, and Korean writing systems. In 1994, with the adoption of "Normative Addendum 1," ISO C standardized two ways of representing larger character sets: wide characters , in which the same bit width is used for every character in a character set, and multibyte characters , in which a given character can be represented by one or several bytes, and the character value of a given byte sequence can depend on its context in a string or stream.

Although C now provides abstract mechanisms to manipulate and convert the different kinds of encoding schemes, the language itself doesn't define or specify any encoding scheme, or any character set except the basic source and execution character sets described in the previous section. In other words, it is left up to individual implementations to specify how to encode wide characters, and what multibyte encoding schemes to support.


Since the 1994 addendum, C has provided not only the type char, but also wchar_t, the wide character type. This type, defined in the header file stddef.h, is large enough to represent any element of the given implementation's extended character sets.

Although the C standard does not require support for Unicode character sets, many implementations use the Unicode transformation formats UTF-16 and UTF-32 (see http://www.unicode.org) for wide characters. The Unicode standard is largely identical with the ISO/IEC 10646 standard , and is a superset of many perviously existing character sets, including the 7-bit ASCII code. When the Unicode standard is implemented, the type wchar_t is at least 16 or 32 bits wide, and a value of type wchar_t represents one Unicode character. For example, the following definition initializes the variable wc with the Greek letter a.

    wchar_t wc = '\x3b1';

The escape sequence beginning with \x indicates a character code in hexadecimal notation to be stored in the variablein this case, the code for a lowercase alpha.

In multibyte character sets, each character is coded as a sequence of one or more bytes. Both the source and execution character sets may contain multibyte characters . If they do, then each character in the basic character set occupies only one byte, and no multibyte character except the null character may contain any byte in which all bits are 0. Multibyte characters can be used in character constants, string literals, identifiers, comments, and header filenames. Many multibyte character sets are designed to support a certain language, such as the Japanese Industrial Standard character set (JIS) . The multibyte UTF-8 character set , defined by the Unicode Consortium, is capable of representing all Unicode characters. UTF-8 uses from one to four bytes to represent a character.

The key difference between multibyte characters and wide characters (that is, characters of type wchar_t) is that wide characters are all the same size, and multibyte characters are represented by varying numbers of bytes. This representation makes multibyte strings more complicated to process than strings of wide characters. For example, even though the character 'A' can be represented in a single byte, finding it in a multibyte string requires more than a simple byte-by-byte comparison, because the same byte value in certain locations could be part of a different character. Multibyte characters are well suited for saving text in files, however (see Chapter 13).

C provides standard functions to obtain the wchar_t value of any multibyte character, and to convert any wide character to its multibyte representation. For example, if the C compiler uses the Unicode standards UTF-16 and UTF-8, then the following call to the function wctomb( ) (read: "wide character to multibyte") obtains the multibyte representation of the character a:

    wchar_t wc = L'\x3B1';     // Greek lower-case alpha, a
    char mbStr[10] = "";
    int nBytes = 0;
    nBytes = wctomb( mbStr, wc );

After the function call, the array mbStr contains the multibyte character, which in this example is the sequence "\xCE\xB1". The wctomb( ) function's return value, assigned here to the variable nBytes, is the number of bytes required to represent the multibyte character, namely 2.

1.5.2. Universal Character Names

C also supports universal character names as a way to use the extended character set regardless of the implementation's encoding. You can specify any extended character by its universal character name, which is its Unicode value in the form:

    \uXXXX

or:

    \UXXXXXXXX

where XXXX or XXXXXXXX is a Unicode code point in hexadecimal notation. Use the lowercase u prefix followed by four hexadecimal digits, or the uppercase U followed by exactly eight hex digits. If the first four hexadecimal digits are zero, then the same universal character name can be written either as \uXXXX or as \U0000XXXX.

Universal character names are permissible in identifiers, character constants, and string literals. However, they must not be used to represent characters in the basic character set.

When you specify a character by its universal character name, the compiler stores it in the character set used by the implementation. For example, if the execution character set in a localized program is ISO 8859-7 (8-bit Greek) , then the following definition initializes the variable alpha with the code \xE1:

    char alpha = '\u03B1';

However, if the execution character set is UTF-16, then you need to define the variable as a wide character:

    wchar_t alpha = '\u03B1';

In this case, the character code value assigned to alpha is hexadecimal 3B1, the same as the universal character name.

Not all compilers support universal character names .


1.5.3. Digraphs and Trigraphs

C provides alternative representations for a number of punctuation marks that are not available on all keyboards . Six of these are the digraphs , or two-character tokens, which represent the characters shown in Table 1-1.

Table 1-1. Digraphs

Digraph

Equivalent

<:

[

:>

]

<%

{

%>

}

%:

#

%:%:

##


These sequences are not interpreted as digraphs if they occur within character constants or string literals. In all other positions, they behave exactly like the single-character tokens they represent. For example, the following code fragments are perfectly equivalent, and produce the same output. With digraphs:

    int arr<::> = <% 10, 20, 30 %>;
    printf( "The second array element is <%d>.\n", arr<:1:> );

Without digraphs:

    int arr[ ] = { 10, 20, 30 };
    printf( "The second array element is <%d>.\n", arr[1] );

Output:

    The second array element is <20>.

C also provides trigraphs , three-character representations, all of them beginning with two question marks. The third character determines which punctuation mark a trigraph represents, as shown in Table 1-2.

Table 1-2. Trigraphs

Trigraph

Equivalent

??(

[

??)

]

??<

{

??>

}

??=

#

??/

\

??!

|

??'

^

??-

~


Trigraphs allow you to write any C program using only the characters defined in ISO/IEC 646, the 1991 standard corresponding to 7-bit ASCII . The compiler's preprocessor replaces the trigraphs with their single-character equivalents in the first phase of compilation. This means that the trigraphs, unlike digraphs, are translated into their single-character equivalents no matter where they occur, even in character constants, string literals, comments, and preprocessing directives. For example, the preprocessor interprets the statement's second and third question marks below as the beginning of a trigraph:

    printf("Cancel???(y/n) ");

Thus the line produces the following preprocessor output:

    printf("Cancel?[y/n) ");

If you need to use one of these three-character sequences and do not want it to be interpreted as a trigraph, you can write the question marks as escape sequences:

    printf("Cancel\?\?\?(y/n) ");

If the character following any two question marks is not one of those shown in Table 1-2, then the sequence is not a trigraph, and remains unchanged.

As another substitute for punctuation characters in addition to the digraphs and trigraphs, the header file iso646.h contains macros that define alternative representations of C's logical operators and bitwise operators, such as and for && and xor for ^. For details, see Chapter 15.



Previous Page
Next Page