Previous Section  < Day Day Up >  Next Section

11.3 Dealing with Non-Western Languages

Supporting locales with non-Western languages adds another dimension to the subject of localization—namely, the issue of character encoding. As you probably know, the characters displayed on your screen are really represented by sequences of bits. To know which character to display for a sequence of bits, applications (e.g., a browser) consult a mapping between the bit sequences and the characters they represent. ASCII is an early standard mapping; it maps 7 bits (the numerical values 0 through 127) to the characters in the English alphabet, the numbers 0 through 9, punctuation characters, and some control characters. That was all that was really needed in the early days of computing, because most computers were kept busy crunching numbers.

But as computers were given new tasks, often dealing with human-readable text, 7 bits didn't cut it. Adding one bit made it possible to represent all letters used in the Western European languages, but it was not enough to represent all characters used around the world. This problem was partly solved by defining a number of standards for using eight bits to represent different character subsets. Each of the 10 ISO-8859 standards defines what is called a charset: a mapping between eight bits (a byte) and a character. For instance, ISO-8859-1, also known as Latin-1, defines the subset used for Western European languages such as English, French, Italian, Spanish, German, and Swedish. ISO-8859-1 is the default charset for HTTP. Other standards in the same series are ISO-8859-2, covering Central and Eastern European languages such as Hungarian, Polish, and Romanian, and ISO-8859-5, with Cyrillic letters used in Russian, Bulgarian, and Macedonian. Chinese, Japanese and Korean contain thousands of characters but with 8 bits, you can only represent 256. A number of multibyte charsets have therefore been defined to handle these languages, such as Big5 for Chinese, Shift_JIS for Japanese, and EUC-KR for Korean.

As you can imagine, all these different standards make it hard to exchange information encoded in different ways. To solve this problem, companies such as Apple, IBM, Microsoft, Novell, Sun, and Xerox founded the Unicode Consortium in 1991, and defined the Unicode standard. Unicode uses 2 bytes (16 bits) to define unique codes for 49,194 characters in Version 3.0, covering most of the world's languages. Java uses Unicode for its internal representation of characters, and Unicode is also supported by many other technologies, like XML and LDAP. Unicode support is included in all modern browsers, e.g., Netscape and Internet Explorer since Version 4. To learn more about Unicode, visit http://www.unicode.org/.

What does this all mean to web application developers? Well, since ISO-8859-1 is the default charset for HTTP, you don't have to worry about character encoding at all when you work with Western languages. But if you provide content in another language, such as Japanese or Russian, you must tell the browser which charset you're using so it can interpret and render the characters correctly. If the files that you serve contain characters encoded with a charset other than ISO-8859-1, you must inform the web container.

We're using JSP pages to build the JSF responses, so let's focus on the JSP features that deal with character encoding. JSP is Java, so the web container uses Unicode internally, but the JSP page is typically stored using another encoding, and the response may need to be sent to the browser with yet another encoding. There are two JSP page directive attributes that specify these charsets. The pageEncoding attribute specifies the charset for the bytes in the JSP page itself, so the container can translate them to Unicode when it reads the file. The contentType attribute can contain a charset in addition to the MIME type, as shown in Figure 11-3. This charset tells the container to convert the Unicode characters used internally to the specified charset encoding when the response is sent to the browser. It also sets the charset attribute in the Content-Type response header that tells the browser how to interpret the response. If a pageEncoding is not specified, the charset specified by the contentType attribute is used to interpret the JSP file bytes as well, and vice versa (if a pageEncoding is specified but not a contentType charset). If a charset is not specified at all, ISO-8859-1 is used for both the file and the response.[1]

[1] For a JSP Document (a JSP page in XML format), UTF-8 or UTF-16 is the default, as determined by the XML parser.

Enough theory. Figure 11-3 shows a simple JSP page that sends the text "Hello World" in Japanese to the browser. The Japanese characters are copied with permission from Jason Hunter's Java Servlet Programming (O'Reilly).

Figure 11-3. Japanese JSP page (japanese.jsp)
figs/Jsf_1103.gif

To create a file with Japanese or other non-Western European characters, you obviously need a text editor that can handle multibyte characters. The JSP page in Figure 11-3 was created with WordPad on a Windows system, using a Japanese font called MS Gothic, and saved as a file encoded with the Shift_JIS charset. Shift_JIS is therefore the charset specified by the pageEncoding attribute, so the container knows how to read the file. The contentType attribute, using the charset attribute, specifies another charset called UTF-8 for the response. UTF-8 is an efficient charset that encodes Unicode characters as one, two, or three bytes, as needed, and is supported by all modern browsers (e.g., Netscape and Internet Explorer, Version 4 or later). It can be used for any language, assuming the browser has access to a font with the language character symbols.

Note the page directive that defines the charset for the file must appear as early as possible in the JSP page, before any characters that can be interpreted only when the charset is known. I recommend you insert it as the first line in the file to avoid problems.

If you pull strings from a resource bundle file, you must also do a bit of work for non-Western languages. The resource bundle file itself must be ISO-8859-1-encoded, but there's a tool bundled with the Java 2 SDK called native2ascii that you can use to convert a file in any encoding to ISO-8859-1 encoding. See the Java SDK documentation for details (http://java.sun.com/j2se/1.4.2/docs/tooldocs/windows/native2ascii.html).

To illustrate how all this works, I developed a simple test page that displays the current date formatted according to the Japanese, Greek, and Russian locales. Figure 11-4 shows what it looks like.

Figure 11-4. Test page for non-Western locales
figs/Jsf_1104.gif

You can switch between the locales by choosing one in the selection list and clicking the New Language button. The page also includes an input field for a date/time value in a format that corresponds to the currently selected locale and a button to submit it. If the value can be interpreted as a date/time value, it's printed in Java's standard format at the bottom of the page.

Example 11-4 shows the JSP page for the test page.

Example 11-4. Test page for non-Western languages (nw_i18n.jsp)
<%@ page contentType="text/html;charset=UTF-8" %>

<%@ taglib uri="http://java.sun.com/jsf/html" prefix="h" %>

<%@ taglib uri="http://java.sun.com/jsf/core" prefix="f" %>



<jsp:useBean id="now" scope="request" class="java.util.Date" />

<f:view locale="#{param['i18n:locale'] == null ? 'ja' : param['i18n:locale']}">

  <html>

    <head>

      <title>

        Non-Western Languages Localization

      </title>

    </head>

    <body bgcolor="white">

      <h:form id="i18n">

        <h:selectOneMenu id="locale" value="#{view.locale.language}">

          <f:selectItem itemValue="ja" itemLabel="Japanese" />

          <f:selectItem itemValue="el" itemLabel="Greek" />

          <f:selectItem itemValue="ru" itemLabel="Russian" />

        </h:selectOneMenu>

        <h:commandButton value="New Language" />

        <p>

        Current localized date/time: 

        <h:outputText value="#{now}">

          <f:convertDateTime datestyle="full" timestyle="full" />

        </h:outputText>

        <p>

        Enter a localized value for the current locale, e.g., by copy/pasting

        the current date/time:<br>

        <h:inputText size="50" value="#{input}">

          <f:convertDateTime datestyle="full" timestyle="full"/>

        </h:inputText>

        <h:commandButton value="Submit Value" />

        <p>

        The current value converted to a java.util.Date is:

        ${input}

      </h:form>

    </body>

  </html>

</f:view>

A JSP page directive at the top of the page declares this page produces a UTF-8-encoded response. If you don't hardcode a response encoding, the container picks an encoding that can be used for the selected locale—but it may not be the one you want. A J2EE 1.4 (Servlet 2.4 and JSP 2.0) container provides a standardized way to map locales to encodings in the web.xml file (see Appendix F for details) so you have full control over the selection, but if you use a J2EE 1.3 (Servlet 2.3 and JSP 1.2) container, the mapping is implementation-dependent. I recommend that you always hardcode UTF-8, unless you must support browsers without UTF-8 support (such browsers are very rare).

Be aware that the J2EE 1.3 specifications are vague regarding which encoding wins if you declare a hardcoded encoding with the page directive and then set the locale in the page body, so the container may choose a different locale than the one you declared. If at all possible, use a J2EE 1.4 container, for which this issue has been clarified (the hardcoded encoding always wins).


The locale is set by an <f:view> action as before, but Example 11-4 has a more complex JSF EL expression as the value. It uses the conditional operator to set the locale either to the value of a request parameter named i18n:locale or to "ja" (Japanese) if there is no parameter with that name. If you look further down in Example 11-4, you'll see there's an <h:selectOneMenu> action element with an id attribute set to locale, nested within an <h:form> action element with an id attribute set to i18n. The name of the request parameter that holds the value of an input component is made up from the ID of the form and the input component itself, separated by a colon, because a form is a naming container (I'll get back to what a naming container is in Chapter 12). Hence, the request parameter named i18n:locale holds the code for the selected locale and the <f:view> action element uses it to set the view locale. The first time the page is requested, the parameter isn't included, so Japanese is used as the default locale.

The <h:selectOneMenu> action element uses another funny-looking JSF EL expression as its value. It looks up the locale property value of the UIViewRoot component and then gets the value of the Locale instance's language property. The nested <f:selectItem> elements define choices with the language codes for Japanese, Greek, and Russian as their values; as a result, the previously selected locale is shown as the current choice for each new request.

An <h:outputText> action element with its nested <f:convertDateTime> element creates an output component that displays the current date and time, represented by a java.util.Date variable created by a JSP <jsp:useBean> action at the top of the page. Before we move on, you may want to switch between the locales and see how the current date/time value changes. Because the view's locale changes when you pick a new locale from the list, the output component's converter knows how to format the value correctly.

The character encoding also plays a crucial role when it comes to processing input. A regular HTTP request can only contain parameter values made up from the characters defined by the ISO-8859-1 charset, so the browser must encode all other characters entered in input fields in terms of the allowed characters. It encodes each nonstandard character as a string, starting with a percent sign followed by a hexadecimal value for the character, e.g., %E4. The problem is that the hexadecimal value only makes sense if you know which charset it comes from. And even though the HTTP specification says that the charset name must be sent in the Content-Type request header, most browsers don't. Luckily, all commonly used browsers use the charset of the response containing the form to encode the parameter values when the form is submitted. As long as you keep track of the response encoding, you can tell the container which charset to use to decode the parameter values. JSF hides this complexity as long as your application doesn't disable session tracking. At the end of the Render Response phase, JSF saves the character encoding used for the response in a session variable, and before reading any request parameters from the next request for the view, it tells the container to use the same encoding. If you run into problems in this area, first confirm that session tracking is working. Make sure all your users have cookies enabled, or that you use JSF components for all links so they include the session ID when cookies are disabled.

The JSP page in Example 4-1 contains an <h:inputText> action that creates an input component with a date/time converter so you can try this out. Enter a value that matches the currently selected locale, e.g., by copying the current date/time value, and submit the form. If the value can be interpreted as a valid date/time value, the JSP EL expression at the end of the page picks it up from where the input component saved it and adds it to the response in its native format. If the value is invalid, the invalid value remains in the input field (so you can correct it), but no value is stored for the JSP EL expression to pick up.

Internationalizing an application is a lot of work, as you've seen in this chapter, but if you're reasonably sure that you will have to do it sooner or later, I suggest that you do it up front. Retrofitting an application for internationalization later is boring and involves a lot more work than doing it from the start.

    Previous Section  < Day Day Up >  Next Section