Team LiB
Previous Section Next Section

2.2. Lexical Structure

This section explains the lexical structure of a Java program. It starts with a discussion of the Unicode character set in which Java programs are written . It then covers the tokens that comprise a Java program, explaining comments, identifiers, reserved words, literals, and so on.

2.2.1. The Unicode Character Set

Java programs are written using Unicode. You can use Unicode characters anywhere in a Java program, including comments and identifiers such as variable names. Unlike the 7-bit ASCII character set, which is useful only for English, and the 8-bit ISO Latin-1 character set, which is useful only for major Western European languages, the Unicode character set can represent virtually every written language in common use on the planet. 16-bit Unicode characters are typically written to files using an encoding known as UTF-8, which converts the 16-bit characters into a stream of bytes. The format is designed so that plain ASCII text (and the 7-bit characters of Latin-1) are valid UTF-8 byte streams. Thus, you can simply write plain ASCII programs, and they will work as valid Unicode.

If you do not use a Unicode-enabled text editor, or if you do not want to force other programmers who view or edit your code to use a Unicode-enabled editor, you can embed Unicode characters into your Java programs using the special Unicode escape sequence \uxxxx, in other words, a backslash and a lowercase u, followed by four hexadecimal characters. For example, \u0020 is the space character, and \u03c0 is the character .

Unicode 3.1 and above, used in Java 5.0 and later, includes "supplementary characters" that require 21 bits to represent. 16-bit encodings of Unicode characters represent these supplementary characters using a surrogate pair, which is a sequence of two 16-bit characters taken from a special reserved range of the 16-bit encoding space. If you ever need to include one of these (rarely used) supplementary characters in Java source code, use two \u sequences to represent the surrogate pair. (Details of surrogate pair encoding are beyond the scope of this book, however.)

2.2.2. Case-Sensitivity and Whitespace

Java is a case-sensitive language. Its keywords are written in lowercase and must always be used that way. That is, While and WHILE are not the same as the while keyword. Similarly, if you declare a variable named i in your program, you may not refer to it as I.

Java ignores spaces, tabs, newlines, and other whitespace, except when it appears within quoted characters and string literals. Programmers typically use whitespace to format and indent their code for easy readability, and you will see common indentation conventions in the code examples of this book.

2.2.3. Comments

Comments are natural-language text intended for human readers of a program. They are ignored by the Java compiler. Java supports three types of comments. The first type is a single-line comment, which begins with the characters // and continues until the end of the current line. For example:

int i = 0;   // Initialize the loop variable

The second kind of comment is a multiline comment. It begins with the characters /* and continues, over any number of lines, until the characters */. Any text between the /* and the */ is ignored by the Java compiler. Although this style of comment is typically used for multiline comments, it can also be used for single-line comments. This type of comment cannot be nested (i.e., one /* */ comment cannot appear within another). When writing multiline comments, programmers often use extra * characters to make the comments stand out. Here is a typical multiline comment:

/*
 * First, establish a connection to the server.
 * If the connection attempt fails, quit right away.
 */

The third type of comment is a special case of the second. If a comment begins with /**, it is regarded as a special doc comment. Like regular multiline comments, doc comments end with */ and cannot be nested. When you write a Java class you expect other programmers to use, use doc comments to embed documentation about the class and each of its methods directly into the source code. A program named javadoc extracts these comments and processes them to create online documentation for your class. A doc comment can contain HTML tags and can use additional syntax understood by javadoc. For example:

/**
 * Upload a file to a web server.
 *
 * @param file The file to upload.
 * @return <tt>true</tt> on success,
 *         <tt>false</tt> on failure.
 * @author David Flanagan
 */

See Chapter 7 for more information on the doc comment syntax and Chapter 8 for more information on the javadoc program.

Comments may appear between any tokens of a Java program, but may not appear within a token. In particular, comments may not appear within double-quoted string literals. A comment within a string literal simply becomes a literal part of that string.

2.2.4. Reserved Words

The following words are reserved in Java: they are part of the syntax of the language and may not be used to name variables, classes, and so forth.

abstract   const      final        int         public        throw
assert     continue   finally      interface   return        throws
boolean    default    float        long        short         transient
break      do         for          native      static        true
byte       double     goto         new         strictfp      try
case       else       if           null        super         void
catch      enum       implements   package     switch        volatile
char       extends    import       private     synchronized  while
class      false      instanceof   protected   this

We'll meet each of these reserved words again later in this book. Some of them are the names of primitive types and others are the names of Java statements, both of which are discussed later in this chapter. Still others are used to define classes and their members (see Chapter 3).

Note that const and goto are reserved but aren't actually used in the language. strictfp was added in Java 1.2, assert was added in Java 1.4, and enum was added in Java 5.0.

2.2.5. Identifiers

An identifier is simply a name given to some part of a Java program, such as a class, a method within a class, or a variable declared within a method. Identifiers may be of any length and may contain letters and digits drawn from the entire Unicode character set. An identifier may not begin with a digit, however, because the compiler would then think it was a numeric literal rather than an identifier.

In general, identifiers may not contain punctuation characters. Exceptions include the ASCII underscore (_) and dollar sign ($) as well as other Unicode currency symbols such as £ and ¥. Currency symbols are intended for use in automatically generated source code, such as code produced by parser generators. By avoiding the use of currency symbols in your own identifiers you don't have to worry about collisions with automatically generated identifiers. Formally, the characters allowed at the beginning of and within an identifier are defined by the methods isJavaIdentifierStart( ) and isJavaIdentifierPart( ) of the class java.lang.Character.

The following are examples of legal identifiers:

i    x1    theCurrentTime    the_current_time

2.2.6. Literals

Literals are values that appear directly in Java source code. They include integer and floating-point numbers, characters within single quotes, strings of characters within double quotes, and the reserved words true, false and null. For example, the following are all literals:

1    1.0    '1'    "one"    true    false    null

The syntax for expressing numeric, character, and string literals is detailed in Section 2.3 later in this chapter.

2.2.7. Punctuation

Java also uses a number of punctuation characters as tokens. The Java Language Specification divides these characters (somewhat arbitrarily) into two categories, separators and operators. Separators are:

(   )   {   }   [   ]   

<   >   :   ;   

,   .   @

Operators are:

+    -    *    /    %    &   |    ^    <<   >>   >>>
+=   -=   *=   /=   %=   &=  |=   ^=   <<=  >>=  >>>=
=    =  =   !=   <    <=   >    >=
!    ~    &&  ||   ++   --   ?    :

We'll see separators throughout the book, and will cover each operator individually in Section 2.4 later in this chapter.

    Team LiB
    Previous Section Next Section