Previous Page
Next Page

1.7. How the C Compiler Works

Once you have written a source file using a text editor, you can invoke a C compiler to translate it into machine code. The compiler operates on a translation unit consisting of a source file and all the header files referenced by #include directives. If the compiler finds no errors in the translation unit, it generates an object file containing the corresponding machine code. Object files are usually identified by the filename suffix .o or .obj . In addition, the compiler may also generate an assembler listing (see Part III).

Object files are also called modules. A library, such as the C standard library, contains compiled, rapidly accessible modules of the standard functions.

The compiler translates each translation unit of a C programthat is, each source file with any header files it includesinto a separate object file. The compiler then invokes the linker, which combines the object files, and any library functions used, in an executable file. Figure 1-1 illustrates the process of compiling and linking a program from several source files and libraries. The executable file also contains any information that the target operating system needs to load and start it.

Figure 1-1. From source code to executable file


1.7.1. The C Compiler's Translation Phases

The compiling process takes place in eight logical steps. A given compiler may combine several of these steps, as long as the results are not affected. The steps are:

  1. Characters are read from the source file and converted, if necessary, into the characters of the source character set. The end-of-line indicators in the source file, if different from the new line character, are replaced. Likewise, any trigraph sequences are replaced with the single characters they represent. (Digraphs, however are left alone; they are not converted into their single-character equivalents.)

  2. Wherever a backslash is followed immediately by a newline character, the preprocessor deletes both. Since a line end character ends a preprocessor directive, this processing step lets you place a backslash at the end of a line in order to continue a directive, such as a macro definition, on the next line.

    Every source file, if not completely empty, must end with a new line character.


  3. The source file is broken down into preprocessor tokens (see the next section, "Tokens") and sequences of whitespace characters. Each comment is treated as one space.

  4. The preprocessor directives are carried out and macro calls are expanded.

    Steps 1 through 4 are also applied to any files inserted by #include directives. Once the compiler has carried out the preprocessor directives, it removes them from its working copy of the source code.


  5. The characters and escape sequences in character constants and string literals are converted into the corresponding characters in the execution character set.

  6. Adjacent string literals are concatenated into a single string.

  7. The actual compiling takes place: the compiler analyzes the sequence of tokens and generates the corresponding machine code.

  8. The linker resolves references to external objects and functions, and generates the executable file. If a module refers to external objects or functions that are not defined in any of the translation units, the linker takes them from the standard library or another specified library. External objects and functions must not be defined more than once in a program.

For most compilers, either the preprocessor is a separate program, or the compiler provides options to perform only the preprocessing (steps 1 through 4 in the preceding list). This setup allows you to verify that your preprocessor directives have the intended effects. For a more practically oriented look at the compiling process, see Chapter 18.

1.7.2. Tokens

A token is either a keyword, an identifier, a constant, a string literal, or a symbol. Symbols in C consist of one or more punctuation characters, and function as operators or digraphs, or have syntactic importance, like the semicolon that terminates a simple statement, or the braces { } that enclose a block statement. For example, the following C statement consists of five tokens:

    printf("Hello, world.\n");

The individual tokens are:

    printf
    (
    "Hello, world.\n"
    )
    ;

The tokens interpreted by the preprocessor are parsed in the third translation phase. These are only slightly different from the tokens that the compiler interprets in the seventh phase of translation:

  • Within an #include directive, the preprocessor recognizes the additional tokens <filename> and "filename".

  • During the preprocessing phase, character constants and string literals have not yet been converted from the source character set to the execution character set.

  • Unlike the compiler proper, the preprocessor makes no distinction between integer constants and floating-point constants.

In parsing the source file into tokens, the compiler (or preprocessor) always applies the following principle: each successive non-whitespace character must be appended to the token being read, unless appending it would make a valid token invalid. This rule resolves any ambiguity in the following expression, for example:

    a+++b

Because the first + cannot be part of an identifier or keyword starting with a, it begins a new token. The second + appended to the first forms a valid tokenthe increment operatorbut a third + does not. Hence the expression must be parsed as:

    a ++ + b

See Chapter 18 for more information on compiling C programs.


Previous Page
Next Page