[ Team LiB ] Previous Section Next Section

5.9 Handling SIGCHLD Signals

The purpose of the zombie state is to maintain information about the child for the parent to fetch at some later time. This information includes the process ID of the child, its termination status, and information on the resource utilization of the child (CPU time, memory, etc.). If a process terminates, and that process has children in the zombie state, the parent process ID of all the zombie children is set to 1 (the init process), which will inherit the children and clean them up (i.e., init will wait for them, which removes the zombie). Some Unix systems show the COMMAND column for a zombie process as <defunct>.

Handling Zombies

Obviously we do not want to leave zombies around. They take up space in the kernel and eventually we can run out of processes. Whenever we fork children, we must wait for them to prevent them from becoming zombies. To do this, we establish a signal handler to catch SIGCHLD, and within the handler, we call wait. (We will describe the wait and waitpid functions in Section 5.10.) We establish the signal handler by adding the function call

Signal (SIGCHLD, sig_chld);

in Figure 5.2, after the call to listen. (It must be done sometime before we fork the first child and needs to be done only once.) We then define the signal handler, the function sig_chld, which we show in Figure 5.7.

Figure 5.7 Version of SIGCHLD signal handler that calls wait (improved in Figure 5.11).


 1 #include     "unp.h"
 2 void
 3 sig_chld(int signo)
 4 {
 5     pid_t   pid;
 6     int     stat;

 7     pid = wait(&stat);
 8     printf("child %d terminated\", pid);
 9     return;
10 }

Warning: Calling standard I/O functions such as printf in a signal handler is not recommended, for reasons that we will discuss in Section 11.18. We call printf here as a diagnostic tool to see when the child terminates.

Under System V and Unix 98, the child of a process does not become a zombie if the process sets the disposition of SIGCHLD to SIG_IGN. Unfortunately, this works only under System V and Unix 98. POSIX explicitly states that this behavior is unspecified. The portable way to handle zombies is to catch SIGCHLD and call wait or waitpid.

If we compile this program—Figure 5.2, with the call to Signal, with our sig_chld handler—under Solaris 9 and use the signal function from the system library (not our version from Figure 5.6), we have the following:

solaris % tcpserv02 &

start server in background

[2] 16939


solaris % tcpcli01

then start client in foreground

hi there

we type this

hi there

and this is echoed


we type our EOF character

child 16942 terminated

output by printf in signal handler

accept error: Interrupted system call

main function aborts

The sequence of steps is as follows:

  1. We terminate the client by typing our EOF character. The client TCP sends a FIN to the server and the server responds with an ACK.

  2. The receipt of the FIN delivers an EOF to the child's pending readline. The child terminates.

  3. The parent is blocked in its call to accept when the SIGCHLD signal is delivered. The sig_chld function executes (our signal handler), wait fetches the child's PID and termination status, and printf is called from the signal handler. The signal handler returns.

  4. Since the signal was caught by the parent while the parent was blocked in a slow system call (accept), the kernel causes the accept to return an error of EINTR (interrupted system call). The parent does not handle this error (Figure 5.2), so it aborts.

The purpose of this example is to show that when writing network programs that catch signals, we must be cognizant of interrupted system calls, and we must handle them. In this specific example, running under Solaris 9, the signal function provided in the standard C library does not cause an interrupted system call to be automatically restarted by the kernel. That is, the SA_RESTART flag that we set in Figure 5.6 is not set by the signal function in the system library. Some other systems automatically restart the interrupted system call. If we run the same example under 4.4BSD, using its library version of the signal function, the kernel restarts the interrupted system call and accept does not return an error. To handle this potential problem between different operating systems is one reason we define our own version of the signal function that we use throughout the text (Figure 5.6).

As part of the coding conventions used in this text, we always code an explicit return in our signal handlers (Figure 5.7), even though falling off the end of the function does the same thing for a function returning void. When reading the code, the unnecessary return statement acts as a reminder that the return may interrupt a system call.

Handling Interrupted System Calls

We used the term "slow system call" to describe accept, and we use this term for any system call that can block forever. That is, the system call need never return. Most networking functions fall into this category. For example, there is no guarantee that a server's call to accept will ever return, if there are no clients that will connect to the server. Similarly, our server's call to read in Figure 5.3 will never return if the client never sends a line for the server to echo. Other examples of slow system calls are reads and writes of pipes and terminal devices. A notable exception is disk I/O, which usually returns to the caller (assuming no catastrophic hardware failure).

The basic rule that applies here is that when a process is blocked in a slow system call and the process catches a signal and the signal handler returns, the system call can return an error of EINTR. Some kernels automatically restart some interrupted system calls. For portability, when we write a program that catches signals (most concurrent servers catch SIGCHLD), we must be prepared for slow system calls to return EINTR. Portability problems are caused by the qualifiers "can" and "some," which were used earlier, and the fact that support for the POSIX SA_RESTART flag is optional. Even if an implementation supports the SA_RESTART flag, not all interrupted system calls may automatically be restarted. Most Berkeley-derived implementations, for example, never automatically restart select, and some of these implementations never restart accept or recvfrom.

To handle an interrupted accept, we change the call to accept in Figure 5.2, the beginning of the for loop, to the following:

     for ( ; ; ) {
         clilen = sizeof (cliaddr);
         if ( (connfd = accept (listenfd, (SA *) &cliaddr, &clilen)) < 0) {
             if (errno == EINTR)
                 continue;         /* back to for () */
                 err_sys ("accept error");

Notice that we call accept and not our wrapper function Accept, since we must handle the failure of the function ourselves.

What we are doing in this piece of code is restarting the interrupted system call. This is fine for accept, along with functions such as read, write, select, and open. But there is one function that we cannot restart: connect. If this function returns EINTR, we cannot call it again, as doing so will return an immediate error. When connect is interrupted by a caught signal and is not automatically restarted, we must call select to wait for the connection to complete, as we will describe in Section 16.3.

    [ Team LiB ] Previous Section Next Section