3.7 System Monitoring

Both hardware and software are prone to failing, sometimes drastically. Although the occurrence of failures can be reduced through careful design and runtime testing, they are sometimes unavoidable. It is the task of the embedded system designer to plan for such a possibility and to provide means of recovery. Often, failure detection and recovery is done by means of system monitoring hardware and software such as watchdogs.

Linux supports two types of system monitoring facilities: watchdog timers and hardware health monitoring. There are both hardware and software implementations of watchdog timers, whereas health monitors always require appropriate hardware. Watchdog timers depend on periodic reinitialization so as not to reboot the system. If the system hangs, the timer eventually expires and causes a reboot. Hardware health monitors provide information regarding the system's physical state. This information can in turn be used to carry out appropriate actions to signal or solve actual physical problems such as overheating or voltage irregularities.

The kernel includes drivers for many watchdog timers. The complete list of supported watchdog devices can be found in the kernel build configuration menu in the Watchdog Cards submenu. The list includes drivers for watchdog timer peripheral cards, a software watchdog, and drivers for watchdog timers found in some CPUs such as the MachZ and the SuperH. Although you may want to use the software watchdog to avoid the cost of a real hardware watchdog, note that the software watchdog may fail to reboot the system in some circumstances. Timer watchdogs are seen as /dev/watchdog in Linux and have to be written to periodically to avoid system reboot. This updating task is traditionally carried out by the watchdog daemon available from ftp://metalab.unc.edu/pub/linux/system/daemons/watchdog/. In an actual embedded system, however, you may want to have the main application carry out the update instead of using the watchdog daemon, since the latter may have no way of knowing whether the main application has stopped functioning properly.

In addition to the software watchdog available in the Linux kernel, RTAI provides an elaborate software watchdog with configurable policies. The main purpose of the RTAI watchdog is to protect the system against programming errors in RTAI applications. Hence, misbehaving tasks cannot hang the system. Upon detecting the offending task, the RTAI watchdog can apply a number of remedies to it including suspending it, killing it, and stretching its period. The RTAI watchdog and appropriate documentation are part of the mainstream RTAI distribution.

Finally, Linux supports quite a few hardware monitoring devices through the "Hardware Monitoring by lm_sensors" project found at http://www2.lm-sensors.nu/~lm78/. The project's web site contains a complete list of supported devices along with extensive documentation on the installation and operation of the software. The lm_sensors package available from the project's web site includes both the device drivers and user-level utilities to interface with the drivers. These utilities include sensord, a daemon that can log sensor values and alert the system through the ALERT syslog level when an alarm condition occurs. The site also provides links to external projects and resources related to lm_sensors.