Chapter 4. Process Scheduling

The previous chapter discussed processes, the operating system abstraction of active program code. This chapter discusses the process scheduler, the chunk of code that puts those processes to work.

The process scheduler is the component of the kernel that selects which process to run next. The process scheduler (or simply the scheduler, to which it is often shortened) can be viewed as the subsystem of the kernel that divides the finite resource of processor time between the runnable processes on a system. The scheduler is the basis of a multitasking operating system such as Linux. By deciding what process can run, the scheduler is responsible for best utilizing the system and giving the impression that multiple processes are executing simultaneously.

The idea behind the scheduler is simple. To best utilize processor time, assuming there are runnable processes, a process should always be running. If there are more runnable processes than processors in a system, some processes will not be running at a given moment. These processes are waiting to run. Deciding what process runs next, given a set of runnable processes, is a fundamental decision that the scheduler must make.

A multitasking operating system is one that can simultaneously interleave execution of more than one process. On single processor machines, this gives the illusion of multiple processes running concurrently. On multiprocessor machines, this also enables processes to actually run concurrently, in parallel, on different processors. On either machine, it also enables many processes to run in the background, not actually executing until work is available. These tasks, although in memory, are not runnable. Instead, such processes utilize the kernel to block until some event (keyboard input, network data, some time in the future, and so on) occurs. Consequently, a modern Linux system may have 100 processes in memory but only one in a runnable state.

Multitasking operating systems come in two flavors: cooperative multitasking and preemptive multitasking. Linux, like all Unix variants and most modern operating systems, provides preemptive multitasking. In preemptive multitasking, the scheduler decides when a process is to cease running and a new process is to resume running. The act of involuntarily suspending a running process is called preemption. The time a process runs before it is preempted is predetermined, and it is called the timeslice of the process. The timeslice, in effect, gives each runnable process a slice of the processor's time. Managing the timeslice enables the scheduler to make global scheduling decisions for the system. It also prevents any one process from monopolizing the processor. As we shall see, this timeslice is dynamically calculated in the Linux process scheduler to provide some interesting benefits.

Conversely, in cooperative multitasking, a process does not stop running until it voluntary decides to do so. The act of a process voluntarily suspending itself is called yielding. Processes are supposed to yield often, but the operating system cannot enforce this. The shortcomings of this approach are numerous: The scheduler cannot make global decisions regarding how long processes run, processes can monopolize the processor for longer than the user desires, and a hung process that never yields can potentially bring down the entire system. Thankfully, most operating systems designed in the last decade have provided preemptive multitasking, with Mac OS 9 and earlier being the most notable (and embarrassing) exceptions. Of course, Unix has been preemptively multitasked since the beginning.

During the 2.5 kernel development series, the Linux kernel received a scheduler overhaul. A new scheduler, commonly called the O(1) scheduler because of its algorithmic behavior^[1], solved the shortcomings of the previous Linux scheduler and introduced powerful new features and performance characteristics. This chapter discusses the fundamentals of scheduler design and how they apply to the new O(1) scheduler and its goals, design, implementation, algorithms, and related system calls.

^[1] O(1) is an example of big-o notation. In short, it means the scheduler can do its thing in constant time, regardless of the size of the input. A full explanation of big-o notation is in Appendix C, "Algorithmic Complexity," for the curious.