Team LiB
Previous Section Next Section

The Linux Scheduling Algorithm

In the previous sections, we discussed process scheduling theory in the abstract, with only occasional mention of how Linux applies a given concept to reality. With the foundation of scheduling now built, we can dive into Linux's very own process scheduler.

The Linux scheduler is defined in kernel/sched.c. The scheduler algorithm and supporting code went through a large rewrite early in the 2.5 kernel development series.

Consequently, the scheduler code is entirely new and unlike the scheduler in previous kernels. The new scheduler was designed to accomplish specific goals:

  • Implement fully O(1) scheduling. Every algorithm in the new scheduler completes in constant-time, regardless of the number of running processes.

  • Implement perfect SMP scalability. Each processor has its own locking and individual runqueue.

  • Implement improved SMP affinity. Attempt to group tasks to a specific CPU and continue to run them there. Only migrate tasks from one CPU to another to resolve imbalances in runqueue sizes.

  • Provide good interactive performance. Even during considerable system load, the system should react and schedule interactive tasks immediately.

  • Provide fairness. No process should find itself starved of timeslice for any reasonable amount of time. Likewise, no process should receive an unfairly high amount of timeslice.

  • Optimize for the common case of only one or two runnable processes, yet scale well to multiple processors, each with many processes.

The new scheduler accomplished these goals.


The basic data structure in the scheduler is the runqueue. The runqueue is defined in kernel/sched.c[3] as struct runqueue. The runqueue is the list of runnable processes on a given processor; there is one runqueue per processor. Each runnable process is on exactly one runqueue. The runqueue additionally contains per-processor scheduling information. Consequently, the runqueue is the primary scheduling data structure for each processor.

[3] Why kernel/sched.c and not <linux/sched.h>? Because it is desired to abstract away the scheduler code and provide only certain interfaces to the rest of the kernel. Placing the runqueue code in a header file would allow code outside of the scheduler to get at the runqueues, and this is not desired.

Let's look at the structure, with comments describing each field:

struct runqueue {
        spinlock_t          lock;   /* spin lock that protects this runqueue */
        unsigned long       nr_running;         /* number of runnable tasks */
        unsigned long       nr_switches;        /* context switch count */
        unsigned long       expired_timestamp;    /* time of last array swap */
        unsigned long       nr_uninterruptible;   /* uninterruptible tasks */
        unsigned long long  timestamp_last_tick;  /* last scheduler tick */
        struct task_struct  *curr;                /* currently running task */
        struct task_struct  *idle;           /* this processor's idle task */
        struct mm_struct    *prev_mm;        /* mm_struct of last ran task */
        struct prio_array   *active;         /* active priority array */
        struct prio_array   *expired;        /* the expired priority array */
        struct prio_array   arrays[2];       /* the actual priority arrays */
        struct task_struct  *migration_thread; /* migration thread */
        struct list_head    migration_queue;   /* migration queue*/
        atomic_t            nr_iowait; /* number of tasks waiting on I/O */

Because runqueues are the core data structure in the scheduler, a group of macros are used to obtain the runqueue associated with a given processor or process. The macro cpu_rq(processor) returns a pointer to the runqueue associated with the given processor; the macro this_rq() returns the runqueue of the current processor; and the macro task_rq(task) returns a pointer to the runqueue on which the given task is queued.

Before a runqueue can be manipulated, it must be locked (locking is discussed in depth in Chapter 8, "Kernel Synchronization Introduction"). Because each runqueue is unique to the current processor, it is rare when a processor desires to lock a different processor's runqueue. (It does happen, however, as we will see.) The locking of the runqueue prohibits any changes to it while the lock-holder is reading or writing the runqueue's members. The most common runqueue locking scenario is when you want to lock the runqueue on which a specific task runs. In that case, the task_rq_lock() and task_rq_unlock() functions are used:

struct runqueue *rq;
unsigned long flags;

rq = task_rq_lock(task, &flags);
/* manipulate the task's runqueue, rq */
task_rq_unlock(rq, &flags);

Alternatively, the method this_rq_lock() locks the current runqueue and rq_unlock() unlocks the given runqueue:

struct runqueue *rq;

rq = this_rq_lock();
/* manipulate this process's current runqueue, rq */

To avoid deadlock, code that wants to lock multiple runqueues needs always to obtain the locks in the same order: by ascending runqueue address. (Again, Chapter 8 offers a full explanation.) For example,

/* to lock ... */
if (rq1 == rq2)
else {
        if (rq1 < rq2) {
        } else {

/* manipulate both runqueues ... */

/* to unlock ... */
if (rq1 != rq2)

These steps are made automatic by the double_rq_lock() and double_rq_unlock() functions. The preceding steps would then become

double_rq_lock(rq1, rq2);

/* manipulate both runqueues ... */

double_rq_unlock(rq1, rq2);

A quick example should help you see why the order of obtaining the locks is important. The topic of deadlock is covered in Chapters 8 and 9 because this is not a problem unique to the runqueues; nested locks always need to be obtained in the same order. The spin locks are used to prevent multiple tasks from simultaneously manipulating the runqueues. They work like a key to a door. The first task to reach the door grabs the key and enters the door, locking the door behind it. If another task reaches the door and finds it locked (because another task is already inside), it must sit and wait for the first task to exit the door and return the key. This waiting is called spinning because the task actually sits in a tight loop, repeatedly checking for the return of the key. Now, consider if one task wants to lock the first runqueue and then the second while another task wants to lock the second runqueue and then the first. Assume the first task succeeds in locking the first runqueue while simultaneously the second task succeeds in locking the second runqueue. Now the first task tries to lock the second runqueue and the second task tries to lock the first runqueue. Neither task succeeds because the other task holds the lock. Both tasks sit, waiting forever for each other. Like an impasse creating a traffic deadlock, this out-of-order locking results in the tasks waiting for each other, forever, and thus deadlocking. If both tasks obtained the locks in the same order, this scenario could not happen. See Chapters 8 and 9 for the full scoop on locking.

The Priority Arrays

Each runqueue contains two priority arrays, the active and the expired array. Priority arrays are defined in kernel/sched.c as struct prio_array. Priority arrays are the data structures that provide O(1) scheduling. Each priority array contains one queue of runnable processors per priority level. These queues contain lists of the runnable processes at each priority level. The priority arrays also contain a priority bitmap used to efficiently discover the highest-priority runnable task in the system.

struct prio_array {
        int               nr_active;         /* number of tasks in the queues */
        unsigned long     bitmap[BITMAP_SIZE];  /* priority bitmap */
        struct list_head  queue[MAX_PRIO];      /* priority queues */

MAX_PRIO is the number of priority levels on the system. By default, this is 140. Thus, there is one struct list_head for each priority. BITMAP_SIZE is the size that an array of unsigned long typed variables would have to be to provide one bit for each valid priority level. With 140 priorities and 32-bit words, this is five. Thus, bitmap is an array with five elements and a total of 160 bits.

Each priority array contains a bitmap field that has at least one bit for every priority on the system. Initially, all the bits are zero. When a task of a given priority becomes runnable (that is, its state is set to TASK_RUNNING), the corresponding bit in the bitmap is set to one. For example, if a task with priority seven is runnable, then bit seven is set. Finding the highest priority task on the system is therefore only a matter of finding the first set bit in the bitmap. Because the number of priorities is static, the time to complete this search is constant and unaffected by the number of running processes on the system. Furthermore, each supported architecture in Linux implements a fast find first set algorithm to quickly search the bitmap. This method is called sched_find_first_bit(). Many architectures provide a find-first-set instruction that operates on a given word[4]. On these systems, finding the first set bit is as trivial as executing this instruction at most a couple of times.

[4] On the x86 architecture, this instruction is called bsfl. On PPC, cntlzw is used for this purpose.

Each priority array also contains an array named queue of struct list_head queues, one queue for each priority. Each list corresponds to a given priority and in fact contains all the runnable processes of that priority that are on this processor's runqueue. Finding the next task to run is as simple as selecting the next element in the list. Within a given priority, tasks are scheduled round robin.

The priority array also contains a counter, nr_active. This is the number of runnable tasks in this priority array.

Recalculating Timeslices

Many operating systems (older versions of Linux included) have an explicit method for recalculating each task's timeslice when they have all reached zero. Typically, this is implemented as a loop over each task, such as

for (each task on the system) {
        recalculate priority
        recalculate timeslice

The priority and other attributes of the task are used to determine a new timeslice. This approach has some problems:

  • It potentially can take a long time. Worse, it scales O(n) for n tasks on the system.

  • The recalculation must occur under some sort of lock protecting the task list and the individual process descriptors. This results in high lock contention.

  • The nondeterminism of a randomly occurring recalculation of the timeslices is a problem with deterministic real-time programs.

  • It is just gross (which is a quite legitimate reason for improving something in the Linux kernel).

The new Linux scheduler alleviates the need for a recalculate loop. Instead, it maintains two priority arrays for each processor: both an active array and an expired array. The active array contains all the tasks in the associated runqueue that have timeslice left. The expired array contains all the tasks in the associated runqueue that have exhausted their timeslice. When each task's timeslice reaches zero, its timeslice is recalculated before it is moved to the expired array. Recalculating all the timeslices is then as simple as just switching the active and expired arrays. Because the arrays are accessed only via pointer, switching them is as fast as swapping two pointers. This is performed in schedule():

struct prio_array *array = rq->active;
if (!array->nr_active) {
        rq->active = rq->expired;
        rq->expired = array;

This swap is a key feature of the new O(1) scheduler. Instead of recalculating each processes priority and timeslice all the time, the O(1) scheduler performs a simple two-step array swap. This resolves the previously discussed problems.


The act of picking the next task to run and switching to it is implemented via the schedule() function. This function is called explicitly by kernel code that wants to sleep and it is invoked whenever a task is to be preempted. The schedule() function is run independently by each processor. Consequently, each CPU makes its own decisions on what process to run next.

The schedule() function is relatively simple for all it must accomplish. The following code determines the highest priority task:

struct task_struct *prev, *next;
struct list_head *queue;
struct prio_array *array;
int idx;

prev = current;
array = rq->active;
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, struct task_struct, run_list);

First, the active priority array is searched to find the first set bit. This bit corresponds to the highest priority task that is runnable. Next, the scheduler selects the first task in the list at that priority. This is the highest priority runnable task on the system and is the task the scheduler will run. See Figure 4.2.

Figure 4.2. The Linux O(1) scheduler algorithm.

If prev does not equal next, then a new task has been selected to run. The function context_switch() is called to switch from prev to next. Context switching is discussed in a subsequent section.

Two important points should be noted from the previous code. First, it is very simple and consequently quite fast. Second, the number of processes on the system has no effect on how long this code takes to execute. There is no loop over any list to find the most suitable process. In fact, nothing affects how long the schedule() code takes to find a new task. It is constant in execution time.

Calculating Priority and Timeslice

At the beginning of this chapter, you saw how priority and timeslice are used to influence the decisions that the scheduler makes. Additionally, you learned about I/O-bound and processor-bound tasks and why it is beneficial to boost the priority of interactive tasks. Now it's time to look at the actual code that implements this design.

Processes have an initial priority that is called the nice value. This value ranges from 20 to +19 with a default of zero. Nineteen is the lowest and 20 is the highest priority. This value is stored in the static_prio member of the process's task_struct. The variable is called the static priority because it does not change from what the user specifies. The scheduler, in turn, bases its decisions on the dynamic priority that is stored in prio. The dynamic priority is calculated as a function of the static priority and the task's interactivity.

The method effective_prio() returns a task's dynamic priority. The method begins with the task's nice value and computes a bonus or penalty in the range 5 to +5 based on the interactivity of the task. For example, a highly interactive task with a nice value of ten can have a dynamic priority of five. Conversely, a mild processor hog with a nice value of ten can have a dynamic priority of 12. Tasks that are only mildly interactiveat some theoretical equilibrium of I/O versus processor usagereceive no bonus or penalty and their dynamic priority is equal to their nice value.

Of course, the scheduler does not magically know whether a process is interactive. It must use some heuristic that is capable of accurately reflecting whether a task is I/O bound or processor bound. The most indicative metric is how long the task sleeps. If a task spends most of its time asleep, then it is I/O bound. If a task spends more time runnable than sleeping, it is certainly not interactive. This extends to the extreme: A task that spends nearly all the time sleeping is completely I/O bound, whereas a task that spends nearly all its time runnable is completely processor bound.

To implement this heuristic, Linux keeps a running tab on how much time a process is spent sleeping versus how much time the process spends in a runnable state. This value is stored in the sleep_avg member of the task_struct. It ranges from zero to MAX_SLEEP_AVG, which defaults to 10 milliseconds. When a task becomes runnable after sleeping, sleep_avg is incremented by how long it slept, until the value reaches MAX_SLEEP_AVG. For every timer tick the task runs, sleep_avg is decremented until it reaches zero.

This metric is surprisingly accurate. It is computed based not only on how long the task sleeps but also on how little it runs. Therefore, a task that spends a great deal of time sleeping, but also continually exhausts its timeslice, will not be awarded a huge bonusthe metric works not just to award interactive tasks but also to punish processor-bound tasks. It is also not vulnerable to abuse. A task that receives a boosted priority and timeslice quickly loses the bonus if it turns around and hogs the processor. Finally, the metric provides quick response. A newly created interactive process quickly receives a large sleep_avg. Despite this, because the bonus or penalty is applied against the initial nice value, the user can still influence the system's scheduling decisions by changing the process's nice value.

Timeslice, on the other hand, is a much simpler calculation. It is based on the static priority. When a process is first created, the new child and the parent split the parent's remaining timeslice. This provides fairness and prevents users from forking new children to get unlimited timeslice. After a task's timeslice is exhausted, however, it is recalculated based on the task's static priority. The function task_timeslice() returns a new timeslice for the given task. The calculation is a simple scaling of the static priority into a range of timeslices. The higher a task's priority, the more timeslice it receives per round of execution. The maximum timeslice, which is given to the highest priority tasks (a nice value of -20), is 800 milliseconds. Even the lowest-priority tasks (those with a nice value of +19) receive at least the minimum timeslice, MIN_TIMESLICE, which is either 5 milliseconds or one timer tick (see Chapter 10, Timers and Time Management), whichever is larger. Tasks with the default priority (a nice value of zero) receive a timeslice of 100 milliseconds. See Table 4.1.

The scheduler provides one additional aide to interactive tasks: If a task is sufficiently interactive, when it exhausts its timeslice it will not be inserted into the expired array, but instead reinserted back into the active array. Recall that timeslice recalculation is provided via the switching of the active and the expired arrays. Normally, as processes exhaust their timeslices, they are moved from the active array to the expired array. When there are no more processes in the active array, the two arrays are switched: The active becomes the expired, and the expired becomes the active. This provides O(1) timeslice recalculation. It also provides the possibility that an interactive task can become runnable but fail to run again until the array switch occurs because the task is stuck in the expired array. Reinserting interactive tasks back into the active array alleviates this problem. The task does not run immediately, but is scheduled round robin with the other tasks at its priority. The logic to provide this feature is implemented in scheduler_tick(), which is called via the timer interrupt (discussed in Chapter 10, "Timers and Time Management"):

struct task_struct *task;
struct runqueue *rq;

task = current;
rq = this_rq();

if (!--task->time_slice) {
        if (!TASK_INTERACTIVE(task) || EXPIRED_STARVING(rq))
                enqueue_task(task, rq->expired);
                enqueue_task(task, rq->active);

First, the code decrements the process's timeslice and checks whether it is now zero. If it is, the task is expired and it needs to be inserted into an array, so this code first checks whether the task is interactive via the TASK_INTERACTIVE() macro. This macro computes whether a task is "interactive enough" based on its nice value. The lower the nice value (the higher the priority) the less interactive a task needs to be. A nice +19 task can never be interactive enough to be reinserted. Conversely, a nice 20 task would need to be a heavy processor hog not to be reinserted. A task at the default nice value, zero, needs to be relatively interactive to be reinserted, but it is not too difficult. Next, the EXPIRED_STARVING() macro checks whether there are processes on the expired array that are starvingthat is, if the arrays have not been switched in a relatively long time. If they have not been switched recently, reinserting the current task into the active array further delays the switch, additionally starving the tasks on the expired array. If this is not the case, the process can be inserted into the active array. Otherwise, it is inserted into the expired array, which is the normal practice.

Sleeping and Waking Up

Tasks that are sleeping (blocked) are in a special non-runnable state. This is important because without this special state, the scheduler would select tasks that did not want to run or, worse, sleeping would have to be implemented as busy looping. A task sleeps for a number of reasons, but always while it is waiting for some event. The event can be a specified amount of time, more data from a file I/O, or another hardware event. A task can also involuntarily go to sleep when it tries to obtain a contended semaphore in the kernel (this is covered in Chapter 9, "Kernel Synchronization Methods"). A common reason to sleep is file I/Ofor example, the task issued a read() request on a file, which needs to be read in from disk. As another example, the task could be waiting for keyboard input. Whatever the case, the kernel behavior is the same: The task marks itself as sleeping, puts itself on a wait queue, removes itself from the runqueue, and calls schedule() to select a new process to execute. Waking back up is the inverse: the task is set as runnable, removed from the wait queue, and added back to the runqueue.

As discussed in the previous chapter, two states are associated with sleeping, TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE. They differ only in that tasks in the TASK_UNINTERRUPTIBLE state ignore signals, whereas tasks in the TASK_INTERRUPTIBLE state wake up prematurely and respond to a signal if one is issued. Both types of sleeping tasks sit on a wait queue, waiting for an event to occur, and are not runnable.

Sleeping is handled via wait queues. A wait queue is a simple list of processes waiting for an event to occur. Wait queues are represented in the kernel by wake_queue_head_t. Wait queues are created statically via DECLARE_WAITQUEUE() or dynamically via init_waitqueue_head(). Processes put themselves on a wait queue and mark themselves not runnable. When the event associated with the wait queue occurs, the processes on the queue are awakened. It is important to implement sleeping and waking correctly, to avoid race conditions.

Some simple interfaces for sleeping used to be in wide use. These interfaces, however, have races: It is possible to go to sleep after the condition becomes true. In that case, the task might sleep indefinitely. Therefore, the recommended method for sleeping in the kernel is a bit more complicated:

/* 'q' is the wait queue we wish to sleep on */
DECLARE_WAITQUEUE(wait, current);

add_wait_queue(q, &wait);
while (!condition) {     /* condition is the event that we are waiting for */
        set_current_state(TASK_INTERRUPTIBLE); /* or TASK_UNINTERRUPTIBLE */
        if (signal_pending(current))
                /* handle signal */
remove_wait_queue(q, &wait);

The task performs the following steps to add itself to a wait queue:

  1. Creates a wait queue entry via DECLARE_WAITQUEUE().

  2. Adds itself to a wait queue via add_wait_queue(). This wait queue awakens the process when the condition for which it is waiting occurs. Of course, there needs to be code elsewhere that calls wake_up() on the queue when the event actually does occur.

  3. Changes the process state to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.

  4. If the state is set to TASK_INTERRUPTIBLE, a signal wakes the process up. This is called a spurious wake up (a wake-up not caused by the occurrence of the event). So check and handle signals.

  5. Tests whether the condition is true. If it is, there is no need to sleep. If it is not true, the task calls schedule().

  6. When the task awakens, it again checks whether the condition is true. If it is, it exits the loop. Otherwise, it again calls schedule() and repeats.

  7. Now that the condition is true, the task can set itself to TASK_RUNNING and remove itself from the wait queue via remove_wait_queue().

If the condition occurs before the task goes to sleep, the loop terminates, and the task does not erroneously go to sleep. Note that kernel code often has to perform various other tasks in the body of the loop. For example, it might need to release locks before calling schedule() and reacquire them after or react to other events.

Waking is handled via wake_up(), which wakes up all the tasks waiting on the given wait queue. It calls try_to_wake_up(), which sets the task's state to TASK_RUNNING, calls activate_task() to add the task to a runqueue, and sets need_resched if the awakened task's priority is higher than the priority of the current task. The code that causes the event to occur typically calls wake_up() afterward. For example, when data arrives from the hard disk, the VFS calls wake_up() on the wait queue that holds the processes waiting for the data.

An important note about sleeping is that there are spurious wake-ups. Just because a task is awakened does not mean that the event for which the task is waiting has occurred; sleeping should always be handled in a loop that ensures that the condition for which the task is waiting has indeed occurred. Figure 4.3 depicts the relationship between each scheduler state.

Figure 4.3. Sleeping and waking up.

The Load Balancer

As discussed, the Linux scheduler implements separate runqueues and locking for each processor on a symmetrical multiprocessing system. That is, each processor maintains its own list of processes and operates the scheduler on only those tasks. The entire scheduling system is, in effect, unique to each processor. How, then, does the scheduler enforce any sort of global scheduling policy on multiprocessing systems? What if the runqueues become unbalanced, say with five processes on one processor's runqueue, but only one on another? The solution is the load balancer, which works to ensure that the runqueues are balanced. The load balancer compares the current processor's runqueue to the other runqueues in the system. If it finds an imbalance, it pulls processes from the busier runqueue to the current runqueue. Ideally, every runqueue will have the same number of processes. That is a lofty goal, but the load balancer comes close.

The load balancer is implemented in kernel/sched.c as load_balance(). It has two methods of invocation. It is called by schedule() whenever the current runqueue is empty. It is also called via timer: every 1 millisecond when the system is idle and every 200 milliseconds otherwise. On uniprocessor systems, load_balance() is never called and in fact is not even compiled into the kernel image because there is only a single runqueue and thus no balancing is needed.

The load balancer is called with the current processor's runqueue locked and with interrupts disabled to protect the runqueues from concurrent access. In the case where schedule() calls load_balance(), its job is pretty clear because the current runqueue is empty and finding any process and pulling it onto this runqueue is advantageous. When the load balancer is called via timer, however, its job might be less apparent: It needs to resolve any imbalance between the runqueues to keep them about even. See Figure 4.4.

Figure 4.4. The load balancer.

The load_balance() function and related methods are fairly large and complicated, although the steps they perform are comprehensible:

First, load_balance() calls find_busiest_queue() to determine the busiest runqueue. In other words, this is the runqueue with the greatest number of processes in it. If there is no runqueue that has at least 25% more processes than the current, find_busiest_queue() returns NULL and load_balance() returns. Otherwise, the busiest runqueue is returned.

Second, load_balance() decides from which priority array on the busiest runqueue it wants to pull. The expired array is preferred because those tasks have not run in a relatively long time and thus are most likely not in the processor's cache (that is, they are not "cache hot"). If the expired priority array is empty, the active one is the only choice.

Next, load_balance() finds the highest priority (smallest value) list that has tasks, because it is more important to fairly distribute high-priority tasks than lower-priority ones.

Each task of the given priority is analyzed to find a task that is not running, not prevented to migrate via processor affinity, and not cache hot. If the task meets this criteria, pull_task() is called to pull the task from the busiest runqueue to the current runqueue.

As long as the runqueues remain imbalanced, the previous two steps are repeated and more tasks are pulled from the busiest runqueue to the current. Finally, when the imbalance is resolved, the current runqueue is unlocked and load_balance()returns.

Here is load_balance(), slightly cleaned up but otherwise in all its glory:

static int load_balance(int this_cpu, runqueue_t *this_rq,
                        struct sched_domain *sd, enum idle_type idle)
        struct sched_group *group;
        runqueue_t *busiest;
        unsigned long imbalance;
        int nr_moved;


        group = find_busiest_group(sd, this_cpu, &imbalance, idle);
        if (!group)
                goto out_balanced;

        busiest = find_busiest_queue(group);
        if (!busiest)
                goto out_balanced;

        nr_moved = 0;
        if (busiest->nr_running > 1) {
                double_lock_balance(this_rq, busiest);
                nr_moved = move_tasks(this_rq, this_cpu, busiest,
                                      imbalance, sd, idle);

        if (!nr_moved) {

                if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
                        int wake = 0;

                        if (!busiest->active_balance) {
                                busiest->active_balance = 1;
                                busiest->push_cpu = this_cpu;
                                wake = 1;
                        if (wake)
                        sd->nr_balance_failed = sd->cache_nice_tries;
        } else
                sd->nr_balance_failed = 0;

        sd->balance_interval = sd->min_interval;

        return nr_moved;


        if (sd->balance_interval < sd->max_interval)
                sd->balance_interval *= 2;

        return 0; 

    Team LiB
    Previous Section Next Section