Delaying Execution

Often, kernel code (especially drivers) needs a way to delay execution for some time without using timers or a bottom-half mechanism. This is usually to allow hardware time to complete a given task. The time is typically quite short. For example, the specifications for a network card might list the time to change Ethernet modes as two microseconds. After setting the desired speed, the driver should wait at least the two microseconds before continuing.

The kernel provides a number of solutions, depending on the semantics of the delay. The solutions have different characteristics. Some hog the processor while delayingeffectively preventingthe accomplishment of any real work. Other solutions do not hog the processor, but offer no guarantee that your code will resume in exactly the required time^[8].

^[8] Actually, no approach guarantees that the delay will be for exactly the time requested. Some come extremely close, howeverand they all promise to wait at least as long as needed. Some just wait longer.

Busy Looping

The simplest solution to implement (although rarely the optimal solution) is busy waiting or busy looping. This technique works only when the time you want to delay is some integer multiple of the tick rate or precision is not very important.

The idea is simple: Spin in a loop until the desired number of clock ticks pass. For example,

unsigned long delay = jiffies + 10;        /* ten ticks */

while (time_before(jiffies, delay))
        ;

The loop continues until jiffies is larger than delay, which will occur only after 10 clock ticks have passed. On x86 with HZ equal to 1000, this results in a wait of 10 milliseconds. Similarly,

unsigned long delay = jiffies + 2*HZ;        /* two seconds */

while (time_before(jiffies, delay))
        ;

This will spin until 2*HZ clock ticks has passed, which is always two seconds regardless of the clock rate.

This approach is not nice to the rest of the system. While your code waits, the processor is tied up spinning in a silly loopno useful work is accomplished! In fact, you rarely want to take this brain-dead approach, and it is shown here because it is a clear and simple method for delaying execution. You might also encounter it in someone else's not-so-pretty code.

A better solution would be to reschedule your process to allow the processor to accomplish other work while your code waits:

unsigned long delay = jiffies + 5*HZ;

while (time_before(jiffies, delay))
        cond_resched();

The call to cond_resched() schedules a new process, but only if need_resched is set. In other words, this solution conditionally invokes the scheduler only if there is some more important task to run. Note that because this approach invokes the scheduler, you cannot make use of it from an interrupt handleronly from process context. In fact, all these approaches are best used from process context, because interrupt handlers should execute as quickly as possible (and busy looping does not help accomplish that goal!). Furthermore, delaying execution in any manner, if at all possible, should not occur while a lock is held or interrupts are disabled.

C aficionados might wonder what guarantee is given that the previous loops even work. The C compiler is usually free to perform a given load only once. Normally, no assurance is given that the jiffies variable in the loop's conditional statement is even reloaded on each loop iteration. The kernel requires, however, that jiffies be reread on each iteration, as the value is incremented elsewhere: in the timer interrupt. Indeed, this is why the variable is marked volatile in <linux/jiffies.h>. The volatile keyword instructs the compiler to reload the variable on each access from main memory and never alias the variable's value in a register, guaranteeing that the previous loop completes as expected.

Small Delays

Sometimes, kernel code (again, usually drivers) requires very short (smaller than a clock tick) and rather precise delays. This is often to synchronize with hardware, which again usually lists some minimum time for an activity to completeoften less than a millisecond. It would be impossible to use jiffies-based delays, as in the previous examples, for such a short wait. With a timer interrupt of 100Hz, the clock tick is a rather large 10 milliseconds! Even with a 1000Hz timer interrupt, the clock tick is still one millisecond. Another solution is clearly necessary for smaller, more precise delays.

Thankfully, the kernel provides two functions for microsecond and millisecond delays, both defined in <linux/delay.h>, which do not use jiffies:

void udelay(unsigned long usecs)
void mdelay(unsigned long msecs)

The former function delays execution by busy looping for the specified number of microseconds. The latter function delays execution for the specified number of milliseconds. Recall one second equals 1000 milliseconds, which equals 1,000,000 microseconds. Usage is trivial:

udelay(150);        /* delay for 150 µs */

The udelay() function is implemented as a loop that knows how many iterations can be executed in a given period of time. The mdelay() function is then implemented in terms of udelay(). Because the kernel knows how many loops the processor can complete in a second (see the sidebar on BogoMips), the udelay() function simply scales that value to the correct number of loop iterations for the given delay.

My BogoMIPS Are Bigger than Yours!

The BogoMIPS value has always been a source of confusion and humor. In reality, the BogoMIPS calculation has very little to do with the performance of your computer and is primarily used only for the udelay() and mdelay() functions. Its name is a contraction of bogus (that is, fake) and MIPS (million of instructions per second). Everyone is familiar with a boot message similar to the following (this is on a 1GHz Pentium 3):

Detected 1004.932 MHz processor. Calibrating delay loop... 1990.65 BogoMIPS

The BogoMIPS value is the number of busy loop iterations the processor can perform in a given period. In effect, BogoMIPS are a measurement of how fast a processor can do nothing! This value is stored in the loops_per_jiffy variable and is readable from /proc/cpuinfo. The delay loop functions use the loops_per_jiffy value to figure out (fairly precisely) how many busy loop iterations they need to execute to provide the requisite delay.

The kernel computes loops_per_jiffy on boot via calibrate_delay() in init/main.c.

The udelay() function should be called only for small delays because larger delays on fast machines might result in overflow. As a rule, do to not use udelay() for delays over one millisecond in duration. For longer durations, mdelay() works fine. Like the other busy waiting solutions for delaying execution, neither of these functions (especially mdelay(), because it is used for such long delays) should be used unless absolutely needed. Remember that it is rude to busy loop with locks held or interrupts disabled because system response and performance will be adversely affected. If you require precise delays, however, these calls are your best bet. Typical uses of these busy waiting functions delay for a very small amount of time, usually in the microsecond range.

`schedule_timeout()`

A more optimal method of delaying execution is to use schedule_timeout(). This call puts your task to sleep until at least the specified time has elapsed. There is no guarantee that the sleep duration will be exactly the specified timeonly that the duration is at least as long as specified. When the specified time has elapsed, the kernel wakes the task up and places it back on the runqueue. Usage is easy:

/* set task's state to interruptible sleep */
set_current_state(TASK_INTERRUPTIBLE);

/* take a nap and wake up in "s" seconds */
schedule_timeout(s * HZ);

The lone parameter is the desired relative timeout, in jiffies. This example puts the task in interruptible sleep for s seconds. Because the task is marked TASK_INTERRUPTIBLE, it wakes up prematurely if it receives a signal. If the code does not want to process signals, you can use TASK_UNINTERRUPTIBLE instead. The task must be in one of these two states before schedule_timeout() is called or else the task will not go to sleep.

Note that because schedule_timeout() invokes the scheduler, code that calls it must be capable of sleeping. See Chapters 8 and 9 for discussions on atomicity and sleeping. In short, you must be in process context and must not hold a lock.

The schedule_timeout() function is fairly straightforward. Indeed, it is a simple application of kernel timers, so let's take a look at it:

signed long schedule_timeout(signed long timeout)
{
        timer_t timer;
        unsigned long expire;

        switch (timeout)
        {
        case MAX_SCHEDULE_TIMEOUT:
                schedule();
                goto out;
        default:
                if (timeout < 0)
                {
                        printk(KERN_ERR "schedule_timeout: wrong timeout "
                            "value %lx from %p\n", timeout,
                            __builtin_return_address(0));
                        current->state = TASK_RUNNING;
                        goto out;
                }
        }

        expire = timeout + jiffies;

        init_timer(&timer);
        timer.expires = expire;
        timer.data = (unsigned long) current;
        timer.function = process_timeout;

        add_timer(&timer);
        schedule();
        del_timer_sync(&timer);

        timeout = expire - jiffies;

 out:
        return timeout < 0 ? 0 : timeout;
}

The function creates a timer with the original name timer and sets it to expire in timeout clock ticks in the future. It sets the timer to execute the process_timeout() function when the timer expires. It then enables the timer and calls schedule(). Because the task is supposedly marked TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE, the scheduler does not run the task, but instead picks a new one.

When the timer expires, it runs process_timeout():

void process_timeout(unsigned long data)
{
        wake_up_process((task_t *) data);
}

This function puts the task in the TASK_RUNNING state and places it back on the runqueue.

When the task reschedules, it returns to where it left off in schedule_timeout() (right after the call to schedule()). In case the task was awakened prematurely (if a signal was received), the timer is destroyed. The function then returns the time slept.

The code in the switch() statement is for special cases and is not part of the general usage of the function. The MAX_SCHEDULE_TIMEOUT check enables a task to sleep indefinitely. In that case, no timer is set (because there is no bound on the sleep duration) and the scheduler is immediately invoked. If you do this, you had better have another method of waking your task up!

Sleeping on a Wait Queue, with a Timeout

Chapter 4 looked at how process context code in the kernel can place itself on a wait queue to wait for a specific event, and then invoke the scheduler to select a new task. Elsewhere, when the event finally occurs, wake_up() is called and the tasks sleeping on the wait queue are awakened and can continue running.

Sometimes it is desirable to wait for a specific event or wait for a specified time to elapsewhichever comes first. In those cases, code might simply call schedule_timeout() instead of schedule() after placing itself on a wait queue. The task wakes up when the desired event occurs or the specified time elapses. The code needs to check why it woke upit might be because of the event occurring, the time elapsing, or a received signaland continue as appropriate.