Team LiB
Previous Section Next Section

The pdflush Daemon

Write operations are deferred in the page cache. When data in the page cache is newer than the data on the backing store, that data is called dirty. Dirty pages that accumulate in memory eventually need to be written back to disk. Dirty page writeback occurs in two situations:

  • When free memory shrinks below a specified threshold, the kernel must write dirty data back to disk to free memory.

  • When dirty data grows older than a specific threshold, sufficiently old data is written back to disk, to ensure that dirty data does not remain dirty indefinitely.

These two jobs have rather different goals. In fact, in the older kernel they were performed by two separate kernel threads (see the following section). In 2.6, however, a gang[4] of kernel threads, the pdflush background writeback daemons (or, simply, the pdflush tHReads), performs these jobs. Rumor has it that pdflush is short for "dirty page flush." Ignore the confusing name; let's look at each of these goals in more detail.

[4] The term "gang" is commonly used in computer science to denote a group of things that can operate in parallel.

First, the pdflush tHReads need to flush dirty data to disk when the amount of free memory in the system shrinks beyond a specified level. The goal of this background writeback is to regain memory from dirty pages when available physical memory is low. The specified memory level is configurable by the dirty_background_ratio sysctl. When free memory drops below this threshold, the kernel invokes the wakeup_bdflush() [5] call to wake up a pdflush thread and have it run the background_writeout() function to begin writeback of dirty pages. This function takes a lone parameter, which is the number of pages to attempt to write back. The function continues writing out data until two conditions are true:

[5] Yes, it is misnamed. It should be wakeup_pdflush(). See the following section for the heritage of this call.

  • The specified minimum number of pages has been written out.

  • The amount of free memory is above the dirty_background_ratio threshold.

These conditions ensure that pdflush does its part to relieve low-memory conditions. Writeback stops prior to these conditions only if pdflush writes back all the dirty pages and there is nothing left to do.

For its second goal, pdflush periodically wakes up (unrelated to low memory conditions) and writes out very old dirty pages. This is done to ensure that no dirty pages remain in memory indefinitely. During a system failure, because memory is volatile, dirty pages in memory that have not been written to disk are lost. Consequently, periodically synchronizing the page cache with the disk is important. On system boot, a timer is initialized to wake up a pdflush thread and have it run the wb_kupdate() function. This function then writes back all data that was modified longer than dirty_expire_ centisecs hundredths of a second ago. The timer is then reinitialized to expire again in dirty_writeback_centisecs hundredths of a second. In this manner, the pdflush threads periodically wake up and write to disk all dirty pages that are older than a specified limit.

The system administrator may set these values either in /proc/sys/vm or via sysctl. Table 15.1 lists the variables.

Table 15.1. pdflush Settings

Variable

Description

dirty_background_ratio

As a percentage of total memory, the number of pages at which the pdflush tHReads will begin writeback of dirty data.

dirty_expire_centisecs

In hundredths of a second, how old data must be to be written out next time a pdflush thread wakes to perform periodic writeback.

dirty_ratio

As a percentage of total memory, the number of pages a process generates before it begins writeback of dirty data.

dirty_writeback_centisecs

In hundredths of a second, how often the pdflush threads should wake up to write data back out to disk.

laptop_mode

A Boolean value controlling laptop mode. See the following section.


The pdflush code lives in mm/pdflush.c and the writeback mechanism lives in mm/page-writeback.c and fs/fs-writeback.c.

Laptop Mode

Laptop mode is a special page writeback strategy intended to optimize battery life by minimizing hard disk activity and allowing hard drives to remain spun down as long as possible. It is configurable via /proc/sys/vm/laptop_mode. By default, this file contains a zero and laptop mode is disabled. Writing a one to this file enables laptop mode.

Laptop mode makes a single change to page writeback behavior. In addition to performing writeback of dirty pages when they grow too old, pdflush also piggybacks off any other physical disk I/O, flushing all dirty buffers to disk. In this manner, pdflush takes advantage of the fact that the disk was just spun up, ensuring that it will not cause the disk to spin up later.

This behavioral change makes the most sense when dirty_expire_centisecs and dirty_writeback_centisecs are set to large valuessay, 10 minutes. With writeback so delayed, the disk is spun up infrequentlyand when it does spin up, laptop mode ensures that the opportunity is well utilized.

Many Linux distributions automatically enable and disable laptop mode, as well as modify other pdflush tunables, when going on and off battery. This would allow a machine to benefit from laptop mode when on battery power and then automatically return to normal page writeback behavior when plugged into AC.

bdflush and kupdated

Prior to the 2.6 kernel, the job of the pdflush threads was met by two other kernel threads, bdflush and kupdated.

The bdflush kernel thread performed background writeback of dirty pages when available memory was low. A set of thresholds was maintained, similar to pdflush, and bdflush was awakened via wakeup_bdflush() whenever free memory dropped below those thresholds.

Two main differences distinguish bdflush and pdflush. The first, which is discussed in the next section, is that there was always only one bdflush daemon, whereas the number of pdflush threads is dynamic. The second difference is that bdflush was buffer-based; it wrote back dirty buffers. Conversely, pdflush is page-based; it writes back whole pages. Of course, the pages may correspond to buffers, but the actual I/O unit is a full page and not a single buffer. This is beneficial as managing pages is easier than managing buffers, because pages are a more general and common unit.

Because bdflush flushes buffers only when memory is low or the number of buffers is too large, the kupdated tHRead was introduced to periodically write back dirty pages. It served an identical purpose to pdflush's wb_kupdate() function.

Both the bdflush kernel threads and their functionality were replaced by the pdflush tHReads.

Congestion Avoidance: Why We Have Multiple Threads

One of the major flaws in the bdflush solution was that bdflush consisted of one thread. This led to possible congestion during heavy page writeback where the single bdflush tHRead would block on a single congested device queue (the list of I/O requests waiting to submit to disk), while other device queues would sit relatively idle. If the system has multiple disks and the associated processing power, the kernel should be able to keep each disk busy. Unfortunately, even with plenty of data needing writeback, bdflush can become stuck handling a single queue and fail to keep all disks saturated. This occurs because the throughput of disks is a finiteand unfortunately rather smallnumber. If only a single thread is performing page writeback, that single thread can easily spend a long time waiting for a single disk because disk throughput is such a limiting quantity. To mitigate this, the kernel needs to multithread page writeback. In this manner, no single device queue can become a bottleneck.

The 2.6 kernel solves this problem by allowing multiple pdflush threads to exist. Each thread individually flushes dirty pages to disk, allowing different pdflush threads to concentrate on different device queues.

The number of threads changes throughout the uptime of a system, according to a simple algorithm. If all existing pdflush tHReads are busy for at least one second, a new pdflush thread is created. The total number of threads cannot exceed MAX_PDFLUSH_THREADS, which by default is eight. Conversely, if a pdflush tHRead is asleep for more than a second, it is terminated. The minimum number of threads is at least MIN_PDFLUSH_THREADS, which by default is two. In this manner, the number of pdflush tHReads adjusts dynamically depending on the amount of page writeback and congestion. If all existing pdflush threads are busy writing back data, a new thread is created. This ensures that a single device queue is not congested while other, less busy, device queues sit around needing but not receiving data writeback. If the congestion diminishes, however, the number of pdflush threads is scaled back to conserve memory.

This is all well and good, but what if each pdflush thread were to get hung up writing to the same, congested, queue? In that case, the performance of multiple pdflush tHReads would not be much improved over a single thread. The memory wasted, however, would be significantly greater. To mitigate this effect, the pdflush threads employ congestion avoidance: They actively try to write back pages whose queues are not congested. As a result, the pdflush threads spread out their work and refrain from merely hammering on the same busy device. When the pdflush tHReads are "busy"and thus, a new thread is spawnedthey are truly busy.

Because of the improvements in page writeback, including the introduction of pdflush, the 2.6 kernel is capable of keeping many more disks saturated than any earlier kernel. In the face of heavy activity, the pdflush tHReads can maintain high throughput across multiple disks.

    Team LiB
    Previous Section Next Section