Previous Section  < Day Day Up >  Next Section

6.3 Operating System

From an operating system point of view, there are several things to consider when building a fast MySQL server. We'll discuss numerous filesystem issues, swap configuration, and threading performance.

6.3.1 Filesystems

The proliferation of freely available filesystems in the Linux world has lead to a regular stream of questions about the best filesystem choice for MySQL on Linux. In a way, it's not all that different from choosing the right storage engine for your tables. You need to consider the benefits and drawbacks of each, as well as your needs. Unlike switching table types, however, you can't change filesystems on the fly. And without creating a bit of a maintenance nightmare, you can't easily use one filesystem for some tables and another for the remaining ones.

It's worth pointing out that filesystem performance is a relatively minor issue most of the time. If switching filesystems gives you your largest performance gains, you've done so many other things right, you deserve a reward.

This section is admittedly Linux-centric. That's primarily because Linux is the operating system with the widest variety of choices, and it's also because Linux happens to be what the authors are most experienced with. Journaling

The biggest difference among the filesystems is journaling. Journaling filesystems maintain a log (or journal) that is never cached. The journal is similar in concept to a write-ahead transaction log. Whenever the filesystem is updated, a record describing the transaction is appended to the log. Another idle thread actually processes these transactions by writing the new data to the filesystem and flagging each processed transaction as it is completed.

If the machine crashes, the filesystem performs a roll-forward recovery, much as InnoDB would. Upon reboot, it simply finishes processing updates from the journal. Incomplete transactions in the journal are discarded, so the filesystem's internal consistency is guaranteed. This significantly decreases the complexity of running a filesystem check, meaning much shorter reboot times in the event of a crash. Even though InnoDB provides its own journaling (in the form of a transaction log), using a journaling filesystem with InnoDB is still worthwhile because of the time saved during an unexpected reboot.

Older filesystems such as Linux's ext2 and Windows FAT16/FAT32 provide no journaling. In the event of an unclean shutdown, they need to perform consistency checks upon reboot. On Linux, you must wait for fsck to do the job. On Windows, scandisk is what you end up waiting for. Luckily Microsoft's NTFS does provide journaling and it's the standard filesystem on Microsoft's server operating systems, Windows NT, 2000, and XP. In the Macintosh world, OS X provides a journaling option for its HFS filesystem. Tru64 and AIX also provide their own journaling filesystem implementations.

FreeBSD currently has no journaling filesystems available, but it does offer an alternative to journaling, known as soft updates. Developed by BSD hacker Kirk McKusick, soft updates ensure that metadata changes are written to disk in such an order that the data is always consistent. Doing this eliminates the need for a separate log and most synchronous disk operations while boosting performance through aggregated disk operations. More information is available on Kirk's web site ( and in the FreeBSD manual pages for newfs and tunefs.

Solaris users who need journaling have traditionally purchased a filesystem product from Veritas, but newer versions of Solaris provide a journaling filesystem that eliminates the need for third-party software. Other features and tweaks

Many of the newer filesystems (those designed in the past 10 years or so) have other important features that affect performance. Their designers realized that disk sizes were steadily increasing, and intensive new applications (high-volume databases, streaming video, etc.) could benefit from rethinking filesystem design. As a result, we have a good selection of high performance filesystems to choose from today. See Section for more details.

The two most notable enhancements in these newer filesystems are support for large directories and better management of fragmentation and free space. Large directory support means that operations on directories that contain thousands of files aren't appreciably slower than operations on smaller directories. This becomes an issue for MySQL only when you have a MySQL database that contains a large number of MyISAM tables. Since each table is composed of three files, the number of files can grow quickly.

Free-space management and fragmentation affect systems on which there are lots of MyISAM tables that change frequently (lots of deletes, inserts, and updates). Some filesystems are smarter than others about allocating contiguous blocks of disk space for files. This helps to reduce fragmentation, which means fewer disk seek operations when operating on the tables. Choosing a filesystem

Choosing a filesystem for MySQL is a matter of considering your needs, the available filesystems, and your comfort level with them. Here we present a brief description of the options on modern Linux systems:


The ext2 filesystem has been around Linux since the early days. It doesn't offer many advanced features, but it is time-tested and known for being very lightweight and reliable.


ext3 evolved out of a desire to add journaling support to the ext2 filesystem. You can think of ext3 as simply that—ext2 with journaling added on. Most of ext2's limitations (such as poor performance with large directories) still exist in ext3.

One interesting byproduct of the ext3 implementation is that you can actually switch the ext3 journal on and off using tunefs. With the journal disabled, an ext3 filesystem effectively becomes an ext2 filesystem again.


ReiserFS, originally created by Hans Reiser, has proven to be quite popular in the Linux world. It was built from the ground up as a modern filesystem. It handles large directories exceptionally well and has a very reliable journaling implementation. As of this writing, ReiserFS Version 3 is in widespread use, and ReiserFS Version 4 is being tested among kernel developers and other adventurous souls.


Ported by SGI from their IRIX operating system, XFS was designed to handle large filesystems with an emphasis on consistent performance. SGI was interested in creating a filesystem that held up under the type of heavy loads that are generated by high-end streaming media applications.


Like SGI, JFS came from another large technology company. IBM has been shipping JFS on their AIX platform for many years. Like SGI, IBM focused on performance and reliability when building JFS.

Table 6-2 summarizes the features implemented by various Linux filesystems.

Table 6-2. Linux filesystem features



Large directories





Yes (optional)

No (patch available)









No FreeBSD

On FreeBSD, there are really only two filesystem types to choose from: UFS and UFS2. The main difference between them is that UFS2 can handle over 1 TB of data, and it has built-in access control list (ACL) and extended attribute support. Aside from the size differences, none of the differences really affect database users. If you have large directories, the UFS_DIRHASH kernel option may help. It creates in-memory hash tables for large directories, and it doesn't affect the on-disk layout. Do you need a filesystem at all?

Traditional high-end database servers often don't use a filesystem at all. Instead, the database server bypasses the filesystem interface entirely and communicates directly with the disks. This raw access method puts the burden of managing space, fragmentation, and read/write requests on the database server itself.

The historical rationale for bypassing the filesystem is that early operating systems didn't place much emphasis on filesystem performance. As long as they stored and retrieved data reliably, most people were happy. Another reason is that volume managers didn't really exist, so the operating systems of the day had no good way to combine the server's whopping 10-MB disks into a single, larger disk. When databases routinely exceeded the size of a single disk, vendors had little choice but to implement their own low-level storage.

Nowadays, modern disks are orders of magnitude larger, modern servers provide RAID, and modern operating systems often have volume managers that make adding more space a trivial operation. Despite these advances, many DBAs still use raw partitions rather than filesystems. Users coming from other database systems often ask about MySQL's ability to use raw disks, expecting it to boost performance even more. Not to be outdone, MySQL's InnoDB storage engine can use raw partitions for its tablespaces.

To take advantage of this capability, you must leave InnoDB's home directory unset and specify that the data-file paths point to raw devices:



However, you must first initialize the partitions. To do so, use newraw instead of raw the first time and start MySQL. InnoDB will the initialize the partitions. Watch the MySQL log file for completion, shut down MySQL, change newraw to raw, and start MySQL again.

From a performance standpoint, tests have shown a very small (2-5%) performance improvement using raw partitions. When you use raw partitions, you can no longer use any of your favorite command-line tools (ls, du, etc.) to investigate the storage. Furthermore, backups are more complicated when using raw disks. Your choice of backup tools is greatly reduced because most deal with filesystems rather than raw disk partitions.

6.3.2 Swap

In an ideal world, your server would never swap. Swapping is usually an indication that you don't have enough memory or that things are configured improperly—maybe MySQL's key buffer is too large, or you're starting too many unused services at boot time. Maybe it's the operating system itself. Some operating systems make a habit of swapping when there's still free memory available.

Some versions of the 2.4 Linux kernel, for example, are known for being a bit too aggressive with swapping. Linux has generally tried to use all available free memory for caching disk access. From the virtual memory subsystem's point of view, free memory is wasted memory. Early versions (2.4.0-2.4.9) were okay, as are later versions (2.4.18 onward). But the middle versions (2.4.10-2.4.17) were known for being a bit too aggressive. On a dedicated MySQL server, with a key buffer of 1 GB and 2 GB of total RAM, it was not uncommon to see Linux swap out parts of the key buffer while performing a table scan, only to swap it back in moments later. Needless to say, this had a very negative affect on performance. The only solution in such a case is to turn off swap entirely or upgrade to a newer kernel. Luckily, most other operating systems haven't suffered from this problem. Even though most systems are well behaved, some MySQL administrators advocate turning swap off as a preventative measure.

6.3.3 Threading

As a multithreaded server, MySQL is most efficient on an operating system that has a well implemented threading system. Windows and Solaris are excellent in this respect. Linux, as usual, is a bit different. Traditionally, Linux has had a slightly unusual threading implementation—using cloned processes as threads. It performs well under most circumstances, but in situations with thousands of active client connections, it imposes a bit of overhead.

More recent work on the Linux scheduler and alternative threading libraries have improved the situation. The Native POSIX Thread Library (NPTL) is shipped by default in RedHat Linux Version 9.0. Other distributions have just begun adopting it as well.

Another popular free operating system, FreeBSD, has threading problems that are much worse. Versions prior to 5.2 provide rather weak native threading. In some circumstances, I/O-intensive threads are able to get an unfair amount of CPU time, thus keeping other threads from executing as quickly as they should. Given the I/O-intensive nature of some database queries, this has a rather devastating affect on MySQL.

If upgrading isn't an option, build MySQL from the FreeBSD ports collection, and be sure to enable support for LinuxThreads. Doing so causes MySQL to use an alternative threading that's more like that available in Linux 2.4. Each thread is actually a process that, thanks to FreeBSD's rfork( ) call, has shared access to MySQL's global buffers. The overhead of this approach may sound like an issue, but it's really quite efficient. Many of Yahoo's hundreds of MySQL servers are using LinuxThreads on FreeBSD quite effectively.

Section 6.4.4 later in this chapter discusses how MySQL's thread cache can help reduce the overhead associated with creating and destroying threads.

    Previous Section  < Day Day Up >  Next Section