Previous Section  < Day Day Up >  Next Section

5.1 Query Processing Basics

How MySQL goes from receiving a query to sending the results back to a client is relatively straightforward. The work happens in several distinct stages. Let's walk through them.

5.1.1 Query Cache

You can enable the query cache (available as of MySQL 4.0.1) by setting query_cache_type to an appropriate value in my.cnf:

query_cache_type = 1

MySQL attempts to locate the results of any SELECT query in the query cache before bothering to analyze or execute it. It does this by hashing the query and using the hashed value to check for the results in the cache. MySQL uses the exact query text it receives, so the cache is sensitive to the most trivial variations.

As far as the cache is concerned, the query:

SELECT * FROM table1

is different from:

select * FROM table1

The same goes for variations in whitespace. MySQL doesn't trim extra space from the beginning or end of queries. This is rarely a problem because most repetitive queries are generated by applications rather than humans sitting at a keyboard.

To save some effort, MySQL cheats a bit. It only bothers to hash SELECT queries, since they're the only ones it makes any sense to cache. Unfortunately, older 4.0 versions of MySQL don't consider every SELECT query. The logic it uses simply checks the first three characters of your query, looking for SEL in a case-insensitive way.

As a result of this three-character "tunnel vision," any time you introduce whitespace or anything else at the beginning of the query, MySQL won't bother with the query cache. This can be a real problem in some applications. We know of a feed-processing system in which the developers uses comments to embed extra information at the beginning of each query:

/* <b>GetLatestStuff</b> */ SELECT * FROM sometable WHERE ...

The comment made is easier to identify the queries in an administrative tool that grabs the output of SHOW PROCESSLIST for display on a web page. Unfortunately, there's no way to tell MySQL to "try harder" when deciding whether a query is a SELECT, so these queries are never cached. Luckily, this problem is cured with a simple upgrade to MySQL 5.0.

It is possible to tell MySQL that it should not cache a given query, however. The way to dodge the query cache is to add the SQL_NO_CACHE hint to your query.

SELECT SQL_NO_CACHE * FROM mytable

This is helpful in controlling cache pollution. If your application has a set of queries that will never benefit from the query cache (perhaps because they run only once a day), there's no sense in caching them. Telling MySQL not to cache such queries leaves more room for storing the results of repetitive queries.

When the query cache is running in ondemand mode (set query_cache_type to 2 in my.cnf), MySQL does the work of trying to find a query in the cache only when it sees a SQL_CACHE hint in the query:

SELECT SQL_CACHE * FROM mytable

If the query's hashed value is found in the cache, MySQL sends the results from the cache to the client, bypassing any additional effort, just as expected.

The format of the results in the query cache is identical to the format used when sending them to a client. So there is very little overhead in retrieving results from the cache and sending them to a client. MySQL simply sends the data over the network. We'll look at query cache performance in Chapter 6.

5.1.2 Parsing, Analysis, and Optimization

Before MySQL can do anything interesting (or useful) with a noncached query, it must parse the query into its component parts. As part of that process, it verifies that the query is syntactically valid and gathers some basic information about the query:

  • What type of query is this? Is it a SELECT, INSERT, UPDATE, or DELETE, or some other administrative command such as SET or GRANT?

  • Which tables are involved? Are there any aliases used?

  • What is the WHERE clause?

  • Are there other hints or modifiers involved?

Once a query is broken down into more basic pieces, MySQL begins the more challenging work of figuring out what to do with it. This is where the query optimizer kicks in. The query optimizer's goal, simply put, is to find the most efficient way to execute a query given all the available information. Most of the time, this means the optimizer works to limit the number of records it must examine. It does this because the time associated with disk I/O is often (but not always) the governing factor that determines how long a query will take. Intuitively, this makes complete sense. It is an extension of the very same logic that explains why indexes are so helpful.

How the optimizer goes about making decisions is often regarded by people unfamiliar with MySQL internals as something like voodoo. Of course, it's not voodoo at all. MySQL has a set of rules and heuristics that have been evolving since its early days. These rules guide its decision-making process. But like any computer program that must deal with the infinite ways humans can assemble data and ask questions about it, the optimizer's not perfect. The rules and heuristics it uses work very well much of the time, but, on occasion, they do not.

The MySQL developers are constantly improving the optimizer—attempting to make it smarter and faster with each new release. Based on feedback from real-world users, they are always looking for ways to refine MySQL's ability to make the right decision. If you find a query that causes MySQL to make bad decisions, be sure to report it. Unreported problems are rarely fixed.

To make good decisions, MySQL tries to answer several important questions.

  • Are there any indexes that are candidates for finding the rows quickly?

  • Which index is best? If multiple tables are involved, which index is best for each table?

  • Which tables depend on which other tables in the join?

  • What's the optimal join order for the tables?

Of course, MySQL needs to make a decision very quickly and without actually testing all the options. Otherwise it might spend more time deciding how to execute the query than actually executing it!

The bulk of MySQL's effort centers around indexes and table join order. These aren't the only factors, but they're certainly the important ones. To get a better understanding of what MySQL is thinking about a SELECT query, it's best to look at the EXPLAIN output for the query.

5.1.3 Using EXPLAIN

So, what sort of knowledge can MySQL gather without expending a lot of effort and time? Let's look at a some queries against a news headline table—the sort of thing you might use to build a customizable news web site. The structure of the table is listed next. Rather than guessing what MySQL will probably do, we'll use its under-appreciated EXPLAIN command to help figure that out. In doing so, we'll see how adding an index or simply rephrasing a query can often better use an existing index and greatly improve performance.

mysql> describe Headline;

+------------+------------------+------+-----+---------+----------------+

| Field      | Type             | Null | Key | Default | Extra          |

+------------+------------------+------+-----+---------+----------------+

| Id         | int(10) unsigned |      | PRI | NULL    | auto_increment |

| Headline   | varchar(255)     |      |     |         |                |

| Url        | varchar(255)     |      | UNI |         |                |

| Time       | int(10) unsigned |      | MUL | 0       |                |

| ExpireTime | int(10) unsigned |      |     | 0       |                |

| Date       | varchar(6)       |      |     |         |                |

| Summary    | text             | YES  |     | NULL    |                |

| ModTime    | timestamp        | YES  |     | NULL    |                |

+------------+------------------+------+-----+---------+----------------+

8 rows in set (0.00 sec)

As you can tell, the Headline table contains information about news stories: title, summary, date, and so on. Headlines can be associated with multiple topics, which are defined in the Topic table. The T2H table maps topics to headlines and vice versa. The relationship is many-to-many because a single headline may be associated with multiple topics.

When you write a query against a primary key or unique index, MySQL should know that there can be only a single match for each value. Indeed, this query is very fast:

mysql> SELECT Headline, Url FROM Headline WHERE Id = 13950120 \G

*************************** 1. row ***************************

Headline: Midwest Cash Grain PM - Soy off, USDA data awaited

     Url: http://biz.yahoo.com/rm/030328/markets_grain_cash_2.html

1 row in set (0.00 sec)

Just as it's obvious to you or me, MySQL knows that only one record can possibly match. Its strategy for finding the row is straightforward: simply check the primary index for a match. If it exists, fetch the row. To verify that, let's EXPLAIN it:

mysql> EXPLAIN SELECT Headline, Url FROM Headline WHERE id = 13950120 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Headline

         type: const

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: const

         rows: 1

        Extra:

1 row in set (0.00 sec)

Just as expected, MySQL knows there's only one matching row. The rows value tells you so. What MySQL says here isn't always to be completely trusted, however, as you'll see in a little bit.

Of course, EXPLAIN is relating much more than how many rows to expect. Let's quickly review the information it provides:


id

The ID of this table in the query. EXPLAIN produces one output record for each table in the query.


select_type

What is this table's role in the larger query? Possible values are SIMPLE, PRIMARY, UNION, DEPENDENT UNION, SUBSELECT, and DERIVED. As we look at the more complicated queries, the meaning will become clearer.


table

The name of the table MySQL will read records from.


type

What type of join will MySQL use? In this example, you see const because there was a constant value in the query. Other possible values are system, eq_ref, ref, range, index, or ALL. We'll revisit this in more detail in the "Joins" section, later in this chapter.


possible_keys

A list of the indexes (or NULL if none) MySQL can use to find rows in the table.


key

The name of the index MySQL decides to use, after checking all possible indexes (listed in possible_keys) and choosing the best.


key_len

The size of the key value (in bytes).


ref

The columns or values that are used to match against the key.


rows

The number of rows MySQL thinks it needs to examine to satisfy the query. If you frequently add and remove records from the table, running ANALYZE TABLE lets MySQL update the index statistics so it can make better estimates.


Extra

Any extra information MySQL wishes to convey about the execution of this query. We'll see some examples of that shortly.

The simple case is just that—simple. Let's ask for a range of values instead.

mysql> SELECT Url FROM Headline WHERE id BETWEEN 13950120 AND 13950125;

+----------------------------------------------------------+

| Url                                                      |

+----------------------------------------------------------+

| http://biz.yahoo.com/rm/030328/markets_grain_cash_2.html |

| http://biz.yahoo.com/prnews/030328/cgf038_1.html         |

| http://biz.yahoo.com/bw/030328/285487_1.html             |

| http://biz.yahoo.com/rc/030328/turkey_hijack_5.html      |

| http://biz.yahoo.com/rm/030328/food_aid_iraq_1.html      |

+----------------------------------------------------------+

5 rows in set (0.00 sec)



mysql> EXPLAIN SELECT Url FROM Headline WHERE id BETWEEN 13950120 AND 13950125 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Headline

         type: range

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: NULL

         rows: 3

        Extra: Using where

1 row in set (0.00 sec)

In this case, the type has switched from const to range to indicate that a search for more than one value. Similarly, ref is now NULL.

Every thing seems reasonable unless you look closely. When executed, the query returns five rows, but the rows says three. That's because the rows value is merely an estimate. It probably should have been called estimated_rows.

The estimate is based on the index MySQL is using. Based on the distribution of records across the possible key values, it simply approximates that there are three valid records between 13950120 and 13950125.

Also notice that the Extra column says Using where. That's MySQL's reassuring way of telling you that it's using the limitations specified in the WHERE clause to select records. It wasn't present in the first example because MySQL treats a single-row lookup using the primary key as special case.

What if we try fetching records based on a nonindexed column:

mysql> SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1112201600;

+----------+

| COUNT(*) |

+----------+

|     3971 |

+----------+

1 row in set (1.04 sec)



mysql> EXPLAIN SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1112201600 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Headline

         type: ALL

possible_keys: NULL

          key: NULL

      key_len: NULL

          ref: NULL

         rows: 302116

        Extra: Using where

1 row in set (0.00 sec)

The NULL value in the key column of the EXPLAIN output tell us that MySQL won't be using an index for this query. In fact, the NULL value in the possible_keys column tells us that there were no indexes to pick from at all. If this type of query is likely to be common, we can simply add an index and rerun the query (or the EXPLAIN) to verify that MySQL uses it.

mysql> ALTER TABLE Headline ADD INDEX (ExpireTime);

Query OK, 302116 rows affected (40.02 sec)

Records: 302116  Duplicates: 0  Warnings: 0



mysql> SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1112201600;

+----------+

| COUNT(*) |

+----------+

|     3971 |

+----------+

1 row in set (0.01 sec)



mysql> EXPLAIN SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1112201600 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Headline

         type: range

possible_keys: ExpireTime

          key: ExpireTime

      key_len: 4

          ref: NULL

         rows: 12009

        Extra: Using where; Using index

1 row in set (0.00 sec)

The query now runs in 0.01 seconds instead of 1.04. The EXPLAIN output looks much better, with the new ExpireTime index being used for a range search. Note again the discrepancy between rows (12009) and the actual row count (3971). In a case like this, it might be possible to improve the estimate that MySQL makes by running either ANALYZE TABLE or OPTIMIZE TABLE on the Headline table.

Also, notice that MySQL said Using index. That means this is an index-only query. MySQL is able to get all the data it needs from the ExpireTime index, so it doesn't bother fetching any of the rows from disk.

But what if you need to fetch multiple headlines, and you know their IDs? Should you use OR or IN(...)? Let's find out what MySQL can tell us, using the lowest and highest headline IDs as well as one in between:

mysql> SELECT Url FROM Headline WHERE Id IN(1531513, 10231599, 13962322);

+----------------------------------------------+

| Url                                          |

+----------------------------------------------+

| http://biz.yahoo.com/bond/010117/bf.html     |

| http://biz.yahoo.com/e/021101/yhoo10-q.html  |

| http://biz.yahoo.com/bw/030331/315850_1.html |

+----------------------------------------------+

3 rows in set (0.00 sec)



mysql> EXPLAIN SELECT Url FROM Headline WHERE Id IN(1531513, 10231599, 13962322) \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Headline

         type: range

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: NULL

         rows: 3

        Extra: Using where

1 row in set (0.00 sec)



mysql>  SELECT Url FROM Headline WHERE Id = 1531513 OR Id = 10231599 OR Id = 13962322;

+----------------------------------------------+

| Url                                          |

+----------------------------------------------+

| http://biz.yahoo.com/bond/010117/bf.html     |

| http://biz.yahoo.com/e/021101/yhoo10-q.html  |

| http://biz.yahoo.com/bw/030331/315850_1.html |

+----------------------------------------------+

3 rows in set (0.03 sec)



mysql>  EXPLAIN SELECT Url FROM Headline WHERE Id = 1531513 OR Id = 10231599 OR Id = 

13962322 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Headline

         type: range

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: NULL

         rows: 3

        Extra: Using where

1 row in set (0.00 sec)

Both queries execute very quickly, and their EXPLAIN output is the same. They're functionally the same. It's clear that either query may return anywhere from zero to three rows. We're querying based on a unique index (the primary key), so there isn't much for MySQL to think about. As it turns out, we happen to know that in this case, MySQL internally changed the multi-OR query to one that uses a single IN(...) list. However, it's clear that as the number of IDs increases, the query string will be smaller if you use the IN(..). A smaller query means less parsing overhead and better performance.

What if we use a subquery to fetch the URL for the highest numbered headline?

mysql> EXPLAIN SELECT Url FROM Headline WHERE Id IN (SELECT MAX(Id) FROM Headline);

After waiting five minutes, we killed the query. Either we did something wrong, or MySQL wasn't using the obvious approach to resolve this query. Hmm.

To find out, let's explain it.

mysql> EXPLAIN SELECT Url FROM Headline WHERE Id IN (SELECT MAX(id) FROM Headline) \G

*************************** 1. row ***************************

           id: 1

  select_type: PRIMARY

        table: Headline

         type: ALL

possible_keys: NULL

          key: NULL

      key_len: NULL

          ref: NULL

         rows: 302116

        Extra: Using where

*************************** 2. row ***************************

           id: 2

  select_type: DEPENDENT SUBSELECT

        table: Headline

         type: index

possible_keys: NULL

          key: PRIMARY

      key_len: 4

          ref: NULL

         rows: 302116

        Extra: Using index

2 rows in set (0.00 sec)

Yikes!

MySQL isn't using any indexes! Notice that both possible_keys fields are NULL. Is this a bug? Perhaps, especially when you consider that the key field in the dependent subselect says that it selected the primary key. But it wasn't in the list of possible keys. And, worse yet, MySQL believes it must examine 302,116 rows to resolve a single-record lookup supposedly based on a primary key.

Of course, this testing was performed with MySQL 4.1.0 alpha, prerelease code in which the query optimizer hadn't been properly tuned to handle subselects well.[1] The point isn't that MySQL didn't do the right thing. No matter how well tuned it is, MySQL will make a bad decision once in a while. When it does, you need to be able to diagnose the problem and, in some cases, come up with a workaround.

[1] Subsequent tests with the 4.1.2 alpha version proved that the query optimizer no longer had this bug.

So let's rewrite the query a bit to simplify things. We're using IN(...) in a query that can only return one row. So let's change that to an equality (=) test.

mysql> SELECT Url FROM Headline WHERE Id = (SELECT MAX(id) FROM Headline);

+----------------------------------------------+

| Url                                          |

+----------------------------------------------+

| http://biz.yahoo.com/bw/030331/315850_1.html |

+----------------------------------------------+

1 row in set (0.00 sec)



mysql> EXPLAIN SELECT Url FROM Headline WHERE Id = (SELECT MAX(id) FROM Headline) \G

*************************** 1. row ***************************

           id: 1

  select_type: PRIMARY

        table: Headline

         type: const

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: const

         rows: 1

        Extra:

*************************** 2. row ***************************

           id: 2

  select_type: SUBSELECT

        table: NULL

         type: NULL

possible_keys: NULL

          key: NULL

      key_len: NULL

          ref: NULL

         rows: NULL

        Extra: Select tables optimized away

2 rows in set (0.00 sec)

A-ha! That did it. The query ran in a split second.

The EXPLAIN output is interesting too. MySQL noticed that we were attempting something very trivial and optimized away the second table. All those NULL values are MySQL's way of saying, "These simply don't matter."

But what if that hadn't worked? Or what if we're using an older MySQL that doesn't have subselects yet? Simple. We can rewrite the query as two SELECT statements and store the intermediate value in a variable on the server side so that no client-side state is required:

mysql> SELECT @max := MAX(Id) FROM Headline;

+-----------------+

| @max := MAX(Id) |

+-----------------+

|        13962322 |

+-----------------+

1 row in set (0.00 sec)



mysql> SELECT Url FROM Headline WHERE Id = @max;

+----------------------------------------------+

| Url                                          |

+----------------------------------------------+

| http://biz.yahoo.com/bw/030331/315850_1.html |

+----------------------------------------------+

1 row in set (0.00 sec)

We don't even need to explain those queries. Based on what we already know, they'll obviously be fast (and they are). Both are queries on primary keys and fetch single values.

And, for completeness, the most MySQL-like way to write that query is to use an ORDER BY and LIMIT:

SELECT Url FROM Headline ORDER BY Id DESC LIMIT 1;

Let's look at one last example. What if you query based on two different indexed fields? MySQL tries to select the index that will result in the fewest rows being examined. So the results will vary depending on your data and the values you choose.

mysql> SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1112201600 AND Id <= 

5000000;

+----------+

| COUNT(*) |

+----------+

|     1175 |

+----------+

1 row in set (0.04 sec)



mysql> EXPLAIN SELECT COUNT(*) FROM Headline

    -> WHERE ExpireTime >= 1112201600 AND Id <= 5000000 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Headline

         type: range

possible_keys: PRIMARY,ExpireTime

          key: ExpireTime

      key_len: 4

          ref: NULL

         rows: 12009

        Extra: Using where

1 row in set (0.00 sec)

For this query, given the choice between the primary key field (Id) and the ExpireTime, MySQL decided to use ExpireTime. However, if the ExpireTime value is changed so that it matches many more rows, MySQL should favor the primary key:

mysql> EXPLAIN SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1012201600 AND Id <= 

5000000 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Headline

         type: range

possible_keys: PRIMARY,ExpireTime

          key: PRIMARY

      key_len: 4

          ref: NULL

         rows: 13174

        Extra: Using where

1 row in set (0.00 sec)

As expected, it does.

Again, this decision-making process is all based on MySQL's notion of what the data looks like—how evenly distributed the values are. Different storage engines (InnoDB, MyISAM, BDB) use different methods to gather those statistics. As a result, you may find that some queries are executed differently if you convert your data to a different table type. Of course, running ANALYZE TABLE will also affect MySQL's statistics.

5.1.3.1 Joins

Things become slightly more complex when you're querying multiple tables. MySQL has to decide which order makes the most sense. Again, the goal it to read as few rows as possible, so it will consider each table and estimate how many rows it must read from each. In doing so, it also needs to understand the relationship among the tables. For example, with a query like this, it's clear that MySQL can't read the table order first:

SELECT customer.name, order.date_placed, region.name

FROM customer, order, region

WHERE order.customer_id = customer.id

AND customer.region_id = region.id

AND customer.name = 'John Doe'

The rows MySQL will need to retrieve from the order table depend on the customer table. So it must read customer before order. In fact, the same is true of region. So in this case, MySQL has to read customer records first. From there it will decide to read the remaining tables in whatever order it chooses.

Unfortunately, finding the optimal join order is one of MySQL's weakest skills. Rather than being clever about this problem, the optimizer simply tries to brute-force its way through. It tries every possible combination before choosing one. That can spell disaster in a some cases. We've seen at least one case in which MySQL took 29 seconds to decide how to execute a multitable join and then 1 second to actually execute it. In this particular case, there were over 10 tables involved. Since MySQL is considering all possible combinations, performance begins to degrade quite drastically as you go beyond a handful of tables. The exact number, of course, depends on how powerful CPUs are this year.

5.1.4 Execution

There's not a lot to say about query execution. MySQL simply follows its plan, fetching rows from each table in order and joining based on the relevant columns (hopefully using indexes). Along the way, it may need to create a temporary table (in memory or on disk) to store the results. Once all the rows are available, it sends them to the client.

Along the way, MySQL gathers some information and statistics about each query it executes, including:

  • Who issued the query

  • How long the process took

  • How many rows were returned

That information will appear in the slow query log (discussed later in this chapter) if the query time exceeds the server's threshold, and the log is enabled. If the query is issued interactively, it will also appear after the query results.

    Previous Section  < Day Day Up >  Next Section