Previous Section  < Day Day Up >  Next Section

8.2 Configuration Issues

To route the connection to a server, the load balancer must select a target server. To do this, it takes two pieces of information into account. First, it needs to know which servers are available. At any time, one or more of the backend servers can be offline (for maintenance, as the result of a crash, etc.). To keep track of the servers, the load balancer periodically checks each one's health.

Once the load balancer has a list of candidate servers, it must decide which should get the next connection. This process can take a number of factors into account, including past performance, load, client address, and so on. Let's look at both issues in more detail.

8.2.1 Health Checks

Load balancers need to perform a health check for each real server to ensure that it's still alive, well, and willing to accept new connections. When load-balancing a web server, this is often a trivial matter. The load balancer is configured to connect to TCP port 80 and request a status file such as /health.html. If the server gets a 2xx response code back, it assumes the server is fine. If not, it may stop sending new requests to the server until it becomes healthy again.

A nice side benefit of asking for a specific file, rather than simply looking for any response on port 80, is that a server can be removed from the cluster without taking it offline: simply remove or rename the file.

Most load balancers provide a great deal of control over the parameters used when testing cluster hosts. Options may include the frequency of checks, the duration of check timeouts, and the number of unhealthy responses required to remove a server from the cluster. See your load balancer's documentation for details.

8.2.1.1 Determining health

So what constitutes a good health check for MySQL? Unfortunately, there's no single answer to that question.

It depends on how sophisticated your load balancer is. Some load balancers can verify only that each server is responding on the necessary TCP port. They'll generally connect to TCP port 3306 (or whichever port you're using) and assume the server is unhealthy if the connection is refused or if it has to wait too long for a response.

Some load balancers are more flexible. They might give you the option of scripting a complicated health check or of running the health check against a different port than normal. This provides a lot of flexibility and control. For example, you can run a web server (such as Apache) on the server and configure the load balancer to check a status file, just as you would for standard HTTP load balancing. You can exploit this indirect kind of check by making the status file a script (PHP, Perl, etc.) or Java servlet that performs arbitrarily complex logic to decide whether the server is really healthy.[2] The arbitrarily complex logic can be as simple as running a SELECT 1 query, or as complicated as parsing the output of SHOW SLAVE STATUS to verify that the slave is reasonably up to date.

[2] Provided, of course, that the arbitrarily complex logic doesn't take arbitrarily long to execute. The load balancer won't wait forever.

If your load balancer offers this degree of flexibility, we highly recommend taking advantage of it. By taking control over the decision-making process, you'll have a better idea of how your cluster will respond in various situations. And after testing, if you're not happy with the results, simply adjust the logic and try again.

What types of things might you check for? This goes back to the question we're trying to answer: what makes a healthy MySQL server, from the load balancer's point of view?

A good health check also depends on your application needs and what's most important. For example, on a nearly real-time dynamic web site like Yahoo! News, you might put more emphasis on replication. If a slave gets busy enough handling regular queries that it becomes sluggish and ends up more than 30 seconds behind on replication, your code can return an unhealthy status code. The load balancer then removes the slave from the list of available servers until the health check passes again. Presumably the reduced demand on the server will allow it to quickly catch up and rejoin the cluster. (See the "Monitoring" section in Chapter 7 for ideas about detecting slow slaves.)

Of course, the success of this algorithm depends on how smart your scripts are. What if the slow server doesn't get caught up? And what if the additional demand that the remaining servers must bear causes them to fall behind? There's a very real chance that one by one, they'll start deciding they too are unhealthy. Before long, the problem cascades until you're left with a cluster of unhealthy servers sitting behind a load balancer that doesn't know where to send connections anymore.

At Yahoo! Finance, we've seen individual servers that try to be smart and end up creating even bigger problems because they didn't have the whole picture. Anticipating the problem mentioned in the previous paragraph, the code that performed health checks introduced yet another level of checking. Each server knew all the other members of the cluster. The health check included code to make sure that there were enough servers left. If a server determined that too many other servers were already down, it would elect to keep handling requests. After all, a slow site is better than no site at all.

But our implementation still wasn't smart enough; the servers still went down in a cascade. The reason turned out to be a simple race condition. The code performed a series of checks, but it did them in the wrong order. The code first checked to see that a sufficient number of other servers were healthy. It then went on to make sure MySQL wasn't too far behind on replication. The problem was that several servers could be doing the health check at exactly the same time. If that happened, it was possible for all servers to believe that all other servers were healthy and proceed to declare themselves unhealthy.

There are numerous solutions to the problem. One is to add a simple sanity check. Each server can, after declaring itself unhealthy, check to make sure that the situation hasn't radically changed. Another option is to appoint a single server in each cluster as the authority for determining who is and isn't healthy. While it introduces a single point of failure (what if this server dies?), it means there are fewer chances for race conditions and similar problems.

To summarize, some load balancers provide you with a lot of flexibility and power. Be careful how you use it. If you elect to take control of the decision-making process (and add complexity to it), be sure that the code is well tested. Ask a few peers to review it for you. Consider what will happen in unusual situations.

8.2.1.2 Connection limits

In normal operations, the load balancer should distribute connections relatively evenly among your severs. If you have eight backend servers, any one of them will handle roughly one eighth of the connections at a given time. But what happens when several backend servers go down at the same time? Because the rest of the cluster must bear the load, you need to ensure that the se servers are configured to handle it.

The most important setting to check is max_connections. In this circumstance, you'll find that if max_connections is set too low, otherwise healthy MySQL servers start refusing connections even if they're powerful enough to handle the load. Many installations don't set the max_connections option, so MySQL uses its built-in default of 100. Instead, set max_connections high enough that this problem can't happen. For example, if you find that each server typically handles 75 connections, a reasonable value for max_connections might be 150 or more. That way, even if half the backend servers failed, you're application won't fail to connect.

8.2.2 Next-Connection Algorithms

Different load balancers implement different algorithms to decide which server should receive the next connection. Some call these scheduling algorithms. Each vendor has different terminology, but this list should provide an idea of what's available:


Random

Each request is directed to a backend server selected at random from the pool of available servers.


Round-robin

Requests are sent to servers in a repeating sequence: A, B, C, A, B, C, etc.


Least connections

The next connection goes to the server with the fewest active connections.


Fastest response

The server that has been handling requests the fastest receives the next connection. This tends to work well when the backend servers are a mix of fast and slow machines.


Hashed

The source IP address of the connection is hashed, thereby mapping it to one of the backend servers. Each time a connection request comes from the same client IP address, it is sent to the same backend server. The bindings change only when the number of machines in the cluster does.


Weighted

Several of the other algorithms can be weighted. For example, you may have four single-CPU machines and four dual-CPU machines. The dual-CPU machines are roughly twice as powerful as the single-CPU machines, so you tell the load balancer to send them twice as many requests—on average.

Which algorithm is right for MySQL? Again, it depends. There are several factors to consider and some pitfalls to avoid. One of the pitfalls is best illustrated with an example.

8.2.2.1 The consequences of poor algorithm choice

In September 2002, Yahoo! launched a one-week memorial site for those affected by the September 11, 2001 terrorist attacks. This site was described in Chapter 6. The remember.yahoo.com site was heavily promoted on the Yahoo! home page and elsewhere. The entire site was built by a small group of Yahoo! employees in the two weeks before the site's launch on September 9.

Needless to say, the site got a lot of traffic. So much, in fact, that Jeremy spent a couple of sleepless nights working to optimize the SQL queries and bring new MySQL servers online to handle the load. During that time the MySQL servers were running red hot. They weren't handling many queries per second (because they are poorly optimized) but they were either disk-bound, CPU-bound, or both. A server was slowest when it first came online because MySQL's key buffer hadn't yet been populated, and the operating system's disk cache didn't have any of the relevant disk blocks cached. They needed several minutes to warm up before taking their full query load.

The situation was made worse by the fact that the load balancer hadn't been configured with this in mind, and nobody realized it until very late in the process. When a server was reconfigured and brought back online, it was immediately pounded with 30-50 new queries. The machine became completely saturated and needed several minutes to recover. During the recovery time, it was nearly unresponsive, with the CPU at 100%, a load average over 25, and the disk nearly maxed out.

After quite a bit of theorizing and poking around, someone thought to question the load-balancer configuration. It turned out that it was set on a least-connections scheduling algorithm. That clearly explained why a new machine was bombarded with new connections and rendered useless for several minutes. Once the load balancer was switched to a random scheduling algorithm, it became much easier to bring down a slave, adjust the configuration, and put it back online without becoming completely overwhelmed.

The moral of the story is that the connection algorithm you select may come back to bite you when you least expect it (and can least afford it). Consider how your algorithm will work in day-to-day operations as well as when you're under an unusually high load or have a higher than normal number of backend servers offline for some reason.

We can't recommend the right configuration for your needs. You need to think about what will work best for your hardware, network, and applications. Furthermore, your algorithm choices are limited by the load balancing hardware or software you're using. When in doubt, test.

    Previous Section  < Day Day Up >  Next Section