only for RuBoard - do not distribute or recompile

3.10 More on SQL and MySQL

In this section we discuss miscellaneous tools and techniques for using SQL and MySQL. We introduce:

Choosing keys and indexes for fast searching
Elementary database-tuning techniques
Adding and deleting users of a DBMS, and changing user permissions
Limitations of MySQL

3.10.1 Keys, Primary Keys, and Indexes

As discussed earlier in our introduction to SQL, each table should have a PRIMARY KEY definition as part of the CREATE TABLE statement. A primary key is an attribute—or set of attributes—that uniquely identifies a row in a table. Storing two rows with the same primary key isn't permitted and, indeed, an attempt to INSERT duplicate primary keys produces an error.

In MySQL, the attribute values of the primary key are stored in an index to allow fast access to a row. The default MySQL index type is fast for queries that find a specific row, a range of rows, for joins between tables, grouping data, ordering data, and finding minimum and maximum values. Indexes don't provide any speed improvement for retrieving all the rows in a table or for other query types.

Indexes are also useful for fast access to rows by values other than those that are associated with attributes in the primary key. For example, in the customer table, you might define an index by adding the clause:

KEY namecity (surname,firstname,city)

to the CREATE TABLE statement. After you define this index, some queries that select a particular customer through a WHERE clause can use it. Consider an example:

SELECT * FROM customer
  WHERE surname = 'Marzalla' AND
  firstname = 'Dimitria' AND
  city = 'St Albans';

This query can use the new index to locate—in at most a few disk accesses—the row that matches the search criteria. Without the index, the DBMS must scan all the rows in the customer table and compare each row to the WHERE clause. This might be quite slow and certainly requires significantly more disk accesses than the index-based approach (assuming the table has more than a few rows).

A particular feature of DBMSs is that they develop a query evaluation strategy and optimize it without any interaction from the user or programmer. If an index is available, and it makes sense to use it in the context of a query, the DBMS does this automatically. All you need to do is identify which queries are common, and make an index available for those common queries by adding the KEY clause to the CREATE TABLE statement or using ALTER TABLE on an existing table.

Careful index design is important. The namecity index we have defined can also speed queries other than those that supply a complete surname, firstname, and city. For example, consider a query:

SELECT * FROM customer
  WHERE surname = 'LaTrobe' AND
  firstname = 'Anthony';

This query can also use the index namecity, because the index permits access to rows in sorted order first by surname, then firstname, and then city. With this sorting, all "LaTrobe, Anthony" index entries are clustered together in the index. Indeed, the index can also be used for the query:

SELECT * FROM customer
  WHERE surname LIKE 'Mar%';

Again, all surnames beginning with "Mar" are clustered together in the index. However, the index can't be used for a query such as:

SELECT * FROM customer
  WHERE firstname = 'Dimitria' AND
  city = 'St Albans';

The index can't be used because the leftmost attribute named in the index, surname, isn't part of the WHERE clause. In this case, all rows in the customer table must be scanned and the query is much slower (again assuming there are more than a few rows in the customer table, and assuming there is no other index).

Careful choice of the order of attributes in a KEY clause is important. For an index to be usable in a query, the leftmost attribute must appear in a WHERE clause.

There are other cases in which an index can't be used, such as when a query contains an OR that isn't on an indexed attribute:

SELECT * FROM customer
  WHERE surname = 'Marzalla' OR
  email = 'dimitria@lucaston.com';

Again, the customer table must be completely scanned, because the second condition, email='dimitria@lucaston.com', requires all rows to be retrieved as there is no index available on the attribute email. Also, the case where the ORed attribute isn't the leftmost attribute in an index requires a complete scan of the customer table. The following example requires a complete scan:

SELECT * FROM customer
WHERE firstname = 'Dimitria' OR
surname = 'Marzalla';

If all the attributes in the index are used in all the queries, to optimize index size, the leftmost attribute in the KEY clause should be the attribute with the highest number of duplicate entries.

Because indexes speed up queries, why not create indexes on all the attributes you can possibly search on? The answer is that while indexes are fast for searching, they consume space and require updates each time rows are added or deleted, or key attributes are changed. So, if a database is largely static, additional indexes have low overheads, but if a database changes frequently, each additional index slows the update process significantly. In either case, indexes consume additional space, and unnecessary indexes should be avoided.

One way to reduce the size of an index and speed updates is to create an index on a prefix of an attribute. Our namecity index uses considerable space: for each row in the customer table, an index entry is up to 120 characters in length because it is created from the combined values of the surname, firstname, and city attributes.^[2] To reduce space, you can define the index as:

^[2] This isn't the space actually required by an index entry, because the data is compressed for storage. However, even with compression, the fewer characters indexed, the more compact the representation, the more space saved, and—depending on the usability of the index—the faster searching and updates are.

KEY namecity (surname(10),firstname(3),city(2));

This uses only the first 10 characters of surname, 3 of firstname, and the first 2 characters of city to distinguish index entries. This is quite reasonable, because 10 characters from a surname distinguishes between most surnames, and the addition of a few characters from a first name and the prefix of their city should be sufficient to uniquely identify almost all customers. Having a smaller index with less information can also mean that queries are actually faster, because more index information can be retrieved from disk per second, and disk retrieval speed is almost always the bottleneck in query performance.

The space saving is significant with a reduced index. A new index entry requires only 15 characters, a saving of up to 105 characters, so index insertions, deletions, and modifications are now likely to be much faster. Note that for TEXT and BLOB attribute types, a prefix must be taken when indexing, because indexing the entire attribute is impractical and isn't permitted by the MySQL DBMS.

3.10.2 Tuning the Database System

Careful index design is one technique that improves the speed of a DBMS and can reduce the resource requirements of a database. However, comprehensive database tuning is a complex topic that fills many books. We include in this section only a few additional practical ideas to begin to improve the performance of a database system.

As discussed previously, accessing a hard disk is slow and is usually the bottleneck in DBMS performance. More specifically, disk seeking—moving the disk head to get information from another location of the disk—is the slowest component of disk access. Therefore, most techniques described in this section are also techniques that improve performance by minimizing disk space requirements.^[3]

^[3] Reducing disk space requirements improves both disk seek and read performance. Disk read performance is improved because less data is required to be transferred, while seek performance is improved because the disk head has to move less on average when randomly accessing a smaller file than when accessing a larger file.

Here are some ways to improve DBMS performance:

Carefully choose attribute types and lengths. Where possible, use small variants such as SMALLINT or MEDIUMINT rather than the regular choice INT. When using fixed-length attributes, such as CHAR, specify a length that is as short as practical.
Use fixed-length attributes; that is, try to avoid types such as VARCHAR or BLOB. While fixed-length text attributes may waste space, scanning fixed-length rows in a query is much faster than scanning variable-length rows.
Design indexes with care. As discussed in the last section, keep the primary key index as small as possible, create only indexes that are needed, and use prefixes of attributes where possible. Ensure that the leftmost attribute in the index is the most frequently used in queries and, if all attributes are used, make sure the leftmost attribute is the one with the highest number of duplicate entries.
Create a statistics table if aggregate functions such as COUNT( ) or SUM( ) are frequently used in queries on large tables. A statistics table stores only one row that is manually updated with the aggregate values of another table. For example, if the statistics table maintains the count of rows in a large customer table, each time a row is inserted or deleted in the customer table, the count is updated in the statistics table. For large tables, this is often faster than calculating aggregate functions with the slow built-in functions that require complete processing of all rows.
If large numbers of rows are deleted from a table, or a table containing variable-length attributes is frequently modified, disk space may be wasted. MySQL doesn't usually remove deleted or modified data; it only marks the location as being no longer in use. Wasted space can affect access speed.

To reorganize a table—by copying data to a temporary location and back again—MySQL provides the OPTIMIZE TABLE command, which should be used periodically. For example:
```
OPTIMIZE TABLE customer;
```
The OPTIMIZE command should be run when the DBMS is offline for scheduled maintenance. The command is nonstandard SQL.
It is possible to create different table types for specific tasks. The default in MySQL is the MyISAM type, and all the tables described so far are this table type. For small, temporary, frequently used lookup tables, a different type, the heap table type, can be used. There are other types, and we briefly discuss alternatives in Chapter 6. More details are provided in Section 9.4 of the MySQL user manual.
Section 10.7 of the MySQL manual includes other excellent ideas for simple performance improvement.

Another aspect of database tuning is optimizing the performance of the DBMS itself. Included with the MySQL installation is the mysqladmin tool for database administration. Details of the system setup can be found by running the following command from a Linux shell:

% mysqladmin -ppassword variables

This shows, in part, the following selected system parameters:

join_buffer             current value: 131072
key_buffer              current value: 8388600
net_buffer_length       current value: 16384
record_buffer           current value: 131072
sort_buffer             current value: 2097144
table_cache             current value: 64

The important parameters are those that impact disk use. MySQL has several main-memory buffer parameters that control how much data is kept in memory for processing. These include:

The record_buffer for scanning all rows in a table
The sort_buffer for ORDER BY and GROUP BY operations
The key_buffer for storing indexes in main memory
The join_buffer for joins that don't use indexes

In general, the larger these buffers, the more data from disk is cached or stored in memory and the fewer disk accesses are required. However, if the sum of these parameters is near to exceeding the size of the memory installed in the server, the underlying operating system will start to swap data between disk and memory, and the DBMS will be slow. In any case, careful experimentation based on the application is likely to improve DBMS performance.

Section 10.2.3 of the MySQL manual suggests parameter settings when starting the MySQL server. First, for machines with at least 64 MB of memory, large tables in the DBMS, and a moderate number of users, use:

safe_mysqld -O key_buffer=16M -O table_cache=128 \
            -O sort_buffer=4M -O record_buffer=1M &

Second, if there is less than 64 MB of memory available, and there are many users, try the following:

safe_mysqld -O key_buffer=512k -O sort_buffer=100k \
            -O record_buffer=100k &

The following setting might be appropriate for the winestore, because many users are expected, the queries are largely index-based, and the database is small:

safe_mysqld -O key_buffer=512k -O sort_buffer=16k \
            -O table_cache=32 -O record_buffer=8k 
            -O net_buffer=1K &

Even more conservative settings might also be acceptable.

There are two other parameters we have not discussed. The table_cache parameter manages the maximum number of open tables per user connection, while the net_buffer parameter sets the minimum size of the network query buffer in which incoming queries are kept before they are executed.

The mysqladmin utility can report the status of the DBMS:

% mysqladmin -ppassword status

The output has the following format:

Uptime: 5721024  Threads: 14  Questions: 7874982
Slow queries: 6  Opens: 115136  Flush tables: 1
Open tables: 62

This gives a brief point-in-time summary of the DBMS status and can help find more about the number of user connections, queries, and table use. Similar output can be generated by running the commands SHOW STATUS and SHOW VARIABLES through the MySQL command interpreter.

Information about query performance can be gained with the benchmark( ) function, which can be used iteratively for tuning when altering table design or DBMS system parameters. The following statement illustrates benchmarking:

SELECT benchmark(10000, COUNT(*)) 
  FROM items;

This statement reports the time taken to evaluate 10,000 calls to COUNT( ) on the items table.

3.10.3 Adding and Deleting Users

We have not yet discussed adding and deleting users from the MySQL DBMS. Our rationale in leaving this topic until this final section is that DBMS users aren't as important in a web database application as in other applications. Because access to the database and DBMS is generally controlled in the application logic of the middle tier, usually only one or two DBMS users are needed.

A user, hugh, who has full control over all aspects of the DBMS and can access the DBMS from the machine that hosts the DBMS, can be created with the statement:

GRANT ALL PRIVILEGES ON *.* TO hugh@localhost
  IDENTIFIED BY 'password' WITH GRANT OPTION;

Allowing access over a network can be added with:

GRANT ALL PRIVILEGES ON *.* TO hugh@"%" 
  IDENTIFIED BY 'password' WITH GRANT OPTION;

There is no need to allow network access for a web database application if the middle-tier components—the web server and scripting engine—are installed on the same machine as the DBMS.

This user can then connect to the database from the shell with the command:

% mysql -ppassword -uhugh

The user information is stored in the mysql database in the user table, which can be explored with:

USE mysql;
SELECT * FROM user;

The mysql database and the user table can be managed in the same way as any other database. For example, you can update the password of the new user with the UPDATE statement:

UPDATE user 
  SET password=password('newpwd') 
  WHERE user='hugh';

Note the use of the password( ) function we described earlier to encrypt the password for storage in the user table.

3.10.3.1 Permissions

Users can be added to the system with an INSERT INTO the user table in the mysql database or, as previously illustrated, you can use the GRANT statement. Moreover, privileges can be adjusted with an UPDATE, added with GRANT, or removed with REVOKE.

Consider the following example:

GRANT SELECT,INSERT,UPDATE,DELETE,CREATE,DROP
  ON winestore.*
  TO dave@localhost
  IDENTIFIED BY 'password';

This adds a new user dave and allows him to use only the SQL statements listed in the winestore database. The parameter winestore.* means all tables within the winestore database.

Privileges can be removed with the REVOKE statement. For example:

REVOKE DROP,CREATE ON winestore.* FROM dave@localhost;

If the privilege or privileges are to be revoked for all databases in the DBMS, not just a single database, winestore.* can be replaced with *.*.

The following privileges can be used in GRANT and REVOKE statements:

ALL PRIVILEGES, FILE, RELOAD, ALTER, INDEX, SELECT, 
CREATE, INSERT, SHUTDOWN, DELETE, PROCESS, UPDATE, 
DROP, REFERENCES, USAGE

3.10.4 Limitations of MySQL

The most significant limitation of MySQL is that it doesn't support nested queries. However, support is planned in MySQL Version 4. Nested queries are those that contain another query. Consider an example nested query to find the wines that have inventory stock:

SELECT DISTINCT wine_id FROM wine
WHERE wine_id IN 
  (SELECT wine_id from inventory);

The query returns the wine_id values from the wine table that are found in the inventory table. Nested queries use the IN, NOT IN, EXISTS, and NOT EXISTS operators.

In many cases, a nested query can be rewritten as a join query. For example, to find the wines that are in stock, you can use the following join query:

SELECT DISTINCT wine.wine_id FROM wine, inventory
WHERE wine.wine_id = inventory.wine_id;

However, some nested queries can't be rewritten as join queries; for difficult queries, temporary tables are often a useful workaround.

A limitation of DELETE and UPDATE is that only one table can be specified in the FROM clause. This problem is particular to MySQL and related to the lack of support for nested queries. This limitation can make modifications of data difficult. For example, it prevents data being deleted or updated using the properties of another table. A solution involves data being copied to a temporary table using a combined INSERT and SELECT statement that joins together data from more than one table. Then, the data can be deleted or updated in the temporary table and then transferred back to the original table. Another approach, using the concat( ) string function, is discussed in Section 1.4.4.1 in the MySQL manual.

To avoid UPDATE and DELETE problems, consider adding additional attributes to tables at design time. For example, in the winestore we added a DATE attribute to the items table so that shopping-cart items can be removed easily if they aren't purchased within one day. Removing rows from the items table based on the DATE in the orders table is difficult without support for nested queries.

MySQL doesn't support stored procedures or triggers. Stored procedures are queries that are compiled and stored in the DBMS. They are then invoked by the middle-tier application logic, with the benefit that the query is parsed only once and there is less communication overhead between the middle and database tiers. Triggers are similar to stored procedures but are invoked by the DBMS when a condition is met. Stored-procedure support is planned for MySQL, but trigger support isn't.

Views aren't supported in MySQL. Views consolidate read-only access to several tables based on a join condition. For example, a view might allow a user to browse the sales made up to April without the need to create a temporary table, as we did in the example in Section 3.8. View support is planned for the future.

Limitations that we don't discuss here include the lack of support for foreign keys and cursors. More detail on the limitations of MySQL can be found in Section 1.4 of the manual distributed with MySQL.

only for RuBoard - do not distribute or recompile