Previous Section  < Day Day Up >  Next Section

The Quest for AES

On the contrary to the preceding story, DES does have a design flaw. Although the algorithm itself is sufficiently secure, recall that the key size and space is only 56 bits. Although it might have been sufficient at the time of DES design, we have seen a sufficient growth in processing power since 1974. Interestingly, the designers of DES might have foreseen it, as the initial proposition for DES key size was 128 bits. However, the National Security Agency (NSA) blocked that proposition for reasons that are unclear (and might become clear) if you are paranoid enough and 56-bit key DES went ahead instead. In July 1998, the Electronic Frontier Foundation (EFF; organized and funded a project to build a DES cracking machine for less than $250,000, and started a massive parallel processing software DES brute-forcing project. On January 19, the EFF DES cracker broke a 56-bit key in 56 hours, testing 88 billion keys per second and completing the third DES challenge contest sponsored by RSA Labs. The need for a new, improved encryption standard has materialized from shadows.

Without waiting for the DES key to be broken, on January 2, 1997, the National Institute of Standards and Technology (NIST; announced the beginning of the AES development effort and made a formal call for AES cipher candidates submission on September 12, 1997. The call stipulated that the AES would specify an unclassified, publicly disclosed encryption algorithm, available royalty-free worldwide. In addition, the algorithm had to implement symmetric key cryptography as a block cipher and support block sizes of 128 bits and key sizes of 128, 192, and 256 bits. The race began. On August 20, 1998, NIST announced a group of 15 AES candidate algorithms at the First AES Candidate Conference (AES1). These candidate ciphers included CAST-256, CRYPTON, DEAL, DFC, E2, FROG, HPC, LOKI 197, MAGENTA, MARS, RC6, RIJNDAEL, SAFER+, SERPENT, and TWOFISH. After the Second AES Candidate Conference (AES2) took place in Rome on March 22 and 23, 1999, only five candidates remained: MARS, RC6, RIJNDAEL, SERPENT, and TWOFISH. The five final candidates were determined to be equally secure, but the issues of efficient, fast and resource-preserving implementation remained. Eventually, in October 2, 2000, NIST announced that it has selected Rijndael as the AES.

Here we briefly evaluate all five finalist AES candidates plus Blowfish, IDEA, and 3DES, giving you the choice to select a cipher you like the most for your host, network, or code. The choice should be based on performance, method of implementation, and licensing issues as well as cipher security. For an outside network security consultancy (e.g., Arhont, any interference with the quality of networking could easily lead to a lost contract and ruined reputation. Managers who outsource security services will not understand the difference between DES, 3DES, and AES. What they will understand is a horde of users chanting, "These guys did something and the network became very slow!" Take a wild guess what might follow.

A bit of a background on which properties of the selected cipher affect the qualities listed is advisable:

  • Data encryption consumes bandwidth, reduces latency, and might contribute to increased packet loss. The questions are what is acceptable loss and how to minimize the side effects of enhanced security by implementing the most appropriate cryptosystem for the given network. Even though WLANs are getting faster, they still have lower bandwidth and throughput compared to their wired counterparts, and they are also shared media. Thus, one has to be especially careful when selecting a VPN solution for a wireless network.

  • Increasing the number of rounds in iteration improves security but demands more CPU resources. Is your wireless gateway CPU capable of dealing with the increased load?

  • Multiplication instruction is not native for pre-Pentium II machines and original (non-ULTRA) SPARCs. It is unlikely to be native for many handheld CPUs as well. Intel Itanium does not have rotation shift instruction, and multiplication is executed in FPU, not IU. Ciphers that use multiplication operations perform badly on such CPUs, and in the case of Itanium, ciphers that use rotation shift are also at a disadvantage. Interestingly, even if the CPU implements rotation shift instruction, some compilers do not use it, adding more to the complexity of this problem.

  • If enciphering involves manipulating very large integers, processors offering high-performance integer calculations outperform processors that focus on floating point operations.

  • All ciphers we describe were tested on 8-, 32-, and 64-bit chips. Their performance on these chips is variable, and if a cipher performs well on a 32-bit CPU, it is not an indicator that it will work great on an 8- or 64-bit one.

  • Encryption and decryption speed do not always match. Frequently, but not always, decryption is slower. You might consider this when choosing platforms for encryption and decryption services.

  • Performance of various ciphers might vary depending on the mode of use. A cipher sufficiently fast in ECB could be inferior in OFB and vice versa.

  • Cipher speed and efficiency are usually higher when implemented in Assembly rather than C and in C rather than higher level languages. Although hardware implementations of encryption are traditionally considered the fastest, in reality it depends on both cipher and hardware used.

  • Generation of large subkeys and storing them in memory negatively affects the amount of RAM used. This is an issue important for restricted-space devices, such as smart cards.

  • S-boxes either are tables, as in the case of DES, or can be derived algebraically. Large table S-boxes also might consume too much RAM on devices with a limited amount of memory. Algebraically derived S-boxes are considered to be less secure.

As far as security goes, all five AES candidates, as well as 3DES and Blowfish, are adequately secure. Thus, availability, implementation, and performance are often the main issues for cipher selection. One of the security criteria you might want to pay attention to is a security margin. The security margin is defined by a number of rounds in iteration above which efficient attacks on the algorithm cannot be mounted and key space exhaustion becomes the only way to break it.

Another rather fascinating point is the resilience of ciphers to novel implementation-based timing and power-consumption attacks. These attacks are physical, not mathematical, by nature. Timing attacks are based on analyzing the amount of time spent on executing instructions when different arguments are supplied to the cipher-implementing device or software. Power-consumption attacks analyze the patterns of device power consumption, which vary with the arguments supplied. A general defense against timing attacks is simultaneous encryption and decryption. General defense against power-consumption attacks is more sophisticated and might involve software balancing; for example, masking the power consumption pattern through processing a complement of intermediate iteration data simultaneously and using the same basic operations. In highly secure environments, you might want to pick ciphers with higher resistance to power and timing attacks due to the very nature of instructions run during the iteration. Table lookups, such as the one used by DES in S-boxes, fixed shifts and rotations, and Boolean NOT, OR, AND, and XOR operations are not vulnerable to timing attacks and can be defended against power attacks by implementing software balancing. Addition and subtraction are more difficult to defend from both timing and power attacks, and multiplication, division, squaring, or variable shifts and rotations are very hard to protect against them.

Now, as we know a bit more about performance and resource consumption issues in applied cryptography, we can proceed further, evaluating well-known symmetric block ciphers for our networking, software, and hardware needs.

AES (Rijndael)

We start with the official AES, or Rijndael, proposed by Belgian mathematicians Vincent Rijmen and Joan Daemen. FIPS 197, which announces the AES and describes it in detail, is available at the NIST encryption site ( The ciphers authors' personal Rijndael site is AES supports 128, 192, and 256 key and plaintext block sizes. One of the unique Rijndael characteristics is dependence of round numbers on a key size: R = K/32 + 6; thus, there are 10, 12, and 14 rounds for 128, 192, and 256 bits, respectively. Rijndael uses four operations:

  • Byte substitution, which is a form of nonlinear permutation and uses a single S-box table

  • Shift row, which is a cyclic shift

  • Mix column operation, which is a linear transformation

  • Round key addition

The size of a round key equals the size of an encryption block used. The encryption block is represented as a rectangular array with four rows. Each byte in the array is XORed with a corresponding subkey byte, which is also represented as a matrix. In a final round of AES, the mix column operation is omitted.

Key schedule function involves key expansion and round key selection. The total number of key bits required equals N(R + 1), where N is a block size and R is the number of rounds. Two different versions of the key expansion function exist for keys less and more than 192 bits.

The images of cipher structures from John Savard's home page are helpful again, this time providing us with both colorful and more convenient schemes underlining both the function and the aesthetics of the Rijndael design (see Figures 11-2 and 11-3).

Figure 11.2. The outline of AES operation.


Figure 11.3. AES: a 3D view.


Rijndael functions well on 8-, 32-, and 64-bit chips. Of the various CPU architectures tested, Itanium was shown to be the most efficient at running AES. Rijndael was shown to be the highest performer on restricted memory space and processing power devices—twice as fast as other finalists—and requiring much less ROM and RAM. It was also the most efficient cipher in all feedback modes and the second highest performer in ECB/CBC. Its safety margin is 7 rounds, with the minimum amount of rounds implemented being 10 (key size of 128 bits). The AES difference in speed between encryption and decryption was not significant. Increasing the key size from 128 to 192 and 256 bytes leads to 20 percent and 40 percent throughput decrease as the number of rounds goes up. When implemented in hardware, Rijndael demonstrated a very high throughput, matched only by Serpent (in ECB mode). Because Rijndael uses only fixed shifts and rotations, Boolean operations, and table lookups, it is reasonably resistant to both timing and power-consumption attacks and can be well protected by software blocking from the latter.


MARS, another AES candidate proposed by IBM, is known for its relative complexity and high number of rounds. The creators of MARS claim that its heterogeneous structure is a deliberate design feature to resist unknown attacks.

MARS's block size is 128 bits, and key sizes can vary between 128 and 448 bits. Plaintext input is taken in the form of four 32-bit blocks. There are three phases of MARS iteration.

During Phase 1, n 32-byte key words (4 < n < 14) are expanded to 40 32-byte subkey words using a key expansion function. Then the data blocks are XORed with the key words, and 8 rounds of unkeyed (not affected by the key) rounds using two fixed S-boxes follow.

Phase 2 is responsible for the major part of MARS security and employs 16 rounds of transformation using an expansion function E (see Figure 11-4).

Figure 11.4. The core of MARS.


The E function takes a key word and adds it to the data word supplied by Phase 1. Then the result is multiplied by a second key word, which must be odd. Afterward, the data is looked up in a fixed S-box (see the previous phase), gets XORed with the multiplied two key word/data word, and undergoes two rotations dependent on the lowest 5 bits of the multiplication result mentioned earlier. You get 32 x 4 = 128 bit output from E. The round on the scheme is one of the 8 forward rounds; 8 backward (in terms of rotation) rounds follow.

Finally, 8 rounds of unkeyed mixing in backward mode constitute Phase 3, basically a reverse of Phase 1. In total, there are 8 x 2 + 32 = 48 rounds, an impressive amount! Despite the superficial complexity, MARS is not very complex from a programmer's view, at least when it comes to the number of implementation lines. Taking into account all rounds, excluding the unkeyed, the safety margin of MARS is 21, which leaves a whole 11 rounds of protective buffer.

However, the heterogenicity of MARS has a downside when it comes to performance and resource consumption. Its software implementation speed and throughput strongly depends on how well the processor/compiler combination can handle multiplication and variable (data-dependent) rotations. On PII/PIII CPUs, these operations are handled fine, but one day you might want to upgrade to Itaniums, UltraSPARCs, or some other CPU architecture that does not support these operations well. Then you might find that MARS encryption and decryption speed (not significantly different in both directions) has become a serious bottleneck. On the other hand, MARS throughput does not appear to depend on key size, so you can safely go up to the maximum 448-bit key length. When implemented in hardware, both throughput and efficiency of MARS are below average. Also, this cipher is not very suitable for restricted-space devices such as smart cards: From the number of subkeys used alone, you can see that both RAM and ROM requirements of MARS would be high.


Another AES finalist that uses multiplication operations and a large number of 32-bit subkeys is RC6 from RSA Labs. RC6 has a licensing issue: It was submitted as an AES candidate under the condition of becoming unlicensed if it won (because all AES candidates were expected to be free to use) and remaining licensed if it didn't win. RC6 can use variable numbers of rounds, block sizes, and keys up to 2,040 bits (which is really a lot for a symmetric block cipher). However, for the AES submission, 32-bit words, 16 rounds, and key sizes of 128, 192, and 256 bits were selected.

RC6 is based on the proprietary RC5 cipher, which is currently attacked by the project.

Just like the majority of post-DES symmetric block ciphers, RC6 is based on Feistel rounds. However, instead of splitting the block in halves (left and right) and operating the rounds between the halves, RC6 splits the halves into two words each (thus, 32-bit words) and runs the rounds between halves of the halves. RC6 generates 44 subkeys of 32 bits each. The block of plaintext input is split into four 32-bit words designated as A, B, C, and D. Encryption of data proceeds in a Little-Endian order: The least significant byte is enciphered first. The initial step is XORing B with the first subkey and D with second. The next round uses the third and fourth subkeys, and so forth. In total there are 20 rounds. After the last round, A is XORed with the 42nd and C with the 43rd key.

What happens inside of the round? The main function is simple: f(x) = x*(2x+1). The result of the function is rotated to the left by 5 bits and XORed with another word. B and D are the only words subjected to the function f. Because there are 16 rounds in the AES submission of RC6, and two words out of four are subjected to multiplication, overall we get 16 x 2 = 32 multiplication operations. The results of f(B) and f(D) followed by left rotation are XORed to A and C, respectively. The least significant 5 bits of the values obtained define the extent to which C and A are circular left-shifted later. Again, there are 2 x 16 rounds = 32 variable rotations per iteration. Finally, the subkeys used for this particular round are XORed with A and C words and the four quarters are rotated: The value of A is placed in D, B in A, C in B, and D in C.

Subkeys for rounds are supplied by a key scheduling function that pads the key with zeros to match its length with the integral number of words. The number of subkeys generated equals 2 x number of rounds + 4, which is 2 x 20 + 4 = 44 in the case of the AES submission. The padded key is loaded into an array L in a Little-Endian format. Two left shifts, one by 3 bits, and one variable are used to create confusion. The size of an output array S is adjusted using two constants P and Q: S [0] = P ; for i = 1 to 2 x rounds_number + 3 do S [i] = S [j-1] + Q where i and j are two subkey numbers in the array. If you are curious, P is e – 2, where e is the base of a natural logarithm function and Q is a Golden Ratio [(5+1)/2]-1. If you aren't curious, P = 0xb7e15162 and Q = 0x9e3779b9, just in case someone asks you about these values in a bar and you don't know what to answer.

RC6 has an adequate security margin of 16 rounds (out of 20). Its decryption speed appears to be slightly higher than its encryption rate. RC6 is a reasonably good performer when implemented in hardware. However, when software-based, performance varies significantly depending on presence of or support for multiplication and variable rotation instructions (see the earlier notes on MARS performance; the same applies to RC6). Also, because of both multiplication and variable shift reliance, RC6 (as well as MARS) is difficult to protect against power-consumption and timing attacks. It should be noted that RC6 is very fast on appropriate architectures, such as PII and PIII, and when implemented in C could outperform all other AES candidates (see the Gladman's AES performance data at for a reference). However, its performance on 8- and 64-bit CPUs was not impressive. While implementing RC6, you might think twice about scalability issues, including possible future use of 64-bit chips or CPU architectures like UltraSPARC or Itanium that do not support multiplication and variable rotation instructions natively. RC6 has a low ROM requirement, because it doesn't use any large tables and table lookups. However, its slow performance on 8-bit chips is a disadvantage on low-end devices. Besides, RC6 subkeys must be precomputed and stored in memory, which makes RAM demand for RC6 higher than RAM demand for other AES candidates. Thus, RC6 is not an ideal cryptographic solution for restricted space and resource device security. RC6 performs better in ECB and CBC, and changes of RC6 key size do not strongly affect its performance.


Whereas RC6 celebrates its simplicity, Bruce Schneier's Twofish is famous for its complexity, even though the authors maintain that this perception is wrong. Nevertheless, Twofish has been with us for a while, it was extensively cryptoanalyzed, and it is used by many software products. A comprehensive list of programs that use Twofish can be viewed at the author's site ( A tool the list forgets to mention (at the time of writing this chapter) is Nessus ( If you are somehow related to network security, you know what it does and have already used it many times. All data between Nessus servers and clients are encrypted with Twofish. The reason for the popularity of Twofish, and that of its predecessor Blowfish, is that the algorithm and source code that implements it is completely license-free for any kind of use.

Twofish uses 16 rounds, 128-bit block size, and 256-bit keys (even though the key size can be decreased) that generate 40 32-bit subkeys. Like RC6, it splits the plaintext block into four subblocks of 32 bits each, using Little-Endian convention. Let us designate these subblocks (or words) as Q0–Q3. Before these words are put through the first Twofish round and after the last round takes place, an operation of the so-called whitening takes place to increase the cipher's confusion level. Whitening is XORing the words with subkeys before and after the rounds, under the condition that the subkeys used for whitening are never used in the cipher again. Thus, the specific input and output from the iteration rounds is concealed.

A Twofish round begins by rotating the last Q3 word 1 bit to the left. Then Q0 and Q1 are rotated left 8 bits. The data is then submitted to four 8-bit key-dependent (fixed box lookups combined with key material XORing) S-boxes. That output is multiplied with matrix material from the so-called MDS matrix. In case you wonder what's in the matrix, here it is:

01 EF 5B 5B

5B EF EF 01

EF 5B 01 EF

EF 01 EF 5B

The output of the matrix on matrix "multiplication" (or should we say "imposition") is put through a mixing Pseudo-Hadamard Transform (PHT) operation. In a nutshell, if we take inputs a and b, 32-bit PHT is defined as follows:

a' = a + b mod 2^32          b = a + 2b mod 2^32

Then the first subkey is added to the value formed from Q0 and the result is XORed with Q2. The second subkey for the round is added to the value formed from Q1 and XORed with Q3. Following that, Q2 is rotated 1 bit right and the block halves are swapped (Q0 with Q2 and Q1 with Q3). The events in the round from its beginning to the PHT mixing are the core of Twofish security and are defined as function g in the literature.

The best way to illustrate the events described is shown in Figure 11-5.

Figure 11.5. Twofish operation structure scheme.


How are the subkeys made? The key schedule function starts by generating three key vectors each one-half key long. The first two are produced via splitting the key into 32-bit parts. The third key is formed by dividing the key into 64-bit blocks and generating one 32-bit part of the key vector by multiplying each 64-bit part by the RS matrix:

01 A4 55 87 5A 58 DB 9E

A4 56 82 F3 1E C6 68 E5

02 A1 FC C1 47 AE 3D 19

A4 55 87 5A 58 DB 9E 03

32-bit words resulting from the multiplication are placed in reverse order into the key vector S. This vector is used to generate key-dependent S-boxes in the function g. For example, if the key is 128 bits long, the S-box structure would be

output = q(0)(S(0,0) xor q(1)(S(1,0) xor q(1)(input))

output = q(1)(S(0,1) xor q(1)(S(1,1) xor q(0)(input))

output = q(0)(S(0,2) xor q(0)(S(1,2) xor q(1)(input))

output = q(1)(S(0,3) xor q(0)(S(1,3) xor q(0)(input))

Another different function, function h, participates in generating the subkeys in parallel with "enriching" the S-boxes with key material. It involves XORing 32-bit words of plaintext with the key vectors and combining obtained results with the MDS matrix. Then the subkeys are generated via addition of the h function-generated data and fixed 8- and 9-bit left shifts.

The facts that the subkeys are generated on the fly and key space is used for two parallel processes of "shuffling" key and plaintext data are the unique characteristics of the Twofish algorithm. This, together with whitening, provides a high level of confusion and is partially responsible for the high safety margin of Twofish: 6 out of 16 rounds. However, the key setup is slow and with the increasing key size, Twofish throughput goes down. Also, because of the addition operation Twofish is somewhat more vulnerable to power consumption and timing attacks, although less vulnerable than MARS and RC6. Encryption and decryption throughputs of Twofish were shown to be practically identical. Because Twofish does not use any atypical instructions and was designed to be implemented on 8- and 64- as well as 32-bit platforms, it performs equally well on all tested architectures with an exemption of ARM chips, on which Twofish is rather slow (too bad for using Twofish implementations on the majority of modern PDAs, e.g., HP iPAQs). The performance of Twofish implemented in hardware was judged "average." Because Twofish does not use large S-boxes and can generate subkeys as the iteration runs without precomputing and storing them in RAM, Twofish scales well on low-resource devices.


The last AES finalist, Serpent, is more massive than it is complex. In fact, Serpent is very similar to DES and perhaps should have been reviewed first for ease of comparison. Despite Serpent's similarity to DES, it is claimed to be more secure than triple-DES (which we cover after dealing with Serpent) while having an operation speed close to DES. Serpent was developed by Ross Anderson (Cambridge University), Eli Biham (Technion, Haifa), and Lars Knudsen (University of Bergen, Norway). Two patent applications for Serpent were filled in the United Kingdom.

Just like the ciphers we went through before, Serpent takes 128-bit blocks of plaintext data and splits them into four 32-bit words. Maximum Serpent key size is 256 bits, and all keys smaller than that are padded to 256 bits by adding 1 to the most significant bit and filling the remaining space with zeros. Serpent employs 32 rounds and uses XOR, table lookup, fixed bit rotation, and bit-shifting instructions. It also uses initial and final permutations similar to the ones used by DES. These permutations are there for increasing computational efficiency and input–output convenience and have no effect on overall cipher security.

Each round starts from XORing the appropriate subkey (128 bits) with a plaintext block (also 128 bits). Then the block is fed into a corresponding S-box. Serpent has eight S-boxes, each used four times to get 32 rounds: S0 is used for Rounds 1, 9, 17, and 25; S1 is used for Rounds 2, 10, 18, 26; and so forth. The output of S-boxes is divided into four 32-bit words Q0, Q1, Q2 and Q3; each word undergoes shifts and rotations in the following order:

  • Q0 is rotated 13 bits left, and Q2 is rotated 3 bits left.

  • Q1 is modified by XORing Q0 and Q2 to it. Q3 gets XORed with Q0 (shifted left 3 bits), and Q2 which is left alone.

  • Then Q1 is rotated 1 bit left, and Q3 7 bits left.

  • Q0 is modified by XORing Q1 and Q3 to it. Q2 gets XORed with Q1 (shifted left 7 bits), and Q3, which is left alone.

  • Q0 is rotated 5 bits left, and Q3 is rotated 22 bits left.

Thus, a bit-slicing effect is achieved and the output of the S-boxes gets well shuffled.

In the final round, the mixing operations are omitted. Instead the final permutation follows. Although it might not be easy to understand the "shuffling," it starts making more sense in the scheme of a single round (see Figure 11-6; the circles denote XOR operations).

Figure 11.6. An outline of Serpent structure.


From this pattern you can imagine that after a chain of 32 rounds very high levels of diffusion and confusion would be reached.

The arrangement of S-boxes in Serpent was inspired by the RC4 structure. The boxes are matrices that contain 16 4-bit entries. Thirty-two copies of each S-box are produced in the iteration process; they are propagated along the rounds in parallel fashion. The internal workings of Serpent S-boxes are completely identical to DES S-box operations. The designers considered preserving DES S-boxes to be an important factor in boosting public confidence by applying a tested and tried technique.

As to the key schedule, after padding (if necessary), a 256-bit key is divided into eight 32-bit words. Then 132 32-bit words are formed in accordance with the following algorithm:

Word(n) = (Word(n-8) XOR Word(n-5) XOR Word(n-3) XOR Word(n-1) XOR '0x9E3779B9' XOR n) <<< 11

where 0x9E3779B9 is something you encountered earlier (yes, the Golden Ratio) and <<< denotes a fixed shift left by 11 bits. The 132 words generated are fed into DES S-boxes to produce 132 subkey words k0-131. These subkeys are merged into groups of four to get what we need: 33 128-bit subkeys for our 32 rounds in iteration.

Serpent has a simple yet powerful structure, making its cryptanalysis easy to perform. It provides a high safety margin of 9 rounds out of 32, the highest safety margin of all AES candidates. Due to its use of XORs, table lookups in DES S-boxes, and fixed rotations and shifts, Serpent is not likely to be vulnerable to timing or power-consumption attacks. However, there is a performance price to pay: Serpent was the slowest AES candidate when implemented in software. Interestingly, though, the speed of Serpent coded in C did not differ from the speed of its Assembly implementation. Also, when you look at the structure of Serpent, you can see four "pipelines" of 32 S-boxes. If Serpent is implemented in hardware that supports four parallel memory pipelines (e.g., Itanium), Serpent might work very fast. In fact, it works fine in hardware anyway. In nonfeedback mode it shows the highest throughput of all five candidates, and in CFB/OFB it is inferior only to Rijndael. The reason for such discrepancy between software and hardware performance of Serpent lies in the simplicity of all instructions used by the algorithm. For exactly the same reason, Serpent is well-suited to restricted memory space devices, despite having a large number of S-boxes. The question of Serpent's performance at different key lengths is simply irrelevant because a key of any size would be padded to 256 bits. Possibly on the basis of both very high security and hardware performance rationales, old-fashioned, DES-like Serpent was second after Rijndael in the AES voting process: Rijndael got 86 votes, Serpent had 59, Twofish garnered 31, RC6 got 23, and MARS received 13.

    Previous Section  < Day Day Up >  Next Section