Previous Page
Next Page

18.4. Optimization

"First make it work. Then make it right. Then make it fast." This quotation, often with slight variations, is widely known as "the golden rule of programming." As far as I've been able to ascertain, the quotation is by Kent Beck, who credits his father with it. Being widely known makes the principle no less important, particularly because it's more honored in the breach than in the observance. A negative form, slightly exaggerated for emphasis, is in a quotation by Don Knuth (who credits Hoare with it): "Premature optimization is the root of all evil in programming."

Optimization is premature if your code is not working yet, or if you're not sure about what, exactly, your code should be doing (since then you cannot be sure if it's working). First make it work. Optimization is also premature if your code is working but you are not satisfied with the overall architecture and design. Remedy structural flaws before worrying about optimization: first make it work, then make it right. These first two steps are not optional; working, well-architected code is always a must.

In contrast, you don't always need to make it fast. Benchmarks may show that your code's performance is already acceptable after the first two steps. When performance is not acceptable, profiling often shows that all performance issues are in a small part of the code, perhaps 10 to 20 percent of the code where your program spends 80 or 90 percent of the time. Such performance-crucial regions of your code are known as its bottlenecks, or hot spots. It's a waste of effort to optimize large portions of code that account for, say, 10 percent of your program's running time. Even if you made that part run 10 times as fast (a rare feat), your program's overall runtime would only decrease by 9 percent, a speedup no user would even notice. If optimization is needed, focus your efforts where they'll matteron bottlenecks. You can optimize bottlenecks while keeping your code 100 percent pure Python, thus not preventing future porting to other Python implementations. In some cases, you can resort to recoding some computational bottlenecks as Python extensions (as covered in Chapter 25), potentially gaining even better performance (possibly at the expense of some potential future portability).

18.4.1. Developing a Fast-Enough Python Application

Start by designing, coding, and testing your application in Python, using available extension modules if they save you work. This takes much less time than it would with a classic compiled language. Then benchmark the application to find out if the resulting code is fast enough. Often it is, and you're donecongratulations! Ship it!

Since much of Python itself is coded in highly optimized C, as are many of its standard and extension modules, your application may even turn out to be already faster than typical C code. However, if the application is too slow, you need to reexamine your algorithms and data structures. Check for bottlenecks due to application architecture, network traffic, database access, and operating system interactions. For typical applications, each of these factors is more likely than language choice to cause slowdowns. Tinkering with large-scale architectural aspects can often speed up an application dramatically, and Python is an excellent medium for such experimentation.

If your program is still too slow, profile it to find out where the time is going. Applications often exhibit computational bottlenecks: small areas of the source code, often between 10 and 20 percent, which account for 80 percent or more of the running time. Then optimize the bottlenecks, applying the techniques suggested in the rest of this chapter.

If normal Python-level optimizations still leave some outstanding computational bottlenecks, you can recode them as Python extension modules, as covered in Chapter 25. In the end, your application will run at roughly the same speed as if you had coded it all in C, C++, or Fortranor faster, when large-scale experimentation has let you find a better architecture. Your overall programming productivity with this process is not much less than if you coded everything in Python. Future changes and maintenance are easy, since you use Python to express the overall structure of the program and lower-level, harder-to-maintain languages for only a few specific computational bottlenecks.

As you build applications in a given area according to this process, you accumulate a library of reusable Python extension modules. You therefore become more and more productive at developing other fast-running Python applications in the same field.

Even if external constraints eventually force you to recode the whole application in a lower-level language, you're still better off for having started in Python. Rapid prototyping has long been acknowledged as the best way to get software architecture just right. A working prototype lets you check that you have identified the right problems and taken a good path to their solution. A prototype also affords the kind of large-scale architectural experiments that can make a real difference to performance. Starting your prototype with Python allows a gradual migration to other languages by way of extension modules. The application remains fully functional and testable at each stage. This ensures against the risk of compromising a design's architectural integrity in the coding stage. The resulting software is faster and more robust than if all of the coding had been lower-level from the start, and your productivity, while not quite as good as with a pure Python application, is still better than if you had been coding at a lower level throughout.

18.4.2. Benchmarking

Benchmarking (also known as load testing) is similar to system testing: both activities are much like running the program for production purposes. In both cases, you need to have at least some subset of the program's intended functionality working, and you need to use known, reproducible inputs. For benchmarking, you don't need to capture and check your program's output: since you make it work and make it right before you make it fast, you are already fully confident about your program's correctness by the time you load-test it. You do need inputs that are representative of typical system operations, ideally ones that may be most challenging for your program's performance. If your program performs several kinds of operations, make sure you run some benchmarks for each different kind of operation.

Elapsed time as measured by your wristwatch is probably precise enough to benchmark most programs. Programs with hard real-time constraints are obviously another matter, but they have needs very different from those of normal programs in most respects. A 5 or 10 percent difference in performance, except for programs with very peculiar constraints, makes no practical difference to a program's real-life usability.

When you benchmark "toy" programs or snippets in order to help you choose an algorithm or data structure, you may need more precision: the timeit module of Python's standard library (mentioned in "Module timeit" on page 484) is quite suitable for such tasks. The benchmarking discussed in this section is of a different kind: it is an approximation of real-life program operation for the sole purpose of checking whether the program's performance at each task is acceptable, before embarking on profiling and other optimization activities. For such system benchmarking, a situation that approximates the program's normal operating conditions is best, and high accuracy in timing is not particularly important.

18.4.3. Large-Scale Optimization

The aspects of your program that are most important for performance are large-scale ones: choice of algorithms, overall architecture, choice of data structures.

The performance issues that you must often take into account are those connected with the traditional big-O notation of computer science. Informally, if you call N the input size of an algorithm, big-O notation expresses algorithm performance, for large values of N, as proportional to some function of N (in precise computer science lingo, this should be called big-Theta, but in real life, most programmers call this big-O, perhaps because a Greek uppercase Theta looks like an O with a dot in the center!).

An O(1) algorithm (also known as a "constant time" algorithm) is one that takes a time that does not grow with N. An O(N) algorithm (also known as a "linear time" algorithm) is one where, for large enough N, handling twice as much data takes about twice as much time, three times as much data three times as much time, and so on, growing proportionally to N. An O(N2) algorithm (also known as a "quadratic time" algorithm) is one where, for large enough N, handling twice as much data takes about four times as much time, three times as much data nine times as much time, and so on, growing proportionally to N squared. Identical concepts and notation are used to describe a program's consumption of memory ("space") rather than of time.

You will find more information on big-O notation, and about algorithms and their complexity, in any good book about algorithms and data structures. Unfortunately, at the time of this writing, there aren't yet any such books that use Python. However, if you are familiar with C, I recommend Mastering Algorithms with C, by Kyle Loudon (O'Reilly).

To understand the practical importance of big-O considerations in your programs, consider two different ways to accept all items from an input iterable and accumulate them into a list in reverse order:

def slow(it):
    result = []
    for item in it: result.insert(0, item)
    return result

def fast(it):
    result = []
    for item in it: result.append(item)
    result.reverse( )
    return result

We could express each of these functions more concisely, but the key difference is best appreciated by presenting the functions in elementary terms. Function slow builds the result list by inserting each input item before all previously received ones. Function fast appends each input item after all previously received ones, then reverses the result list at the end. Intuitively, one might think that the final reversing represents extra work, and therefore slow should be faster than fast. But that's not the way things work out.

Each call to result.append takes roughly the same amount of time, independent of how many items are already in list result, since there is always a free slot for an extra item at the end of the list (in pedantic terms, append is amortized O(1), but I don't cover amortization in this book). The for loop in function fast executes N times to receive N items. Since each iteration of the loop takes a constant time, overall loop time is O(N). result.reverse also takes time O(N), as it is directly proportional to the total number of items. Thus, the total running time of fast is O(N). (If you don't understand why a sum of two quantities, each O(N), is also O(N), consider that the sum of any two linear functions of N is also a linear function of Nand "being O(N)" has exactly the same meaning as "consuming an amount of time that is a linear function of N.")

In contrast, each call to result.insert must make space at slot 0 for the new item to insert by moving all items that are already in list result forward one slot. This takes a time proportional to the number of items that are already in the list. The overall amount of time to receive N items is therefore proportional to 1+2+3+...N-1, a sum whose value is O(N2). Therefore, the total running time of slow is O(N2).

It's almost always worth replacing an O(N2) solution with an O(N) one, unless you can somehow assign rigorous small limits to the input size N. If N can grow without very strict bounds, the O(N2) solution will inevitably turn out to be disastrously slower than the O(N) one for large enough values of N, no matter what the proportionality constants in each case may be (and no matter what profiling tells you). Unless you have other O(N2) or even worse bottlenecks elsewhere that you cannot eliminate, a part of the program that is O(N2) will inevitably turn into the program's bottleneck and dominate runtime for large enough values of N. Do yourself a favor and watch out for the big O: all other performance issues, in comparison, are almost always insignificant.

Incidentally, function fast can be made even faster by expressing it in more idiomatic Python. Just replace the first two lines with the single statement:

result = list(it)

This change does not affect fast's big-O character (fast is still O(N) after the change), but does speed things up by a large constant factor. Often, in Python, the simplest, clearest, most idiomatic way to express something is also the fastest.

Choosing algorithms with good big-O characteristics is roughly the same task in Python as in any other language. You just need a few indications about the big-O performance of Python's elementary building blocks, and I provide them in the following sections.

18.4.3.1. List operations

Python lists are internally implemented as vectors (also known as dynamic arrays), not as "linked lists." This fundamental implementation choice determines just about all performance characteristics of Python lists, in big-O terms.

Chaining two lists L1 and L2, of length N1 and N2 (i.e., L1+L2) is O(N1+N2). Multiplying a list L of length N by the integer M (i.e., L*M) is O(N*M). Accessing or rebinding any list item is O(1). len( ) on a list is also O(1). Accessing any slice of length M is O(M). Rebinding a slice of length M with one of identical length is also O(M). Rebinding a slice of length M1 with one of different length M2 is O(M1+M2+N1), where N1 is the number of items after the slice in the target list (in other words, such length-changing slice rebindings are very cheap when they occur at the end of a list, and costly when they occur at the beginning or around the middle of a long list). If you need first-in, first-out (FIFO) operations, a list is probably not the fastest data structure for the purpose: instead, try type collections.deque, covered in "deque" on page 173.

Most list methods, as shown back in Table 4-3, are equivalent to slice rebindings and have the same big-O performance. Methods count, index, remove, and reverse, and operator in, are O(N). Method sort is generally O(N*log(N)), but is highly optimized to be O(N) in some important special cases, like when the list is already sorted, reverse-sorted, or sorted except for a few items. range(a,b,c) is O((b-a)/c). xrange(a,b,c) is O(1), but looping on xrange's result is O((b-a)/c).

18.4.3.2. String operations

Most methods on a string of length N (be it plain or Unicode) are O(N). len(astring) is O(1). The fastest way to produce a copy of a string with transliterations and/or removal of specified characters is the string's method translate. The single most practically important big-O consideration involving strings is covered in "Building up a string from pieces" on page 484.

18.4.3.3. Dictionary operations

Python dictionaries are internally implemented with hash tables. This fundamental implementation choice determines just about all performance characteristics of Python dictionaries, in big-O terms.

Accessing, rebinding, adding, or removing a dictionary item is generally O(1), as are methods has_key, get, setdefault, and popitem, and operator in. d1.update(d2) is O(len(d2)). len(adict) is O(1). Methods keys, items, and values are O(N). Methods iterkeys, iteritems, and itervalues are O(1), but looping on the iterators that those methods return is O(N) (the methods with names that start with iter do save memory compared to their counterparts that return lists, which in turn may make them faster), and looping directly on a dict has the same big-O performance as iterkeys. Never test with if x in d.keys( ). That would be O(N), while the equivalent test if x in d: is O(1) (if d.has_key(x): is also O(1), but is slower than if x in d: and has no compensating advantage).

When the keys in a dictionary are instances of classes that define _ _hash_ _ and equality comparison methods, dictionary performance is of course affected by those methods. The performance indications presented in this paragraph hold only when hashing and equality comparison are O(1).

18.4.3.4. Set operations

Python sets, like dictionaries, are internally implemented with hash tables. All performance characteristics of sets are, in big-O terms, the same as those of dictionaries.

Adding or removing a set item is generally O(1), as is operator in. len(aset) is O(1). Looping on a set is O(N). When the items in a set are instances of classes that define _ _hash_ _ and equality comparison methods, set performance is of course affected by those methods. The performance indications presented in this paragraph hold only when hashing and equality comparison are O(1).

18.4.3.5. Summary of big-O times for operations on Python built-in types

Let L be any list, T any string (plain or Unicode); D any dict; S any set, with (say) numbers as items (with O(1) hashing and comparison) and x any number:


O(1)

len(L), len(T), len(D), len(S), L[i], T[i], D[i], del D[i], if x in D, if x in S, S.add(x), S.remove(x), additions or removals to/from the right end of L


O(N)

Loops on L, T, D, S, general additions or removals to/from L (not at the right end), all methods on T, if x in L, if x in T, most methods on L, all shallow copies


O(N log N)

L.sort in general (but O(N) if L is already nearly sorted or reverse-sorted)

18.4.4. Profiling

Most programs have hot spots (i.e., regions of source code that account for most of the time elapsed during a program run). Don't try to guess where your program's hot spots are: a programmer's intuition is notoriously unreliable in this field. Use module profile to collect profile data over one or more runs of your program, with known inputs. Then use module pstats to collate, interpret, and display that profile data. To gain accuracy, you can calibrate the Python profiler for your machine (i.e., determine what overhead profiling incurs on your machine). Module profile can then subtract this overhead from the times it measures so that the profile data you collect is closer to reality. Python 2.5 introduces a new standard library module cProfile with similar functionality to profile; cProfile is preferable, since it's faster, which imposes less overhead. Yet another profiling module in Python's standard library is hotshot (covered at http://docs.python.org/lib/module-hotshot.html and present since Python 2.2); unfortunately, hotshot is not compatible with threads.

18.4.4.1. The profile module

The profile module supplies one function you will often use.

run

run(code,filename=None)

code is a string that is usable with statement exec, normally a call to the main function of the program you're profiling. filename is the path of a file that run creates or rewrites with profile data. Usually, you call run a few times, specifying different filenames, and different arguments to your program's main function, in order to exercise various program parts in proportion to what you expect will be their use "in real life." Then you use module pstats to display collated results.

You may call run without a filename to get a summary report, similar to the one module pstats could give you, on standard output. However, this approach gives no control over output format, nor any way to consolidate several runs into one report. In practice, you should rarely use this feature: it's better to collect profile data into files.

Module profile also supplies class Profile (mentioned in the next section). By instantiating Profile directly, you can access advanced functionality, such as the ability to run a command in specified local and global dictionaries. I do not cover such advanced functionality of class profile.Profile further in this book.


18.4.4.2. Calibration

To calibrate profile for your machine, you need to use class Profile, which module profile supplies and internally uses in function run. An instance p of Profile supplies one method you use for calibration.

calibrate

p.calibrate(N)

Loops N times, then returns a number that is the profiling overhead per call on your machine. N must be large if your machine is fast. Call p.calibrate(10000) a few times and check that the various numbers it returns are close to each other, then pick the smallest one of them. If the numbers vary a lot, try again with larger values of N.

The calibration procedure can be time-consuming. However, you need to perform it only once, repeating it only when you make changes that could alter your machine's characteristics, such as applying patches to your operating system, adding memory, or changing Python version. Once you know your machine's overhead, you can tell profile about it each time you import it, right before using profile.run. The simplest way to do this is as follows:

import profile profile.Profile.bias =...the overhead you measured...
profile.run('main( )', 'somefile')


18.4.4.3. The pstats module

The pstats module supplies a single class, Stats, to analyze, consolidate, and report on the profile data contained in one or more files written by function profile.run.

Stats

class Stats(filename,*filenames)

Instantiates Stats with one or more filenames of files of profile data written by function profile.run.

An instance s of class Stats provides methods to add profile data and sort and output results. Each method returns s, so you can chain several calls in the same expression. s's main methods are as follows.

add

s.add(filename)

Adds another file of profile data to the set that s is holding for analysis.

print_callees, print_callers

s.print_callees(*restrictions)

Outputs the list of functions in s's profile data, sorted according to the latest call to s.sort_stats and subject to given restrictions, if any. You can call each printing method with zero or more restrictions, to be applied one after the other, in order, to reduce the number of output lines. A restriction that is an int n limits the output to the first n lines. A restriction that is a float f between 0.0 and 1.0 limits the output to a fraction f of the lines. A restriction that is a string is compiled as a regular expression (covered in "Regular Expressions and the re Module" on page 201); only lines that satisfy a search method call on the regular expressions are output. Restrictions are cumulative. For example, s.print_calls(10,0.5) outputs the first 5 lines (half of 10). Restrictions apply only after the summary and header lines: the summary and header are output unconditionally.

Each function f that is output is accompanied by the list of f's callers (the functions that called f) or f's callees (the functions that f called) according to the name of the method.

print_stats

s.print_stats(*restrictions)

Outputs statistics about s's profile data, sorted according to the latest call to s.sort_stats and subject to given restrictions, if any, as covered in print_callees, print_callers on page 481. After a few summary lines (date and time on which profile data was collected, number of function calls, and sort criteria used), the output, absent restrictions, is one line per function, with six fields per line, labeled in a header line. For each function f, print_stats outputs six fields:

  • Total number of calls to f

  • Total time spent in f, exclusive of other functions that f called

  • Total time per call to f (i.e., field 2 divided by field 1)

  • Cumulative time spent in f, and all functions directly or indirectly called from f

  • Cumulative time per call to f (i.e., field 4 divided by field 1)

  • The name of function f

sort_stats

s.sort_stats(key, *keys)

Gives one or more keys on which to sort future output, in priority order. Each key is a string. The sort is descending for keys that indicate times or numbers, and alphabetical for key 'nfl'. The most frequently used keys when calling sort_stats are:


'calls'

Number of calls to the function (like field 1 covered in "print_stats")


'cumulative'

Cumulative time spent in the function and all functions it called (like field 4 covered in print_stats on page 482)


'nfl'

Name of the function, its module, and the line number of the function in its file (like field 6 covered in print_stats on page 482)


'time'

Total time spent in the function itself, exclusive of functions it called (like field 2 covered in print_stats on page 482)

strip_dirs

s.strip_dirs( )

Alters s by stripping directory names from all module names to make future output more compact. s is unsorted after s.strip_dirs( ), and therefore you normally call s.sort_stats right after calling s.strip_dirs.


18.4.5. Small-Scale Optimization

Fine-tuning of program operations is rarely important. Tuning may make a small but meaningful difference in some particularly hot spot, but hardly ever is it a decisive factor. And yet, fine-tuning, in the pursuit of mostly irrelevant micro-efficiencies, is where a programmer's instincts are likely to lead. It is in good part because of this that most optimization is premature and best avoided. The most that can be said in favor of fine-tuning is that, if one idiom is always speedier than another when the difference is measurable, it's worth getting into the habit of always using the former and not the latter.

Most often, in Python, if you do what comes naturally and choose simplicity and elegance, you end up with code that has good performance as well as clarity and maintainability. In a few cases, an approach that may not be intuitively preferable still does offer performance advantages, as discussed in the rest of this section.

The simplest possible optimization is to run your Python programs using python -O or -OO. -OO makes little direct difference to performance compared to -O, but -OO may save memory, since it removes docstrings from the bytecode, and memory availability is sometimes (indirectly) a performance bottleneck. The optimizer is not very powerful in current releases of Python, but it may still gain you performance advantages on the order of 5 percent, sometimes as large as 10 percent (potentially larger if you make use of assert statements and if _ _debug_ _: guards, as suggested in "The assert Statement" on page 138). The best aspect of -O is that it costs nothingas long as your optimization isn't premature, of course (don't bother using -O on a program you're still developing).

18.4.5.1. Module timeit

Standard library module timeit is very handy for measuring the precise performance of specific snippets of code. You can have module timeit use timeit's functionality in your programs, but the simplest and most normal use is from the command line:

python -mtimeit -s'setup statement(s)' 'statement(s)
to be timed'

For example, say you're wondering about the performance of x=x+1 versus x+=1. At some command prompt, you can easily try:

$ python -mtimeit -s'x=0' 'x=x+1'
1000000 loops, best of 3: 0.25 usec per loop
$ python -mtimeit -s'x=0' 'x+=1'
1000000 loops, best of 3: 0.258 usec per loop

and find out that performance is for all intents and purposes identical in both cases.

18.4.5.2. Building up a string from pieces

The single Python "anti-idiom" that's likeliest to kill your program's performance, to the point that you should never use it, is to build up a large string from pieces by looping on string concatenation statements such as big_string+=piece. Python strings are immutable, so each such concatenation means that Python must free the M bytes previously allocated for big_string, and allocate and fill M+K bytes for the new version. Doing this repeatedly in a loop, you end up with roughly O(N2) performance, where N is the total number of characters. More often than not, O(N2) performance where O(N) is available is a performance disaster. On some platforms, things may be even bleaker due to memory fragmentation effects caused by freeing many memory areas of progressively larger sizes.

To achieve O(N) performance, accumulate intermediate pieces in a list rather than build up the string piece by piece. Lists, unlike strings, are mutable, so appending to a list has O(1) performance (amortized). Change each occurrence of big_string+=piece into temp_list.append(piece). Then, when you're done accumulating, use the following to build your desired string result in O(N) time:

big_string = ''.join( 
temp_list)

Using a list comprehension, generator expression, or other direct means (such as a call to map, or use of standard library module itertools) to build temp_list may often offer further optimization over repeated calls to temp_list.append. Other O(N) ways to build up big strings, which some Python programmers find more readable, are to concatenate the pieces to an instance of array.array('c') with the array's extend method, or to write the pieces to an instance of cStringIO.StringIO.

In the special case where you want to output the resulting string, you may gain a further small slice of performance by using writelines on temp_list (never building big_string in memory). When feasible (i.e., when you have the output file object open and available in the loop), it's just as effective to perform a write call for each piece, without any accumulation.

Although not nearly as crucial as += on a big string in a loop, another case where removing string concatenation may give a slight performance improvement is when you're concatenating several values in an expression:

oneway = str(x)+' eggs and '+str(y)+' slices of '+k+' ham'
another = '%s eggs and %s slices of %s ham' % (x, y, k)

Using operator % for string formatting is often a good performance choice.

18.4.5.3. Searching and sorting

Operator in, the most natural tool for searching, is O(1) when the righthand side operand is a set or dictionary, but O(N) when the righthand side operand is a string, list, or tuple. If you need to perform many searches on a container, you're generally much better off using a set or dictionary, rather than a list or tuple, as the container. Python sets and dictionaries are highly optimized for searching and fetching items by key.

Method sort of Python lists is also a highly optimized and sophisticated tool. You can rely on sort's performance. Performance dramatically degrades, however, if you pass sort a custom callable to perform comparisons in order to sort a list based on anything but built-in comparisons. To satisfy such needs, consider using the decorate-sort-undecorate (DSU) idiom instead. In spelled-out form, this idiom has the following steps:


Decorate

Build an auxiliary list A where each item is a tuple made up of the sort keys, ending with the item of the original list L or with the item's index.


Sort

Call A.sort( ) without arguments.


Undecorate

Extract the items in order from the now-sorted A.

The decorate and undecorate steps are often handily performed with list comprehensions. If you need the sort to be in-place, assign the final sorted list to L[:]. Otherwise, DSU provides a sorted copy, without disturbing the original list L.

For example, say we have in L a large list of strings, each of at least two words, and we want to sort L in-place by the second word of each string:

A = [ (s.split( )[1], s) for s in L ]
A.sort( )
L[:] = [ t[1] for t in A ]

This is much faster than passing to L.sort a function that compares two strings by their second words, as in:

def cmp2ndword(a, b): return cmp(a.split( )[1], b.split( )[1])
L.sort(cmp2ndword)

On a series of benchmarks with Python 2.4 on lists of 10,000 strings, I measured the DSU version as about five times faster than the non-DSU one.

A particularly fast and effective way to use DSU is to specify a named argument key= to sort, a possibility that was introduced in Python 2.4. Module operator supplies functions attrgetter and itemgetter that are particularly suitable for this use. The fastest way to perform the task mentioned above is:

import operator L.sort(key=operator.itemgetter(1))

On the same benchmarks, this is another five times faster than the plain DSU version.

Occasionally, you may avoid the need for sorting by using heaps, covered in "The heapq Module" on page 177.

18.4.5.4. Avoiding exec and from...import *

Code in a function runs faster than code at the top level in a module because access to a function's local variables is optimized to be very fast. If a function contains an exec statement without explicit dictionaries, however, the whole function slows down. The presence of such an exec statement forces the Python compiler to avoid the modest but important optimizations it normally performs regarding access to local variables, since the exec might cause any alteration at all to the function's namespace. A from statement of the form:

from MyModule import *

wastes performance, too, since it also can alter a function's namespace unpredictably.

exec itself is also quite slow, and even more so if you apply it to a string of source code rather than to a code object. By far the best approachfor performance, for correctness, and for clarityis to avoid exec altogether. It's most often possible to find better (faster, more robust, and clearer) solutions. If you must use exec, always use it with explicit dictionaries. If you need to exec a dynamically obtained string more than once, compile the string just once and then repeatedly exec the resulting code object.

eval works on expressions, not on statements; therefore, while still slow, it avoids some of the worst performance impacts of exec. With eval, too, you're best advised to use explicit dictionaries. If you need several evaluations of the same dynamically obtained string, compile the string once and then repeatedly eval the resulting code object.

See "Dynamic Execution and the exec Statement" on page 328 for more details and advice about exec, eval, and compile.

18.4.5.5. Optimizing loops

Most of your program's bottlenecks will be in loops, particularly nested loops, because loop bodies tend to execute repeatedly. Python does not implicitly perform any code hoisting: if you have any code inside a loop that might be executed just once by hoisting it out of the loop, and the loop is a performance bottleneck, hoist the code out yourself. Sometimes the presence of code to hoist may not be immediately obvious:

def slower(anobject, ahugenumber):
    for i in xrange(ahugenumber): anobject.amethod(i)
def faster(anobject, ahugenumber):
    themethod = anobject.amethod
    for i in xrange(ahugenumber): themethod(i)

In this case, the code that faster hoists out of the loop is the attribute lookup anobject.amethod. slower repeats the lookup every time, while faster performs it just once. The two functions are not 100 percent equivalent: it is (barely) conceivable that executing amethod might cause such changes on anobject that the next lookup for the same named attribute fetches a different method object. This is part of why Python doesn't perform such optimizations itself. In practice, such subtle, obscure, and tricky cases happen quite seldom; you're quite safe performing such optimizations yourself, to squeeze the last drop of performance out of some crucial bottleneck.

Python is faster with local variables than with global ones. If a loop repeatedly accesses a global whose value does not change between iterations, cache the value in a local variable and have the loop access the local instead. This also applies to built-ins:

def slightly_slower(asequence, adict):
    for x in asequence: adict[x] = hex(x)
def slightly_faster(asequence, adict):
    myhex = hex
    for x in asequence: adict[x] = myhex(x)

Here, the speedup is very modest, on the order of five percent or so.

Do not cache None. None is a keyword, so no further optimization is needed.

List comprehensions can be faster than loops, and so can map and filter. For optimization purposes, try changing loops into list comprehensions or map and filter calls where feasible. The performance advantage of map and filter is nullified if you have to use a lambda or an extra level of function call. Only when you pass to map or filter a built-in function, or a function you'd have to call anyway even from an explicit loop, do you stand to gain.

The loops that you can replace most naturally with list comprehensions, or map and filter calls, are ones that build up a list by repeatedly calling append on the list. The following example shows this optimization in a micro-performance benchmark script:

import time, operator

def slow(asequence):
    result = []
    for x in asequence: result.append(-x)
    return result

def middling(asequence):
    return map(operator.neg, asequence)

def fast(asequence):
    return [-x for x in asequence]

biggie = xrange(500*1000)
tentimes = [None]*10
def timit(afunc):
    lobi = biggie
    start = time.clock( )
    for x in tentimes: afunc(lobi)
    stend = time.clock( )
    return "%-10s: %.2f" % (afunc._ _name_ _, stend-start)

for afunc in slow, middling, fast, fast, middling, slow:
    print timit(afunc)

Running this example with Python 2.4 on my laptop shows that fast takes 3.62 seconds, middling 4.71 seconds, and slow 6.91 seconds. In other words, on this machine, slow (the loop of append method calls) is about 47 percent slower than middling (the single map call), and middling, in turn, is about 30 percent slower than fast (the list comprehension). The list comprehension is the most direct way to express the task being micro-benchmarked in this example, so, not surprisingly, it's also fastestalmost two times faster than the loop of append method calls.

18.4.5.6. Optimizing I/O

If your program does substantial amounts of I/O, it's likely that performance bottlenecks are due to I/O, not to computation. Such programs are said to be I/O-bound, rather than CPU-bound. Your operating system tries to optimize I/O performance, but you can help it in a couple of ways. One such way is to perform your I/O in chunks of a size that is optimal for performance, rather than simply being convenient for your program's operations. Another way is to use threading.

From the point of view of a program's convenience and simplicity, the ideal amount of data to read or write at a time is often small (one character or one line) or very large (an entire file at a time). That's often okay: Python and your operating system work behind the scenes to let your program use convenient logical chunks for I/O, while arranging physical I/O operations with chunk sizes more attuned to performance. Reading and writing a whole file at a time is quite likely to be okay for performance as long as the file is not very large. Specifically, file-at-a-time I/O is fine as long as the file's data fits comfortably in physical memory, leaving ample memory available for your program and operating system to perform whatever other tasks they're performing at the same time. The hard problems of I/O-bound performance tend to come with huge files.

If performance is an issue, don't use a file's readline method, which is limited in the amount of chunking and buffering it can perform. (Using writeline, on the other hand, gives no performance problem when that method is convenient for your program.) When reading a text file, loop directly on the file object to get one line at a time with best performance. If the file isn't too huge, and so can conveniently fit in memory, time two versions of your programone looping directly on the file object, the other calling readlines to read the whole file into memory. Either may prove faster.

For binary files, particularly large binary files whose contents you need just a part of on each run of your program, module mmap (covered in "The mmap Module" on page 360) can often give you both good performance and program simplicity.

Making an I/O-bound program multithreaded may sometimes afford substantial performance gains if you can arrange your program's architecture accordingly. Start a few worker threads devoted exclusively to I/O, have the computational threads request I/O operations from the I/O threads via Queue instances, and post the request for each input operation as soon as you know you'll eventually need that data. Performance will increase only if there are other tasks your computational threads can perform while I/O threads are blocked waiting for data. Basically, you get better performance this way if you can manage to overlap computation and waiting for data by having different threads do the computing and the waiting. (See "Threads in Python" on page 341 for detailed coverage of Python threading and a suggested architecture.) On the other hand, if a substantial fraction of your I/O is on the Net, an even faster and definitely more scalable approach is to eschew threads in favor of asynchronous (event-driven) architectures, as covered in "Event-Driven Socket Programs" on page 533.


Previous Page
Next Page