I l@ve RuBoard

2.12 Directory Tools

One of the more common tasks in the shell utilities domain is applying an operation to a set of files in a directory -- a "folder" in Windows-speak. By running a script on a batch of files, we can automate (that is, script) tasks we might have to otherwise run repeatedly by hand.

For instance, suppose you need to search all of your Python files in a development directory for a global variable name (perhaps you've forgotten where it is used). There are many platform-specific ways to do this (e.g., the grep command in Unix), but Python scripts that accomplish such tasks will work on every platform where Python works -- Windows, Unix, Linux, Macintosh, and just about any other in common use today. Simply copy your script to any machine you wish to use it on, and it will work, regardless of which other tools are available there.

2.12.1 Walking One Directory

The most common way to go about writing such tools is to first grab hold of a list of the names of the files you wish to process, and then step through that list with a Python for loop, processing each file in turn. The trick we need to learn here, then, is how to get such a directory list within our scripts. There are at least three options: running shell listing commands with os.popen, matching filename patterns with glob.glob, and getting directory listings with os.listdir. They vary in interface, result format, and portability.

2.12.1.1 Running shell listing commands with os.popen

Quick: How did you go about getting directory file listings before you heard of Python? If you're new to shell tools programming, the answer may be: "Well, I started a Windows file explorer and clicked on stuff," but I'm thinking in terms of less GUI-oriented command-line mechanisms here (and answers submitted in Perl and Tcl only get partial credit).

On Unix, directory listings are usually obtained by typing ls in a shell; on Windows, they can be generated with a dir command typed in an MS-DOS console box. Because Python scripts may use os.popen to run any command line we can type in a shell, they also are the most general way to grab a directory listing inside a Python program. We met os.popen earlier in this chapter; it runs a shell command string and gives us a file object from which we can read the command's output. To illustrate, let's first assume the following directory structures (yes, I have both dir and ls commands on my Windows laptop; old habits die hard):

C:\temp>dir /B
about-pp.html
python1.5.tar.gz
about-pp2e.html
about-ppr2e.html
newdir

C:\temp>ls
about-pp.html     about-ppr2e.html  python1.5.tar.gz
about-pp2e.html   newdir

C:\temp>ls newdir
more   temp1  temp2  temp3

The newdir name is a nested subdirectory in C:\temp here. Now, scripts can grab a listing of file and directory names at this level by simply spawning the appropriate platform-specific command line, and reading its output (the text normally thrown up on the console window):

C:\temp>python
>>> import os
>>> os.popen('dir /B').readlines(  )
['about-pp.html\012', 'python1.5.tar.gz\012', 'about-pp2e.html\012', 
'about-ppr2e.html\012', 'newdir\012']

Lines read from a shell command come back with a trailing end-line character, but it's easy enough to slice off:

>>> for line in os.popen('dir /B').readlines(  ): 
...     print line[:-1]
...
about-pp.html
python1.5.tar.gz
about-pp2e.html
about-ppr2e.html
newdir

Both dir and ls commands let us be specific about filename patterns to be matched and directory names to be listed; again, we're just running shell commands here, so anything you can type at a shell prompt goes:

>>> os.popen('dir *.html /B').readlines(  )
['about-pp.html\012', 'about-pp2e.html\012', 'about-ppr2e.html\012']

>>> os.popen('ls *.html').readlines(  )
['about-pp.html\012', 'about-pp2e.html\012', 'about-ppr2e.html\012']

>>> os.popen('dir newdir /B').readlines(  )
['temp1\012', 'temp2\012', 'temp3\012', 'more\012']

>>> os.popen('ls newdir').readlines(  )
['more\012', 'temp1\012', 'temp2\012', 'temp3\012']

These calls use general tools and all work as advertised. As we noted earlier, though, the downsides of os.popen are that it is nonportable (it doesn't work well in a Windows GUI application in Python 1.5.2 and earlier, and requires using a platform-specific shell command), and it incurs a performance hit to start up an independent program. The following two alternative techniques do better on both counts.

2.12.1.2 The glob module

The term "globbing" comes from the * wildcard character in filename patterns -- per computing folklore, a * matches a "glob" of characters. In less poetic terms, globbing simply means collecting the names of all entries in a directory -- files and subdirectories -- whose names match a given filename pattern. In Unix shells, globbing expands filename patterns within a command line into all matching file- names before the command is ever run. In Python, we can do something similar by calling the glob.glob built-in with a pattern to expand:

>>> import glob
>>> glob.glob('*')
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']

>>> glob.glob('*.html')
['about-pp.html', 'about-pp2e.html', 'about-ppr2e.html']

>>> glob.glob('newdir/*')
['newdir\\temp1', 'newdir\\temp2', 'newdir\\temp3', 'newdir\\more']

The glob call accepts the usual filename pattern syntax used in shells (e.g., ? means any one character, * means any number of characters, and [] is a character selection set).^[12] The pattern should include a directory path if you wish to glob in something other than the current working directory, and the module accepts either Unix or DOS-style directory separators (/ or \). This call also is implemented without spawning a shell command, and so is likely to be faster and more portable across all Python platforms than the os.popen schemes shown earlier.

^[12] In fact, glob just uses the standard fnmatch module to match name patterns; see the fnmatch description later in this chapter in Section 2.12.3 for more details.

Technically speaking, glob is a bit more powerful than described so far. In fact, using it to list files in one directory is just one use of its pattern-matching skills. For instance, it can also be used to collect matching names across multiple directories, simply because each level in a passed-in directory path can be a pattern too:

C:\temp>python
>>> import glob
>>> for name in glob.glob('*examples/L*.py'): print name
...
cpexamples\Launcher.py
cpexamples\Launch_PyGadgets.py
cpexamples\LaunchBrowser.py
cpexamples\launchmodes.py
examples\Launcher.py
examples\Launch_PyGadgets.py
examples\LaunchBrowser.py
examples\launchmodes.py

>>> for name in glob.glob(r'*\*\visitor_find*.py'): print name
...
cpexamples\PyTools\visitor_find.py
cpexamples\PyTools\visitor_find_quiet2.py
cpexamples\PyTools\visitor_find_quiet1.py
examples\PyTools\visitor_find.py
examples\PyTools\visitor_find_quiet2.py
examples\PyTools\visitor_find_quiet1.py

In the first call here, we get back filenames from two different directories that matched the *examples pattern; in the second, both of the first directory levels are wildcards, so Python collects all possible ways to reach the base filenames. Using os.popen to spawn shell commands only achieves the same effect if the underlying shell or listing command does too.

2.12.1.3 The os.listdir call

The os module's listdir call provides yet another way to collect filenames in a Python list. It takes a simple directory name string, not a filename pattern, and returns a list containing the names of all entries in that directory -- both simple files and nested directories -- for use in the calling script:

>>> os.listdir('.')
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']

>>> os.listdir(os.curdir)
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']

>>> os.listdir('newdir')
['temp1', 'temp2', 'temp3', 'more']

This too is done without resorting to shell commands, and so is portable to all major Python platforms. The result is not in any particular order (but can be sorted with the list sort method), returns base filenames without their directory path prefixes, and includes names of both files and directories at the listed level.

To compare all three listing techniques, let's run them side by side on an explicit directory here. They differ in some ways but are mostly just variations on a theme -- os.popen sorts names and returns end-of-lines, glob.glob accepts a pattern and returns filenames with directory prefixes, and os.listdir takes a simple directory name and returns names without directory prefixes:

>>> os.popen('ls C:\PP2ndEd').readlines(  )
['README.txt\012', 'cdrom\012', 'chapters\012', 'etc\012', 'examples\012',
'examples.tar.gz\012', 'figures\012', 'shots\012']

>>> glob.glob('C:\PP2ndEd\*')
['C:\\PP2ndEd\\examples.tar.gz', 'C:\\PP2ndEd\\README.txt', 
'C:\\PP2ndEd\\shots', 'C:\\PP2ndEd\\figures', 'C:\\PP2ndEd\\examples',
'C:\\PP2ndEd\\etc', 'C:\\PP2ndEd\\chapters', 'C:\\PP2ndEd\\cdrom']

>>> os.listdir('C:\PP2ndEd')
['examples.tar.gz', 'README.txt', 'shots', 'figures', 'examples', 'etc',
'chapters', 'cdrom']

Of these three, glob and listdir are generally better options if you care about script portability, and listdir seems fastest in recent Python releases (but gauge its performance yourself -- implementations may change over time).

2.12.1.4 Splitting and joining listing results

In the last example, I pointed out that glob returns names with directory paths, but listdir gives raw base filenames. For convenient processing, scripts often need to split glob results into base files, or expand listdir results into full paths. Such translations are easy if we let the os.path module do all the work for us. For example, a script that intends to copy all files elsewhere will typically need to first split off the base filenames from glob results so it can add different directory names on the front:

>>> dirname = r'C:\PP2ndEd'
>>> for file in glob.glob(dirname + '/*'):
...     head, tail = os.path.split(file)
...     print head, tail, '=>', ('C:\\Other\\' + tail)
...
C:\PP2ndEd examples.tar.gz => C:\Other\examples.tar.gz
C:\PP2ndEd README.txt => C:\Other\README.txt
C:\PP2ndEd shots => C:\Other\shots
C:\PP2ndEd figures => C:\Other\figures
C:\PP2ndEd examples => C:\Other\examples
C:\PP2ndEd etc => C:\Other\etc
C:\PP2ndEd chapters => C:\Other\chapters
C:\PP2ndEd cdrom => C:\Other\cdrom

Here, the names after the => represent names that files might be moved to. Conversely, a script that means to process all files in a different directory than the one it runs in will probably need to prepend listdir results with the target directory name, before passing filenames on to other tools:

>>> for file in os.listdir(dirname):
...     print os.path.join(dirname, file)
...
C:\PP2ndEd\examples.tar.gz
C:\PP2ndEd\README.txt
C:\PP2ndEd\shots
C:\PP2ndEd\figures
C:\PP2ndEd\examples
C:\PP2ndEd\etc
C:\PP2ndEd\chapters
C:\PP2ndEd\cdrom

2.12.2 Walking Directory Trees

Notice, though, that all of the preceding techniques only return the names of files in a single directory. What if you want to apply an operation to every file in every directory and subdirectory in a directory tree?

For instance, suppose again that we need to find every occurrence of a global name in our Python scripts. This time, though, our scripts are arranged into a module package : a directory with nested subdirectories, which may have subdirectories of their own. We could rerun our hypothetical single-directory searcher in every directory in the tree manually, but that's tedious, error-prone, and just plain no fun.

Luckily, in Python it's almost as easy to process a directory tree as it is to inspect a single directory. We can either collect names ahead of time with the find module, write a recursive routine to traverse the tree, or use a tree-walker utility built-in to the os module. Such tools can be used to search, copy, compare, and otherwise process arbitrary directory trees on any platform that Python runs on (and that's just about everywhere).

2.12.2.1 The find module

The first way to go hierarchical is to collect a list of all names in a directory tree ahead of time, and step through that list in a loop. Like the single-directory tools we just met, a call to the find.find built-in returns a list of both file and directory names. Unlike the tools described earlier, find.find also returns pathnames of matching files nested in subdirectories, all the way to the bottom of a tree:

C:\temp>python
>>> import find
>>> find.find('*')
['.\\about-pp.html', '.\\about-pp2e.html', '.\\about-ppr2e.html', 
'.\\newdir', '.\\newdir\\more', '.\\newdir\\more\\xxx.txt',
'.\\newdir\\more\\yyy.txt', '.\\newdir\\temp1', '.\\newdir\\temp2',
'.\\newdir\\temp3', '.\\python1.5.tar.gz']

>>> for line in find.find('*'): print line
...
.\about-pp.html
.\about-pp2e.html
.\about-ppr2e.html
.\newdir
.\newdir\more
.\newdir\more\xxx.txt
.\newdir\more\yyy.txt
.\newdir\temp1
.\newdir\temp2
.\newdir\temp3
.\python1.5.tar.gz

We get back a list of full pathnames, that each include the top-level directory's path. By default, find collects names matching the passed-in pattern in the tree rooted at the current working directory, known as ".". If we want a more specific list, we can pass in both a filename pattern and a directory tree root to start at; here's how to collect HTML filenames at "." and below:

>>> find.find('*.html', '.')
['.\\about-pp.html', '.\\about-pp2e.html', '.\\about-ppr2e.html']

Incidentally, find.find is also the Python library's equivalent to platform-specific shell commands such as a find -print on Unix and Linux, and dir /B /S on DOS and Windows. Since we can usually run such shell commands in a Python script with os.popen, the following does the same work as find.find, but is inherently nonportable, and must start up a separate program along the way:

>>> import os
>>> for line in os.popen('dir /B /S').readlines(  ): print line,
...
C:\temp\about-pp.html
C:\temp\python1.5.tar.gz
C:\temp\about-pp2e.html
C:\temp\about-ppr2e.html
C:\temp\newdir
C:\temp\newdir\temp1
C:\temp\newdir\temp2
C:\temp\newdir\temp3
C:\temp\newdir\more
C:\temp\newdir\more\xxx.txt
C:\temp\newdir\more\yyy.txt

If the find calls don't seem to work in your Python, try changing the import statement used to load the module from import find to from PP2E.PyTools import find. Alas, the Python standard library's find module has been marked as "deprecated" as of Python 1.6. That means it may be deleted from the standard Python distribution in the future, so pay attention to the next section; we'll use its topic later to write our own find module -- one that is also shipped on this book's CD (see http://examples.oreilly.com/python2).

2.12.2.2 The os.path.walk visitor

To make it easy to apply an operation to all files in a tree, Python also comes with a utility that scans trees for us, and runs a provided function at every directory along the way. The os.path.walk function is called with a directory root, function object, and optional data item, and walks the tree at the directory root and below. At each directory, the function object passed in is called with the optional data item, the name of the current directory, and a list of filenames in that directory (obtained from os.listdir). Typically, the function we provide scans the filenames list to process files at each directory level in the tree.

That description might sound horribly complex the first time you hear it, but os.path.walk is fairly straightforward once you get the hang of it. In the following code, for example, the lister function is called from os.path.walk at each directory in the tree rooted at ".". Along the way, lister simply prints the directory name, and all the files at the current level (after prepending the directory name). It's simpler in Python than in English:

>>> import os
>>> def lister(dummy, dirname, filesindir):
...     print '[' + dirname + ']'
...     for fname in filesindir:
...         print os.path.join(dirname, fname)         # handle one file
...
>>> os.path.walk('.', lister, None)
[.]
.\about-pp.html
.\python1.5.tar.gz
.\about-pp2e.html
.\about-ppr2e.html
.\newdir
[.\newdir]
.\newdir\temp1
.\newdir\temp2
.\newdir\temp3
.\newdir\more
[.\newdir\more]
.\newdir\more\xxx.txt
.\newdir\more\yyy.txt

In other words, we've coded our own custom and easily changed recursive directory listing tool in Python. Because this may be something we would like to tweak and reuse elsewhere, let's make it permanently available in a module file, shown in Example 2-15, now that we've worked out the details interactively.

Example 2-15. PP2E\System\Filetools\lister_walk.py

# list file tree with os.path.walk
import sys, os

def lister(dummy, dirName, filesInDir):              # called at each dir
    print '[' + dirName + ']'
    for fname in filesInDir:                         # includes subdir names
        path = os.path.join(dirName, fname)          # add dir name prefix
        if not os.path.isdir(path):                  # print simple files only
            print path

if __name__ == '__main__':
    os.path.walk(sys.argv[1], lister, None)          # dir name in cmdline

This is the same code, except that directory names are filtered out of the filenames list by consulting the os.path.isdir test, to avoid listing them twice (see -- it's been tweaked already). When packaged this way, the code can also be run from a shell command line. Here it is being launched from a different directory, with the directory to be listed passed in as a command-line argument:

C:\...\PP2E\System\Filetools>python lister_walk.py C:\Temp
[C:\Temp]
C:\Temp\about-pp.html
C:\Temp\python1.5.tar.gz
C:\Temp\about-pp2e.html
C:\Temp\about-ppr2e.html
[C:\Temp\newdir]
C:\Temp\newdir\temp1
C:\Temp\newdir\temp2
C:\Temp\newdir\temp3
[C:\Temp\newdir\more]
C:\Temp\newdir\more\xxx.txt
C:\Temp\newdir\more\yyy.txt

The walk paradigm also allows functions to tailor the set of directories visited by changing the file list argument in place. The library manual documents this further, but it's probably more instructive to simply know what walk truly looks like. Here is its actual Python-coded implementation for Windows platforms, with comments added to help demystify its operation:

def walk(top, func, arg):                  # top is the current dirname
    try:
        names = os.listdir(top)            # get all file/dir names here
    except os.error:                       # they have no path prefix
        return
    func(arg, top, names)                  # run func with names list here
    exceptions = ('.', '..')
    for name in names:                     # step over the very same list
        if name not in exceptions:         # but skip self/parent names
            name = join(top, name)         # add path prefix to name
            if isdir(name):
                walk(name, func, arg)      # descend into subdirs here

Notice that walk generates filename lists at each level with os.listdir, a call that collects both file and directory names in no particular order, and returns them without their directory paths. Also note that walk uses the very same list returned by os.listdir and passed to the function you provide, to later descend into subdirectories (variable names). Because lists are mutable objects that can be changed in place, if your function modifies the passed-in filenames list, it will impact what walk does next. For example, deleting directory names will prune traversal branches, and sorting the list will order the walk.

2.12.2.3 Recursive os.listdir traversals

The os.path.walk tool does tree traversals for us, but it's sometimes more flexible, and hardly any more work, to do it ourself. The following script recodes the directory listing script with a manual recursive traversal function. The mylister function in Example 2-16 is almost the same as lister in the prior script, but calls os.listdir to generate file paths manually, and calls itself recursively to descend into subdirectories.

Example 2-16. PP2E\System\Filetools\lister_recur.py

# list files in dir tree by recursion
import sys, os

def mylister(currdir):
    print '[' + currdir + ']'
    for file in os.listdir(currdir):              # list files here
        path = os.path.join(currdir, file)        # add dir path back
        if not os.path.isdir(path):
            print path
        else:
            mylister(path)                        # recur into subdirs

if __name__ == '__main__': 
    mylister(sys.argv[1])                         # dir name in cmdline

This version is packaged as a script too (this is definitely too much code to type at the interactive prompt); its output is identical when run as a script:

C:\...\PP2E\System\Filetools>python lister_recur.py C:\Temp
[C:\Temp]
C:\Temp\about-pp.html
C:\Temp\python1.5.tar.gz
C:\Temp\about-pp2e.html
C:\Temp\about-ppr2e.html
[C:\Temp\newdir]
C:\Temp\newdir\temp1
C:\Temp\newdir\temp2
C:\Temp\newdir\temp3
[C:\Temp\newdir\more]
C:\Temp\newdir\more\xxx.txt
C:\Temp\newdir\more\yyy.txt

But this file is just as useful when imported and called elsewhere:

C:\temp>python
>>> from PP2E.System.Filetools.lister_recur import mylister
>>> mylister('.')
[.]
.\about-pp.html
.\python1.5.tar.gz
.\about-pp2e.html
.\about-ppr2e.html
[.\newdir]
.\newdir\temp1
.\newdir\temp2
.\newdir\temp3
[.\newdir\more]
.\newdir\more\xxx.txt
.\newdir\more\yyy.txt

We will make better use of most of this section's techniques in later examples in Chapter 5, and this book at large. For example, scripts for copying and comparing directory trees use the tree-walker techniques listed previously. Watch for these tools in action along the way. If you are interested in directory processing, also see the coverage of Python's old grep module in Chapter 5; it searches files, and can be applied to all files in a directory when combined with the glob module, but simply prints results and does not traverse directory trees by itself.

2.12.3 Rolling Your Own find Module

Over the last eight years, I've learned to trust Python's Benevolent Dictator. Guido generally does the right thing, and if you don't think so, it's usually only because you haven't yet realized how your own position is flawed. Trust me on this. On the other hand, it's not completely clear why the standard find module I showed you seems to have fallen into deprecation; it's a useful tool. In fact, I use it a lot -- it is often nice to be able to grab a list of files to process in a single function call, and step through it in a for loop. The alternatives -- os.path.walk, and recursive functions -- are more code-y, and tougher for beginners to digest.

I suppose the find module's followers (if there be any) could have defended it in long, drawn-out debates on the Internet, that would have spanned days or weeks, been joined by a large cast of heroic combatants, and gone just about nowhere. I decided to spend ten minutes whipping up a custom alternative instead. The module in Example 2-17 uses the standard os.path.walk call described earlier to reimplement a find operation for Python.

Example 2-17. PP2E\PyTools\find.py

#!/usr/bin/python
########################################################
# custom version of the now deprecated find module 
# in the standard library--import as "PyTools.find";
# equivalent to the original, but uses os.path.walk,
# has no support for pruning subdirs in the tree, and
# is instrumented to be runnable as a top-level script;
# results list sort differs slightly for some trees;
# exploits tuple unpacking in function argument lists;
########################################################

import fnmatch, os

def find(pattern, startdir=os.curdir):
    matches = []
    os.path.walk(startdir, findvisitor, (matches, pattern))
    matches.sort(  )
    return matches

def findvisitor((matches, pattern), thisdir, nameshere):
    for name in nameshere:
        if fnmatch.fnmatch(name, pattern):
            fullpath = os.path.join(thisdir, name)
            matches.append(fullpath)

if __name__ == '__main__':
    import sys
    namepattern, startdir = sys.argv[1], sys.argv[2]
    for name in find(namepattern, startdir): print name

There's not much to this file; but calling its find function provides the same utility as the deprecated find standard module, and is noticeably easier than rewriting all of this file's code every time you need to perform a find-type search. To process every Python file in a tree, for instance, I simply type:

from PP2E.PyTools import find
for name in find.find('*.py'):
    ...do something with name...

As a more concrete example, I use the following simple script to clean out any old output text files located anywhere in the book examples tree:

C:\...\PP2E>type PyTools\cleanoutput.py
import os                                  # delete old output files in tree
from PP2E.PyTools.find import find         # only need full path if I'm moved
for filename in find('*.out.txt'):         # use cat instead of type in Linux
    print filename
    if raw_input('View?') == 'y':
        os.system('type ' + filename)
    if raw_input('Delete?') == 'y':
        os.remove(filename)

C:\temp\examples>python %X%\PyTools\cleanoutput.py
.\Internet\Cgi-Web\Basics\languages.out.txt
View?
Delete?
.\Internet\Cgi-Web\PyErrata\AdminTools\dbaseindexed.out.txt
View?
Delete?y

To achieve such code economy, the custom find module calls os.path.walk to register a function to be called per directory in the tree, and simply adds matching filenames to the result list along the way.

New here, though, is the fnmatch module -- a standard Python module that performs Unix-like pattern matching against filenames, and was also used by the original find. This module supports common operators in name pattern strings: * (to match any number of characters), ? (to match any single character), and [...] and [!...] (to match any character inside the bracket pairs, or not); other characters match themselves.^[13] To make sure that this alternative's results are similar, I also wrote the test module shown in Example 2-18.

^[13] Unlike the re module, fnmatch supports only common Unix shell matching operators, not full-blown regular expression patterns; to understand why this matters, see Chapter 18 for more details.

Example 2-18. PP2E\PyTools\find-test.py

############################################################
# test custom find; the builtin find module is deprecated:
# if it ever goes away completely, replace all "import find"
# with "from PP2E.PyTools import find" (or add PP2E\PyTools
# to your path setting and just "import find"); this script 
# takes 4 seconds total time on my 650mhz Win98 notebook to
# run 10 finds over a directory tree of roughly 1500 names; 
############################################################

import sys, os, string
for dir in sys.path:
    if string.find(os.path.abspath(dir), 'PyTools') != -1:
        print 'removing', repr(dir)
        sys.path.remove(dir)   # else may import both finds from PyTools, '.'!

import find                    # get deprecated builtin (for now)
import PP2E.PyTools.find       # later use: from PP2E.PyTools import find
print  find
print  PP2E.PyTools.find

assert find.find != PP2E.PyTools.find.find        # really different?
assert string.find(str(find), 'Lib') != -1        # should be after path remove
assert string.find(str(PP2E.PyTools.find), 'PyTools') != -1 

startdir = r'C:\PP2ndEd\examples\PP2E'
for pattern in ('*.py', '*.html', '*.c', '*.cgi', '*'):
    print pattern, '=>'
    list1 = find.find(pattern, startdir)
    list2 = PP2E.PyTools.find.find(pattern, startdir)
    print len(list1), list1[-1]
    print len(list2), list2[-1]
    print list1 == list2,; list1.sort(  ); print list1 == list2

There is some magic at the top of this script that I need to explain. To make sure that it can load both the standard library's find module and the custom one in PP2E\PyTools, it must delete the entry (or entries) on the module search path that point to the PP2E\PyTools directory, and import the custom version with a full package directory -- PP2E.PyTools.find. If not, we'd always get the same find module, the one in PyTools, no matter where this script is run from.

Here's why. Recall that Python always adds the directory containing a script being run to the front of sys.path. If we didn't delete that entry here, the import find statement would always load the custom find in PyTools, because the custom find.py module is in the same directory as the find-test.py script. The script's home directory would effectively hide the standard library's find. If that doesn't make sense, go back and reread Section 2.7 earlier in this chapter.

Below is the output of this tester, along with a few command-line invocations; unlike the original find, the custom version in Example 2-18 can be run as a command-line tool too. If you study the test output closely, you'll notice that the custom find differs only in an occasional sort order that I won't go into further here (the original find module used a recursive function, not os.path.walk); the "0 1" lines mean that results differ in order, but not content. Since find callers don't generally depend on precise filename result ordering, this is trivial:

C:\temp>python %X%\PyTools\find-test.py
removing 'C:\\PP2ndEd\\examples\\PP2E\\PyTools'
<module 'find' from 'C:\Program Files\Python\Lib\find.pyc'>
<module 'PP2E.PyTools.find' from 'C:\PP2ndEd\examples\PP2E\PyTools\find.pyc'>
*.py =>
657 C:\PP2ndEd\examples\PP2E\tounix.py
657 C:\PP2ndEd\examples\PP2E\tounix.py
0 1
*.html =>
37 C:\PP2ndEd\examples\PP2E\System\Filetools\template.html
37 C:\PP2ndEd\examples\PP2E\System\Filetools\template.html
1 1
*.c =>
46 C:\PP2ndEd\examples\PP2E\Other\old-Integ\embed.c
46 C:\PP2ndEd\examples\PP2E\Other\old-Integ\embed.c
0 1
*.cgi =>
24 C:\PP2ndEd\examples\PP2E\Internet\Cgi-Web\PyMailCgi\onViewSubmit.cgi
24 C:\PP2ndEd\examples\PP2E\Internet\Cgi-Web\PyMailCgi\onViewSubmit.cgi
1 1
* =>
1519 C:\PP2ndEd\examples\PP2E\xferall.linux.csh
1519 C:\PP2ndEd\examples\PP2E\xferall.linux.csh
0 1

C:\temp>python %X%\PyTools\find.py *.cxx C:\PP2ndEd\examples\PP2E
C:\PP2ndEd\examples\PP2E\Extend\Swig\Shadow\main.cxx
C:\PP2ndEd\examples\PP2E\Extend\Swig\Shadow\number.cxx

C:\temp>python %X%\PyTools\find.py *.asp C:\PP2ndEd\examples\PP2E
C:\PP2ndEd\examples\PP2E\Internet\Other\asp-py.asp

C:\temp>python %X%\PyTools\find.py *.i C:\PP2ndEd\examples\PP2E
C:\PP2ndEd\examples\PP2E\Extend\Swig\Environ\environ.i
C:\PP2ndEd\examples\PP2E\Extend\Swig\Shadow\number.i
C:\PP2ndEd\examples\PP2E\Extend\Swig\hellolib.i

C:\temp>python %X%\PyTools\find.py setup*.csh C:\PP2ndEd\examples\PP2E
C:\PP2ndEd\examples\PP2E\Config\setup-pp-embed.csh
C:\PP2ndEd\examples\PP2E\Config\setup-pp.csh
C:\PP2ndEd\examples\PP2E\EmbExt\Exports\ClassAndMod\setup-class.csh
C:\PP2ndEd\examples\PP2E\Extend\Swig\setup-swig.csh

[filename sort scheme]
C:\temp> python
>>> l = ['ccc', 'bbb', 'aaa', 'aaa.xxx', 'aaa.yyy', 'aaa.xxx.nnn']
>>> l.sort(  )
>>> l
['aaa', 'aaa.xxx', 'aaa.xxx.nnn', 'aaa.yyy', 'bbb', 'ccc']

Finally, if an example in this book fails in a future Python release because there is no find to be found, simply change find-module imports in the source code to say from PP2E.PyTools import find instead of import find. The former form will find the custom find module in the book's example package directory tree; the old module in the standard Python library is ignored (if it is still there at all). And if you are brave enough to add the PP2E\PyTools directory itself to your PYTHONPATH setting, all original import find statements will continue to work unchanged.

Better still, do nothing at all -- most find-based examples in this book automatically pick the alternative by catching import exceptions, just in case they aren't located in the PyTools directory:

try:
    import find
except ImportError:
    from PP2E.PyTools import find

The find module may be gone, but it need not be forgotten.

Python Versus csh

If you are familiar with other common shell script languages, it might be useful to see how Python compares. Here is a simple script in a Unix shell language called csh that mails all the files in the current working directory having a suffix of .py (i.e., all Python source files) to a hopefully fictitious address:

#!/bin/csh foreach x (*.py) echo $x mail eric@halfabee.com -s $x < $x end

The equivalent Python script looks similar:

#!/usr/bin/python import os, glob for x in glob.glob('*.py'): print x os.system('mail eric@halfabee.com -s %s < %s' % (x, x))

but is slightly more verbose. Since Python, unlike csh, isn't meant just for shell scripts, system interfaces must be imported, and called explicitly. And since Python isn't just a string-processing language, character strings must be enclosed in quotes as in C.

Although this can add a few extra keystrokes in simple scripts like this, being a general-purpose language makes Python a better tool, once we leave the realm of trivial programs. We could, for example, extend the preceding script to do things like transfer files by FTP, pop up a GUI message selector and status bar, fetch messages from an SQL database, and employ COM objects on Windows -- all using standard Python tools.

Python scripts also tend to be more portable to other platforms than csh. For instance, if we used the Python SMTP interface to send mail rather than relying on a Unix command-line mail tool, the script would run on any machine with Python and an Internet link (as we'll see in Chapter 11, SMTP only requires sockets). And like C, we don't need $ to evaluate variables; what else would you expect in a free language?

I l@ve RuBoard