I l@ve RuBoard Previous Section Next Section

5.4 Searching Directory Trees

Engineers love to change things. As I was writing this book, I found it almost irresistible to move and rename directories, variables, and shared modules in the book examples tree, whenever I thought I'd stumbled on to a more coherent structure. That was fine early on, but as the tree became more intertwined, this became a maintenance nightmare. Things like program directory paths and module names were hardcoded all over the place -- in package import statements, program startup calls, text notes, configuration files, and more.

One way to repair these references, of course, is to edit every file in the directory by hand, searching each for information that has changed. That's so tedious as to be utterly impossible in this book's examples tree, though; as I wrote these words, the example tree contained 118 directories and 1342 files! (To count for yourself, run a command-line python PyTools/visitor.py 1 in the PP2E examples root directory.) Clearly, I needed a way to automate updates after changes.

5.4.1 Greps and Globs in Shells and Python

There is a standard way to search files for strings on Unix and Linux systems: the command-line program grep and its relatives list all lines in one or more files containing a string or string pattern.[7] Given that Unix shells expand (i.e., "glob") filename patterns automatically, a command such as grep popen *.py will search a single directory's Python files for string "popen". Here's such a command in action on Windows (I installed a commercial Unix-like fgrep program on my Windows 98 laptop because I missed it too much there):

[7] In fact, the act of searching files often goes by the colloquial name "grepping" among developers who have spent any substantial time in the Unix ghetto.

C:\...\PP2E\System\Filetools>fgrep popen *.py
diffall.py:# - we could also os.popen a diff (unix) or fc (dos)
dirdiff.py:# - use os.popen('ls...') or glob.glob + os.path.split
dirdiff6.py:    files1 = os.popen('ls %s' % dir1).readlines(  )
dirdiff6.py:    files2 = os.popen('ls %s' % dir2).readlines(  )
testdirdiff.py:    expected = expected + os.popen(test % 'dirdiff').read(  )
testdirdiff.py:        output = output + os.popen(test % script).read(  )

DOS has a command for searching files too -- find, not to be confused with the Unix find directory walker command:

C:\...\PP2E\System\Filetools>find /N "popen" testdirdiff.py

---------- testdirdiff.py
[8]    expected = expected + os.popen(test % 'dirdiff').read(  )
[15]        output = output + os.popen(test % script).read(  )

You can do the same within a Python script, by either running the previously mentioned shell command with os.system or os.popen, or combining the grep and glob built-in modules. We met the glob module in Chapter 2; it expands a filename pattern into a list of matching filename strings (much like a Unix shell). The standard library also includes a grep module, which acts like a Unix grep command: grep.grep prints lines containing a pattern string among a set of files. When used with glob, the effect is much like the fgrep command:

>>> from grep import grep
>>> from glob import glob
>>> grep('popen', glob('*.py'))
diffall.py:  16: # - we could also os.popen a diff (unix) or fc (dos)
dirdiff.py:  12: # - use os.popen('ls...') or glob.glob + os.path.split
dirdiff6.py:  19:     files1 = os.popen('ls %s' % dir1).readlines(  )
dirdiff6.py:  20:     files2 = os.popen('ls %s' % dir2).readlines(  )
testdirdiff.py:   8:     expected = expected + os.popen(test % 'dirdiff')...
testdirdiff.py:  15:         output = output + os.popen(test % script).read(  )

>>> import glob, grep
>>> grep.grep('system', glob.glob('*.py'))
dirdiff.py:  16: # - on unix systems we could do something similar by
regtest.py:  18:         os.system('%s < %s > %s.out 2>&1' % (program, ...
regtest.py:  23:         os.system('%s < %s > %s.out 2>&1' % (program, ...
regtest.py:  24:         os.system('diff %s.out %s.out.bkp > %s.diffs' ...

The grep module is written in pure Python code (no shell commands are run), is completely portable, and accepts both simple strings and general regular expression patterns as the search key (regular expressions appear later in this text). Unfortunately, it is also limited in two major ways:

  • It simply prints matching lines instead of returning them in a list for later processing. We could intercept and split its output by redirecting sys.stdin to an object temporarily (Chapter 2 showed how), but that's fairly inconvenient.[8]

    [8] Due to its limitations, the grep module has been tagged as "deprecated" as of Python 1.6, and may disappear completely in future releases. It was never intended to become a widely reusable tool. Use other tree-walking techniques in this book to search for strings in files, directories, and trees. Of the original Unix-like grep, glob, and find modules in Python's library, only glob remains nondeprecated today (but see also the custom find implementation presented in Chapter 4 ).

  • More crucial here, the grep/glob combination still inspects only a single directory ; as we also saw in Chapter 2, we need to do more to search all files in an entire directory tree.

On Unix systems, we can work around the second of these limitations by running a grep shell command from within a find shell command. For instance, the following Unix command line:

find . -name "*.py" -print -exec fgrep popen {} \;

would pinpoint lines and files at and below the current directory that mention "popen". If you happen to have a Unix-like find command on every machine you will ever use, this is one way to process directories.

5.4.1.1 Cleaning up bytecode files

I used to run the script in Example 5-8 on some of my machines to remove all .pyc bytecode files in the examples tree before packaging or upgrading Pythons (it's not impossible that old binary bytecode files are not forward-compatible with newer Python releases).

Example 5-8. PP2E\PyTools\cleanpyc.py
###########################################################
# find and delete all "*.pyc" bytecode files at and below
# the directory where this script is run; this assumes a 
# Unix-like find command, and so is very non-portable; we
# could instead use the Python find module, or just walk 
# the directry trees with portable Python code; the find
# -exec option can apply a Python script to each file too;
###########################################################

import os, sys

if sys.platform[:3] == 'win':
    findcmd = r'c:\stuff\bin.mks\find . -name "*.pyc" -print'
else:
    findcmd = 'find . -name "*.pyc" -print'
print findcmd

count = 0
for file in os.popen(findcmd).readlines(  ):        # for all file names
    count = count + 1                             # have \n at the end
    print str(file[:-1])
    os.remove(file[:-1])

print 'Removed %d .pyc files' % count

This script uses os.popen to collect the output of a commercial package's find program installed on one of my Windows computers, or else the standard find tool on the Linux side. It's also completely nonportable to Windows machines that don't have the commercial find program installed, and that includes other computers in my house, and most of the world at large.

Python scripts can reuse underlying shell tools with os.popen, but by so doing they lose much of the portability advantage of the Python language. The Unix find command is both not universally available, and is a complex tool by itself (in fact, too complex to cover in this book; see a Unix manpage for more details). As we saw in Chapter 2, spawning a shell command also incurs a performance hit, because it must start a new independent program on your computer.

To avoid some of the portability and performance costs of spawning an underlying find command, I eventually recoded this script to use the find utilities we met and wrote Chapter 2. The new script is shown in Example 5-9.

Example 5-9. PP2E\PyTools\cleanpyc-py.py
###########################################################
# find and delete all "*.pyc" bytecode files at and below
# the directory where this script is run; this uses a 
# Python find call, and so is portable to most machines;
# run this to delete .pyc's from an old Python release;
# cd to the directory you want to clean before running;
###########################################################

import os, sys, find              # here, gets PyTools find

count = 0
for file in find.find("*.pyc"):   # for all file names
    count = count + 1
    print file
    os.remove(file)

print 'Removed %d .pyc files' % count

This works portably, and avoids external program startup costs. But find is really just a tree-searcher that doesn't let you hook into the tree search -- if you need to do something unique while traversing a directory tree, you may be better off using a more manual approach. Moreover, find must collect all names before it returns; in very large directory trees, this may introduce significant performance and memory penalties. It's not an issue for my trees, but your trees may vary.

5.4.2 A Python Tree Searcher

To help ease the task of performing global searches on all platforms I might ever use, I coded a Python script to do most of the work for me. Example 5-10 employs standard Python tools we met in the preceding chapters:

  • os.path.walk to visit files in a directory

  • sting.find to search for a string in a text read from a file

  • os.path.splitext to skip over files with binary-type extensions

  • os.path.join to portably combine a directory path and filename

  • os.path.isdir to skip paths that refer to directories, not files

Because it's pure Python code, though, it can be run the same way on both Linux and Windows. In fact, it should work on any computer where Python has been installed. Moreover, because it uses direct system calls, it will likely be faster than using op.popen to spawn a find command that spawns many grep commands.

Example 5-10. PP2E\PyTools\search_all.py
#########################################################
# Use: "python ..\..\PyTools\search_all.py string".
# search all files at and below current directory
# for a string; uses the os.path.walk interface,
# rather than doing a find to collect names first;
#########################################################

import os, sys, string
listonly = 0
skipexts = ['.gif', '.exe', '.pyc', '.o', '.a']        # ignore binary files

def visitfile(fname, searchKey):                       # for each non-dir file
    global fcount, vcount                              # search for string
    print vcount+1, '=>', fname                        # skip protected files
    try:
        if not listonly:
            if os.path.splitext(fname)[1] in skipexts:
                print 'Skipping', fname
            elif string.find(open(fname).read(  ), searchKey) != -1:
                raw_input('%s has %s' % (fname, searchKey))
                fcount = fcount + 1
    except: pass
    vcount = vcount + 1

def visitor(myData, directoryName, filesInDirectory):  # called for each dir 
    for fname in filesInDirectory:                     # do non-dir files here
        fpath = os.path.join(directoryName, fname)     # fnames have no dirpath
        if not os.path.isdir(fpath):                   # myData is searchKey
            visitfile(fpath, myData)
     
def searcher(startdir, searchkey):
    global fcount, vcount
    fcount = vcount = 0
    os.path.walk(startdir, visitor, searchkey)

if __name__ == '__main__':
    searcher('.', sys.argv[1])
    print 'Found in %d files, visited %d' % (fcount, vcount)

This file also uses the sys.argv command-line list and the __name__ trick for running in two modes. When run standalone, the search key is passed on the command line; when imported, clients call this module's searcher function directly. For example, to search (grep) for all appearances of directory name "Part2" in the examples tree (an old directory that really did go away!), run a command line like this in a DOS or Unix shell:

C:\...\PP2E>python PyTools\search_all.py Part2 
1 => .\autoexec.bat
2 => .\cleanall.csh
3 => .\echoEnvironment.pyw
4 => .\Launcher.py
.\Launcher.py has Part2 
5 => .\Launcher.pyc
Skipping .\Launcher.pyc
6 => .\Launch_PyGadgets.py
7 => .\Launch_PyDemos.pyw
8 => .\LaunchBrowser.out.txt
.\LaunchBrowser.out.txt has Part2 
9 => .\LaunchBrowser.py
.\LaunchBrowser.py has Part2 
...
 ...more lines deleted
...
1339 => .\old_Part2\Basics\unpack2b.py
1340 => .\old_Part2\Basics\unpack3.py
1341 => .\old_Part2\Basics\__init__.py
Found in 74 files, visited 1341

The script lists each file it checks as it goes, tells you which files it is skipping (names that end in extensions listed in variable skipexts that imply binary data), and pauses for an Enter key press each time it announces a file containing the search string (bold lines). A solution based on find could not pause this way; although trivial in this example, find doesn't return until the entire tree traversal is finished. The search_all script works the same when imported instead of run, but there is no final statistics output line (fcount and vcount live in the module, and so would have to be imported to be inspected here):

>>> from PP2E.PyTools.search_all import searcher 
>>> searcher('.', '-exec')           # find files with string '-exec'
1 => .\autoexec.bat
2 => .\cleanall.csh
3 => .\echoEnvironment.pyw
4 => .\Launcher.py
5 => .\Launcher.pyc
Skipping .\Launcher.pyc
6 => .\Launch_PyGadgets.py
7 => .\Launch_PyDemos.pyw
8 => .\LaunchBrowser.out.txt
9 => .\LaunchBrowser.py
10 => .\Launch_PyGadgets_bar.pyw
11 => .\makeall.csh
12 => .\package.csh
.\package.csh has -exec 
 ...more lines deleted...

However launched, this script tracks down all references to a string in an entire directory tree -- a name of a changed book examples file, object, or directory, for instance.[9]

[9] See the coverage of regular expressions in Chapter 18. The search_all script here searches for a simple string in each file with string.find, but it would be trivial to extend it to search for a regular expression pattern match instead (roughly, just replace string.find with a call to a regular expression object's search method). Of course, such a mutation will be much more trivial after we've learned how to do it.

    I l@ve RuBoard Previous Section Next Section