I l@ve RuBoard

17.3 Implementing Sets

Another commonly used data structure is the set, a collection of objects that support operations such as:

Intersection: Make a new set with all items in common.
Union: Make a new set with all items in either operand.
Membership: Test if an item exists in a set.

And there are others, depending on the intended use. Sets come in handy for dealing with more abstract group combinations. For instance, given a set of engineers and a set of writers, you can pick out individuals who do both activities by intersecting the two sets. A union of such sets would contain either type of individual, but only include any given individual once.

Python lists, tuples, and strings come close to the notion of a set: the in operator tests membership, for iterates, etc. Here, we add operations not directly supported by Python sequences. The idea is that we're extending built-in types for unique requirements.

17.3.1 Set Functions

As before, let's first start out with a function-based set manager. But this time, instead of managing a shared set object in a module, let's define functions to implement set operations on passed-in Python sequences (see Example 17-8).

Example 17-8. PP2E\Dstruct\Basic\inter.py

def intersect(seq1, seq2):
    res = []                          # start with an empty list
    for x in seq1:                    # scan the first sequence
        if x in seq2:
            res.append(x)             # add common items to the end
    return res

def union(seq1, seq2):				
    res = list(seq1)                  # make a copy of seq1
    for x in seq2:                    # add new items in seq2
        if not x in res:
            res.append(x)
    return res

These functions work on any type of sequence -- lists strings, tuples, and other objects that conform to the sequence protocols expected by these functions (for loops, in membership tests). In fact, we can even use them on mixed object types: the last two commands in the following code compute the intersection and union of a list and a tuple. As usual in Python, the object interface is what matters, not the specific types:

C:\...\PP2E\Dstruct\Basic>python
>>> from inter import *
>>> s1 = "SPAM"
>>> s2 = "SCAM" 
>>> intersect(s1, s2), union(s1, s2)
(['S', 'A', 'M'], ['S', 'P', 'A', 'M', 'C'])
>>> intersect([1,2,3], (1,4))
[1]
>>> union([1,2,3], (1,4))
[1, 2, 3, 4]

Notice that the result is always a list here, regardless of the type of sequences passed in. We could work around this by converting types or by using a class to sidestep this issue (and we will in a moment). But type conversions aren't clear- cut if the operands are mixed-type sequences. Which type do we convert to?

17.3.1.1 Supporting multiple operands

If we're going to use the intersect and union functions as general tools, one useful extension is support for multiple arguments (i.e., more than two). The functions in Example 17-9 use Python's variable-length argument lists feature to compute the intersection and union of arbitrarily many operands.

Example 17-9. PP2E\Dstruct\Basic\inter2.py

def intersect(*args):
    res = [] 
    for x in args[0]:                  # scan the first list
        for other in args[1:]:         # for all other arguments
            if x not in other: break   # this item in each one?
        else:
            res.append(x)              # add common items to the end
    return res

def union(*args):				
    res = []
    for seq in args:                   # for all sequence-arguments
        for x in seq:                  # for all nodes in argument
            if not x in res: 
                res.append(x)          # add new items to result
    return res

The multi-operand functions work on sequences in the same way as the original functions, but they support three or more operands. Notice that the last two examples in the following session work on lists with embedded compound objects: the in tests used by the intersect and union functions apply equality testing to sequence nodes recursively, as deep as necessary to determine collection comparison results:

C:\...\PP2E\Dstruct\Basic>python
>>> from inter2 import *
>>> s1, s2, s3 = 'SPAM', 'SLAM', 'SCAM'
>>> intersect(s1, s2)
['S', 'A', 'M']
>>> intersect(s1, s2, s3)
['S', 'A', 'M']
>>> intersect(s1, s2, s3, 'HAM')
['A', 'M']

>>> union(s1, s2), union(s1, s2, s3)
(['S', 'P', 'A', 'M', 'L'], ['S', 'P', 'A', 'M', 'L', 'C'])
>>> intersect([1,2,3], (1,4), range(5))
[1]

>>> s1 = (9, (3.14, 1), "bye", [1,2], "mello")
>>> s2 = [[1,2], "hello", (3.14, 0), 9]
>>> intersect(s1, s2)
[9, [1, 2]]
>>> union(s1, s2)
[9, (3.14, 1), 'bye', [1, 2], 'mello', 'hello', (3.14, 0)]

17.3.2 Set Classes

The set functions can operate on a variety of sequences, but they aren't as friendly as true objects. Among other things, your scripts need to keep track of the sequences passed into these functions manually. Classes can do better: the class in Example 17-10 implements a set object that can hold any type of object. Like the stack classes, it's essentially a wrapper around a Python list with extra set operations.

Example 17-10. PP2E\Dstruct\Basic\set.py

class Set:
    def __init__(self, value = []):     # on object creation
        self.data = []                  # manages a local list
        self.concat(value)
    def intersect(self, other):         # other is any sequence type
        res = []                        # self is the instance subject
        for x in self.data:
            if x in other:
                res.append(x)
        return Set(res)                 # return a new Set
    def union(self, other):				
        res = self.data[:]              # make a copy of my list
        for x in other:                                    
            if not x in res:
                res.append(x)
        return Set(res)				
    def concat(self, value):            # value: a list, string, Set...
        for x in value:                 # filters out duplicates
           if not x in self.data:
                self.data.append(x)

    def __len__(self):          return len(self.data)
    def __getitem__(self, key): return self.data[key] 	
    def __and__(self, other):   return self.intersect(other) 	
    def __or__(self, other):    return self.union(other)
    def __repr__(self):         return '<Set:' + `self.data` + '>'

The Set class is used like the Stack class we saw earlier in this chapter: we make instances and apply sequence operators plus unique set operations to them. Intersection and union can be called as methods, or by using the & and | operators normally used for built-in integer objects. Because we can string operators in expressions now (e.g., x & y & z), there is no obvious need to support multiple operands in intersect/union methods here. As with all objects, we can either use the Set class within a program, or test it interactively as follows:

>>> from set import Set
>>> users1 = Set(['Bob', 'Emily', 'Howard', 'Peeper'])
>>> users2 = Set(['Jerry', 'Howard', 'Carol'])
>>> users3 = Set(['Emily', 'Carol'])
>>> users1 & users2
<Set:['Howard']>
>>> users1 | users2
<Set:['Bob', 'Emily', 'Howard', 'Peeper', 'Jerry', 'Carol']>
>>> users1 | users2 & users3
<Set:['Bob', 'Emily', 'Howard', 'Peeper', 'Carol']>
>>> (users1 | users2) & users3
<Set:['Emily', 'Carol']>
>>> users1.data
['Bob', 'Emily', 'Howard', 'Peeper']

17.3.3 Optimization: Moving Sets to Dictionaries

Once you start using the Set class, the first problem you might encounter is its performance: its nested for loops and in scans become exponentially slow. That slowness may or may not be significant in your applications, but library classes should generally be coded as efficiently as possible.

One way to optimize set performance is by changing the implementation to use dictionaries instead of lists, for storing sets internally -- items may be stored as the keys of a dictionary whose values are all None. Because lookup time is constant and short for dictionaries, the in list scans of the original set may be replaced with direct dictionary fetches in this scheme. In traditional terms, moving sets to dictionaries replaces slow linear searches with fast hash tables.

The module in Example 17-11 implements this idea. Its class is a subclass of the original set, and redefines the methods that deal with the internal representation but inherits others. The inherited & and | methods trigger the new intersect and union methods here, and the inherited len method works on dictionaries as is. As long as Set clients are not dependent on the order of items in a set, they can switch to this version directly by just changing the name of the module where Set is imported from; the class name is the same.

Example 17-11. PP2E\Dstruct\Basic\fastset.py

import set
                                           # fastset.Set extends set.Set 
class Set(set.Set):
    def __init__(self, value = []):
        self.data = {}                     # manages a local dictionary
        self.concat(value)                 # hashing: linear search times
    def intersect(self, other):
        res = {}
        for x in other:                    # other: a sequence or Set
            if self.data.has_key(x):       # use hash-table lookup
                res[x] = None
        return Set(res.keys(  ))             # a new dictionary-based Set
    def union(self, other):				
        res = {}                           # other: a sequence or Set
        for x in other:                    # scan each set just once
            res[x] = None
        for x in self.data.keys(  ):         # '&' and '|' come back here
            res[x] = None                  # so they make new fastset's
        return Set(res.keys(  ))	
    def concat(self, value):
        for x in value: self.data[x] = None

    # inherit and, or, len
    def __getitem__(self, key):  return self.data.keys(  )[key] 	
    def __repr__(self):          return '<Set:' + `self.data.keys(  )` + '>'

This works about the same as the previous version:

>>> from fastset import Set
>>> users1 = Set(['Bob', 'Emily', 'Howard', 'Peeper'])
>>> users2 = Set(['Jerry', 'Howard', 'Carol'])
>>> users3 = Set(['Emily', 'Carol'])
>>> users1 & users2
<Set:['Howard']>
>>> users1 | users2
<Set:['Emily', 'Howard', 'Jerry', 'Carol', 'Peeper', 'Bob']>
>>> users1 | users2 & users3
<Set:['Emily', 'Howard', 'Carol', 'Peeper', 'Bob']>
>>> (users1 | users2) & users3
<Set:['Emily', 'Carol']>
>>> users1.data
{'Emily': None, 'Bob': None, 'Peeper': None, 'Howard': None}

The main functional difference in this version is the order of items in the set: because dictionaries are randomly ordered, this set's order will differ from the original. For instance, you can store compound objects in sets, but the order of items varies in this version:

>>> import set, fastset
>>> a = set.Set([(1,2), (3,4), (5,6)])
>>> b = set.Set([(3,4), (7,8)])
>>> a & b
<Set:[(3, 4)]>
>>> a | b
<Set:[(1, 2), (3, 4), (5, 6), (7, 8)]>

>>> a = fastset.Set([(1,2), (3,4), (5,6)])
>>> b = fastset.Set([(3,4), (7,8)])
>>> a & b
<Set:[(3, 4)]>
>>> a | b
<Set:[(3, 4), (1, 2), (7, 8), (5, 6)]>
>>> b | a
<Set:[(3, 4), (5, 6), (1, 2), (7, 8)]>

Sets aren't supposed to be ordered anyhow, so this isn't a showstopper. A deviation that might matter, though, is that this version cannot be used to store unhashable objects. This stems from the fact that dictionary keys must be immutable. Because values are stored in keys, dictionary sets can contain only things like tuples, strings, numbers, and class objects with immutable signatures. Mutable objects like lists and dictionaries won't work directly. For example, the call:

fastset.Set([[1,2],[3,4]])

raises an exception with this dictionary-based set, but works with the original set class. Tuples work here because they are immutable; Python computes hash values and tests key equality as expected.

17.3.3.1 Timing the results

So how did we do on the optimization front? Example 17-12 contains a script to compare set class performance. It reuses the timer module used earlier to test stacks.

Example 17-12. PP2E\Dstruct\Basic\settime.py

import timer, sys
import set, fastset

def setops(Class):
    a = Class(range(50))                 # a 50-integer set
    b = Class(range(20))                 # a 20-integer set
    c = Class(range(10))
    d = Class(range(5))
    for i in range(5):
        t = a & b & c & d                # 3 intersections
        t = a | b | c | d                # 3 unions

if __name__ == '__main__':
    rept = int(sys.argv[1]) 
    print 'set =>    ', timer.test(rept, setops, set.Set)
    print 'fastset =>', timer.test(rept, setops, fastset.Set)

The setops function makes four sets and combines them with intersection and union operators five times. A command-line argument controls the number of times this whole process is repeated. More accurately, each call to setops makes 34 Set instances (4 + [5 x (3 + 3)] ), and runs the intersect and union methods 15 times each (5 x 3) in the for loop's body. On the same test machine, the performance improvement is equally dramatic this time around:

C:\...\PP2E\Dstruct\Basic>python settime.py 50
set =>     1.5440352671
fastset => 0.446057593993

C:\...\PP2E\Dstruct\Basic>python settime.py 100
set =>     2.77783486146
fastset => 0.888354648921

C:\...\PP2E\Dstruct\Basic>python settime.py 200
set =>     5.7762634305
fastset => 1.77677885985

At least for this test case, the simple set implementation is over three times slower than dictionary-based sets. In fact, this threefold speedup is probably sufficient. Python dictionaries are already optimized hash tables that you might be hard- pressed to improve on. Unless there is evidence that dictionary-based sets are still too slow, our work here is probably done.

Using the Python Profiler

The Python profiler provides another way to gather performance data besides timing sections of code as done in this chapter. Because the profiler tracks all function calls, it provides much more information in a single blow. As such, it's a more powerful way to isolate bottlenecks in slow programs -- after profiling, you should have a good idea where to focus your optimization efforts.

The profiler ships with Python as the standard library module called profile, and provides a variety of interfaces for measuring code performance. It is structured and used much like the pdb command-line debugger: import the profile module and call its functions with a code string to measure performance. The simplest profiling interface is its profile.run(statementstring) function. When invoked, the profiler runs the code string, collects statistics during the run, and issues a report on the screen when the statement completes.

The report's format is straightforward and well-documented in the Python library manual. By default, it lists the number of calls and times spent in each function invoked during the run. When the profiler is enabled, each interpreter event is routed to a Python handler. This gives an accurate picture of performance, but tends to make the program being profiled run much slower than normal.

17.3.4 Optimizing fastset by Coding Techniques (or Not)

As coded, there seems to be a bottleneck in the fastset class: each time we call a dictionary's keys method, Python makes a new list to hold the result, and this can happen repeatedly during intersections and unions. If you are interested in trying to optimize this further, see the following files in the book CD (view CD-ROM content online at http://examples.oreilly.com/python2):

PP2E\Dstruct\Basic\fastset2.py
PP2E\Dstruct\Basic\fastset3.py

I wrote these to try to speed up sets further, but failed miserably. It turns out that adding extra code to try to shave operations usually negates the speedup you obtain. There may be faster codings, but the biggest win here was likely in changing the underlying data structure to dictionaries, not in minor code tweaks.

As a rule of thumb, your intuition about performance is almost always wrong in a dynamic language like Python: the algorithm is usually the real culprit behind performance problems, not the coding style or even the implementation language. By removing the combinatorial list scanning algorithm of the original set class, the Python implementation became dramatically faster.

In fact, moving the original set class to C without fixing the algorithm would not have addressed the real performance problem. Coding tricks don't usually help as much either, and they make your programs difficult to understand. In Python, it's almost always best to code for readability first and optimize later if needed. Despite its simplicity, fastset is fast indeed.

17.3.5 Adding Relational Algebra to Sets (CD)

If you are interested in studying additional set-like operations coded in Python, see the following files on this book's CD (see http://examples.oreilly.com/python2):

PP2E\Dstruct\Basic\rset.py : RSet implementation
PP2E\Dstruct\Basic\reltest.py : Test script for RSet

The RSet subclass defined in rset.py adds basic relational algebra operations for sets of dictionaries. It assumes the items in sets are mappings (rows), with one entry per column (field). RSet inherits all the original Set operations (iteration, intersection, union, & and | operators, uniqueness filtering, etc.), and adds new operations as methods:

Select: Return a set of nodes that have a field equal to a given value.
Bagof: Collect set nodes that satisfy an expression string.
Find: Select tuples according to a comparison, field, and value.
Match: Find nodes in two sets with the same values for common fields.
Product: Compute a Cartesian product: concatenate tuples from two sets.
Join: Combine tuples from two sets that have the same value for a field.
Project: Extract named fields from the tuples in a table.
Difference: Remove one set's tuples from another.

Alternative implementations of set difference operations can also be found in the diff.py file in the same CD directory.

I l@ve RuBoard