24.2. Parsing XML with SAX
In most cases, the best way to extract information from an XML document is to parse the document with an event-driven parser compliant with SAX, the Simple API for XML. SAX defines a standard API that can be implemented on top of many different underlying parsers. The SAX approach to parsing has similarities to most of the HTML parsers covered in Chapter 23. As the parser encounters XML elements, text contents, and other significant events in the input stream, the parser calls back to methods of your classes. Such event-driven parsing, based on callbacks to your methods as relevant events occur, also has similarities to the event-driven approach that is almost universal in GUIs and in some of the best, most scalable networking frameworks, such as Twisted, mentioned in Chapter 19. Event-driven approaches in various programming fields may not appear natural to beginners, but enable high performance and particularly high scalability, making them very suitable for high-workload cases.
To use SAX, you define a content handler class, subclassing a library class and overriding some methods. Then you build a parser object p, install an instance of your class as p's handler, and feed p the input stream to parse. p calls methods on your handler to reflect the document's structure and contents. Your handler's methods perform application-specific processing. The xml.sax package supplies a factory function to build p, and convenience functions for simpler operation in typical cases. xml.sax also supplies exception classes, raised in cases of invalid input and other errors.
Optionally, you can also register with parser p other kinds of handlers besides the content handler. You can supply a custom error handler to use an error diagnosis strategy different from normal exception raising, for example in order to diagnose several errors during a parse. You can supply a custom DTD handler to receive information about notation and unparsed entities from the XML document's Document Type Definition (DTD). You can supply a custom entity resolver to handle external entity references in advanced, customized ways. These advanced possibilities are rarely used, and I do not cover them further in this book.
24.2.1. The xml.sax Package
The xml.sax package supplies exception class SAXException and subclasses of it to support fine-grained exception handling. xml.sax also supplies three functions.
The last argument of methods startElement and startElementNS is an attributes object attr, a read-only mapping of attribute names to attribute values. For method startElement, names are identifier strings. For method startElementNS, names are pairs (uri,localname), where uri is the namespace's URI or None, and localname is the name of the tag. In addition to some mapping methods, attr also supports methods that let you work with the qname (qualified name) of each attribute.
For startElement, each qname is the same string as the corresponding name. For startElementNS, a qname is the corresponding local name for attributes not associated with a namespace (i.e., attributes whose uri is None); otherwise, the qname is the string prefix:name used in the document's text for this attribute.
The parser may reuse in later processing the attr object that it passes to methods startElement and startElementNS. If you need to keep a copy of the attributes of an element, call attr.copy( ) to get the copy.
22.214.171.124. Incremental parsing
All parsers support a method parse, which you call with the XML document as either a string or a file-like object open for reading. parse does not return until the end of the XML document. Most SAX parsers, though not all, also support incremental parsing, letting you feed the XML document to the parser a little at a time, as the document arrives from a network connection or other source; good incremental parsers perform all possible callbacks to your handler class's methods as soon as possible, so you don't have to wait for the whole document to arrive before you start processing it (the processing can instead proceed as incrementally as the parsing itself does, which is a great idea for asynchronous networking approaches, covered in "Event-Driven Socket Programs" on page 533). A parser p that is capable of incremental parsing supplies three more methods.
126.96.36.199. The xml.sax.saxutils module
The saxutils module of package xml.sax supplies two functions and a class that provide handy ways to generate XML output based on an input XML document.
24.2.2. Parsing XHTML with xml.sax
The following example uses xml.sax to perform a typical XHTML-related task that is very similar to the tasks performed in the examples of Chapter 22. The example fetches an XHTML page from the Web with urllib, parses it, and outputs all unique links from the page to other sites. The example uses urlparse to examine the links for the given site and outputs only the links whose URLs have an explicit scheme of 'http'.
import xml.sax, urllib, urlparse class LinksHandler(xml.sax.ContentHandler): def startDocument(self): self.seen = set( ) def startElement(self, tag, attributes): if tag != 'a': return value = attributes.get('href') if value is not None and value not in self.seen: self.seen.add(value) pieces = urlparse.urlparse(value) if pieces != 'http': return print urlparse.urlunparse(pieces) p = xml.sax.make_parser( ) p.setContentHandler(LinksHandler( )) f = urllib.urlopen('http://www.w3.org/MarkUp/') BUFSIZE = 8192 while True: data = f.read(BUFSIZE) if not data: break p.feed(data) p.close( )
This example is quite similar to the HTMLParser example in Chapter 22. With the xml.sax module, the parser and the handler are separate objects (while in the examples of Chapter 22 they coincided). Method names differ (startElement in this example versus handle_starttag in the HTMLParser example). The attributes argument is a mapping here, so its method get immediately gives us the attribute value we're interested in, while in the examples of Chapter 22, attributes were given as a sequence of (name,value) pairs, so we had to loop on the sequence until we found the right name. Despite these differences in detail, the overall structure is very close, and typical of simple event-driven parsing tasks.