23.2. The htmllib Module

The htmllib module supplies a class named HTMLParser that subclasses SGMLParser and defines start_tag, do_tag, and end_tag methods for HTML 2.0 tags. HTMLParser implements and overrides methods to perform calls to methods of a formatter object, covered in "The formatter Module" on page 581. You can subclass HTMLParser and override methods. In addition to start_tag, do_tag, and end_tag methods, an instance h of HTMLParser supplies the following attributes and methods.



Called for each <a> tag. href, name, and type are the string values of the tag's attributes with the same names. HTMLParser's implementation of anchor_bgn maintains a list of outgoing hyperlink targets (i.e., href arguments of method s.anchor_bgn) in an instance attribute named s.anchorlist.


h.anchor_end( )

Called for each </a> end tag. HTMLParser's implementation of anchor_end emits to the formatter a footnote reference that is an index within s.anchorlist. In other words, by default, HTMLParser asks the formatter to format an <a>/</a> tag pair as the text inside the tag, followed by a footnote reference number that points to the URL in the <a> tag. Of course, it's up to the formatter to deal with this formatting request.


The h.anchor_list attribute contains the list of outgoing hyperlink target URLs, as built by method h.anchor_bgn.


The h.formatter attribute is the formatter object f associated with h, which you pass as the only argument when you instantiate HTMLParser(f).



Called for each <img> tag. Each argument is the string value of the tag's attribute of the same name. HTMLParser's implementation of handle_image calls h.handle_data(alt) (in other words, the default implementation ignores the image proper and formats the alternate text instead).



The h.nofill attribute is false when the parser is collapsing whitespace, the normal case. It is true when the parser must preserve whitespace, typically within a <pre> tag.


h.save_bgn( )

Diverts data to an internal buffer instead of passing it to the formatter, until the next call to h.save_end( ). h has only one buffer, so you cannot nest save_bgn calls.


h.save_end( )

Returns a string with all data in the internal buffer and directs data back to the formatter from now on. If save_bgn state was not on, raises TypeError.

The formatter module defines formatter and writer classes. Instantiate a formatter by passing a writer instance to the class, then pass the formatter instance to class HTMLParser of module htmllib. You can define your own formatters and writers by subclassing formatter's classes and overriding methods appropriately, but I do not cover this advanced and rarely used possibility in this book. An application with special output requirements would typically define an appropriate writer, subclassing AbstractWriter and overriding all methods, and use class AbstractFormatter without needing to subclass it. Module formatter supplies the following classes.


class AbstractFormatter(writer)

The standard formatter implementation, suitable for most tasks.


class AbstractWriter( )

A writer implementation that prints each of its method names when called, suitable for debugging purposes only.


class DumbWriter(file=sys.stdout,maxcol=72)

A writer implementation that emits text to file object file, with word wrapping to ensure that no text line is longer than maxcol characters.


class NullFormatter(writer=None)

A formatter implementation whose methods are do-nothing stubs. When writer is None, instantiates NullWriter. Suitable when you subclass HMTLParser to analyze an HTML document but don't want any output to occur.


class NullWriter( )

A writer implementation whose methods are do-nothing stubs.

23.2.1. The htmlentitydefs Module

The htmlentitydefs module supplies three attributes:


A mapping from Unicode codepoints to HTML entity names. For example, htmlentitydefs.codepoint2name[229] is 'auml', since Unicode character 229, "lowercase a with umlaut," is encoded in HTML as '&auml;'.


A mapping from HTML entity names to Latin-1 characters or HTML character references. For example, htmlentitydefs.entitydefs['auml'] is '\xe4', and htmlentitydefs.entitydefs['sigma'] is '&#963;'.


A mapping from HTML entity names to Unicode codepoints. For example, htmlentitydefs.name2codepoint['auml'] is 228.

Module htmllib uses module htmlentitydefs internally.

23.2.2. Parsing HTML with htmllib

The following example uses htmllib to perform the same task as in the previous example for sgmllib, fetching a page from the Web with urllib, parsing it, and outputting the hyperlinks:

import htmllib, formatter, urllib, urlparse

p = htmllib.HTMLParser(formatter.NullFormatter( ))
f = urllib.urlopen('')
BUFSIZE = 8192
while True:
    data =
    if not data: break
p.close( )

seen = set( )
for url in p.anchorlist:
  if url in seen: continue
    pieces = urlparse.urlparse(url)
    if pieces[0] == 'http':
        print urlparse.urlunparse(pieces)

The example exploits the anchorlist attribute of class htmllib.HTMLParser, and therefore does not need to perform any subclassing. htmllib.HTMLParser builds the anchorlist attribute as it parses the HTML page, so the code need only loop on the list and work with the list's items, each a relevant URL.

