I l@ve RuBoard

12.6 More on HTML and URL Escapes

Perhaps the most subtle change in the last section's rewrite is that, for robustness, this version also calls cgi.escape for the language name, not just for the language's code snippet. It's unlikely but not impossible that someone could pass the script a language name with an embedded HTML character. For example, a URL like:

http://starship.python.net/~lutz/Basics/languages2reply.cgi?language=a<b

embeds a < in the language name parameter (the name is a<b). When submitted, this version uses cgi.escape to properly translate the < for use in the reply HTML, according to the standard HTML escape conventions discussed earlier:

<TITLE>Languages</TITLE>
<H1>Syntax</H1><HR>

<H3>a&lt;b</H3><P><PRE>
Sorry--I don't know that language
</PRE></P><BR>
<HR>

The original version doesn't escape the language name, such that the embedded <b is interpreted as an HTML tag (which may make the rest of the page render in bold font!). As you can probably tell by now, text escapes are pervasive in CGI scripting -- even text that you may think is safe must generally be escaped before being inserted into the HTML code in the reply stream.

12.6.1 URL Escape Code Conventions

Notice, though, that while it's wrong to embed an unescaped < in the HTML code reply, it's perfectly okay to include it literally in the earlier URL string used to trigger the reply. In fact, HTML and URLs define completely different characters as special. For instance, although & must be escaped as &amp inside HTML code, we have to use other escaping schemes to code a literal & within a URL string (where it normally separates parameters). To pass a language name like a&b to our script, we have to type the following URL:

http://starship.python.net/~lutz/Basics/languages2reply.cgi?language=a%26b

Here, %26 represents & -- the & is replaced with a % followed by the hexadecimal value (0x26) of its ASCII code value (38). By URL standard, most nonalphanumeric characters are supposed to be translated to such escape sequences, and spaces are replaced by + signs. Technically, this convention is known as the application/x-www-form-urlencoded query string format, and it's part of the magic behind those bizarre URLs you often see at the top of your browser as you surf the Web.

12.6.2 Python HTML and URL Escape Tools

If you're like me, you probably don't have the hexadecimal value of the ASCII code for & committed to memory. Luckily, Python provides tools that automatically implement URL escapes, just as cgi.escape does for HTML escapes. The main thing to keep in mind is that HTML code and URL strings are written with entirely different syntax, and so they employ distinct escaping conventions. Web users don't generally care, unless they need to type complex URLs explicitly (browsers handle most escape code details internally). But if you write scripts that must generate HTML or URLs, you need to be careful to escape characters that are reserved in either syntax.

Because HTML and URLs have different syntaxes, Python provides two distinct sets of tools for escaping their text. In the standard Python library:

cgi.escape escapes text to be embedded in HTML.
urllib.quote and quote_plus escape text to be embedded in URLs.

The urllib module also has tools for undoing URL escapes (unquote, unquote_plus), but HTML escapes are undone during HTML parsing at large (htmllib). To illustrate the two escape conventions and tools, let's apply each toolset to a few simple examples.

12.6.3 Escaping HTML Code

As we saw earlier, cgi.escape translates code for inclusion within HTML. We normally call this utility from a CGI script, but it's just as easy to explore its behavior interactively:

>>> import cgi
>>> cgi.escape('a < b > c & d "spam"', 1)
'a &lt; b &gt; c &amp; d &quot;spam&quot;'

>>> s = cgi.escape("1<2 <b>hello</b>")
>>> s
'1&lt;2 &lt;b&gt;hello&lt;/b&gt;'

Python's cgi module automatically converts characters that are special in HTML syntax according to the HTML convention. It translates <, >, &, and with an extra true argument, ", into escape sequences of the form &X;, where the X is a mnemonic that denotes the original character. For instance, < stands for the "less than" operator (<) and & denotes a literal ampersand (&).

There is no un-escaping tool in the CGI module, because HTML escape code sequences are recognized within the context of an HTML parser, like the one used by your web browser when a page is downloaded. Python comes with a full HTML parser, too, in the form of standard module htmllib, which imports and specializes tools in module sgmllib (HTML is a kind of SGML syntax). We won't go into details on the HTML parsing tools here (see the library manual for details), but to illustrate how escape codes are eventually undone, here is the SGML module at work reading back the last output above:

>>> from sgmllib import TestSGMLParser
>>> p = TestSGMLParser(1)
>>> s
'1&lt;2 &lt;b&gt;hello&lt;/b&gt;'
>>> for c in s:
...    p.feed(c)
...
>>> p.close(  )
data: '1<2 <b>hello</b>'

12.6.4 Escaping URLs

By contrast, URLs reserve other characters as special and must adhere to different escape conventions. Because of that, we use different Python library tools to escape URLs for transmission. Python's urllib module provides two tools that do the translation work for us: quote, which implements the standard %XX hexadecimal URL escape code sequences for most nonalphanumeric characters, and quote_plus, which additionally translates spaces to + plus signs. The urllib module also provides functions for unescaping quoted characters in a URL string: unquote undoes %XX escapes, and unquote_plus also changes plus signs back to spaces. Here is the module at work, at the interactive prompt:

>>> import urllib
>>> urllib.quote("a & b #! c")
'a%20%26%20b%20%23%21%20c'

>>> urllib.quote_plus("C:\stuff\spam.txt")
'C%3a%5cstuff%5cspam.txt'

>>> x = urllib.quote_plus("a & b #! c")
>>> x
'a+%26+b+%23%21+c'

>>> urllib.unquote_plus(x)
'a & b #! c'

URL escape sequences embed the hexadecimal values of non-safe characters following a % sign (usually, their ASCII codes). In urllib, non-safe characters are usually taken to include everything except letters, digits, a handful of safe special characters (any of _,.-), and / by default). You can also specify a string of safe characters as an extra argument to the quote calls to customize the translations; the argument defaults to /, but passing an empty string forces / to be escaped:

>>> urllib.quote_plus("uploads/index.txt")
'uploads/index.txt'

>>> urllib.quote_plus("uploads/index.txt", '')
'uploads%2findex.txt'

Note that Python's cgi module also translates URL escape sequences back to their original characters and changes + signs to spaces during the process of extracting input information. Internally, cgi.FieldStorage automatically calls urllib.unquote if needed to parse and unescape parameters passed at the end of URLs (most of the translation happens in cgi.parse_qs). The upshot is that CGI scripts get back the original, unescaped URL strings, and don't need to unquote values on their own. As we've seen, CGI scripts don't even need to know that inputs came from a URL at all.

12.6.5 Escaping URLs Embedded in HTML Code

But what do we do for URLs inside HTML? That is, how do we escape when we generate and embed text inside a URL, which is itself embedded inside generated HTML code? Some of our earlier examples used hardcoded URLs with appended input parameters inside <A HREF> hyperlink tags; file languages2.cgi, for instance, prints HTML that includes a URL:

<a href="getfile.cgi?filename=languages2.cgi">

Because the URL here is embedded in HTML, it must minimally be escaped according to HTML conventions (e.g., any < characters must become <), and any spaces should be translated to + signs. A cgi.escape(url) call, followed by a string.replace(url, " ", "+") would take us this far, and would probably suffice for most cases.

That approach is not quite enough in general, though, because HTML escaping conventions are not the same as URL conventions. To robustly escape URLs embedded in HTML code, you should instead call urllib.quote_plus on the URL string before adding it to the HTML text. The escaped result also satisfies HTML escape conventions, because urllib translates more characters than cgi.escape, and the % in URL escapes is not special to HTML.

But there is one more wrinkle here: you also have to be careful with & characters in URL strings that are embedded in HTML code (e.g., within <A> hyperlink tags). Even if parts of the URL string are URL-escaped, when more than one parameter is separated by a &, the & separator might also have to be escaped as & according to HTML conventions. To see why, consider the following HTML hyperlink tag:

<A HREF="file.cgi?name=a&job=b&amp=c&sect=d&lt=e">hello</a>

When rendered in most browsers I've tested, this URL link winds up looking incorrectly like this (the "S" character is really a non-ASCII section marker):

file.cgi?name=a&job=b&=c&S=d<=e

The first two parameters are retained as expected (name=a, job=b), because name is not preceded with an &, and &job is not recognized as a valid HTML character escape code. However, the &amp, &sect, and &lt parts are interpreted as special characters, because they do name valid HTML escape codes. To make this work as expected, the & separators should be escaped:

<A HREF="file.cgi?name=a&amp;job=b&amp;amp=c&amp;sect=d&amp;lt=e">hello</a>

Browsers render this fully escaped link as expected:

file.cgi?name=a&job=b&amp=c&sect=d&lt=e

The moral of this story is that unless you can be sure that the names of all but the leftmost URL query parameters embedded in HTML are not the same as the name of any HTML character escape code like amp, you should generally run the entire URL through cgi.escape after escaping its parameter names and values with urllib.quote_plus:

>>> import cgi
>>> cgi.escape('file.cgi?name=a&job=b&amp=c&sect=d&lt=e')
'file.cgi?name=a&amp;job=b&amp;amp=c&amp;sect=d&amp;lt=e'

Having said that, I should add that some examples in this book do not escape & URL separators embedded within HTML simply because their URL parameter names are known to not conflict with HTML escapes. This is not, however, the most general solution; when in doubt, escape much and often.

"Always Look on the Bright Side of Life"

Lest these formatting rules sound too clumsy (and send you screaming into the night!), note that the HTML and URL escaping conventions are imposed by the Internet itself, not by Python. (As we've seen, Python has a different mechanism for escaping special characters in string constants with backslashes.) These rules stem from the fact that the Web is based upon the notion of shipping formatted strings around the planet, and they were surely influenced by the tendency of different interest groups to develop very different notations.

You can take heart, though, in the fact that you often don't need to think in such cryptic terms; when you do, Python automates the translation process with library tools. Just keep in mind that any script that generates HTML or URLs dynamically probably needs to call Python's escaping tools to be robust. We'll see both the HTML and URL escape tool sets employed frequently in later examples in this chapter and the next two. In Chapter 15, we'll also meet systems such as Zope that aim to get rid of some of the low-level complexities that CGI scripters face. And as usual in programming, there is no substitute for brains; amazing technologies like the Internet come at a cost in complexity.

I l@ve RuBoard