Section 19.1. URL Access

19.1. URL Access

A URL identifies a resource on the Internet. A URL is a string composed of several optional parts, called components, known as scheme, location, path, query, and fragment. A URL with all its parts looks something like:

scheme://lo.ca.ti.on/pa/th?query#fragment

For example, in http://www.python.org:80/faq.cgi?src=fie, the scheme is http, the location is www.python.org:80, the path is /faq.cgi, the query is src=fie, and there is no fragment. Some of the punctuation characters form a part of one of the components they separate, while others are just separators and are part of no component. Omitting punctuation implies missing components. For example, in mailto:me@you.com, the scheme is mailto, the path is me@you.com, and there is no location, query, or fragment. The missing // means the URL has no location part, the missing ? means it has no query part, and the missing # means it has no fragment part.

19.1.1. The urlparse Module

The urlparse module supplies functions to analyze and synthesize URL strings. The most frequently used functions of module urlparse are urljoin, urlsplit, and urlunsplit.

urljoin
urljoin(base_url_string,relative_url_string)

Returns a URL string u, obtained by joining relative_url_string, which may be relative, with base_url_string. The joining procedure that urljoin performs to obtain its result u may be summarized as follows:

When either of the argument strings is empty, u is the other argument.
When relative_url_string explicitly specifies a scheme that is different from that of base_url_string, u is relative_url_string. Otherwise, u's scheme is that of base_url_string.
When the scheme does not allow relative URLs (e.g., mailto), or relative_url_string explicitly specifies a location (even when it is the same as the location of base_url_string), all other components of u are those of relative_url_string. Otherwise, u's location is that of base_url_string.
u's path is obtained by joining the paths of base_url_string and relative_url_string according to standard syntax for absolute and relative URL paths. For example:

import urlparse urlparse.urljoin('http://somehost.com/some/path/here','../other/path') # Result is: 'http://somehost.com/some/other/path'

urlsplit
urlsplit(url_string,default_scheme='',allow_fragments=TRue)

Analyzes url_string and returns a tuple with five string items: scheme, location, path, query, and fragment. default_scheme is the first item when the url_string lacks a scheme. When allow_fragments is False, the tuple's last item is always '', whether or not url_string has a fragment. Items corresponding to missing parts are always ''. For example:

urlparse.urlsplit('http://www.python.org:80/faq.cgi?src=fie') # Result is: ('http','www.python.org:80','/faq.cgi','src=fie','')

urlunsplit
urlunsplit(url_tuple)

url_tuple is any iterable with exactly five items, all strings. For example, any return value from a urlsplit call is an acceptable argument for urlunsplit. urlunsplit returns a URL string with the given components and the needed separators, but with no redundant separators (e.g., there is no # in the result when the fragment, url_tuple's last item, is ''). For example:

urlparse.urlunsplit(('http','www.python.org:80','/faq.cgi','src=fie','')) # Result is: 'http://www.python.org:80/faq.cgi?src=fie'

urlunsplit(urlsplit(x)) returns a normalized form of URL string x, which is not necessarily equal to x because x need not be normalized. For example:

urlparse.urlunsplit(urlparse.urlsplit('http://a.com/path/a?')) # Result is: 'http://a.com/path/a'

In this case, the normalization ensures that redundant separators, such as the trailing ? in the argument to urlsplit, are not present in the result.

19.1.2. The urllib Module

The urllib module supplies simple functions to read data from URLs. urllib supports the following protocols (schemes): http, https, ftp, gopher, and file. file indicates a local file. urllib uses file as the default scheme for URLs that lack an explicit scheme. You can find simple, typical examples of urllib use in Chapter 23, where urllib.urlopen is used to fetch HTML and XML pages that all the various examples parse and analyze.

19.1.2.1. Functions

Module urllib supplies a number of functions, with urlopen being the most frequently used.

quote
quote(str,safe='/')

Returns a copy of str where special characters are changed into Internet-standard quoted form %xx. Does not quote alphanumeric characters, spaces, any of the characters _,.-, nor any of the characters in string safe. For example:

print urllib.quote('zip&zap') # emits: zip%26zap

quote_plus
quote_plus(str, safe='/')

Like quote, but also changes spaces into plus signs.

unquote
unquote(str)

Returns a copy of str where each quoted form %xx is changed into the corresponding character. For example:

print urllib.unquote('zip%26zap') # emits: zip&zap

unquote_plus
unquote_plus(str)

Like unquote, but also changes plus signs into spaces.

urlcleanup
urlcleanup( )

Clears the cache of function urlretrieve, covered in "urlretrieve".

urlencode
urlencode(query,doseq=False)

Returns a string with the URL-encoded form of query. query can be either a sequence of (name, value) pairs, or a mapping, in which case the resulting string encodes the mapping's (key, value) pairs. For example:

urllib.urlencode([('ans',42),('key','val')]) # 'ans=42&key=val' urllib.urlencode({'ans':42, 'key':'val'}) # 'key=val&ans=42'

The order of items in a dictionary is arbitrary: if you need the URL-encoded form to have key/value pairs in a specific order, use a sequence as the query argument, as in the first call in this snippet.

When doseq is true, any value in query that is a sequence and is not a string is encoded as separate parameters, one per item in value. For example:

urllib.urlencode([('K',('x','y','z'))],True) # 'K=x&K=y&K=z'

When doseq is false (the default), each value is encoded as the quote_plus of its string form given by built-in str, whether the value is a sequence or not:

urllib.urlencode([('K',('x','y','z'))],False) # 'K=%28%27x%27%2C+%27y%27%2C+%27z%27%29'

urlopen
urlopen(urlstring,data=None,proxies=None)

Accesses the given URL and returns a read-only file-like object f. f supplies file-like methods read, readline, readlines, and close, as well as two others:

f.geturl( )

Returns the URL of f. This may differ from urlstring by normalization (as mentioned for function urlunsplit earlier) and because of HTTP redirects (i.e., indications that the requested data is located elsewhere). urllib supports redirects transparently, and method geturl lets you check for them if you want.

f.info( )

Returns an instance m of class Message of module mimetools, covered in "The Message Classes of the rfc822 and mimetools Modules" on page 573. m's headers provide metadata about f. For example, m['Content-Type'] is the MIME type of the data in f, and m's methods m.gettype( ), m.getmaintype( ), and m.getsubtype( ) provide the same information.

When data is None and urlstring's scheme is http, urlopen sends a GET request. When data is not None, urlstring's scheme must be http, and urlopen sends a POST request. data must then be in URL-encoded form, and you normally prepare it with function urlencode, covered in urlencode on page 496.

urlopen can use proxies that do not require authentication. Set environment variables http_proxy, ftp_proxy, and gopher_proxy to the proxies' URLs to exploit this. You normally perform such settings in your system's environment, in platform-dependent ways, before you start Python. On the Macintosh only, urlopen transparently and implicitly retrieves proxy URLs from your Internet configuration settings. Alternatively, you can pass as argument proxies a mapping whose keys are scheme names, with the corresponding values being proxy URLs. For example:

f=urllib.urlopen('http://python.org', proxies={'http':'http://prox:999'})

urlopen does not support proxies that require authentication; for such advanced needs, use the richer library module urllib2, covered in "The urllib2 Module" on page 498.

urlretrieve
urlretrieve(urlstring,filename=None,reporthook=None,data=None)

Similar to urlopen(urlstring,data), but instead returns a pair (f,m). f is a string that specifies the path to a file on the local filesystem. m is an instance of class Message of module mimetools, like the result of method info called on the result value of urlopen, covered in "urlopen".

When filename is None, urlretrieve copies retrieved data to a temporary local file, and f is the path to the temporary local file. When filename is not None, urlretrieve copies retrieved data to the file named filename, and f is filename. When reporthook is not None, it must be a callable with three arguments, as in the function:

def reporthook(block_count, block_size, file_size): print block_count

urlretrieve calls reporthook zero or more times while retrieving data. At each call, it passes block_count, the number of blocks of data retrieved so far; block_size, the size in bytes of each block; and file_size, the total size of the file in bytes. urlretrieve passes file_size as -1 when it cannot determine file size, which depends on the protocol involved and on how completely the server implements that protocol. The purpose of reporthook is to allow your program to give graphical or textual feedback to the user about the progress of the file-retrieval operation that urlretrieve performs.

19.1.2.2. The FancyURLopener class

You normally use module urllib tHRough the functions it supplies (most often urlopen). To customize urllib's functionality, however, you can subclass urllib's FancyURLopener class and bind an instance of your subclass to attribute _urlopener of module urllib. The customizable aspects of an instance f of a subclass of FancyURLopener are the following.

prompt_user_passwd
f.prompt_user_passwd(host,realm)

Returns a pair (user,password) to use to authenticate access to host in the security realm. The default implementation in class FancyURLopener prompts the user for this data in interactive text mode. Your subclass can override this method in order to interact with the user via a GUI or to fetch authentication data from persistent storage.
version
f.version

The string that f uses to identify itself to the serverfor example, via the User-Agent header in the HTTP protocol. You can override this attribute by subclassing or rebind it directly on an instance of FancyURLopener.

19.1.3. The urllib2 Module

The urllib2 module is a rich, highly customizable superset of module urllib. urllib2 lets you work directly with advanced aspects of protocols such as HTTP. For example, you can send requests with customized headers as well as URL-encoded POST bodies, and handle authentication in various realms, in both Basic and Digest forms, directly or via HTTP proxies.

In the rest of this section, I cover only the ways in which urllib2 lets your program customize these advanced aspects of URL retrieval. I do not try to impart the advanced knowledge of HTTP and other network protocols, independent of Python, that you need to make full use of urllib2's rich functionality. As an HTTP tutorial, I recommend Python Web Programming, by Steve Holden (New Riders): it offers good coverage of HTTP basics with examples coded in Python and a good bibliography if you need further details about network protocols.

19.1.3.1. Functions

urllib2 supplies a function urlopen that is basically identical to urllib's urlopen. To customize urllib2, install, before calling urlopen, any number of handlers grouped into an opener, using the build_opener and install_opener functions.

You can also optionally pass to urlopen an instance of class Request instead of a URL string. Such an instance may include both a URL string and supplementary information on how to access it, as covered in "The Request class" on page 500.

build_opener
build_opener(*handlers)

Creates and returns an instance of class OpenerDirector (covered in "The OpenerDirector class" on page 502) with the given handlers. Each handler can be a subclass of class BaseHandler, instantiable without arguments, or an instance of such a subclass, however instantiated. build_opener adds instances of various handler classes provided by module urllib2 in front of the handlers you specify to handle proxies; unknown schemes; the http, file, and https schemes; HTTP errors; and HTTP redirects. However, if you have instances or subclasses of said classes in handlers, this indicates that you want to override these defaults.

install_opener
install_opener(opener)

Installs opener as the opener for further calls to urlopen. opener can be an instance of class OpenerDirector, such as the result of a call to function build_opener, or any signature-compatible object.

urlopen
urlopen(url,data=None)

Almost identical to the urlopen function in module urllib. However, you customize behavior via the opener and handler classes of urllib2 (covered in "The OpenerDirector class" on page 502 and "Handler classes" on page 502) rather than via class FancyURLopener as in module urllib. Argument url can be a URL string, like for the urlopen function in module urllib. Alternatively, url can be an instance of class Request, covered in the next section.

19.1.3.2. The Request class

You can optionally pass to function urlopen an instance of class Request instead of a URL string. Such an instance can embody both a URL and, optionally, other information on how to access the target URL.

Request
class Request(urlstring,data=None,headers={})

urlstring is the URL that this instance of class Request embodies. For example, if there are no data and headers, calling:

urllib2.urlopen(urllib2.Request(urlstring))

is just like calling:

urllib2.urlopen(urlstring)

When data is not None, the Request constructor implicitly calls on the new instance r its method r.add_data(data). headers must be a mapping of header names to header values. The Request constructor executes the equivalent of the loop:

for k,v in headers.items( ): r.add_header(k,v)

The Request constructor also accepts optional parameters allowing fine-grained control of HTTP Cookie behavior, but such advanced functionality is rarely necessary: the class's default handling of cookies is generally sufficient. For fine-grained, client-side control of cookies, see also http://docs.python.org/lib/module-cookielib.html; I do not cover the cookielib module of the standard library in this book.

An instance r of class Request supplies the following methods.

add_data
r.add_data(data)

Sets data as r's data. Calling urlopen(r) then becomes like calling urlopen(r,data)i.e., it requires r's scheme to be http and uses a POST request with a body of data, which must be a URL-encoded string.

Despite its name, method add_data does not necessarily add the data. If r already had data, set in r's constructor or by previous calls to r.add_data, the latest call to r.add_data replaces the previous value of r's data with the new given one. In particular, r.add_data(None) removes r's previous data, if any.

add_header
r.add_header(key,value)

Adds a header with the given key and value to r's headers. If r's scheme is http, r's headers are sent as part of the request. When you add more than one header with the same key, later additions overwrite previous ones, so out of all headers with one given key, only the one given last matters.

add_unredirec-ted_header
r.add_unredirected_header(key,value)

Like add_header, except that the header is added only for the first request, and is not used if the requesting procedure meets and follows any further HTTP redirection.

get_data
r.get_data( )

Returns the data of r, either None or a URL-encoded string.

get_full_url
r.get_full_url( )

Returns the URL of r, as given in the constructor for r.

get_host
r.get_host( )

Returns the host component of r's URL.

get_method
r.get_method( )

Returns the HTTP method of r, either of the strings 'GET' or 'POST'.

get_selector
r.get_selector( )

Returns the selector components of r's URL (path and all following components).

get_type
r.get_type( )

Returns the scheme component of r's URL (i.e., the protocol).

has_data
r.has_data( )

Like r.get_data( ) is not None.

has_header
r.has_header(key)

Returns true if r has a header with the given key; otherwise, returns False.

set_proxy
r.set_proxy(host,scheme)

Sets r to use a proxy at the given host and scheme for accessing r's URL.

19.1.3.3. The OpenerDirector class

An instance d of class OpenerDirector collects instances of handler classes and orchestrates their use to open URLs of various schemes and to handle errors. Normally, you create d by calling function build_opener and then install it by calling function install_opener. For advanced uses, you may also access various attributes and methods of d, but this is a rare need and I do not cover it further in this book.

19.1.3.4. Handler classes

Module urllib2 supplies a class BaseHandler to use as the superclass of any custom handler classes you write. urllib2 also supplies many concrete subclasses of BaseHandler that handle schemes gopher, ftp, http, https, and file, as well as authentication, proxies, redirects, and errors. Writing custom handlers is an advanced topic, and I do not cover it further in this book.

19.1.3.5. Handling authentication

urllib2's default opener does no authentication. To get authentication, call build_opener to build an opener with instances of HTTPBasicAuthHandler, ProxyBasicAuthHandler, HTTPDigestAuthHandler, and/or ProxyDigestAuthHandler, depending on whether you want authentication to be directly in HTTP or to a proxy, and on whether you need Basic or Digest authentication.

To instantiate each of these authentication handlers, use an instance x of class HTTPPasswordMgrWithDefaultRealm as the only argument to the authentication handler's constructor. You normally use the same x to instantiate all the authentication handlers you need. To record users and passwords for given authentication realms and URLs, call x.add_password one or more times.

add_password
x.add_password(realm,URLs,user,password)

Records in x the pair (user,password) as the credentials in the given realm for URLs given by URLs. realm is a string that names an authentication realm, or None, to supply default credentials for any realm not specifically recorded. URLs is a URL string or a sequence of URL strings. A URL u is deemed applicable for these credentials if there is an item u1 of URLs such that the location components of u and u1 are equal, and the path component of u1 is a prefix of that of u. Other components (scheme, query, fragment) don't affect applicability for authentication purposes.

The following example shows how to use urllib2 with basic HTTP authentication:

import urllib2 x = urllib2.HTTPPasswordMgrWithDefaultRealm( ) x.add_password(None, 'http://myhost.com/', 'auser', 'apassword') auth = urlib2.HTTPBasicAuthHandler(x) opener = urllib2.build_opener(auth) urllib2.install_opener(opener) flob = urllib2.urlopen('http://myhost.com/index.html') for line in flob.readlines( ): print line,