Section 10.6. Compressed Files

10.6. Compressed Files

Storage space and transmission bandwidth are increasingly cheap and abundant, but in many cases you can save such resources, at the expense of some computational effort, by using compression. Computational power grows cheaper and more abundant even faster than other resources, such as bandwidth, so compression's popularity keeps growing. Python makes it easy for your programs to support compression, since the Python standard library contains several modules dedicated to compression.

Since Python offers so many ways to deal with compression, some guidance may be helpful. Files containing data compressed with the zlib module are not automatically interchangeable with other programs, except for those files built with the zipfile module, which respects the standard format of ZIP file archives. You can write custom programs, with any language able to use InfoZip's free zlib compression library, to read files produced by Python programs using the zlib module. However, if you do need to interchange compressed data with programs coded in other languages, but have a choice of compression methods, I suggest you use modules bzip2 (best), gzip, or zipfile instead. Module zlib, however, may be useful when you want to compress some parts of datafiles that are in some proprietary format of your own and need not be interchanged with any other program except those that make up your own application.

10.6.1. The gzip Module

The gzip module lets you read and write files compatible with those handled by the powerful GNU compression programs gzip and gunzip. The GNU programs support many compression formats, but module gzip supports only the highly effective native gzip format, normally denoted by appending the extension .gz to a filename. Module gzip supplies the GzipFile class and an open factory function.

GzipFile
class GzipFile(filename=None, mode=None, compresslevel=9, fileobj=None)

Creates and returns a file-like object f wrapping the file or file-like object fileobj. When fileobj is None, filename must be a string that names a file; GzipFile opens that file with the given mode (by default, 'rb'), and f wraps the resulting file object. mode should be 'ab', 'rb', 'wb', or None. When mode is None, f uses the mode of fileobj if it can find out the mode; otherwise, it uses 'rb'. When filename is None, f uses the filename of fileobj if it can find out the name; otherwise, it uses ''. compresslevel is an integer between 1 and 9: 1 requests modest compression but fast operation; 9 requests the best compression at the cost of more computation.

File-like object f delegates most methods to the underlying file-like object fileobj, transparently accounting for compression as needed. However, f does not allow nonsequential access, so f does not supply methods seek and tell. Calling f.close does not close fileobj if f was created with a not-None fileobj. This matters when fileobj is an instance of StringIO.StringIO: you can call fileobj.getvalue after f.close to get the compressed data string. However, it also means that you always have to call fileobj.close explicitly after f.close.

open
open(filename, mode='rb', compresslevel=9)

Like GzipFile(filename, mode, compresslevel), but filename is mandatory and there is no provision for passing an already opened fileobj.

A gzip example

Say that you have some function f(x) that writes data to a text file object x passed in as an argument by calling x.write and/or x.writelines. It's easy to make f to write data to a gzip-compressed file instead:

import gzip underlying_file = open('x.txt.gz', 'wb') compressing_wrapper = gzip.GzipFile(fileobj=underlying_file, mode='wt') f(compressing_wrapper) compressing_wrapper.close( ) underlying_file.close( )

This example opens the underlying binary file x.txt.gz and explicitly wraps it with gzip.GzipFile, and thus, at the end, we need to close each object separately. This is necessary because we want to use two different modes: the underlying file must be opened in binary mode (any translation of line endings would produce an invalid compressed file), but the compressing wrapper must be opened in text mode because we want the implicit translation of \n to os.linesep. Reading back a compressed text filefor example, to display it on standard outputis similar:

import gzip underlying_file = open('x.txt.gz', 'rb') uncompressing_wrapper = gzip.GzipFile(fileobj= underlying_file, mode='rt') for line in uncompressing_wrapper: print line, uncompressing_wrapper.close( ) underlying_file.close( )

10.6.2. The bz2 Module

The bz2 module lets you read and write files compatible with those handled by the compression programs bzip2 and bunzip2, which often achieve even better compression than gzip and gunzip. Module bz2 supplies the BZ2File class, for transparent file compression and decompression, and functions compress and decompress to compress and decompress data strings in memory. It also provides objects to compress and decompress data incrementally, enabling you to work with data streams that are too large to comfortably fit in memory at once. For such advanced functionality, consult the Python library's online reference.

BZ2File
class BZ2File(filename=None, mode='r', buffering=0, compresslevel=9)

Creates and returns a file-like object f, corresponding to the bzip2-compressed file named by filename, which must be a string denoting a file's path. mode can be 'r', for reading; 'w', for writing; or 'rU', for reading with universal-newlines translation. When buffering is 0, the default, the file is unbuffered. When buffering is greater than 0, the file uses a buffer of buffering bytes, rounded up to a reasonable amount. compresslevel is an integer between 1 and 9: 1 requests modest compression but fast operation; 9 requests the best compression at the cost of more computation.

f supplies all methods of built-in file objects, including seek and tell. Thus, f is seekable; however, the seek operation is emulated, and, while guaranteed to be semantically correct, may in some cases be extremely slow.

compress
compress(s, level=9)

Compresses string s and returns the string of compressed data. level is an integer between 1 and 9: 1 requests modest compression but fast operation; 9 requests the best compression at the cost of more computation.

decompress
decompress(s)

Decompresses the compressed data string s and returns the string of uncompressed data.

10.6.3. The tarfile Module

The tarfile module lets you read and write TAR files (archive files compatible with those handled by popular archiving programs such as tar) optionally with either gzip or bzip2 compression. For invalid TAR file errors, functions of module tarfile raise exceptions that are instances of exception class tarfile.TarError. Module tarfile supplies the following classes and functions.

is_tarfile
is_tarfile(filename)

Returns true if the file named by string filename appears to be a valid TAR file (possibly with compression), judging by the first few bytes; otherwise, returns False.

TarInfo
class TarInfo(name='')

Methods getmember and getmembers of TarFile instances return instances of TarInfo, supplying information about members of the archive. You can also build a TarInfo instance with a TarFile instance's method gettarinfo. The most useful attributes supplied by a TarInfo instance t are:

linkname

A string that is the target file's name if t.type is LNKTYPE or SYMTYPE

mode

Permission and other mode bits of the file identified by t

mtime

Time of last modification of the file identified by t

name

Name in the archive of the file identified by t

size

Size in bytes (uncompressed) of the file identified by t

type

File type, one of many constants that are attributes of module tarfile (SYMTYPE for symbolic links, REGTYPE for regular files, DIRTYPE for directories, and so on)

To check the type of t, rather than testing t.type, you can call t's methods. The most frequently used methods of t are:

t.isdir( )

Returns true if the file is a directory

t.isfile( )

Returns TRue if the file is a regular file

t.issym( )

Returns TRue if the file is a symbolic link

open
open(filename, mode='r', fileobj=None, bufsize=10240)

Creates and returns a TarFile instance f to read or create a TAR file through file-like object fileobj. When fileobj is None, filename must be a string naming a file; open opens the file with the given mode (by default, 'r'), and f wraps the resulting file object. Calling f.close does not close fileobj if f was opened with a fileobj that is not None. This behavior of f.close is important when fileobj is an instance of StringIO.StringIO: you can call fileobj.getvalue after f.close to get the archived and possibly compressed data as a string. This behavior also means that you have to call fileobj.close explicitly after calling f.close.

mode can be 'r', to read an existing TAR file, with whatever compression it has (if any); 'w', to write a new TAR file, or truncate and rewrite an existing one, without compression or 'a', to append to an existing TAR file, without compression. Appending to compressed TAR files is not supported. To write a TAR file with compression, mode can be 'w:gz' for gzip compression, or 'w:bz2' for bzip2 compression. Special mode strings 'r|' or 'w|' can be used to read or write uncompressed, nonseekable TAR files (using a buffer of bufsize bytes), and 'r|gz', 'r|bz2', 'w|gz', and 'w|bz2' can be used to read or write such files with compression.

A TarFile instance f supplies the following methods.

add
f.add(filepath, arcname=None, recursive=true)

Adds to archive f the file named by filepath (can be a regular file, a directory, or a symbolic link). When arcname is not None, it's used as the archive member name in lieu of filepath. When filepath is a directory, add recursively adds the whole filesystem subtree rooted in that directory, unless you pass recursive as False.

addfile
f.addfile(tarinfo, fileobj=None)

Adds to archive f a member identified by tarinfo, a TarInfo instance (the data is the first tarinfo.size bytes of file-like object fileobj if fileobj is not None).

close
f.close( )

Closes archive f. You must call close, or else an incomplete, unusable TAR file might be left on disk. Mandatory finalization is best performed with a try/finally statement, as covered in TRy/finally on page 123.

extract
f.extract(member, path='.')

Extracts the archive member identified by member (a name or a TarInfo instance) into a corresponding file in the directory named by path (the current directory by default).

extractfile
f.extractfile(member)

Extracts the archive member identified by member (a name or a TarInfo instance) and returns a read-only file-like object with methods read, readline, readlines, seek, and tell.

getmember
f.getmember(name)

Returns a TarInfo instance with information about the archive member named by string name.

getmembers
f.getmembers( )

Returns a list of TarInfo instances, one for each member in archive f, in the same order as the entries in the archive itself.

getnames
f.getnames( )

Returns a list of strings, the names of each member in archive f, in the same order as the entries in the archive itself.

gettarinfo
f.gettarinfo(name=None, arcname=None, fileobj=None)

Returns a TarInfo instance with information about the open file object fileobj, when not None, or else the existing file whose path is string name. When arcname is not None, it's used as the name attribute of the resulting TarInfo instance.

list
f.list(verbose=true)

Outputs a textual directory of the archive f to file sys.stdout. If optional argument verbose is False, outputs only the names of the archive's members.

10.6.4. The zipfile Module

The zipfile module lets you read and write ZIP files (i.e., archive files compatible with those handled by popular compression programs zip and unzip, pkzip and pkunzip, WinZip, and so on). Detailed information on the formats and capabilities of ZIP files can be found at http://www.pkware.com/appnote.html and http://www.info-zip.org/pub/infozip/. You need to study this detailed information in order to perform advanced ZIP file handling with module zipfile. If you do not specifically need to interoperate with other programs using the ZIP file standard, modules gzip and bz2 are most often preferable ways to handle compressed-file needs.

Module zipfile can't handle ZIP files with appended comments, multidisk ZIP files, or .zip archive members using compression types besides the usual ones, known as stored (a file copied to the archive without compression) and deflated (a file compressed using the ZIP format's default algorithm). For errors related to invalid .zip files, functions of module zipfile raise exceptions that are instances of exception class zipfile.error. Module zipfile supplies the following classes and functions.

is_zipfile
is_zipfile(filename)

Returns true if the file named by string filename appears to be a valid ZIP file, judging by the first few and last bytes of the file; otherwise, returns False.

ZipInfo
class ZipInfo(filename='NoName', date_time=(1980, 1, 1, 0, 0, 0))

Methods getinfo and infolist of ZipFile instances return instances of ZipInfo to supply information about members of the archive. The most useful attributes supplied by a ZipInfo instance z are:

comment

A string that is a comment on the archive member

compress_size

Size in bytes of the compressed data for the archive member

compress_type

An integer code recording the type of compression of the archive member

date_time

A tuple with six integers recording the time of last modification to the file: the items are year, month, day (1 and up), hour, minute, second (0 and up)

file_size

Size in bytes of the uncompressed data for the archive member

filename

Name of the file in the archive

ZipFile
class ZipFile(filename, mode='r',compression=zipfile.ZIP_STORED)

Opens a ZIP file named by string filename. mode can be 'r', to read an existing ZIP file; 'w', to write a new ZIP file or truncate and rewrite an existing one; or 'a', to append to an existing file.

When mode is 'a', filename can name either an existing ZIP file (in which case new members are added to the existing archive) or an existing non-ZIP file. In the latter case, a new ZIP file-like archive is created and appended to the existing file. The main purpose of this latter case is to let you build a self-unpacking .exe file (i.e., a Windows executable file that unpacks itself when run). The existing file must then be a pristine copy of an unpacking .exe prefix, as supplied by www.info-zip.org and by other purveyors of ZIP file compression tools.

compression is an integer code that can be either of two attributes of module zipfile. zipfile.ZIP_STORED requests that the archive use no compression; zipfile.ZIP_DEFLATED requests that the archive use the deflation mode of compression (i.e., the most usual and effective compression approach used in .zip files).

A ZipFile instance z supplies the following methods.

close
z.close( )

Closes archive file z. Make sure the close method gets called, or else an incomplete and unusable ZIP file might be left on disk. Such mandatory finalization is generally best performed with a try/finally statement, as covered in "try/finally" on page 123.

getinfo
z.getinfo(name)

Returns a ZipInfo instance that supplies information about the archive member named by string name.

infolist
z.infolist( )

Returns a list of ZipInfo instances, one for each member in archive z, in the same order as the entries in the archive.

namelist
z.namelist( )

Returns a list of strings, the name of each member in archive z, in the same order as the entries in the archive.

printdir
z.printdir( )

Outputs a textual directory of the archive z to file sys.stdout.

read
z.read(name)

Returns a string containing the uncompressed bytes of the file named by string name in archive z. z must be opened for 'r' or 'a'. When the archive does not contain a file named name, read raises an exception.

testzip
z.testzip( )

Reads and checks the files in archive z. Returns a string with the name of the first archive member that is damaged, or None if the archive is intact.

write
z.write(filename, arcname=None, compress_type=None)

Writes the file named by string filename to archive z, with archive member name arcname. When arcname is None, write uses filename as the archive member name. When compress_type is None, write uses z's compression type; otherwise, compress_type is zipfile.ZIP_STORED or zipfile.ZIP_DEFLATED, and specifies how to compress the file. z must be opened for 'w' or 'a'.

writestr
z.writestr(zinfo, bytes)

zinfo must be a ZipInfo instance specifying at least filename and date_time. bytes is a string of bytes. writestr adds a member to archive z using the metadata specified by zinfo and the data in bytes. z must be opened for 'w' or 'a'. When you have data in memory and need to write the data to the ZIP file archive z, it's simpler and faster to use z.writestr rather than z.write. The latter requires you to write the data to disk first and later remove the useless disk file. The following example shows both approaches, each encapsulated into a function and polymorphic to each other:

import zipfile def data_to_zip_direct(z, data, name): import time zinfo = zipfile.ZipInfo(name, time.localtime( )[:6]) zinfo.compress_type = zipfile.ZIP_DEFLATED z.writestr(zinfo, data) def data_to_zip_indirect(z, data, name): import os flob = open(name, 'wb') flob.write(data) flob.close( ) z.write(name) os.unlink(name) zz = zipfile.ZipFile('z.zip', 'w', zipfile.ZIP_DEFLATED) data = 'four score\nand seven\nyears ago\n' data_to_zip_direct(zz, data, 'direct.txt') data_to_zip_indirect(zz, data, 'indirect.txt') zz.close( )

Besides being faster and more concise, data_to_zip_direct is handier, since it works in memory and doesn't require the current working directory to be writable, as data_to_zip_indirect does. Of course, method write also has its uses when you already have the data in a file on disk and just want to add the file to the archive.

Here's how you can print a list of all files contained in the ZIP file archive created by the previous example, followed by each file's name and contents:

import zipfile zz = zipfile.ZipFile('z.zip') zz.printdir( ) for name in zz.namelist( ): print '%s: %r' % (name, zz.read(name)) zz.close( )

10.6.5. The zlib Module

The zlib module lets Python programs use the free InfoZip zlib compression library (http://www.info-zip.org/pub/infozip/zlib/), version 1.1.3 or later. Module zlib is used by modules gzip and zipfile, but is also available directly for any special compression needs. The most commonly used functions supplied by module zlib are the following:

compress
compress(s, level=6)

Compresses string s and returns the string of compressed data. level is an integer between 1 and 9; 1 requests modest compression but fast operation, and 9 requests compression as good as feasible, requiring more computation.

decompress
decompress(s)

Decompresses the compressed data string s and returns the string of uncompressed data.

Module zlib also supplies functions to compute Cyclic-Redundancy Check (CRC) checksums to detect damage in compressed data. It also provides objects to compress and decompress data incrementally to work with data streams too large to fit in memory at once. For such advanced functionality, consult the Python library's online reference.