9.7. Regular Expressions and the re ModuleA regular expression (RE) is a string that represents a pattern. With RE functionality, you can check any string with the pattern and see if any part of the string matches the pattern. The re module supplies Python's RE functionality. The compile function builds a RE object from a pattern string and optional flags. The methods of a RE object look for matches of the RE in a string or perform substitutions. Module re also exposes functions equivalent to a RE's methods, but with the RE's pattern string as the first argument. REs can be difficult to master, and this book does not purport to teach them; I cover only the ways in which you can use REs in Python. For general coverage of REs, I recommend the book Mastering Regular Expressions, by Jeffrey Friedl (O'Reilly). Friedl's book offers thorough coverage of REs at both tutorial and advanced levels. Many tutorials and references on REs can also be found online. 9.7.1. Pattern-String SyntaxThe pattern string representing a regular expression follows a specific syntax:
Since RE patterns often contain backslashes, you often specify them using raw-string syntax (covered in "Strings" on page 40). Pattern elements (e.g., r'\t', equivalent to the non-raw string literal '\\t') do match the corresponding special characters (e.g., the tab character '\t'). Therefore, you can use raw-string syntax even when you do need a literal match for some such special character. Table 9-2 lists the special elements in RE pattern syntax. The exact meanings of some pattern elements change when you use optional flags, together with the pattern string, to build the RE object. The optional flags are covered in "Optional Flags" on page 205.
9.7.2. Common Regular Expression Idioms'.*' as a substring of a regular expression's pattern string means "any number of repetitions (zero or more) of any character." In other words, '.*' matches any substring of a target string, including the empty substring. '.+' is similar, but matches only a nonempty substring. For example: 'pre.*post' matches a string containing a substring 'pre' followed by a later substring 'post', even if the latter is adjacent to the former (e.g., it matches both 'prepost' and 'pre23post'). On the other hand: 'pre.+post' matches only if 'pre' and 'post' are not adjacent (e.g., it matches 'pre23post' but does not match 'prepost'). Both patterns also match strings that continue after the 'post'. To constrain a pattern to match only strings that end with 'post', end the pattern with \Z. For example: r'pre.*post\Z' matches 'prepost', but not 'preposterous'. Note that you need to express the pattern with raw-string syntax (or escape the backslash \ by doubling it into \\), as it contains a backslash. Use raw-string syntax for all rE pattern literals, which ensures you'll never forget to escape a backslash. Another frequently used element in RE patterns is \b, which matches a word boundary. If you want to match the word 'his' only as a whole word, and not its occurrences as a substring in such words as 'this' and 'history', the RE pattern is: r'\bhis\b' with word boundaries both before and after. To match the beginning of any word starting with 'her', such as 'her' itself but also 'hermetic', but not words that just contain 'her' elsewhere, such as 'ether' or 'there', use: r'\bher' with a word boundary before, but not after, the relevant string. To match the end of any word ending with 'its', such as 'its' itself but also 'fits', but not words that contain 'its' elsewhere, such as 'itsy' or 'jujitsu', use: r'its\b' with a word boundary after, but not before, the relevant string. To match whole words thus constrained, rather than just their beginning or end, add a pattern element \w* to match zero or more word characters. To match any full word starting with 'her', use: r'\bher\w*' To match any full word ending with 'its', use: r'\w*its\b' 9.7.3. Sets of CharactersYou denote sets of characters in a pattern by listing the characters within brackets ([]). In addition to listing characters, you can denote a range by giving first and last characters of the range separated by a hyphen (-). The last character of the range is included in the set, differently from other Python ranges. Within a set, special characters stand for themselves, except \, ], and -, which you must escape (by preceding them with a backslash) when their position is such that unescaped, they would form part of the set's syntax. You can denote a class of characters within a set by escaped-letter notation, such as \d or \S. \b in a set means a backspace character, not a word boundary. If the first character in the set's pattern, right after the [, is a caret (^), the set is complemented: such a set matches any character except those that follow ^ in the set pattern notation. A frequent use of character sets is to match a word using a definition of which characters can make up a word that differs from \w's default (letters and digits). To match a word of one or more characters, each of which can be a letter, an apostrophe, or a hyphen, but not a digit (e.g., 'Finnegan-O'Hara'), use: r"[a-zA-Z'\-]+" It's not strictly necessary to escape the hyphen with a backslash in this case, since its position makes it syntactically unambiguous. However, the backslash is advisable because it makes the pattern somewhat more readable by visually distinguishing the hyphen that you want to have as a character in the set from those used to denote ranges. 9.7.4. AlternativesA vertical bar (|) in a regular expression pattern, used to specify alternatives, has low syntactic precedence. Unless parentheses change the grouping, | applies to the whole pattern on either side, up to the start or end of the string, or to another |. A pattern can be made up of any number of subpatterns joined by |. To match such a RE, the first subpattern is tried first, and if it matches, the others are skipped. If the first subpattern does not match, the second subpattern is tried, and so on. | is neither greedy nor nongreedy: it just doesn't take the length of the match into consideration. Given a list L of words, a RE pattern that matches any of the words is: '|'.join([r'\b%s\b' % word for word in L]) If the items of L can be more general strings, not just words, you need to escape each of them with function re.escape (covered in escape on page 212), and you probably don't want the \b word boundary markers on either side. In this case, use the following RE pattern: '|'.join(map(re.escape,L)) 9.7.5. GroupsA regular expression can contain any number of groups, from 0 to 99 (any number is allowed, but only the first 99 groups are fully supported). Parentheses in a pattern string indicate a group. Element (?P<id>...) also indicates a group, and gives the group a name, id, that can be any Python identifier. All groups, named and unnamed, are numbered from left to right, 1 to 99; group 0 means the whole RE. For any match of the RE with a string, each group matches a substring (possibly an empty one). When the RE uses |, some groups may not match any substring, although the RE as a whole does match the string. When a group doesn't match any substring, we say that the group does not participate in the match. An empty string ('') is used as the matching substring for any group that does not participate in a match, except where otherwise indicated later in this chapter. For example: r'(.+)\1+\Z' matches a string made up of two or more repetitions of any nonempty substring. The (.+) part of the pattern matches any nonempty substring (any character, one or more times) and defines a group, thanks to the parentheses. The \1+ part of the pattern matches one or more repetitions of the group, and \Z anchors the match to end-of-string. 9.7.6. Optional FlagsA regular expression pattern element with one or more of the letters iLmsux between (? and ) lets you set RE options within the pattern, rather than by the flags argument to function compile of module re. Options apply to the whole RE, no matter where the options element occurs in the pattern. For clarity, always place options at the start of the pattern. Placement at the start is mandatory if x is among the options, since x changes the way Python parses the pattern. Using the explicit flags argument is more readable than placing an options element within the pattern. The flags argument to function compile is a coded integer built by bitwise ORing (with Python's bitwise OR operator, |) one or more of the following attributes of module re. Each attribute has both a short name (one uppercase letter), for convenience, and a long name (an uppercase multiletter identifier), which is more readable and thus normally preferable:
For example, here are three ways to define equivalent REs with function compile, covered in compile on page 212. Each of these REs matches the word "hello" in any mix of upper- and lowercase letters: import re r1 = re.compile(r'(?i)hello') r2 = re.compile(r'hello', re.I) r3 = re.compile(r'hello', re.IGNORECASE) The third approach is clearly the most readable, and thus the most maintainable, even though it is slightly more verbose. The raw-string form is not necessary here, since the patterns do not include backslashes; however, using raw strings is innocuous, and is the recommended style for clarity. Option re.VERBOSE (or re.X) lets you make patterns more readable and understandable by appropriate use of whitespace and comments. Complicated and verbose RE patterns are generally best represented by strings that take up more than one line, and therefore you normally want to use the triple-quoted raw-string format for such pattern strings. For example: repat_num1 = r'(0[0-7]*|0x[\da-fA-F]+|[1-9]\d*)L?\Z' repat_num2 = r'''(?x) # pattern matching integer numbers (0 [0-7]* | # octal: leading 0, then 0+ octal digits 0x [\da-fA-F]+ | # hex: 0x, then 1+ hex digits [1-9] \d* ) # decimal: leading non-0, then 0+ digits L?\Z # optional trailing L, then end of string ''' The two patterns defined in this example are equivalent, but the second one is made somewhat more readable by the comments and the free use of whitespace to group portions of the pattern in logical ways. 9.7.7. Match Versus SearchSo far, we've been using regular expressions to match strings. For example, the RE with pattern r'box' matches strings such as 'box' and 'boxes', but not 'inbox'. In other words, a RE match is implicitly anchored at the start of the target string, as if the RE's pattern started with \A. Often, you're interested in locating possible matches for a RE anywhere in the string, without anchoring (e.g., find the r'box' match inside such strings as 'inbox', as well as in 'box' and 'boxes'). In this case, the Python term for the operation is a search, as opposed to a match. For such searches, use the search method of a RE object: the match method deals with matching only from the start. For example: import re r1 = re.compile(r'box') if r1.match('inbox'): print 'match succeeds' else print 'match fails' # prints: match fails if r1. search('inbox'): print 'search succeeds' # prints: search succeeds else print 'search fails' 9.7.8. Anchoring at String Start and EndThe pattern elements ensuring that a regular expression search (or match) is anchored at string start and string end are \A and \Z, respectively. More traditionally, elements ^ for start and $ for end are also used in similar roles. ^ is the same as \A, and $ is the same as \Z, for RE objects that are not multiline (i.e., that do not contain pattern element (?m) and are not compiled with the flag re.M or re.MULTILINE). For a multiline RE, however, ^ anchors at the start of any line (i.e., either at the start of the whole string or at any position right after a newline character \n). Similarly, with a multiline RE, $ anchors at the end of any line (i.e., either at the end of the whole string or at any position right before \n). On the other hand, \A and \Z anchor at the start and end of the string whether the RE object is multiline or not. For example, here's how to check if a file has any lines that end with digits: import re digatend = re.compile(r'\d$', re.MULTILINE) if digatend.search(open('afile.txt').read( )): print "some lines end with digits" else: print "no lines end with digits" A pattern of r'\d\n' is almost equivalent, but in that case the search fails if the very last character of the file is a digit not followed by an end-of-line character. With the example above, the search succeeds if a digit is at the very end of the file's contents, as well as in the more usual case where a digit is followed by an end-of-line character. 9.7.9. Regular Expression ObjectsA regular expression object r has the following read-only attributes that detail how r was built (by function compile of module re, covered in compile on page 212):
These attributes make it easy to get back from a compiled RE object to its pattern string and flags, so you never have to store those separately. A RE object r also supplies methods to locate matches for r within a string, as well as to perform substitutions on such matches. Matches are generally represented by special objects, covered in "Match Objects" on page 210.
9.7.10. Match ObjectsMatch objects are created and returned by methods match and search of a regular expression object, and are the items of the iterator returned by method finditer. They are also implicitly created by methods sub and subn when argument repl is callable, since in that case a suitable match object is passed as the argument on each call to repl. A match object m supplies the following read-only attributes that detail how m was created:
A match object m also supplies several methods.
9.7.11. Functions of Module reThe re module supplies the attributes listed in "Optional Flags" on page 205. It also provides one function for each method of a regular expression object (findall, finditer, match, search, split, sub, and subn), each with an additional first argument, a pattern string that the function implicitly compiles into a RE object. It's generally preferable to compile pattern strings into RE objects explicitly and call the RE object's methods, but sometimes, for a one-off use of a RE pattern, calling functions of module re can be slightly handier. For example, to count the number of occurrences of substring 'hello' in any mix of cases, one function-based way is: import re junk, count = re.subn(r'(?i)hello', '', astring) print 'Found', count, 'occurrences of "hello"' In such cases, RE options (here, for example, case insensitivity) must be encoded as RE pattern elements (here, (?i)), since the functions of module re do not accept a flags argument. Module re internally caches the RE objects it creates from the patterns passed to its functions; to purge that cache and reclaim some memory, call re.purge( ). Module re also supplies error, the class of exceptions raised upon errors (generally, errors in the syntax of a pattern string), and two more functions.
|