Pattern Matching with Grep

Pattern matching is one of the techniques most commonly used to identify similar text strings in one or more files. It forms the basis of several utilities on the Linux/UNIX platforms. The pattern matching algorithms on the Linux/UNIX-based systems are so strong that many developers and organizations started porting them onto other platforms, such as Windows. The grep utility is very widely used by developers and users of the Linux/UNIX-based platforms. The name ‘grep’ stands for global regular expression print, which means that the input stream is searched for the global occurrence of a regular expression and prints the result to the output stream. It is used to locate (or identify) the lines within one or more files containing text that matches a specific pattern. Patterns are also known as regular expressions, which are either basic or extended in nature. In simple terms, a basic expression is a string of text that you are looking for. For example, you may be interested to find all the lines in a source program containing a variable name accountBalance. If all the programs to be searched are stored in the same directory, then the following command would give the desired result, assuming that you are searching the C++ source files.

    $ grep 'accountBalance' *.cpp

The result of this command is a list of all the lines containing the specific term accountBalance in all the C++ programs. The list is displayed to the standard output (which is the screen output by default). The pattern matching is case-sensitive in this example.

The general form of this command is

    $ grep [options] <pattern> <file name(s)>

It can be noticed that, like many other Linux commands, the grep command also accepts options to provide fine-granulated results, as the user needs. We can specify the –i option to indicate that the search should be case-insensitive. The –c option may be used to indicate that only the number of lines matching the pattern should be displayed, and not the lines with text. The –m <maxcount> option indicates that the command should discontinue pattern searching after reaching maxcount number of matched lines. The –n option indicates that the line number of the matching line should precede the line text in the list when displayed. The –l (the letter l in lower case) option indicates that the normal output should be suppressed and instead, the file names containing the lines matching the pattern should be displayed. The –h option indicates suppressing the prefixing of file name when multiple files are searched. We can save multiple patterns to be searched in a pattern file, and the pattern file can be given as input to the command. For example, the following list shows the set of patterns to be searched, which are stored in a pattern file called patfile.

    accountBalance
    creditIssued
    newAccountNumber

The following command will search for all these patterns in the C++ source programs in the current directory. Here, the –f option indicates that the patterns are stored in a separate file, which should be used in the pattern matching process.

    $ grep –f patfile *.cpp

The –v option tells the command to do the inverse operation, which means to display all the lines that do not contain the matching pattern. There are more options, and the readers are encouraged to check with the manual pages of the command.

After reviewing some of the command’s options, it is time to examine the ways in which patterns can be specified. Patterns are evaluated by the grep command based on the regular expression provided in the argument. A regular expression is more than a pattern. In the simplest case, a regular expression may be equivalent to the pattern (i.e, when no special/metacharacters are used in the regular expression). However, if the regular expression contains (basic or extended) metacharacters, then the pattern has to be evaluated. For example, if the pattern we are searching is a simple text string without any ambiguities, then it can be specified without quotes. Therefore, the first example can also be specified as shown below. In this example, the single quotes surrounding the pattern are removed.

    $ grep accountBalance *.cpp

However, if the pattern has embedded spaces or tabs, it is always recommended to enclose the pattern in either single quotes or double quotes. If the pattern includes a single-quote character, then a pair of double quotes can identify the pattern properly, and similarly if the pattern has a double-quote character, a pair of single quotes would identify the pattern properly. When patterns are enclosed in quotes, they are matched exactly for the enclosed string of text.

Metacharacters are those that convey additional meaning to the command to facilitate evaluating the regular expression and finding the pattern to be searched. There are two types of metacharacters—basic and extended. The basic metacharacters are the dot ., caret ^, dollar sign $, the asterisk *, a pair of square brackets [], the backslash character \ and an occurrence range specifier \{n,m\}. The extended metacharacters are the question mark ?, plus sign +, alternation symbol |, a pair of parentheses, (), and an occurrence range specifier {n,m}. The extended metacharacters are used by the egrep command (‘egrep’ stands for extended grep), or by the grep command with the –E command-line option. The meaning of each of these metacharacters will be discussed in the following paragraphs.

If the regular expression has a preceding caret character ^, then the pattern that follows the caret character is searched to be at the beginning of the line. In the following example, the first line displays all the directory names within the current directory, and the second line displays all the symbolic links in the current directory.

    $ ls –l | grep ^d
    $ ls –l | grep ^l

Similarly, a regular expression having an ending $ symbol displays all the lines that end with the pattern preceding the $ symbol. A dot character ‘.’ embedded in an expression indicates any single character may be matched at the position where the dot is present. For example, the expression p..r may evaluate to match words such as ‘peer’, ‘pour’, ‘poor’ and so on, as there are two dots between the characters ‘p’ and ‘r’. An embedded asterisk * indicates any number of occurrences of the character preceding the asterisk, such that the expression p.*r may evaluate to match words like ‘pioneer’, ‘ponder’, and so on, in addition to the words mentioned earlier.

A bracket expression is one that encloses characters within a pair of square brackets, [ and ]. In such an expression, each of the characters (other than the enclosing square brackets) is identified as a pattern. For example, the expression [7bxH9] conveys that the files should be searched for the occurrence of each of the characters 7, b, x, H, and 9, and all the lines that match any of these characters should be displayed. The bracket expression may also contain a range expression such as 5-8, which means that any character from 5 through 8 in the natural sorted order is included in the expression. Therefore the expressions [5-8] and [5678] are the same. However, it is necessary to note that because the ranges are dependent on the sort order, the character locales influence the sort order and hence, the ranges. If a caret ^ appears as the first character in a bracket expression, then the pattern indicates that the character in the input stream should match any character other than specified in the list for that position; however, the newline character is not considered for matching. As an example, the expression [^5-8] indicates that the characters to be matched should not include the numerals 5 through 8 in the specific position of the pattern where the bracketed expression is located.

As mentioned earlier, {n,m} stands for the occurrence range specifier in the extended metacharacter syntax. The same thing is identified as \{n,m\} in the basic metacharacter syntax. The only difference is that in the basic metacharacter syntax, the curly braces are preceded by the backslash \ character. Both the approaches provide the same result as explained here. The embedded {n} pattern indicates that the search should attempt to match for exactly ‘n’ occurrences of the pattern preceding the beginning curly brace; {n,} pattern indicates at least ‘n’ occurrences of the pattern should be searched for; {n,m} pattern indicates at least ‘n’ and at most m occurrences should be searched for.

If the regular expression contains the + symbol, it indicates that one or more occurrences of the preceding pattern should be searched for, and a ? indicates that zero or more occurrences of the preceding expression should be searched for. Finally, the alternation symbol | indicates that the pattern either preceding or succeeding the | symbol should be searched for (and may be used in searching for text containing synonyms), while the pair of parentheses () is used to group the enclosed characters as a word within the regular expressions.

The symbols |, ?, +, (, and ) lose their meaning in the basic regular expressions, as they are defined as part of the extended regular expressions. Therefore, either they should be used with a preceding backslash character \ while using the grep command (as in \?, \+, and so on), they may be used without the backslash character with the grep command with the –E option, or they may be used without the backslash character with the egrep character.

The backslash \ metacharacter should be used to precede another metacharacter in order to enable the metacharacters to be considered part of the pattern being matched. For example, the \* is used to make the asterisk part of the pattern, as in a\*b, which makes the pattern to be searched a*b.