Limitations of the Basic Syntax
Even though regular expressions are quite powerful because of the original rules, inherent limitations make their use impractical. For example, there is no regular expression that can be used to specify the concept of "any character." In addition, if you happen to have to specify a parenthesis or star as a regular expressionrather than as a special characteryou're pretty much out of luck.
Let's start from the beginning. It's sometimes useful to be able to recognize whether a portion of a regular expression should appear at the beginning or at the end of a string. For example, suppose you're trying to determine whether a string represents a valid HTTP URL. The regex http:// would match both http://www.phparch.com, which is a valid URL, and nhttp://www.phparch.com, which is not (and could easily represent a typo on the user's part).
By using the "^" special character, you can indicate that the following regular expression should be matched only at the beginning of the string. Thus, the regex ^http:// will create a match only with the first of the two strings.
The same conceptalthough in reverseapplies to the end-of-string marker "$", which indicates that the regular expression preceding it must end exactly at the end of the string. For example, com$ will match "sams.com" but not "communication."
The special characters "+" and "?" work similarly to the Kleene Star, with the exception that they represent "at least one instance" and "either zero or one instances" of the regex they are attached to, respectively.
As I briefly mentioned earlier, having a "wildcard" that can be used to match any character is extremely useful in a wide range of scenarios, particularly considering that the "." character is considered a regular expression in its own right, so that it can be combined with the Kleene Star and any of the other modifiers. For example, the expression
can be used to indicate:
As you might have guessed, this expression is a very rough form of email address validation. Note how I have used the backslash character (\) to force the regex compiler to interpret the penultimate "." as a literal character, rather than as another instance of the "any character" regular expression.
However, that is a rather primitive way of checking for the validity of an email address. After all, only letters of the alphabet, the underscore character (_), the minus character (), and digits are allowed in the name, domain, and extension portion of an email. This is where the range denominators come into play.
As mentioned previously, anything within nonescaped square brackets represents a set of alternatives for a particular character position. For example, [abc] indicates either an "a", a "b", or a "c". However, representing something like "any character" by including every possible symbol in the square brackets would give birth to some ridiculously long regular expressionsand regex are complex enough as it is.
Luckily, it's possible to specify a "range" of characters by separating them with a dash. For example, [a-z] means "any lowercase character." You can also specify more than one range and combine them with individual characters by placing them side-by-side. For example, our email validation requirements can be satisfied by the expression [A-Za-z0-9_], which turns the overall regex into
The range specifications that we have seen so far are all inclusivethat is, they tell the regex compiler which characters can be in the string. Sometimes, it's more convenient to use exclusive specifications, dictating that any character except the characters you specify are valid. This can be done by prepending a caret character (^) to the character specifications inside the square bracket. For example, [^A-Z] means "any character except any uppercase letter of the alphabet."
Going back to the email validation regex, it's still not as good as it could be. For example, we know for sure that a domain extension (for example, .ca or .com) must have a minimum of two characters (as in .ca) and a maximum of four (as in .info). We can therefore use the minimum-maximum length specifier that I introduced earlier to specify this additional requirement:
Naturally, you may want to allow only email addresses that have a three-letter domain (such as .com). This can be accomplished by omitting the comma and max parameters from the length specifiers:
If, on the other hand, you would like to leave the maximum number of characters open in anticipation of the fact that longer domain extensions may be introduced in the future, you could use the following regex: