Previous Page
Next Page

Validating an Email Address with Regular Expressions

Back in Chapter 7, one of the tasks was validating an email address. To do the job, the script needed to be relatively long. Script 8.1, at its heart, does exactly the same thing as Script 7.15; but by using regular expressions, it takes many fewer lines, and you get a more rigorous result. You'll find the simple HTML in Script 8.2, and the CSS is unchanged from Script 7.6.

Script 8.1. These few lines of JavaScript go a long way to validate email addresses.

window.onload = initForms;

function initForms() {
     for (var i=0; i< document.forms.length; i++) {
         document.forms[i].onsubmit = function() {return validForm();}
     }
}

function validForm() {
     var allGood = true;
     var allTags = document.getElementsByTagName ("*");

     for (var i=0; i<allTags.length; i++) {
        if (!validTag(allTags[i])) {
           allGood = false;
        }
     }
     return allGood;

     function validTag(thisTag) {
        var outClass = "";
        var allClasses = thisTag.className.split (" ");

        for (var j=0; j<allClasses.length; j++) {
           outClass += validBasedOnClass(allClasses[j]) + " ";
        }

        thisTag.className = outClass;

        if (outClass.indexOf("invalid") > -1) {
           invalidLabel(thisTag.parentNode);
           thisTag.focus();
           if (thisTag.nodeName == "INPUT") {
              thisTag.select();
           }
           return false;
        }
           return true;

           function validBasedOnClass(thisClass) {
              var classBack = "";

              switch(thisClass) {
                 case "":
                 case "invalid":
                    break;
                 case "email":
                    if (allGood && !validEmail (thisTag.value)) classBack = "invalid ";
                 default:
                    classBack += thisClass;
              }
              return classBack;
           }

           function validEmail(email) {
              var re = /^\w+([\.-]?\w+)*@\w+  ([\.-]?\w+)*(\.\w{2,3})+$/;

              return re.test(email);
           }

           function invalidLabel(parentTag) {
              if (parentTag.nodeName == "LABEL") {
                 parentTag.className += " invalid";
              }
          }
      }
}

Script 8.2. The HTML for the email validation example.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
     <title>Email Validation</title>
     <link rel="stylesheet" href="script01.css" />
     <script language="Javascript" type="text/ javascript" src="script01.js">
     </script>
</head>
<body>
     <h2 align="center">Email Validation</h2>
     <form action="someAction.cgi">
         <p><label>Email Address:
         <input class="email" type="text" size="50" /></label></p>
         <p><input type="reset" />&nbsp;<input type="submit" value="Submit" /></p>
     </form>
</body>
</html>

To validate an email address using regular expressions:

1.
var re = /^\w+([\.-]?\w+)*@\w+ ([\.-]?\w+)*(\.\w{2,3})+$/;



Yow! What on earth is this? Don't panic; it's just a regular expression in the validEmail() function. Let's break it apart and take it piece by piece. Like any line of JavaScript, you read a regular expression from left to right.

First, re is just a variable. We've given it the name re so that when we use it later, we'll remember that it's a regular expression. The line sets the value of re to the regular expression on the right side of the equals sign.

A regular expression always begins and ends with a slash, / (of course, there is still a semicolon here, to denote the end of the JavaScript line, but the semicolon is not part of the regular expression). Everything in between the slashes is part of the regular expression.

The caret ^ means that we're going to use this expression to examine a string starting at the string's beginning. If the caret was left off, the email address might show as valid even though there was a bunch of garbage at the beginning of the string.

The expression \w means any one character, "a" through "z", "A" through "Z", "0" through "9", or underscore. An email address must start with one of these characters.

The plus sign + means one or more of whatever the previous item was that we're checking on. In this case, an email address must start with one or more of any combination of the characters "a" through "z", "A" through "Z", "0" through "9", or underscore.

The opening parenthesis ( signifies a group. It means that we're going to want to refer to everything inside the parentheses in some way later, so we put them into a group now.

The brackets [] are used to show that we can have any one of the characters inside. In this example, the characters \.- are inside the brackets. We want to allow the user to enter either a period or a dash, but the period has a special meaning to regular expressions, so we need to preface it with a backslash \ to show that we really want to refer to the period itself, not its special meaning. Using a backslash before a special character is called escaping that character. Because of the brackets, the entered string can have either a period or a dash here, but not both. Note that the dash doesn't stand for any special character, just itself.

The question mark ? means that we can have zero or one of the previous item. So along with it being okay to have either a period or a dash in the first part of the email address (the part before the @), it's also okay to have neither.



Are You Freaking Out Yet?

If this is the first time that you've been exposed to regular expressions, chances are you're feeling a bit intimidated right about now. We've included this chapter here because it makes the most sense to use regular expressions to validate form entries. But the rest of the material in this book doesn't build on this chapter, so if you want to skip on to the next chapter until you've got a bit more scripting experience under your belt, we won't mind a bit.


Following the ?, we once again have \w+, which says that the period or dash must be followed by some other characters.

The closing parenthesis ) says that this is the end of the group. That's followed by an asterisk *, which means that we can have zero or more of the previous itemin this case, whatever was inside the parentheses. So while "dori" is a valid email prefix, so is "testing-testing-1-2-3".

The @ character doesn't stand for anything besides itself, located between the email address and the domain name.

The \w+ once again says that a domain name must start with one or more of any character "a" through "z", "A" through "Z", "0" through "9", or underscore. That's again followed by ([\.-]?\w+)* which says that periods and dashes are allowed within the suffix of an email address.

We then have another group within a set of parentheses: \.\w{2,3} which says that we're expecting to find a period followed by characters. In this case, the numbers inside the braces mean either 2 or 3 of the previous item (in this case the \w, meaning a letter, number, or underscore). Following the right parenthesis around this group is a +, which again means that the previous item (the group, in this case) must exist one or more times. This will match ".com" or ".edu", for instance, as well as "ox.ac.uk".

And finally, the regular expression ends with a dollar sign $, which signifies that the matched string must end here. This keeps the script from validating an email address that starts off properly but contains garbage characters at the end. The slash closes the regular expression. The semicolon ends the JavaScript statement, as usual.

2.
return re.test(email);



This single line takes the regular expression defined in the previous step and uses the test() method to check the validity of email. If the entered string doesn't fit the pattern stored in re, test() returns false, and the user sees the incorrect field and its label turn red and bold, as shown in Figure 8.1. Otherwise, a valid entry returns true (Figure 8.2), and the form submits the email address to a CGI, someAction.cgi for additional processing.

Figure 8.1. Here's the result if the user enters an invalid email address; the label and field turn red and bold.


Figure 8.2. But this address is just fine.


Tips

  • This code doesn't match every possible legal variation of email addresses, just the ones that you're likely to want to allow a person to enter.

  • Note that in Script 8.1, after we assigned the value of re, we used re as an object in step 2. Like any other JavaScript variable, the result of a regular expression can be an object.

  • Compare the validEmail() functions in Scripts 7.15 and 8.1. The former has 27 lines of code; the latter, only four. They do the same thing, so you can see that the power of regular expressions can save you a lot of coding.

  • In the script above, someAction.cgi is just an example name for a CGIit's literally "some action"any action that you want it to be. If you want to learn to write CGIs, we recommend Elizabeth Castro's book Perl and CGI for the World Wide Web, Second Edition: Visual QuickStart Guide.

  • You'll see in Table 8.1 that the special characters (sometimes called meta characters) in regular expressions are case- sensitive. Keep this in mind when debugging scripts that use regular expressions.

    Table 8.1. Regular Expression Special Characters

    Character

    Matches

    \

    Toggles between literal and special characters; for example, "\w" means the special value of "\w" (see below) instead of the literal "w", but "\$" means to ignore the special value of "$" (see below) and use the "$" character instead

    ^

    Beginning of a string

    $

    End of a string

    *

    Zero or more times

    +

    One or more times

    ?

    Zero or one time

    .

    Any character except newline

    \b

    Word boundary

    \B

    Non-word boundary

    \d

    Any digit 0 through 9 (same as [0-9])

    \D

    Any non-digit

    \f

    Form feed

    \n

    New line

    \r

    Carriage return

    \s

    Any single white space character (same as [ \f\n\r\t\v])

    \S

    Any single non-white space character

    \t

    Tab

    \v

    Vertical tab

    \w

    Any letter, number, or the underscore (same as [a-zA-Z0-9_])

    \W

    Any character other than a letter, number, or underscore

    \xnn

    The ASCII character defined by the hexadecimal number nn

    \onn

    The ASCII character defined by the octal number nn

    \cX

    The control character X

    [abcde]

    A character set that matches any one of the enclosed characters

    [^abcde]

    A complemented or negated character set; one that does not match any of the enclosed characters

    [a-e]

    A character set that matches any one in the range of enclosed characters

    [\b]

    The literal backspace character (different from \b)

    {n}

    Exactly n occurrences of the previous character

    {n,}

    At least n occurrences of the previous character

    {n,m}

    Between n and m occurrences of the previous character

    ()

    A grouping, which is also stored for later use

    x|y

    Either x or y


  • There are characters in regular expressions that modify other operators. We've listed them in Table 8.2.

    Table 8.2. Regular Expression Modifiers

    Modifier

    Meaning

    g

    Search for all possible matches (globally), not just the first

    i

    Search without case-sensitivity




Previous Page
Next Page