[ Team LiB ] Previous Section Next Section

Recipe 26.1 Parsing an HTML Page Using thejavax.swing.text Subpackages


You want to use the classes the Java 2 Standard Edition (J2SE) makes available for parsing HTML.


Use the various subpackages of the javax.swing.text package to create a parser for HTML.


The J2SE 1.3 and 1.4 versions include the necessary classes for sifting through web pages in search of information. The Java programs these recipes use import the following classes:


The design pattern that these classes use to read web pages involves three main elements:

  1. A java.net.URL object that opens up a socket or InputStream to the web page using HTTP. The code then uses this object to read the web page.

  2. A ParserDelegator object with which the code sifts through the web page by calling this object's parse( ) method.

  3. A ParserCallback object that the ParserDelegator uses to take certain actions while it is parsing the web page's HTML text. A callback in general is an object that Java code typically passes into another object's constructor. The enclosing object then drives the callback by invoking the callback's methods, which the Java programmer implements according to what they want to accomplish by parsing the HTML. The role of the callback will become clearer as you read through these recipes.

The servlet and JavaBean defined in this chapter use an inner class to implement the callback. Example 26-1 shows the callback that extends javax.swing.text.html.HTMLEditorKit.ParserCallback.

Example 26-1. A callback class for sifting through web pages
class MyParserCallback extends ParserCallback {

      //bread crumbs that lead us to the stock price
      private boolean lastTradeFlag = false; 
      private boolean boldFlag = false;
  public MyParserCallback( ){
    //Reset the enclosing class' stock-price instance variable
        if (stockVal != 0)
        stockVal = 0f;
  //A method that the parser calls each time it confronts a start tag
  public void handleStartTag(javax.swing.text.html.HTML.Tag t,
    MutableAttributeSet a,int pos) {
      if (lastTradeFlag && (t == javax.swing.text.html.HTML.Tag.B )){
          boldFlag = true;

  //A method that the parser calls each time it reaches nested text content
  public void handleText(char[] data,int pos){
      htmlText  = new String(data);
      if (htmlText.indexOf("No such ticker symbol.") != -1){
              throw new IllegalStateException(
                "Invalid ticker symbol in handleText( ) method.");
      }  else if (htmlText.equals("Last Trade:")){
          lastTradeFlag = true;
      } else if (boldFlag){
              stockVal = new Float(htmlText).floatValue( );
          } catch (NumberFormatException ne) {
                  // tease out any commas in the number using NumberFormat
                  java.text.NumberFormat nf = java.text.NumberFormat.
                    getInstance( );
                  Double f = (Double) nf.parse(htmlText);
                  stockVal =  (float) f.doubleValue( );
              } catch (java.text.ParseException pe){
                   throw new IllegalStateException(
                          "The extracted text " + htmlText +
                       " cannot be parsed as a number!");
          //Reset the inner class's instance variables
          lastTradeFlag = false;
          boldFlag = false;
  } //handleText

A callback includes methods that represent the attainment of a certain element of a web page during the parsing process. For example, the parser (the object that encloses the callback object) calls handleStartTag( ) whenever it runs into an opening tag as it traverses the web page. Examples of opening tags are <html>, <title>, or <body>. Therefore, when you implement the handleStartTag( ) method in the code, you can control what your program does when it finds an opening tag, such as "prepare to grab the text that appears within the opening and closing title tag."

Example 26-1 uses a particular algorithm to search a web page for an updated stock quote, and this is what the two methods (handleStartTag( ) and handleText( )) accomplish in the MyParserCallback class:

  1. It looks for the text "Last Trade" in the handleText( ) callback method; if it's found, the lastTradeFlag boolean variable is set to true. This is like "dropping a bread crumb" as the program travels through the vast HTML of the web page.

  2. If handleStartTag( ) finds a b tag right after "Last Trade" is found (the lastTradeFlag flag is true), it grabs the nested content of that b tag, because this content represents the stock quote.

The big negative of web harvesting, which web services is partly designed to solve, is that when the web page you are parsing is changed, your program throws exceptions and no longer pulls out the information, because its algorithms are based on the old page structure.

Example 26-2 shows a snippet of code that uses the ParserDelegator and MyParserCallback objects, just to give you an idea of how they fit together before we move on to the servlet and JSP.

Example 26-2. A code snippet shows the parser and callback classes at work
//Instance variables
private ParserDelegator htmlParser = null;
private MyParserCallback callback = null;

//Initialize a BufferedReader and a URL inside of a method for connecting 
//to and reading a web page
BufferedReader webPageStream = null;
URL stockSite = new URL(BASE_URL + symbol);

//Connect inside of a method
webPageStream = new BufferedReader(
  new InputStreamReader(stockSite.openStream( )));

//Create the parser and callback           
htmlParser = new ParserDelegator( );
callback = new MyParserCallback( );//ParserCallback
//Call parse( ), passing in the BufferedReader and callback objects

The parse( ) method of ParserDelegator is what triggers the calling of the callback's methods, with the callback passed in as an argument to parse( ).

Now let's see how these classes work in a servlet, JavaBean, and JSP.

See Also

A Javadoc link for ParserDelegator: http://java.sun.com/j2se/1.4.1/docs/api/javax/swing/text/html/parser/ParserDelegator.html; Chapter 27 on using web services APIs to grab information from web servers.

    [ Team LiB ] Previous Section Next Section