Introduction to XML

This chapter isn't meant to provide a comprehensive study of XML (Extensible Markup Language). However, we'll look at the basic components of an XML document, which will aid our study of the usage of XML in the context of WebLogic Server. If you're already familiar with the structure of an XML document, you may skip this section and move along to the next section of this chapter.

The Standard Generalized Markup Language (SGML), defined in the ISO standard 8879:1986, outlines the process for data interchange between different subsystems that makes it structured and consistent. You should be familiar with one type of an SGML document: an HTML document. HTML is a subset of SGML. As you might already know, an HTML document is nothing but a structured representation of data in such a way that a browser that understands HTML can display the data by using the data formats specified in the HTML document. The browser uses these format requirements in conjunction with the user's preferences to display the data. Thus, an HTML document ensures that the formats are always adhered to, irrespective of which browser or application uses the document. How the formats are rendered may depend on the settings of the browser.

HTML is presentation oriented. It does not concern itself with interpreting the data that's represented; all it knows is how the data looks to the user. Thus, if you have your application reading an HTML document, it'll be very difficult to make your application understand the document's content. This is where XML fits in. Like HTML, XML also derives from SGML and can be considered a subset of SGML. However, unlike HTML, XML is not about data presentation. XML does not address the format of the data; instead, it enables you to describe the structure and meaning of your data.

XML is a foundation for many different standards and protocols. Notably,

Web Services technologies
J2EE and WebLogic deployment descriptors
Ant

As its name suggests, XML is extensible. You can use an XML document to describe data using elements that you define for your application. Any other application that has to use this data needs to be aware of the elements that it has to look for in the document. When the application understands the tags and how to parse them, it can easily access the data being sent. Because of these reasons, XML has become the de facto standard for data transfer over the Internet.

For instance, in our example, Listing 29.1 could be a sample XML file that the airline sends to the bank for billing the credit card.

Listing 29.1 A Payment XML Document

1. <?xml version="1.0"?>
2. <!DOCTYPE paymentInfo SYSTEM "paymentinfo.dtd">
3. <paymentInfo>
4.  <creditCard number="1234123412341234" type="MC" expiration="03/2005"/>
5.  <amount>354.99</amount>
6. </paymentInfo>

It's very evident from this listing that unlike an HTML document, an XML document does not have any predefined tags, although the structure of the XML document is quite similar to an HTML document in that it has elements and attributes. You're free to decide on and use elements that best describe your data. For instance, here we have defined an element called creditCard, which has three attributes: number, type, and expiration. We also have an amount element, which indicates the total amount to be charged. Both these tags are wrapped inside a root element called paymentInfo. As long as the bank is aware of the format, the billing application can easily use the data that is being passed in by the airline.

An XML document consists of two parts: the header and the content.

The XML Header

The XML header describes the XML file. As you can see in line 1 of Listing 29.1, we tell the user that the contents are formed based on the version 1.0 specification of XML. The header can also contain other attributes such as the encoding and an indication of whether the document can stand alone or requires other documents to make it complete.

In line 2, the header contains the DOCTYPE definition of the document. The airline and the bank have mutually agreed on a set of tags that they'll use to communicate. But how does the bank ensure that the requests adhere to the agreed structure? The bank system uses a dictionary that's based on the mutually agreed structure. This dictionary is known as a document type definition or DTD document. The DOCTYPE is a mechanism by which the XML indicates to the parser which DTD it conforms to. Based on this definition, the parser validates whether the XML follows all the rules laid out using the DTD. In line 2 of this example, we indicate to the parser that the XML uses a DTD called paymentinfo.dtd (which resides in the file system) by using the keyword SYSTEM. Using the SYSTEM keyword makes the parser look for the DTD either in the relative or absolute file system or in the URL, based on the data that's provided.

XML documents may also use DTDs that lie in some public domain. To do this, they use the keyword PUBLIC instead of SYSTEM. If you look at the ejb-jar.xml file, which describes an EJB deployment, you'll notice that the DOCTYPE is given as follows:

<!DOCTYPE ejb-jar PUBLIC '-//Sun Microsystems, Inc.
   //DTD Enterprise JavaBeans 2.0//EN' >

Here we indicate to the parser that it needs to pick up the DTD mentioned in the PUBLIC domain under the name Sun Microsystems, Inc.//DTD Enterprise JavaBeans 2.0//EN.

You can combine both the SYSTEM and PUBLIC keywords to specify to the parser that it must look for the PUBLIC ID first, as shown in the following header. If the parser cannot resolve the PUBLIC ID, it can then use the SYSTEM URL to specify the DTD. However, the SYSTEM keyword is omitted when combining the two.

<!DOCTYPE ejb-jar PUBLIC '-//Sun Microsystems, Inc.
   //DTD Enterprise JavaBeans 2.0//EN'
  'http://java.sun.com/dtd/ejb-jar_2_0.dtd'>

Apart from these, you may also see other tags in the header that describe processing instructions (or PIs) of the XML. These header elements typically consist of a target followed by the data. The data is normally represented as key-value pairs, although that isn't a requirement.

Remember that although the header provides more meaning to your XML document, it isn't required for the XML to be complete. All these tags are optional. If your XML document includes a DOCTYPE declaration and the parser validates it, the document is considered to be valid. If the document doesn't contain a DOCTYPE, the parser won't validate it. In such a case, your XML document will be considered well formed if it follows the rules laid out by the W3C about the structure of XML documents. Not using a DOCTYPE obviously prevents the parser from validating the document, and improves performance at the expense of checking for the validity of the document. Needless to say, valid XML documents should also be well formed.

The XML Content

As mentioned earlier, the XML content is pretty much open for definition by the application in question. It does have to be well formed; that is, it must conform to some basic rules that are laid out by the W3C. The W3C document can be accessed online at http://www.w3.org/xml. This section aims at defining some of the pieces of the puzzle that make up your XML content.

Elements

An element in an XML document describes a piece of data. Consider lines 4 and 5 in Listing 29.1. We define two elements, one being a creditCard and the other being an amount. At a first glance, these two might look different, but they really aren't that different. Each element describes a particular piece of the data.

Elements are made up using arbitrary element names, which are enclosed in angled brackets (< and >). Names must begin with a letter or an underscore. Names can be of any length, and can contain letters, numbers, underscores, hyphens, and periods. Names cannot contain embedded spaces. Element names are case sensitive. You can typically use the same naming conventions that you follow for naming Java variables to create XML element names. Understand that element names can be as descriptive as you choose, but making them unnecessarily long can cause confusion while reading the XML file.

All open elements must be closed. Elements are closed by using an ending tag, which consists of a forward slash (/) followed by the name of the element that's being closed; for example, </amount>. Between an opening and closing element tags, you may have any number of sub-elements and raw text.

XML tags cannot be nested, but HTML tags can be. HTML does not require the document to be well formed, whereas XML does.

Now consider the difference between lines 4 and 5 in Listing 29.1. Line 4 describes a credit card element and looks like the following:

4.    <creditCard number="1234123412341234" type="MC" expiration="03/2005"/>

This is a well-formed element. However, it doesn't have a closing tag of </creditCard>. Or does it? In the case of an HTML document, for many tags you must explicitly define the closing tag for an empty tag such as this. But in the case of an XML document, you can use a shortcut to close the tag, thus reducing the clutter in your document. The shortcut is the use of a /> characters to close your element. This is the same as defining the creditCard tag with its attributes, closing the tag with the angle bracket, and subsequently including a </creditCard> tag. Thus, in this case, we define an element called creditCard, define some attributes to it, and close it all within the same tag. This concept can be extended to define empty tags, which act like Booleans in your XML document. For instance, if the airline were to tell the bank that the bank needs only to authorize the amount and not actually make the charge, it could add a new element, <authorizeOnly/>, in its XML document. These are known as empty tags. Of course, in an XML document, there are several ways in which a particular piece of data can be represented. For instance, the <authorizeOnly/> tag means the same as a <operation type="authorize"/> element.

The root element is the top-level element that does not include the header information. There can be only one root element in your XML document in order to make it a well-formed XML. For instance, in our examples, the <paymentInfo> element forms the root element. For all practical purposes, the root element is like any other element in the XML document. It is just special because it describes the data that's represented by the document.

Attributes

Consider line 4 in Listing 29.1. From this line, it's obvious that an element can contain not only data between the start and end tags, but it can also contain attributes. Attributes define an element. Attributes are defined as key-value pairs within the starting tag of an element. Thus, in our credit card example, the attributes of the credit card are its number, type, and expiration date. Naming attributes follow the same rules as naming elements. The value of the attribute is enclosed within a set of either single or double quotes. Typically, it is standard practice to use double quotes for specifying values. Thus, you can define an element called <paymentInfo> as follows, which practically replaces the entire XML document described earlier:

<paymentInfo cardNum="xxx" type="xx" expDate="xx/xxxx">
  200.00
</paymentInfo>

Here you list the credit card data as attributes and the amount as the value of the element. Another form of representing the same data is the following:

<paymentInfo>
  <creditCard>
    <number>xxxx</number>
    <type>xx</type>
    <expDate>xx/xxxx</expDate>
  </creditCard>
  <amount>200.00</amount>
</paymentInfo>

So, which is the correct way of representing this XML? Well, there are no correct or incorrect ways. These are all different representations. What determines whether a data is to be represented as a value or an attribute of an element? Again, there is no hard-and-fast rule to determine this. One general rule of thumb is that if a data can have multiple values or is very long, that data is generally better off defined as an element rather than an attribute of an element. Also, data that's defined using attributes can be described in the DTD. In other words, the DTD can tell to the parser the possible valid values that can go into an attribute. Thus, if your data requires that kind of validation, you should choose to use an attribute rather than an element. Finally, the order of the data might be important, or data could be repeated. In such cases, using elements instead of attributes allows repetition of tags and validation of the order of the data . Attributes cannot be repeated nor can their order validated.

Entity References

Sometimes it becomes important that you use characters that are usually considered special characters in your XML data. For instance, you already know that an XML file is built using tags that are wrapped in angle brackets. So, how would you use an angle bracket within your data? For instance, if you want to represent the mathematical condition x < y as an XML condition element, how would you do it? The first thing that comes to mind is to represent it as follows:

<condition>x < y</condition>

It doesn't take more than a few seconds to realize that this does not make this document a well-formed XML document. That leaves us with the question how we represent the less than symbol in XML. To represent such data, you use entity references. An entity reference is a special symbol that represents different data within an XML document. Thus, when a parser parses out your document and encounters an entity reference, it knows to replace it with the correct data that's represented by that entity reference. Entity references are of the format &[reference-name];, where the [reference-name] part of the reference is replaced with the appropriate entity name. These symbols are the same ones that are used in HTML and URLs. The valid entity references are listed in Table 29.1.

Table 29.1. XML Entity References
Data Represented
Data
Entity Reference Used
Less than bracket
<
<
Greater than bracket
>
>
Ampersand
&
&
Double quote
"
"
Apostrophe
'
'

Thus, you can represent the mathematical condition as

<condition>x &lt; y</condition>

A parser that parses this element will know to replace the < with a < symbol.

CDATA Section

Sometimes, certain data that's represented by your XML document may be so complex that it's better for the parser not to attempt to parse it, and to simply feed it to the application. An example of this would be a snippet of code that is embedded within your XML document. Your code will probably make use of so many special characters that if you use entity references for each of them, you're bound to mess up the XML document. One way of avoiding this is by wrapping your data within a CDATA block. By doing so, you're instructing the parser not to attempt to parse the data, but simply to return it to the application. These data elements don't contain any entity references. They are considered to be raw text. Thus, a condition block that uses CDATA looks like this:

<condition>
  <![CDATA[
    x < y ;
  ]]>
</condition>

Comments

You can include comments within an XML document by beginning them with the string . The following is a valid comment within an XML document:

<!-- This document represents a mathematical condition -->

Namespaces

A namespace qualifies a name. Conceptually, namespaces in XML are very simple, but can cause a great deal of heartburn in understanding if you don't work with an example. Consider the simple XML file shown in Listing 29.2, which describes how a book inventory has to be displayed on the screen. It has embedded HTML code to provide formatting.

Listing 29.2 XML with Two Types of Data Embedded

1.  <html>
2.    <head><title>Book Inventory</title></head>
3.    <body>
4.      <bookInventory>
5.        <table>
6.          <tr>
7.           <td>Title</td><td>Published by</td>
8.          </tr>
9.          <tr>
10.           <td>
11.             <title>
12.              WebLogic Server Unleashed
13.            </title>
14.           </td>
15.           <td>
16.             <publisher>SAMS Publications</publisher>
17.           </td>
18.         </tr>
19.       </table>
20.     </bookInventory>
21.   </body>
22. </html>

Here we're creating an HTML table that contains information about some books. All is well when we look at it, but consider an application parsing through this XML document. It has to deal with a whole lot of HTML code, when all it's looking for is the data about the books. Look at lines 2 and 11. In line 2, we display the title of the HTML page, and in line 11, we have the title of the book. Both the tags are defined as title. Although this is correct, it can get very tricky for an application that's parsing through this XML document.

To work around this, XML namespaces were introduced. An XML namespace is essentially a qualifier to a name. Instead of saying title, you would now qualify the title to either the presentation logic or to the data. The XML 1.0 specification used URIs for qualifying tag names. Namespace qualifier URIs are written within curly braces just before the tag/attribute names. Thus, the title tag that specifies the title of the page may be written as

<{http://www.w3.org/html}title>
  Book Inventory
</{http://www.w3.org/html}title>

However this is bound to make the XML unreadable. To overcome this problem, XML also provides a shorthand mechanism to specify namespaces. To specify a shorthand, use the reserved xmlns tag. For example, here we create a presentation namespace that points to the HTML namespace, and then use it to qualify all HTML tags in our XML file.

<presentation:html xmlns:presentation=" http://www.w3.org/html ">
  <presentation:head><presentation:title>
    Book Inventory
  </presentation:title></presentation:head>
...

Here we specify a shorthand called presentation by defining the attribute xmlns:presentation to the html tag. We point this shorthand to the URI that qualifies HTML. When we specify a tag or an attribute that is of type html, we simply add the prefix presentation: to the tag name; for example, presentation:title. Note that children of all levels within the html tag will have access to this shorthand.

If you try typing in the URI that we specified into a Web browser's address bar, there's a very good chance that your browser will take you nowhere. This is because the XML specification does not require that the URI specified be valid or even that it exists. All the specification requires is a unique URI that can then be used by applications to access the data.

Given all this, the complete XML document with namespaces defined for html and the data would look as follows:

<presentation:html xmlns:data="http://www.xyzcompany.com/books"
    xmlns:presentation="http://www.w3.org/HTML/1998/html4">
  <presentation:head><presentation:title>
   Book Inventory
  </presentation:title></presentation:head>
  <presentation:body>
   <data:bookInventory>
     <presentation:table>
      <presentation:tr align="center">
        <presentation:td>Title</presentation:td>
        <presentation:td>Published by</presentation:td>
      </presentation:tr>
      <presentation:tr align="left">
        <presentation:td><data:title>WebLogic Server 7.0 Unleashed
           </data:title></presentation:td>
        <presentation:td>
          <data:publisher>SAMS</data:publisher>
        </presentation:td>
      </presentation:tr>
     </presentation:table>
   </data:bookInventory>
  </presentation:body>
</presentation:html>

You can also specify a default namespace by not having any prefix to the xmlns attribute. Any tag or attribute that isn't prefixed by a namespace tag will be associated with the default namespace.

[ Team LiB ]