What Is XML?

XML stands for Extensible Markup Language, and its very flexibility makes it notoriously hard to define. It is beyond the scope of this book to provide a complete introduction to XML, but we can cover some of the basics. If you would like to read more about XML, please read Sams Teach Yourself XML in 24 Hours (ISBN 0-672-32213-7). For a formal definition, see http://www.w3.org/XML/.

XML is a markup language that enables you to define your own markup languages. In fact, it is more a set of rules than a language in itself. These rules determine the ways in which you can define tags and elements (similar to HTML elements). As long as you obey the rules, you have complete freedom to create languages that fulfill a whole range of functions. Because the rules are strict, XML interpreters can easily read XML documents and make their contents available to scripts that can then act on the instructions they contain.

An XML document usually starts with an XML declaration, like so:

<?xml version="1.0"?>

It also might refer to a document type declaration (DTD). DTDs are beyond the scope of this book, but they define which elements a document can contain, and in what order. Here's an example of one:

<!DOCTYPE rootel SYSTEM "http://www.corrosive.co.uk/sample.dtd">

The rest of an XML document is made up primarily of tags that combine to form elements and attributes. XML elements look very similar to HTML elements. An XML element is made up of starting and ending tags that can surround text or other elements.

A starting tag consists of a less than sign (<) followed by an element name followed by a greater than sign (>). Open tags can also contain attributes that consist of an attribute name and a quoted attribute value separated by an equals sign. The following fragment illustrates an open tag containing an attribute:

<newsitem type="world">

Both attribute and element names must begin with a letter or an underscore followed by any combination of letters and numbers. No element name can begin with the letters xml.

A closing tag consists of a less than sign (<), a forward slash (/) followed by an element name followed by a greater than sign (>), as shown here:

</newsitem>

As you can see, XML elements look pretty familiar. One variation you might not be used to, however, is the empty element. These are compressed into a single tag, so

<nothinghere></nothinghere>

would become

<nothinghere />

Listing 22.1 pulls all this together into a sample XML document. This is a shortened version of the XML document that we will be working on throughout the chapter.

Listing 22.1 An XML Document

 1: <?xml version="1.0"?>
 2: <banana-news>
 3:     <newsitem type="world">
 4:         <headline>Banana sales reach all time high</headline>
 5:         <image>/res/high.gif</image>
 6:         <byline>William Curvey</byline>
 7:         <article>Research published today by the World Banana
 8:             Tribunal suggests that we have never had it so
 9:             good banana-wise...</article>
10:     </newsitem>
11:
12:     <newsitem type="home">
13:         <headline>Domestic banana use beggars belief</headline>
14:         <image>/res/use.gif</image>
15:         <byline>Charles Split</byline>
16:         <article>Bananas are for more than eating it seems. Local
17:             Innovation Centers have been showcasing some
18:             exciting banana related technologies...</article>
19:     </newsitem>
20: </banana-news>

Although Listing 22.1 looks a little like an HTML document, you can see that it contains entirely made-up element names. That is the point of XML. It hands the control and the responsibility over to the developer. An XML interpreter validates syntax and lets you easily access the elements, but it is up to you to write code to act on the information received.

In our example we have illustrated a structure for news items. The entire document is enclosed by a single element, <banana-news> (lines 2–20). This is called the root element. A document must have a single root element that encloses all other elements in a document, and every subsequent element must completely enclose any children it might have. Any elements that overlap generate an error in any compliant XML parser, as shown here:

<a><b></a></b>

Am XML document is often represented as a tree of data. Listing 22.1 is drawn out in this way in Figure 22.1. <banana-news> is at the root, branching out to two sibling <newsitem> elements. The <newsitem> elements further divide, leading to the deepest elements.

Figure 22.1. An XML document represented as a tree.

graphics/22fig01.gif

So, what is XML for? Well, the short answer is that it is up to you. But in practical terms, XML documents tend to fulfill a range of purposes, including

To structure data logically for sharing (as in Listing 22.1)
To format data (as in XHTML)
To send instructions to an interpreter (whether local or remote)

In this chapter we will concentrate on the first use. Our banana news structure is designed to provide structures that enable us and our partners to easily work with news items.

[ Team LiB ]