The structure and formatting of XML in an XML document must follow the rules of the XML instance syntax. The term instance is used to explicitly distinguish the difference between the use of some particular type of XML and its specification. This usage parallels the difference in object-oriented terminology between an object instance and an object type.
XML documents contain an optional prolog followed by a root element that contains the contents of the document.
A document can be identified as an XML document through the use of a processing instruction . Processing instructions (PIs) are special directives to the application that will process the XML document. They have the following syntax:
In general, data-oriented XML applications do not use application-specific processing instructions. Instead, they tend to put all information in elements and attributes. However, you should use one standard processing instruction—the XML declaration —in the XML document prolog to determine two very important pieces of information: the version of XML in the document and the character encoding:
<?xml version="1.0" encoding="UTF-8"?>
The version parameter of the xml PI tells the processing application the version of the XML specification to which the document conforms. Currently, there is only one version: "1.0". The encoding parameter is optional. It identifies the character set of the document. The default value is "UTF-8".
UTF-8 is a variable-length character encoding standard that generates 7-bit safe output. This type of output makes it easy to move XML on the Internet using standard communication protocols such as HTTP, SMTP, and FTP. Keep in mind that XML is internationalized by design and can support other character encodings such as Unicode and ISO/IEC 10646. However, for simplicity and readability purposes, this book will use UTF-8 encoding for all samples.
If you omit the XML declaration, the XML version is assumed to be 1.0, and the processing application will try to guess the encoding of the document based on clues such as the raw byte order of the data stream. This approach has problems, and whenever interoperability is of high importance—such as for Web services—applications should always provide an explicit XML declaration and use UTF-8 encoding.
<!-- Sample comment and more ... -->
Comments can span multiple lines but cannot be nested (comments cannot enclose other comments). Everything inside the comment markers will be ignored by the processing application. Some of the XML samples in this book will use comments to provide you with useful context about the examples in question.
<?xml version="1.0" encoding="UTF-8"?> <!-- Created by Bob Dister, approved by Mary Jones --> <po id="43871" submitted="2001-10-05"> <!-- The rest of the purchase order will be the same as before --> ... </po>
In this case, po is the root element of the XML document.
The term element is a technical name for the pairing of a start and end tag in an XML document. In the previous example, the po element has the start tag <po> and the end tag </po>. Every start tag must have a matching end tag and vice versa. Everything between these two tags is the content of the element. This includes any nested elements, text, comments, and so on.
Element names can include all standard programming language identifier characters ([0-9A-Za-z]) as well as underscore (_), hyphen (-), and colon (:), but they must start with a letter. customer-name is a valid XML element name. However, because XML is case-sensitive, customer-name is not the same element as Customer-Name.
According to the XML Specification, elements can have three different content types. They can have element-only content, mixed content, or empty content. Element-only content consists entirely of nested elements. Any whitespace separating elements is not considered significant in this case. Mixed content refers to any combination of nested elements and text. All elements in the purchase order example, with the exception of description, have element content. Most elements in the skateboard user guide example earlier in the chapter had mixed content.
Note that the XML Specification does not define a text-only content model. Outside the letter of the specification, an element that contains only text is often referred to as having data content; but, technically speaking, it has mixed content. This awkwardness comes as a result of XML's roots in SGML and document-oriented applications. However, in most data-oriented applications, you will never see elements whose contents are both nested elements and text. It will typically be one or the other, because limiting the content to be either elements or text makes processing XML much easier.
The syntax for elements with empty content is a start tag immediately followed by an end tag, as in <emptyElement></emptyElement>. Because this is simply too much text, the XML Specification also allows the shorthand form <emptyElement/>. For example, because the last item in our purchase order does not have a nested description element, it has empty content. Therefore, we could have written it as follows:
<item sku="008-PR" quantity="1000"/>
<!-- This is correct nesting --> <P><B><I>Bold, italicized text in a paragraph</I></B></P> <!--Bad syntax: overlapping I and B tags --> <P><I><B>Bold, italicized text in a paragraph</I></B></P> <!-- Bad syntax: overlapping P and B tags --> <B><P><I>Bold, italicized text in a paragraph</I></B></P>
The notion of an XML document root implies that there can be only one element at the very top level of a document. For example, the following would not be a valid XML document:
<first>I am the first element</first> <second>I am the second element</second>
It is easy to think of nested XML elements as a hierarchy. For example, Figure 2.1 shows a hierarchical tree representation of the XML elements in the purchase order example together with the data (text) associated with them.
Unfortunately, it is often difficult to identify XML elements precisely in the hierarchy. To aid this task, the XML community has taken to using genealogy terms such as parent, child, sibling, ancestor, and descendant. Figure 2.2 illustrates the terminology as it applies to the order element of the purchase order:
The start tags for XML elements can have zero or more attributes. An attribute is a name-value pair. The syntax for an attribute is a name (which uses the same character set as an XML element name) followed by an equal sign (=), followed by a quoted value. The XML Specification requires the quoting of values; both single and double quotes can be used, provided they are correctly matched. For example, the po element of our purchase order has two attributes, id and submitted:
<po id="43871" submitted="2001-10-05"> ... </po>
A family of attributes whose names begin with xml: is reserved for use by the XML Specification. Probably the best example is xml:lang, which is used to identify the language of the text that is the content of the element with that attribute. For example, we could have written the description elements in our purchase order example to identify the description text as English:
<description xml:lang="en">Skateboard backpack; five pockets</description>
Note that applications processing XML are not required to recognize, process, and act based on the values of these attributes. The key reason why the XML Specification identified these attributes is that they address common use-cases; standardizing them would aid interoperability between applications.
Without any meta-information about an XML document, attribute values are considered to be pieces of text. In the previous example, the id might look like a number and the submission date might look like a date, but to an XML processor they will both be just strings. This obviously causes some headaches when processing data-oriented XML, and it is one of the primary reasons most data-oriented XML documents have associated meta-information described in XML Schema (introduced later in this chapter).
At the same time, XML applications are free to attach any semantics they choose to XML markup. A common use-case is leveraging attributes to create a basic linking mechanism within an XML document. The typical scenario involves a document having duplicate information in multiple locations. The goal is to eliminate information duplication. The process has three steps:
The purchase order example offers the opportunity to try this out (see Listing 2.3). As shown in the example, in most cases, the bill-to and ship-to addresses will be the same.
<po id="43871" submitted="2001-10-05"> <billTo> <company>The Skateboard Warehouse</company> <street>One Warehouse Park</street> <street>Building 17</street> <city>Boston</city> <state>MA</state> <postalCode>01775</postalCode> </billTo> <shipTo> <company>The Skateboard Warehouse</company> <street>One Warehouse Park</street> <street>Building 17</street> <city>Boston</city> <state>MA</state> <postalCode>01775</postalCode> </shipTo> ... </po>
There is no reason to duplicate this information. Instead, we can use the markup shown in Listing 2.4.
<po id="43871" submitted="2001-10-05"> <billTo id="addr-1"> <company>The Skateboard Warehouse</company> <street>One Warehouse Park</street> <street>Building 17</street> <city>Boston</city> <state>MA</state> <postalCode>01775</postalCode> </billTo> <shipTo href="addr-1"/> ... </po>
We followed the three steps described previously:
You might have noticed that now both the po and billTo elements have an attribute called id. This is fine, because attributes are always associated with an element.
Attribute values as well as the text and whitespace between tags must follow precisely a small but strict set of rules. Most XML developers tend to think of these as mapping to the string data type in their programming language of choice. Unfortunately, things are not that simple.
First, and most important, all character data in an XML document must comply with the document's encoding. Any characters outside the range of characters that can be included in the document must be escaped and identified as character references . The escape sequence used throughout XML uses the ampersand (&) as its start and the semi-colon (;) as its end. The syntax for character references is an ampersand, followed by a pound/hash sign (#), followed by either a decimal character code or lowercase x followed by a hexadecimal character code, followed by the semicolon. Therefore, the 8-bit character code 128 will be encoded in a UTF-8 XML document as €.
Unfortunately, for obscure document-oriented reasons, there is no way to include character codes 0 through 7, 9, 11, 12, or 14 through 31 (typically known as non-whitespace control characters in ASCII) in XML documents. Even a correctly escaped character reference will not do. This situation can cause unexpected problems for programmers whose string data types can sometimes end up with these values.
Another legacy from the document-centric world that XML came from is the rules for whitespace handling. It is not important to completely define these rules here, but a couple of them are worth mentioning:
Luckily, most data-oriented XML applications care little about whitespace.
In addition to character references, XML documents can define entities as well as references to them (entity references ). Entities are typically not important for data-oriented applications and we will not discuss them in detail here. However, all XML processors must recognize several pre-defined entities that map to characters that can be confused with markup delimiters. These characters are less than (<); greater than (>); ampersand (&); apostrophe, a.k.a. single quote ('); and quote, a.k.a. double quote ("). Table 2.1 shows the syntax for escaping these characters.
<example-to-show> <?xml version="1.0"?> <rootElement> <childElement id="1"> The man said: "Hello, there!". </childElement> </rootElement> </example-to-show>
The result is not only reduced readability but also a significant increase in the size of the document, because single characters are mapped to character escape sequences whose length is at least four characters.
To address this problem, the XML Specification has a special multi-character escape construct. The name of the construct, CDATA section , refers to the section holding character data. The syntax is <![CDATA[, followed by any sequences of characters allowed by the document encoding that does not include ]]>, followed by ]]>. Therefore, you can write the previous example much more simply as follows:
<example-to-show><![CDATA[ <?xml version="1.0"?> <rootElement> <childElement id="1"> The man said: "Hello, there!". </childElement> </rootElement> ]]></example-to-show>
A Simpler Purchase Order
Based on the information in this section, we can re-write the purchase order document as shown in Listing 2.4.
<?xml version="1.0" encoding="UTF-8"?> <!-- Created by Bob Dister, approved by Mary Jones --> <po id="43871" submitted="2001-10-05"> <billTo id="addr-1"> <company>The Skateboard Warehouse</company> <street>One Warehouse Park</street> <street>Building 17</street> <city>Boston</city> <state>MA</state> <postalCode>01775</postalCode> </billTo> <shipTo href="addr-1"/> <order> <item sku="318-BP" quantity="5"> <description>Skateboard backpack; five pockets</description> </item> <item sku="947-TI" quantity="12"> <description>Street-style titanium skateboard.</description> </item> <item sku="008-PR" quantity="1000"/> </order> </po>