[ Team LiB ] Previous Section Next Section

Defining an XML Document

XML documents can be defined using other documents. The definition document serves as a dictionary for the XML document. The advantage of using such definition documents is that, when used in conjunction with the XML document, you can be sure that the XML document follows certain basic rules as laid out in the dictionary. This reduces the chance of processing errors in your application. Such documents also serve as a handshake between two different systems that communicate with each other using XML documents.

An XML dictionary can be described using two types of definition documents: document type definitions (DTDs) and XML schemas. In this section, we look at both methods of dictionary definition.

Document Type Definition

As we briefly touched on earlier, a DTD is like a dictionary for the XML document. It describes the valid structure of an XML document. DTDs make XML documents more usable because the parser can validate the document before the application receives the data. DTDs offer a very flexible way of describing what elements must be present in an XML document, what the valid values for their attributes are, and so forth. Although this chapter briefly introduces the concept of DTDs, it isn't intended to be a comprehensive discussion of DTDs. We only introduce some concepts that will help you write basic DTDs for your XML. If you're already familiar with writing DTDs for XML documents, you may skip this section and proceed to the next.

Let's consider the sample XML document in Listing 29.3, which describes an email document.

Listing 29.3 An XML Document That Describes an Email
1. <email>
2.   <from name="John Doe" id="johndoe@xyzcompany.com"/>
3.   <to name="Jane Doe" id="janedoe@xyzcompany.com"/>
4.   <to name="SomeOther Doe" id="someotherdoe@xyzcompany.com"/>
4.   <cc name="YetAnother Doe" id="yetanotherdoe@xyzcompany.com"/>
5.   <subject>Hello!</subject>
6.   <options>
7.     <read_receipt/>
8.     <priority type="Normal"/>
9.   </options>
10.   <body>
11.     Hello, how are you doing.
12.   </body>
13. </email>

Elements

All keywords within your DTD begin with the <! symbol. The elements of your XML document are denoted by using the ELEMENT keyword. The DTD has one such line for each element of your XML. The ELEMENT keyword is followed by the name of the element, which is then followed by the content model of the element. The content model describes the data that can be contained within this element. The content model is defined by using opening and closing parentheses that enclose the contents. An element can contain one of two types of content: other elements or textual data. If an element contains other elements, the content model lists the names of the contained elements. For instance, the element tag that describes the <options> element in lines 6 through 9 will be as follows:


<!ELEMENT options (request_receipt, priority)>

When an element is defined like this, the parser ensures that the contained subelements must appear once, and only once, within the parent element. Now look at the element <email> in the XML document. There can be several <to> elements within the <email> element, each describing a single To address. Also, the email can have one or more <cc> tags that indicate carbon copy recipients, but they have to be optional. To define such rules, you append a recurrence modifier to the element name in question. A recurrence modifier indicates several properties of the element as shown in Table 29.2.

Table 29.2. DTD Recurrence Modifiers

Modifier

Example

Implies

?

options?

Can appear once, but may be absent (0..1)

+

to+

Must appear at least once, but can optionally be repeated n number of times (1..n)

*

cc*

Can appear any number of times, and may be absent (0..n)

Thus, the element <email> will be defined as follows to indicate that it must contain at least one <to> element, but optional <cc> elements:


<!ELEMENT email (from, to+, cc*, subject, options?, body)>

If an element contains data, the content model has the word #PCDATA within the parentheses. For instance, the element <subject> is defined as follows within the DTD:


<!ELEMENT subject (#PCDATA)>

Empty elements are defined using the EMPTY keyword instead of the content model. For instance, the <read_receipt/> element indicates that the sender has requested a read receipt from the recipients. This can be defined using the following DTD entry:


<!ELEMENT read_receipt EMPTY>

Attribute List

Consider the following line of the XML document from Listing 29.3:


4.     <to name="Jane Doe" id="janedoe@xyzcompany.com"/>

Here we define attributes to the <to> element. There are two attributes: name and id. To define the attributes of an element, use the ATTLIST keyword. This indicates the element name and properties of all the attributes that belong to that element. For each attribute, there is one segment that contains the attribute name, the attribute type, and a flag that indicates whether the attribute is required. For instance, the following illustrates a DTD entry for the <to> element:


<!ATTLIST to
  name CDATA #REQUIRED
  id CDATA #REQUIRED
>

Here we indicate that name and id are of type CDATA (text) and that both are required. Now consider the priority element in the XML file. Priority has to be restricted to Normal, High, or Urgent. It cannot take any other values. You can do this by setting up constraints on the attribute instead of the CDATA keyword, as follows:


<!ATTLIST priority
  type ( Normal | High | Urgent ) #REQUIRED
>

Putting all the pieces together, we come up with the full email DTD:


<!ELEMENT email (from,to+,cc*,subject,options,body)>
<!ELEMENT from EMPTY>
<!ATTLIST from name CDATA #REQUIRED
     id CDATA #REQUIRED>
<!ELEMENT to EMPTY>
<!ATTLIST to name CDATA #REQUIRED
     id CDATA #REQUIRED>
<!ELEMENT cc EMPTY>
<!ATTLIST cc name CDATA #REQUIRED
     id CDATA #REQUIRED>
<!ELEMENT subject (#PCDATA)>
<!ELEMENT options (read_receipt, priority)>
<!ELEMENT read_receipt EMPTY>
<!ELEMENT priority EMPTY>
<!ATTLIST priority type CDATA #REQUIRED>
<!ELEMENT body (#PCDATA)

XML Schema

You can also describe an XML document using an XML schema instead of a DTD. An XML schema is a specification recommended by the World Wide Web Consortium (W3C; http://www.w3c.org). An XML schema is a much more powerful and flexible mechanism of defining XMLs than a DTD. Although a DTD can validate the structure of an XML document, a schema can additionally validate data types and data. It's possible to define data elements and the actual data (such as list of values or data patterns) that they can take. Even though a DTD has its own format, an XML schema is itself an XML document. A schema uses (and supports the use of) the concept of namespaces extensively, so it is important that you understand namespaces well before you proceed.

Using a schema, you can define

  • The elements of an XML document and their attributes

  • Their content or lack thereof

  • Their attributes

  • Ordering, hierarchy, and number of elements

  • Data types of the content or attributes, and default values

Because an XML schema is itself an XML document, the schema has its own XML schema definition, which can be accessed by visiting the URL http://www.w3c.org/2001/XMLSchema.xsd. This schema document defines several core data types, which you can then extend to create your own custom data types. Those data types can then be referenced within your schema just like basic data types. In this section, we look briefly at creating XML schemas to define your XML documents. This section is not meant to be an exhaustive discussion of the features of XML schemas.

XML schemas are also extendable. In other words, you could write a common schema definition library and then write other definitions that extend from this common library. Let's now look briefly at the components of an XML schema definition document.

The schema Element

Refer to Listing 29.3 for a simple XML document that represents an email. In the previous section, we wrote a DTD to define this XML. In this section, we'll write a schema to define the same XML. Because a schema is also an XML document, it should be both well formed and valid. The root element of an XML schema is always a schema element in the XSD namespace.


<?xml version="1.0"?>
<xs:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.mycompany.com/schemas"
xmlns="http://www.mycompany.com/schemas">

Here we define the prefix xs to point to the namespace for any XML schema. Any element that belongs to this namespace should be prefixed with xs:. For example, we define the root element as xs:schema because the XML document represents an XML schema. We can also define other namespaces that we'll be using in our schema. In the preceding snippet, we define that any element defined with the xs prefix points to the namespace "http://www.w3.org/2001/XMLSchema". Along with this, we also define that the target namespace for the schema that we're creating is http://www.mycompany.com/schemas/email.

As a reminder, the URL specified need not physically exist. It is only a referring URL that identifies the namespace and makes it unique. By including the tag xmlns="..." in this line, we indicate that any element that has not been qualified by a prefix belongs to the default namespace, which also happens to be the target namespace for the schema.

Simple Elements

The body of the schema is built by defining elements. Each element is associated with a data type. As indicated earlier, the schema definition already defines basic data types. You may then define your own data types that are either built as extensions of these basic data types or as a collection of other elements. This scheme promotes reusability of data types within the schema and even across schemas. A simple element is an element in an XML file, which has a body, but does not encompass other elements. Nor do simple elements have attributes. For example, in the email XML, the <body> element is a simple element. It is defined using the schema tag element. The syntax for defining a simple element is


<xs:element name="body" type="xs:string"/>

In the preceding line, we define the body tag of our XML and indicate that it can contain data of type string. This data type is a built-in data type defined by the XML Schema definition. You can use other data types while defining elements. Some of the most commonly used data types are xs:string, xs:decimal, xs:integer, xs:boolean, xs:date and xs:time.

While defining a simple element, you can also define a default value by including the default attribute to the element tag. This attribute indicates the value that is set to the element being declared if no other value is specified. You can alternatively define fixed elements that take a particular value and nothing else.

Attributes

You can define an attribute for an element in your XML by using the attribute tag. For example, the priority tag in our email XML has an attribute called type. This can be defined by using the attribute tag as follows:


<xs:attribute name="type" type="xs:string"/>

This indicates that the type attribute is of type xs:string. Other than xs:string, attributes may also use any of the data types discussed earlier. Attributes may be defined as fixed by using the fixed attribute of the attribute tag. For example:


<xs:attribute name="..." type="xs:string" fixed="value"/>

You may also provide a default value by using the default attribute. Here's an example:


<xs:attribute name="..." type="xs:string" default="defaultValue/>

Attributes may be defined as required or optional by using the use attribute of this tag. This attribute takes one of two values: required or optional. For example, in the case of the type attribute, we expect it to be present in all cases; therefore, we add the use attribute as follows:


<xs:attribute name="type" type="xs:string" use="required"/>

In the earlier email example , we can see that we have at least one user-defined data type that can be reused; namely, the emailAddress data type. The other elements are not reusable within the schema, (and we assume in other schemas), and hence need not be defined as explicit data types.

Restrictions

By defining an attribute of a particular type, the schema ensures that no XML document passes validation with data that does not correspond to the defined type. For example, if you have an attribute defined as type xs:integer and pass in a string value, the XML document will fail validation. This is known as a restriction. You can define several restrictions on the data by using the xs:restriction tag. We look at a few examples in the remainder of this section.

Consider the type attribute of the email document. Assume that we would like to place a restriction on this attribute to ensure that the type is always one of Low, Normal, or High. To do this, we create a simple data type, which is an extension of the basic string data type, but has an additional restriction on it. Thus, we modify the definition of the type attribute as follows:


<xs:attribute name="type" use="required">
  <xs:simpleType>
   <xs:restriction base="xs:string">
    <xs:enumeration value="Low"/>
    <xs:enumeration value="Medium"/>
    <xs:enumeration value="High"/>
   </xs:restriction>
  </xs:simpleType>
</xs:attribute>

Here we define a new type and use it within the type attribute. However, this new type cannot be reused by other elements or attributes. If we want to use this type in other elements, we'll have to define it outside the attribute tag and give it a name. We can then specify the type of the type attribute to the name indicated. The following listing describes how this is done:


<xs:simpleType name="priorityDataType">
  <xs:restriction base="xs:string">
   <xs:enumeration value="Low"/>
   <xs:enumeration value="Medium"/>
   <xs:enumeration value="High"/>
  </xs:restriction>
</xs:simpleType>
<xs:attribute name="type" type="priorityDataType" use="required"/>

You may also place restrictions based on patterns by using the xs:pattern tag within the restriction tag. This tag takes any standard regular expression to define valid patterns. For example, to specify that an element can take only lowercase letters and should have at least one character, you can define a restriction as


<xs:restriction base="xs:string">
  <xs:pattern value="[a-z]+"/>
</xs:restriction>

You can also define a restriction on the number of characters that an element (or attribute) can have by using the xs:length tag. Here's an example:


<xs:restriction base="xs:string">
  <xs:length value="2" />
</xs:restriction>
Complex Elements

Complex elements have other embedded elements and/or attributes. In the email document, it's easy to see that there are several complex elements. A complex element is defined using a syntax that's similar to that of a simple element, but is defined with the type set as a complex data type. As you saw earlier, this can be done either explicitly by using the type attribute (and thus creating reusable complex data types), or by embedding the definition of the complex data type within the element tag.

Let's define an emailAddressType data type that represents the following elements:


<from name="John Doe" id="johndoe@xyzcompany.com"/>
<to name="Jane Doe" id="janedoe@xyzcompany.com"/>
<cc name="YetAnother Doe" id="yetanotherdoe@xyzcompany.com"/>

<xs:complexType name="emailAddressType">
 <xs:attribute name="name" type="xs:string" use="required"/>
 <xs:attribute name="id" type="xs:string" use="required"/>
</xs:complexType>
<xs:element name="from" type="emailAddressType" />
<xs:element name="to" type="emailAddressType" />
<xs:element name="cc" type="emailAddressType" />

As you can see, we defined one complex data type and reused it across three different elements. Now let's define the complex type for the options tag of our XML. This has two child elements: a read_receipt, which is an empty element, and a priority tag:


<options>
 <read_receipt/>
 <priority type="Normal"/>
</options>
<xs:simpleType name="priorityDataType">
  <xs:restriction base="xs:string">
   <xs:enumeration value="Low"/>
   <xs:enumeration value="Medium"/>
   <xs:enumeration value="High"/>
  </xs:restriction>
</xs:simpleType>
<xs:element name="options" >
  <xs:complexType>
   <xs:sequence>
    <xs:element name="read_receipt">
      <xs:complexType/>
    </xs:element>
    <xs:element name="priority">
      <xs:complexType>
       <xs:attribute name="type"
          type="priorityDataType" use="required"/>
      </xs:complexType>
    </xs:element>
   </xs:sequence>
  </xs:complexType>
</xs:element>

As you can see in this example, we've created nested complex types to define the options tag of our XML. The options tag is defined to have a sequence, which contains the read_receipt tag followed by the priority tag. The read_receipt tag is defined as empty, whereas the priority tag is defined to be a complex type that has one attribute. If the options tag were to have an attribute, it would be defined after the xs:sequence block.

You can also define complex data types that extend from other complex data types. To do this, you use the xs:extension tag. After extending from a base data type, you can add more elements and attributes to the new data type. For example, the following code would create a new subtype of the options type, which would add the expirationDate attribute:


<xs:complexType name="expiratingOptions">
 <xs:complexContent>
  <xs:extension base="options" insert="prepend">
   <xs:attribute name="expirationDate" type="xs:date"/>
  </xs:extension>
 </xs:complexContent>
</xs:complexType>
Indicators

Indicators control how elements are used within an XML document. The xs:all indicator represents that the contained elements may occur in any order, but only once. For example, the following complex type indicates that the person's name should have a first name and last name, but not necessarily in any order:


<xs:complexType name="personsName">
  <xs:all>
   <xs:element name="firstname" type="xs:string"/>
   <xs:element name="lastname" type="xs:string"/>
  </xs:all>
</xs:complexType>

The choice indicator, on the other hand, indicates that either one child element, or the other, can occur. For example, the following block defines the payment information to be from either a checking account or a credit card:


<xs:element name="paymentDetails">
 <xs:complexType>
  <xs:choice>
   <xs:element name="checkingAccountNumber" type="xs:string"/>
   <xs:element name="creditCardNumber" type="xs:string"/>
  </xs:choice>
 </xs:complexType>
</xs:element>

The sequence indicator, which should be familiar to you, indicates that the child elements appear in the given sequence. The minOccurs and maxOccurs indicators indicate the minimum and maximum number of times a given element can occur. For example, if your document has to provide for a minimum of one and a maximum of three addresses per person, your schema would look like this:


<xs:element name="address" type="addressType" minOccurs="1" maxOccurs="3"/>

However, in the case of the email, there is no way to tell how many to addresses your email should have. In such cases, the maxOccurs indicator should be set to a special value of unbounded to indicate that there could be any number of the particular element in your document.

Using all of these elements, we can now put together the XML schema for the email XML document. The schema is given in the examples in the file email.xsd. Listing 29.4 shows this file.

Listing 29.4 XSD Document for the Email XML
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
  elementFormDefault="qualified"
  attributeFormDefault="unqualified">
<xs:element name="email">
 <xs:complexType>
  <xs:sequence>
   <xs:element name="from" type="emailAddressType"
       minOccurs="1" maxOccurs="1"/>
   <xs:element name="to" type="emailAddressType"
       minOccurs="1" maxOccurs="unbounded"/>
   <xs:element name="cc" type="emailAddressType"
       minOccurs="0" maxOccurs="unbounded"/>
   <xs:element name="subject" type="xs:string"
       minOccurs="1" maxOccurs="1"/>
   <xs:element name="options">
     <xs:complexType>
       <xs:sequence>
        <xs:element name="read_receipt">
          <xs:complexType/>
        </xs:element>
        <xs:element name="priority">
          <xs:complexType>
            <xs:attribute name="type"
                 type="priorityDataType"
                 use="required"/>
          </xs:complexType>
        </xs:element>
       </xs:sequence>
     </xs:complexType>
    </xs:element>
    <xs:element name="body" >
     <xs:simpleType>
       <xs:restriction base="xs:string">
         <xs:pattern value="[a-zA-Z0-9 ]+"/>
       </xs:restriction>
     </xs:simpleType>
    </xs:element>
  </xs:sequence>
 </xs:complexType>
</xs:element>
<xs:complexType name="emailAddressType">
  <xs:attribute name="name" type="xs:string" use="required"/>
  <xs:attribute name="id" type="xs:string" use="required"/>
</xs:complexType>
<xs:simpleType name="priorityDataType">
  <xs:restriction base="xs:string">
   <xs:enumeration value="Low"/>
   <xs:enumeration value="Normal"/>
   <xs:enumeration value="High"/>
  </xs:restriction>
</xs:simpleType>
</xs:schema>

    [ Team LiB ] Previous Section Next Section