Previous Section Next Section

Document Type Definitions

Document Type Definitions (DTDs) are an optional feature of XML documents. A document associated with a DTD has a set of rules regarding what elements and attributes can be part of the document and where can they appear. DTDs originate from SGML, although XML's DTDs are greatly simplified. The presence of DTDs in XML documents allows us to distinguish the concepts of well-formedness graphics/book.gif and validity graphics/book.gif

Well-Formedness and Validity

If a document subscribes to the rules of XML syntax (as described in the section "XML Instances") it is considered well-formed. Well-formedness implies that XML processing software can read the document without any basic errors associated with parsing such as invalid character data, mismatched start and end tags, multiple attributes with the same name, and so on. The XML Specification mandates that if any well-formedness constraint is not met, the XML parser must immediately generate a non-recoverable error. This rigid mandate makes it easy to separate the doings of the software focused on the logical structure graphics/book.gif of an XML document (what the markup means) from the mundane details of the physical structure graphics/book.gif of the document (the markup syntax).

However, well-formedness is not sufficient for most applications. Consider, for example, the SkatesTown order processing application. When an XML document is submitted to it, it cares not that it is well-formed XML but that it is indeed a purchase order in the specific XML format it requires. The notion of format applies to the set of rules describing SkatesTown's purchase orders: "The document must begin with a po element that has two attributes (id and submitted) which will be followed by a billTo element…" and so on. In other words, before a submitted document is processed, it must be identified as a valid purchase order.

This is how the notion of validity comes in. DTDs offer an automated, declarative mechanism for validating the contents of XML documents as they are parsed. Therefore, XML applications can limit the amount of validation they need to perform. If the SkatesTown purchase order processing application could not delegate validation to the XML processor, it would have had to express all validation rules directly in code. Code is procedural in nature and much harder to maintain than DTDs, which are declarative and have a reasonably readable syntax.

To handle validity checks, DTDs must enable the following:

  • Identification of the elements that can be in a document

  • Identification of the order and relation between elements

  • Identification of the attributes of every element and whether they are optional or required

Last but not least, there needs to be a mechanism to associate DTDs with XML documents.

Document Structure

DTDs are a mechanism to express the valid structure of a document. One way to visualize the structure of a document is as a tree of possible element and attribute combinations. For example, Figure 2.3 shows the document structure for purchase orders as expressed by a popular XML processing tool. The image uses some syntax from regular expressions to visualize the multiplicity of elements: question mark (?) stands for optional (zero or one), asterisk (*) stands for any (zero or more) , and plus (+) stands for at least some (one or more).

Figure 2.3. Document structure defined by the purchase order DTD.


Every element in the document structure tree has an associated model group. Model groups identify the sequencing and multiplicity of element content. There are two types of sequences: sequence and choice. Sequence defines the exact order in which child elements must appear. In DTDs, the sequence operator in model groups is the comma (,). The model group (A, B, C) defines a content model where the first child element will be A, followed by B, followed by C. Choice defines the possible elements that can appear at any given position in the content model. The choice operator in model groups is the pipe character (|). The model group (A | B | C) defines a content model where there will be only one child element that can be A or B or C. Sequences and choices can be nested, as in ((A | (X, Y, Z)), B, (C | D)). This content model defines the following possible combinations of child elements:

  • A, B, C

  • A, B, D

  • X, Y, Z, B, C

  • X, Y, Z, B, D

The multiplicity of elements is defined using the same regular expression syntax used in document structure trees. The absence of a suffix stands for exactly one, question mark (?) stands for optional (zero or one), asterisk (*) stands for any (zero or more), and plus (+) stands for at least some (one or more). For example, the model group (A, B?, C*, D+) allows for the following combinations of child elements (… stands for "potentially many more of the same element"):

  • A, D…

  • A, B, D…

  • A, B, C…, D…

  • A, C…, D…

Are DTDs Enough?

Documents associated with DTDs are a huge step forward from basic XML markup. DTDs allow for validating document structure (element content, allowed attributes, and their value types), which significantly reduces the amount of custom validation code that needs to be written in XML applications. However, DTDs have some notable deficiencies:

  • Although they express structured information, they do not use XML markup. DTD syntax is not as easy to process and manipulate as XML.

  • DTDs were designed before namespaces came into existence and don't have good facilities for dealing with them. This is a problem for data-oriented applications that rely heavily on namespaces.

  • DTDs do not offer sufficient reusability and extensibility capabilities. No mechanism exists for associating more than one DTD with an XML document. It is easy to reach the limit of what DTDs allow for even basic applications.

  • DTDs model groups are sometimes too restrictive, in particular with respect to the order of child elements. No convenient DTD mechanism exists for declaring, for example, that the content of some element could include two child elements A and five child elements B, regardless of the order in which they appear.

  • DTDs have no notion of data types. This hurts data-oriented applications where XML is eventually bound to some application-level data structure in a programming language. For example, DTDs offer no mechanism to enforce the simple rule that the values of the quantity attribute of the item element should be positive integers.

  • For these reasons and others, one of the main Web service protocols—Simple Object Access Protocol (SOAP), which we'll discuss in Chapter 3—explicitly forbids the use of DTDs for defining document structure.

For these reasons, this chapter will not discuss DTDs in any further detail. We won't even introduce the basic DTD syntax here because data-oriented XML applications have moved away from DTDs; these applications use another mechanism to validate XML documents and to enforce document structure and datatype rules. To address the problems inherent in DTDs, the XML community developed XML Schema, a much richer meta-language for XML documents expressed natively in XML.

    Previous Section Next Section