Previous Section Next Section

Origins of XML

World Wide Web Consortium (W3C) began work on Extensible Markup Language (XML) in the middle of 1996. XML 1.0, released on February 10, 1998, resulted from the computer industry's need to develop a simple yet extensible mechanism for the textual representation of structured and semi-structured information. The design inspiration for XML came from two main sources: Standard Generalized Markup Language (SGML) and HTML.

The concept of generalized markup (GM) has been around for decades. It involves using tags graphics/book.gif to identify pieces of information. Simply put, tags are names surrounded by pointy brackets (< and >). For example, <title> is a tag. The innovative thing about GM is that it requires information to be surrounded by both start and end tags. End tags look like start tags with the addition of a forward slash (/) before the tag name, as in </title>. The notion of start and end tags allows for nesting, which, in turn, lets you structure information in a hierarchical manner.

Consider the following example, which uses markup to indicate that a book has a title and several authors:

    <title>Building Web Services with Java</title>
        <author>Steve Graham</author>
        <author>Simeon Simeonov</author>
        <author>Toufic Boubez</author>
        <author>Doug Davis</author>
        <author>Glen Daniels</author>
        <author>Yuichi Nakamura</author>
        <author>Ryo Neyama</author>

Using markup to represent information about books has many benefits. The information is readily readable by humans. It is also quite easy to process with software because start and end tags clearly delineate where certain pieces of information start and where they end. Further, this way to represent information is inherently extensible. For example, you can easily imagine how to add more authors or other information (such as the book's ISBN) to the book description. Markup is appealing because of its simplicity combined with the potential for extensibility. Not all markup is simple, though. In fact, our industry's first attempt to formally define generalized markup yielded a very complex specification. SGML was ratified by ISO in 1986. It defined everything you could ever want to know about markup and more. SGML-enabled software was expensive; typically, only large companies could afford it. The software also tended to be full of defects. Over time, a growing community of SGML experts began to voice opinions that, perhaps, the core ideas of SGML could be organized in a much simpler fashion. All that was needed was a catalyst to force the change and an organization that could lead the standardization effort. The catalyst was the combination of HTML and the Web. The organization was the W3C.

By its nature, SGML is a meta-language graphics/book.gif. It does not prescribe any particular markup; instead, it defines how any given markup language can be formally specified. For better or worse, the term for these markup languages is SGML applications graphics/book.gif. Because the term is confusing (a markup language specification is not a piece of software), it is rarely used nowadays, but you still might encounter it in some of the reference materials pointed out at the end of the chapter.

The most popular SGML application is HTML, the markup language that rules the Web. HTML combines markup from several different categories to provide a rich hypertext experience:

  • Text structuring tags: <H1>, <H2>, <P>, <BR>

  • Formatting tags: <B>, <I>

  • Linking and embedding tags: <IMG>, <A>

  • Data input tags: <FORM>, <INPUT>, <SELECT>

The HTML specification is owned by W3C. Unfortunately, due to the rapid growth of the Internet and the market pressure caused by the browser wars, the leading browser vendors introduced a number of incompatible tags to HTML completely outside the scope of the HTML specification. These tags created problems for Internet software vendors and HTML document authors—they had to be careful what markup they used, based on the type of browser that would display the HTML document. Yet at the same time, they themselves were not able to extend HTML with markup that could have been useful to them.

The need to simplify SGML coincided with the need to control the evolution of HTML and create a simple generalized markup language for use on the Web. SGML was too heavy for this purpose—it simply took too much effort to support and process. XML became that lightweight language. After about one-and-a-half-years of work, the XML working group at the W3C produced a final specification. XML is similar to SGML in that it preserves the notion of GM. However, the specification is much simpler. There are very few optional features, and most SGML features that were deemed difficult to implement were abandoned.

XML is here to stay. The XML industry is experiencing a boom. XML has become the de facto standard for representing structured and semi-structured information in textual form. Many specifications are built on top of XML to extend its capabilities and enable its use in a broader range of scenarios. One of the most exciting areas of use for XML is Web services. The rest of this chapter will introduce the set of XML technologies and standards that are the foundation of Web services:

  • XML instances— The rules for creating syntactically correct XML documents

  • XML Schema— A recent standard that enables detailed validation of XML documents as well as the specification of XML datatypes

  • XML Namespaces— Definitions of the mechanisms for combining XML from multiple sources in a single document

  • XML processing— The core architecture and mechanisms for creating, parsing, and manipulating XML documents from programming languages

    Previous Section Next Section