Previous Page
Next Page

7.6. Exchanging XML

When exchanging data between two heterogeneous components of a system, we need to define two elements. First, we need a medium and protocol over which the components can communicate. Our medium is likely to be a local of wide area network, unless we're on the same machine (in which case we can use local sockets or pipes) or have some dedicated connection (such as serial or InfiniBand). Of the seven layers of the OSI network protocol model, we usually decide on the top few. At layer 1 we're using 1000-baseT or 1000-baseSX (or something slower), at layer 2 we're using Ethernet (or possibly ATM), at layer 3, 4, and 5 we're using TCP/IP, and at the top of the stack we're probably using either HTTP or something of our own devising.

Once we have our full protocol stack, all that's left is to decide what we actually want to send over it. The second element we need for exchanging data, after the protocol, is the data format.

XML lends itself well as a data interchange format. Its design in 1996 was based on a subset of SGML, with a few syntactical additions, as a method of exchanging data in a format that was open, both human and computer readable, self-documenting, and could express complex structures and relationships.

7.6.1. Parsing XML

Like HTTP, XML isn't as easy to deal with as it first appears. Parsing XML by hand is almost always a bad idea, unless you are certain of the exact subset of XML that the data source will be using. Consider the following snippets of data:

<hello> <world> </hello>

This is pretty straightforwardjust some simple entities we'll need to decode:

<hello> <w&#111;rld> </hello>
<hello> <w&#x6f;rld> </hello>

These two get a little more complicatedstill entities, but now numbered instead of the four we've come to expect ( < > & "). We're still within the realms of things that are straightforward for us to parse with a simple state machine parser:

<hello><![CDATA[ <world> ]]></hello>

CDATA segments start to make things a little more interesting. We need to treat everything within the section as character data and ignore any special characters we find:

<hello xmlns:foo="urn:some/other/namespace"><foo:hello> bar</foo:hello></hello>

Namespaces can start to cloud the issue. We need to know the local and full name of each node to understand the document fully. XML namespace aliases can be reused within the same document and declared at any time, so when we see foo:bar we need to know what namespace foo is representing at that point in the document:

<?xml version="1.0" encoding="UTF-8" ?>
<hello> <world> </hello>

The XML documents we receive may or may not have a prolog section or other processing instructions. We need to know which we can skip over and which are important. The prolog's encoding attribute is pretty important.

<!DOCTYPE hello [
  <!ELEMENT hello (#PCDATA)>
]>
<hello>world</hello>

A document type definition (DTD) may be included, describing the format of the document. Hopefully, we already know the format given that we've fetched for use, but we can use the DTD to check for document validity.

<!DOCTYPE hello [
  <!ENTITY world "world">
]>
<hello>&world;</hello>

But some parts of the DTD really do matter to us. If there are internal or external entities defined, then we can't expand the document without parsing and substituting them.

XML is a very rich data representation with a lot of flexibility. Unfortunately, this also means that it tends to have a lot of features we don't need, adding lots of cruft on top of the subset we actually want to use.

If we understand all of these various elements of an XML document and the others not mentioned here, then that still doesn't guarantee that we'll be able to understand the data passed to us. A document can be badly formed or invalid in many different ways, all or which might break our parsing. Element tags need to be balanced correctly, nested rather than overlapped. Bracket characters can't appear in PCDATA sections unless they're the start or end of a tag.

I'm sure you can see where this is headed. XML is difficult to parse. Lots of people already use XML, so there's a lot of software out there for parsing it. We'll invoke our laziness principle, and build on the work of others once again. There are plenty of good XML parsers and language bindings already built, so we'll leverage those. It would be a dangerous waste of resources to try and build our own from scratch.

XML parsers come in two main flavors, SAX (Simple API for XML) and DOM (Document Object Model). SAX parsers are serial, reading the document from start to end, generating events as they go. SAX parsers are good for long documents, as they don't need to hold the whole parse tree in memory at once. DOM parsers first construct a model of the parse tree in memory (or on disk) and then allow random access to the elements within. DOM parsers are good for small documents that can easily fit in memory where you need to access random parts of the tree.

Most XML parsing libraries are built on one of two underlying libraries, Expat and libxml. Expat was the original open source XML-parsing library, built in 1998 by James Clark, one of the creators of XML. Expat is a SAX-like event-based parser, on which SAX and DOM parsers can be fairly easily built. libxml was created in 1990 by Daniel Veillard for the GNOME project (although it also works standalone) and implements SAX and DOM-like parsing semantics.

In PHP 4, we can use the XML parser library, built on top of Expat to provide parsing services. The PEAR modules XML_Parser and XML_Tree provide SAX and DOM parsing services respectively. You can also use the domxml extension for DOM parsing using libxml. In PHP 5, the SimpleXML and Dom extensions (based on libxml) provide SAX and DOM parsing services, respectively.

For Perl programmers, the XML-LibXML package contains libxml-based DOM (XML::LibXML::DOM) and SAX (XML::LibXML::SAX) parsers. There are a huge number of other implementations on CPAN, and it's worth searching for the one that suits you best. There are numerous SAX and DOM parsers built on both Expat and libxml.

When we can talk HTTP and parse XML, we can start to bring the two together to send requests and receive responses from remote services. We could decide on everything on top of that for ourselves, but we're fans of building on other people's worklet's allow other people to make all the mistakes first and then we can come in and use the good parts. There are three important protocols for communicating with XML over HTTP, and we'll look briefly at each in turn.

7.6.2. REST

The term REST was coined by HTTP coinventor Roy Fielding in his 2000 doctoral thesis entitled Architectural Styles and the Design of Network-Based Software Architectures. REST, or Representational State Transfer, refers to a collection of architectural principles used for transfer of information over the Web, but is now used to describe simple RPC-based protocols using XML over HTTP.

REST is currently the poster child of the open source web application community as it avoids some of the perceived pitfalls of the more strictly defined protocols such as XML-RPC and SOAP. Namely, it's lightweight and application-specific (since there's no formal envelope or required structure) and that it makes better use of HTTP (using DELETE and PUT verbs). In this sense, REST isn't actually a protocol beyond HTTP, but rather an agreed on way of accessing and modifying resources over HTTP using XML.

7.6.3. XML-RPC

XML-RPC was designed in 1995 by Dave Winer when he became frustrated with the SOAP design process. XML-RPC is a very simple protocol, which can be summed up in a couple of pages in its entirety. It describes a request and response XML document, and a format for encoding data within these documents using a few basic typesnumbers, strings, arrays, and structs/hashes. An XML-RPC request looks like this:

<?xml version="1.0"?>
<methodCall>
        <methodName>{method name}</methodName>
        <params>
                <param>{value}</param>
                <param>{value}</param>
                <param>{value}</param>
        </params>
</methodCall>

There are then two formats of response. The successful response contains response data:

<?xml version="1.0"?>
<methodResponse>
        <params>
                <param>{value}</param>
                <param>{value}</param>
                <param>{value}</param>
        </params>
</methodResponse>

An unsuccessful request elicits a fault response containing an error code and message:

<?xml version="1.0"?>
<methodResponse>
        <fault>
                <value>
                        <struct>
                                <member>
                                         <name>faultCode</name>
                                         <value><int>{code}</int></value>
                                </member>
                                 <member>
                                          <name>faultString</name>
                                          <value><string>{error}</string></value>
                                 </member>
                         </struct>
                </value>
        </fault>
</methodResponse>

It's quite easy to see why critics of XML-RPC and web services in general complain about heavy syntax and difficult parsing. Here we have 12 XML tag pairs to describe an error with a single code and message. Complex successful responses can become very large very quickly.

7.6.4. SOAP

The last in our merry bunch, SOAP, originally stood for the Simple Object Access Protocol. Lately, it was been renamed just to SOAP (no longer an acronym) after it ceased being very simple.

As with XML-RPC, SOAP has a request and response envelope that wraps the actual data. Data inside a SOAP envelope is usually expressed using XML Schema notation. The request and response envelopes, at their simplest, are identical:

<?xml version="1.0" encoding="utf-8" ?>
<s:Envelope
        xmlns:s="http://www.w3.org/2003/05/soap-envelope"
        xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
        xmlns:xsd="http://www.w3.org/1999/XMLSchema"
>
        <s:Body>
                 {request/response body}
        </s:Body>
</s:Envelope>

As with XML-RPC, when an error occurs the response takes on a specific format, indicating the nature of the error:

<?xml version="1.0" encoding="utf-8" ?>
<s:Envelope xmlns:s="http://www.w3.org/2003/05/soap-envelope">
        <s:Body>
                <s:Fault>
                        <faultcode>{code}</faultcode>
                        <faultstring>{message}</faultstring>
                        <faultactor>{url}</faultactor>
                        <details>{explanation}</details>
                </s:Fault>
        </s:Body>
</s:Envelope>

SOAP is slightly harder to parse than XML-RPC, as it typically uses multiple namespaces, but is still just as verbose.

We'll look at how we can provide our own REST, XML-RPC, and SOAP interfaces in Chapter 11. For the moment it's enough to know what choices we have for communicating with XML-based external services over HTTP.


Previous Page
Next Page