6.3. The MIME Format
ARPANET was created in 1969 and email soon after in 1971. The first standard for the exchange of public messages came in the form of an RFC document in 1973 (RFC 561). In Internet terms, the email standard is very, very old. So old, in fact, that you'd be forgiven for thinking that everyone must have read and understood it by now. Of course, this isn't true in practice and some mailers still make basic mistakes. When you're parsing email from a variety of different sources, these mistakes can make your life difficult.
The Internet and the protocols that comprise email are based on a set of open standards, encapsulated into RFC (request for comment) documents. These documents are now somewhat misnamed, as they represent a final standard and not a solicitation for feedback. RFCs were originally created by the people working on ARPANET and are now published by the Internet Society (ISOC), the Internet Engineering Task Force (IETF), and the Internet Architecture Board (IAB).
There are three important RFCs concerning email (although there are many more that cover email in some capacity), and we'll touch on them here to get a good understanding of how the protocols have evolved.
The first RFC dealing substantially with email was RFC 561 in September 1973, entitled "Standardizing Network Mail Headers." This RFC described a standardized method of specifying headers and bodies, which was later adopted for HTTP.
Email was more widely standardized in RFC 822. Released in August 1982, it was titled "Standard for the Format of ARPA Internet Text Messages" and described the format for the email address system still used today.
Finally, the format we currently use for mail was described in RFC 1341 in June 1992. Entitled "MIME (Multipurpose Internet Mail Extensions): Mechanisms for Specifying and Describing the Format of Internet Message Bodies," the document describes media types and multipart message structure. By reading these three RFCs you can get a good sense of how email has developed and the way in which email documents are structured.
The MIME standard defines various values for the Content-type header, which allow you to specify the contents of a message body. Content types have a type and subtype component, separated with a forward slash (main type first). The various types include text body types (text/plain, text/html), images (image/gif, image/jpeg), and binary attachments (application/octet-stream, application/zip).
There's a special primary media type, multipart/*, which specifies that the body contains multiple subchunks. multipart/alternative specifies that the subchunks are alternative representations of the same content (often used for including both text/plain and text/html bodies in a single message). multipart/mixed is used for attaching files; both attachments and message bodies are included as subchunks. message/rfc822 can be used for attaching an email including all of its headers. A typical multipart header might look like this:
Content-type: multipart/alternative; boundary="8732947.038A7B5C765A86D87EE983"
The boundary property specifies a string that is used to split the body down into chunks. The body of the mail with the above header might look something like this:
--8732947.038A7B5C765A86D87EE983 Content-type: text/plain hello world --8732947.038A7B5C765A86D87EE983 Content-type: text/html <b>hello world</b> --8732947.038A7B5C765A86D87EE983--
Prefixed by two dashes, the boundary indicates the division between subchunks. The final chunk ends with the boundary both prefixed and suffixed with two dashes. A multipart body can contain one or more subchunks, all delimited by the same boundary string.
The contents of each subchunk contain both headers and bodies themselveseverything the email doesalthough many headers (email subject, From address) don't need to be repeated here. The subchunks can, however, specify a Content-type header. By specifying a multipart content type, subchunks can themselves contain subchunks, ad infinitum. They must, of course, use a different boundary string; otherwise, you wouldn't be able to tell where to split up the contents.
This example of multiple levels of chunks is indented to make reading it a little easier, but in practice it would not be indented at all. Such indentation actually makes the mail invalid because the headers and boundaries cannot have leading whitespace:
Content-Type: multipart/mixed; boundary=outer --outer Content-Type: text/plain Content-Disposition: inline Some text goes here --outer Content-Type: multipart/mixed; boundary=inner --inner Content-Type: image/jpeg Content-Disposition: attachment <jpeg data> --inner Content-Type: image/jpeg Content-Disposition: attachment <jpeg data> --inner-- --outer--