6.7. Wireless Carriers Hate You
You may not have realized it yet, but wireless carriers hate you. This can be a little hard to take in at first; surely they're just like regular humans? Unfortunately, after working with email sent from mobile devices, you will begin to see that they hate you. Not only do they hate you, but they also hate everything you've ever done and everyone you've ever worked with.
This might seem like the ramblings of a paranoid programmer who's spent too many years using COBOL, but after the experience of processing email sent from mobile devices, I can't come up with any other explanation.
If you're going to receive mail from mobile devices, then you're going to have to create a lot of special cases. This obviously sucks, but is unavoidable if you want to be able to deal with mail in a consistent manner. At some point somebody will hopefully create a centralized clearing house for the processing of mail from delinquent senders, but until that time you'll need to deal with the issue yourself, patching individual problems as you find them and as carriers evolve new methods of sending almost valid email.
The first batch of special casing comes from email subject lines. Many mobile carriers prefix a particular string to each subject line, while some replace it altogether with their own. This is all very well, until you receive mail with the following subject line:
Subject: [PXT from 5555551234]
If you're blithely taking subject lines and adding them to objects as metadata, then you'll be displaying the sender's phone number to the public. This is clearly a privacy issue. Not all subject lines are so evil, though; many are being merely annoying:
Subject: This is an MMS message
Magritte is turning in his grave. But when you end up with a few thousand photos with that title, you start to lose the value in your data. Even worse, you can inadvertently start advertising the mobile carrier in your application:
Subject: This message was sent from a T-Mobile wireless phone
Many mobile devices do bizarre things with attachments. It's quite normal to see mail from desktop clients who don't mark attachments with the Content-Disposition: attachment header. Some clients will send images with the Content-Disposition: inline header to indicate the image should be displayed inline in the message. When looking for attached images, you would typically traverse the chunks looking for a media type of image/*, or a filename matching /\.(jpg|jpeg|gif|png)/.
This covers all mail we've seen coming from regular clients, but some mobile carriers instead attach images using the following headers:
Content-type: text/plain; charset="us-ascii" Content-Transfer-Encoding: base64 dGhhbiAxNTEgcHJpbnRzLCB0aGUgb3JkZXIgbXVzdCBiZSBzaGlwcGVkIHZpYSAiUHJpb3JpdHkg TWFpbCJBAABWZXJpZnkgdGhhdCAxIHRvIDUwIHByaW50cyB3aWxsIGhhdmUgdGhlIHNoaXBwaW5n IHByaWNlIG9mICQxNC45OToAAFZlcmlmeSB0aGF0IHRoZSA1MSB0byAxMDAgcHJpbnRzJyBzaGlw cGluZyBwcmljZSBpcyAkMTUuOTk7AABWZXJpZnkgdGhhdCB0aGUgMTAxIHRvIDE1MCBwcmludHMn IHNoaXBwaW5nIHByaWNlIGlzICQxNi45OTsAAFZlcmlmeSB0aGF0IHRoZSAxNTEgdG8gMjAwIHBy
It's hard to believe, but not only do they not mark the chunk as an attachment or inline, or specify a filename, but they actually claim it's ASCII-encoded plain text.
To correctly identify images in these cases, we have to use a few rules of thumb. We can check the From address against a known list of offenders, and treat text/plain segments with base64 encoding suspiciously. We can examine the magic bytes at the beginning of the file to look for known types; GIFs start with GIF89a and JPEGs have JFIF at byte 7. Alternately, we can just be suspicious of all bodies with a lot of data in them (more than a few kilobytes) and treat them as attachments.
The text/plain type is tricky because it's also used by valid bodies. When we come across attachments of unknown types, such as application/octet-stream (the "I-don't-know-what-it-is" type), then we should probably try to process it as an attachment. When the mailer agent doesn't know the correct media type, it will often omit the header altogether or send a spurious application/* header.
Some mobile carriers attach extra images to email, including spacer GIFs and company logos. When extracting attachments, you need to be careful to exclude these files. There are again two ways you can go about this: discard all images under a certain size, which works fairly well, or keep a list of carriers who attach extra images and discard themmany carriers will attach photos as JPEGs and extra images as GIFs.
A certain carrier doubles up the attachments on each emailsending two copies of every file, one as a subchunk with the text body and one as a subchunk with the HTML body. It's a good idea to compare each file you find to check they're not identical. By comparing the length and calculating a simple checksum (such as crc32( ) in PHP), you can easily eliminate doubles.
Some carriers will detach any media attachments from mail and replace them with a link to where the content can be viewed online (alongside their advertisements). This is a particularly troublesome issue, since you can't gather the input by just parsing the mail. Your parser will need to recognize these carriers (usually by keeping a list of offending providers) and then go and remotely fetch the attachments.
As if this were not already enough of an inconvenience, some carriers require you to visit one linked page, receive a cookie, and then click through to the actual media. Your attachment-parsing layer will need to deal with each of these cases individually, although using a good HTTP client library with cookie support helps immensely.
By the time you've dealt with a few of these situations, you'll have an extensive library of code and special casing. The worst part is that it's usually not over at that point; new carriers appear and old ones start sending mail in more creative ways. The source code for the Flickr attachment parser is littered with comments from disgruntled programmers:
# Some carriers have partnered with the # URL scheme 'http://foo.com//shareImage' # Careful observers will note the *required* # double slashes after the hostname. # Maybe somebody should explain the web to them?
A good attachment parser is an essential component of an email-receiving system; it will sit between the MIME parser and the application logic. Much of the development drive in this layer will come from mobile carriers with odd ideas of how to send mail, but it's a good chance to test our coding prowess. Our mail-processing system now looks like Figure 6-1.
Figure 6-1. Mail-processing system