An attachment that wasn’t there

By Slavo Greminger and Oli Schacher

On a daily basis we collect tons of Spam emails, which we analyze for malicious content. Of course, this is not done manually by our thousands of minions, but automated using some Python-fu. Python is a programming language that comes with many libraries, making it easy for us to quickly perform such tasks.

Python’s email library deals with, well, emails. And it does it well. But on October 3rd, we encountered an attachment that wasn’t there – at least according to Python’s email library.

Mal-formatted email — Left: Outlook Web does not show the attachment Right: Thunderbird does show the attachment

Now how could that happen?

Emails do have a certain structure, which is described nicely in RFC #822, RFC #2822, RFC #5322, RFC #2045, RFC #2046, RFC #2047, RFC #2049, RFC #2231, RFC #4288 and RFC #4289. Even though these RFC’s are clear in their own way, an illustration might help (we focus on multipart emails only) to understand why Python’s email library got fooled.

Emails essentially consist of two parts: the headers and the body. A typical email header is

To: me@example.com

with the header name “To” and the header value “me@example.com“. In the case of multipart emails, there also is a (MIME) header

Content-Type: multipart/alternative; boundary=0123456789boundaryabcdef

which tells us, because of the main type “multipart“, that the body of this email consists of several mime parts separated by the boundary “0123456789boundaryabcdef“. Here is an example of a well-formatted multipart email with a zip-attachment:

To: me@example.com
Subject: Well-formatted email
Content-Type: multipart/alternative; boundary=0123456789boundaryabcdef

--0123456789boundaryabcdef
Content-Type: text/plain; charset=UTF-8

This email is well-formatted.
--0123456789boundaryabcdef
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p>This email is well-formatted.</p>

--0123456789boundaryabcdef
Content-Type: application/zip; name="attachmentthatwasthere.zip"
Content-Transfer-Encoding: base64
Content-ID: <004601d21e34$4b05ce10$3703a8c0@25OW9ZE>

UEsDBBUAAAAJAIZRREm9/p/AY+YBAFeyAgAwAAAASU1HLTIwMTYxMDAzLVdBMDAwMStJTUct
MjAxNjEwMDktV0EwMDAyLmpwZWcuZXhl8o2awMDMwMDAAsT//zMw7GCAAAcGQgCilk9+Fx/D

--0123456789boundaryabcdef

With a very simple modification, the attachment called “attachmentthatwasthere.zip” can be hidden:

To: me@example.com
Subject: Mal-formatted email
Content-Type: multipart/alternative; boundary=0123456789boundaryabcdef

--0123456789boundaryabcdef
Content-Type: text/plain; charset=UTF-8

This email is mal-formatted.
--0123456789boundaryabcdef
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p>This email is mal-formatted.</p>

--0123456789boundaryabcdef

--0123456789boundaryabcdef
Content-Type: application/zip; name="attachmentthatwasntthere.zip"
Content-Transfer-Encoding: base64
Content-ID: <004601d21e34$4b05ce10$3703a8c0@25OW9ZE>

UEsDBBUAAAAJAIZRREm9/p/AY+YBAFeyAgAwAAAASU1HLTIwMTYxMDAzLVdBMDAwMStJTUct
MjAxNjEwMDktV0EwMDAyLmpwZWcuZXhl8o2awMDMwMDAAsT//zMw7GCAAAcGQgCilk9+Fx/D

--0123456789boundaryabcdef

The empty line after the – – 0123456789boundaryabcdef causes parsers to think that the end of the email has been reached. Email readers treat this situation differently: Thunderbird identifies this as an error, ignores the red boundary line and shows the attachment. Apple’s Mail, Microsoft Outlook for Mac 2011 and Microsoft Outlook 2010 (other versions have not been tested) do not show the attachment. What about web-based readers? Open-Xchange and Roundcube show it readily, but Gmail and Outlook Web do not.

And, as the attentive reader might have realized, Python’s email library does not “show” the attachment either. Thus, how did we modify our parsers to treat/recognize these attachments? Python’s email library lists everything after the (supposed) end of the email in an object called epilogue. By parsing this epilogue, hidden attachments can be detected (for a full implementation see https://github.com/gryphius/fuglu/blob/2ffd57ef876dcb553cb7b84b37f4a9d82cabe07a/fuglu/src/fuglu/plugins/attachment.py#L748-L767).

boundary = message.get_boundary()

epilogue = message.epilogue
if epilogue is None or boundary is None or boundary not in epilogue:
    return

for candidate in epilogue.split(boundary):
    part_content = candidate.strip()
    if part_content.lower().startswith('content'):
        message = email.message_from_string(part_content)
        yield message

To conclude: Was this a clever move by the spammers? The answer is – err – no, we think. If you want to reach the inbox of as many recipients as possible, it does not really help you that many of the recipients do not even see the malicious attachment. If you only want to reach the inbox of recipients using a particular brand of email reader, then this might not help you either: Antispam engines have several layers to recognize spam emails – malformation is one of the indicators. So, probably, this is a bug and was not intentional.

Still, it always is amazing how creative people can be, and how differently applications behave if standards are not respected. We are looking forward to the next trick (or bug) and continue with our minions to analyze spam for malicious content in order to protect Swiss universities.