By Slavo Greminger and Oli Schacher
On a daily basis we collect tons of Spam emails, which we analyze for malicious content. Of course, this is not done manually by our thousands of minions, but automated using some Python-fu. Python is a programming language that comes with many libraries, making it easy for us to quickly perform such tasks.
Python’s email library deals with, well, emails. And it does it well. But on October 3rd, we encountered an attachment that wasn’t there – at least according to Python’s email library.
Now how could that happen?
Emails do have a certain structure, which is described nicely in RFC #822, RFC #2822, RFC #5322, RFC #2045, RFC #2046, RFC #2047, RFC #2049, RFC #2231, RFC #4288 and RFC #4289. Even though these RFC’s are clear in their own way, an illustration might help (we focus on multipart emails only) to understand why Python’s email library got fooled.
Emails essentially consist of two parts: the headers and the body. A typical email header is
with the header name “To” and the header value “email@example.com“. In the case of multipart emails, there also is a (MIME) header
Content-Type: multipart/alternative; boundary=0123456789boundaryabcdef
which tells us, because of the main type “multipart“, that the body of this email consists of several mime parts separated by the boundary “0123456789boundaryabcdef“. Here is an example of a well-formatted multipart email with a zip-attachment:
To: firstname.lastname@example.org Subject: Well-formatted email Content-Type: multipart/alternative; boundary=0123456789boundaryabcdef --0123456789boundaryabcdef Content-Type: text/plain; charset=UTF-8 This email is well-formatted. --0123456789boundaryabcdef Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <p>This email is well-formatted.</p> --0123456789boundaryabcdef Content-Type: application/zip; name="attachmentthatwasthere.zip" Content-Transfer-Encoding: base64 Content-ID: <004601d21e34$4b05ce10$3703a8c0@25OW9ZE> UEsDBBUAAAAJAIZRREm9/p/AY+YBAFeyAgAwAAAASU1HLTIwMTYxMDAzLVdBMDAwMStJTUct MjAxNjEwMDktV0EwMDAyLmpwZWcuZXhl8o2awMDMwMDAAsT//zMw7GCAAAcGQgCilk9+Fx/D --0123456789boundaryabcdef
With a very simple modification, the attachment called “attachmentthatwasthere.zip” can be hidden:
To: email@example.com Subject: Mal-formatted email Content-Type: multipart/alternative; boundary=0123456789boundaryabcdef --0123456789boundaryabcdef Content-Type: text/plain; charset=UTF-8 This email is mal-formatted. --0123456789boundaryabcdef Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <p>This email is mal-formatted.</p> --0123456789boundaryabcdef --0123456789boundaryabcdef Content-Type: application/zip; name="attachmentthatwasntthere.zip" Content-Transfer-Encoding: base64 Content-ID: <004601d21e34$4b05ce10$3703a8c0@25OW9ZE> UEsDBBUAAAAJAIZRREm9/p/AY+YBAFeyAgAwAAAASU1HLTIwMTYxMDAzLVdBMDAwMStJTUct MjAxNjEwMDktV0EwMDAyLmpwZWcuZXhl8o2awMDMwMDAAsT//zMw7GCAAAcGQgCilk9+Fx/D --0123456789boundaryabcdef
The empty line after the – – 0123456789boundaryabcdef causes parsers to think that the end of the email has been reached. Email readers treat this situation differently: Thunderbird identifies this as an error, ignores the red boundary line and shows the attachment. Apple’s Mail, Microsoft Outlook for Mac 2011 and Microsoft Outlook 2010 (other versions have not been tested) do not show the attachment. What about web-based readers? Open-Xchange and Roundcube show it readily, but Gmail and Outlook Web do not.
And, as the attentive reader might have realized, Python’s email library does not “show” the attachment either. Thus, how did we modify our parsers to treat/recognize these attachments? Python’s email library lists everything after the (supposed) end of the email in an object called epilogue. By parsing this epilogue, hidden attachments can be detected (for a full implementation see https://github.com/gryphius/fuglu/blob/2ffd57ef876dcb553cb7b84b37f4a9d82cabe07a/fuglu/src/fuglu/plugins/attachment.py#L748-L767).
boundary = message.get_boundary() epilogue = message.epilogue if epilogue is None or boundary is None or boundary not in epilogue: return for candidate in epilogue.split(boundary): part_content = candidate.strip() if part_content.lower().startswith('content'): message = email.message_from_string(part_content) yield message
To conclude: Was this a clever move by the spammers? The answer is – err – no, we think. If you want to reach the inbox of as many recipients as possible, it does not really help you that many of the recipients do not even see the malicious attachment. If you only want to reach the inbox of recipients using a particular brand of email reader, then this might not help you either: Antispam engines have several layers to recognize spam emails – malformation is one of the indicators. So, probably, this is a bug and was not intentional.
Still, it always is amazing how creative people can be, and how differently applications behave if standards are not respected. We are looking forward to the next trick (or bug) and continue with our minions to analyze spam for malicious content in order to protect Swiss universities.