Using Curl to Retrieve Malicious Websites

Published: 2010-01-20. Last Updated: 2010-01-20 22:04:25 UTC
by Lenny Zeltser (Version: 2)
4 comment(s)

Here's how to use Curl to download potentially-malicious websites, and why you may want to use this tool instead of the more-common Wget.

Curl and Wget are excellent command-line tools for Windows and Unix. They can download remote files and save them locally without attempting to display or render them. As the result, these tools are handy for retrieving files from potentially malicious website for local analysis--the small feature-set of these utilities, compared to traditional Web browsers, minimizes the vulnerability surface.

Both Curl and Wget support HTTP, HTTPS and FTP protocols, and allow the user to define custom HTTP headers that malicious websites may examine before attempting to attack the visitor (more on that below). Curl also supports other protocols you might find useful, such as LDAP and SFTP; however, these protocols are rarely used by analysts when examining content and code of malicious websites.

Overall, the two tools are similar when it comes to retrieving remote website files. However, the one limitation of Wget that is relevant for analyzing malicious websites it its inability to display contents of remote error pages. These error pages might be fake and contain attack code. Curl will retrieve their full contents for your review; Wget will simply display the HTTP error code.

Consider this example that uses Wget:

$ wget http://www.example.com/page

Resolving www.example.com...
Connecting to www.example.com:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2010-01-19 05:37:11 ERROR 404: Not Found. 

Many analysts assume that the malicious web page is gone when they see this. However, consider the same connection made with Curl:

$ curl http://www.example.com/page

<HTML>
<HEAD><TITLE>404 Not Found</TITLE></HEAD>
<BODY>
<H2>404 Not Found</H2>
<SCRIPT>
document.write("Hi there, bear!");
</SCRIPT>

<P>The requested URL was not found on this server.</P>
</BODY>
</HTML>

Now you can see that the error page is an HTML document that has JavaScript embedded in it. In this example, the script simply prints a friendly greeting; however, it could have been malicious. The victim's browser would render the page and execute the script that could implement an attack.

Another useful feature of Curl is its ability to save headers that the remote web server supplied when responding to the HTTP request. This is useful because JavaScript obfuscation techniques make use of information about the page and its context, such as its last-modified time. Saving the headers allows the analyst to use this information when/if it becomes necessary. Use the "-D" parameter to specify the filename where the headers should be saved:

$ curl http://www.example.com/page -D headers.txt

<HTML>
<HEAD><TITLE>404 Not Found</TITLE></HEAD>
...

$ cat headers.txt

HTTP/1.1 404 Not Found
Server: Apache/2.0.55
Content-Type: text/html; charset=iso-8859-1
Date: Wed, 19 Jan 2010 05:51:44 GMT
Last-Modified: Wed, 19 Jan 2010 03:51:44 GMT
Accept-Ranges: bytes
Connection: close
Cache-Control: no-cache,no-store

If you wish Curl to also save the retrieved page to a file, instead of sending it to STDOUT, use the "-o" parameter, or simply redirect STDOUT to a file using ">". This is particularly useful when retrieving binary files, or when the web server responds with an ASCII file that it automatically compressed. If you're not sure about the type of the file you obtained, check it using the Unix "file" command or the TrID utility (available for Windows and Unix).

Update: Didier Stevens mentioned that using "-d -o" parameters to Wget allows him to capture full HTTP request and response details in the specified log file. However, this does not seem to address the issue of Wget not displaying contents of HTTP error pages.

Whether using Curl or Wget to retrieve files from potentially-malicious websites, consider what headers you are supplying to the remote site as part of your HTTP request. Many malicious sites look at the headers to determine how or whether to attack the victim, so if they notice Curl's or Wget's identifier in the User-Agent header, you won't get far. Malicious sites also frequently examine the Referer header to target users that came from specific sites, such as Google. Even if you define these headers, the lack of other less-important headers typically set by traditional Web browsers could give you away as an analyst.

I recommend creating a .curlrc or a .wgetrc file that defines the headers you wish these tools to supply. You can define these options on the command-line when calling Curl and Wget, but I find it more convenient to use the configuration files. Consider using your own web server, "nc -l -p 80", and/or a network sniffer to observe what headers a typical browser such as Internet Explorer sends, and define them in your .curlrc or .wgetrc file. Here's one example of a .curlrc file:

header = "Accept: image/gif, image/jpeg, image/pjpeg, image/pjpeg, application/x-ms-application, application/x-ms-xbap, application/vnd.ms-xpsdocument, application/xaml+xml, */*"
header = "Accept-Language: en-us"
header = "Accept-Encoding: gzip, deflate"
header = "Connection: Keep-Alive"

user-agent = "Mozilla/4.0 (Mozilla/4.0; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 3.0.04506.30)"
referer = "http://www.google.com/search?hl=en&q=web&aq=f&oq=&aqi=g1"

The syntax for .wgetrc is very similar, except you should not use quotation marks when defining each field. (Here is another example specific to .wgetrc.)

You may need to tweak "user-agent" and "referer" fields for a specific situation. For more examples of User-Agent strings, see UserAgentString.com.

The "Accept-Encoding" specifies that your browser is willing to accept compressed files from the web server. This will slow you down a bit, because you'll need to decompress the responses (e.g., "gunzip"); however, it will make your request seem more legitimate to the malicious website.

There you have it--a few tips for using Curl (and Wget) for retrieving files from potentially malicious websites. What do you think?

 -- Lenny

Lenny Zeltser - Security Consulting
Lenny teaches malware analysis at SANS Institute. You can find him on Twitter.

Keywords:
4 comment(s)

Comments

When I use wget to retrieve malware, I also add options -d -o 02.log to record an extended log (debug log) of all interactions between wget and the server into file 02.log
Just lloked this up: for curl, use option --trace 02.log
A technique I've used is to download and encrypt the binary on the fly to ensure it can be stored safely in an inert state for later analysis and so that AntiVirus won't automatically detect and delete it. Have used this in an emergency to download a malware sample through the corporate proxy in a safe way for later analysis.

The basic command is:
D:\curl>curl -L –x <proxy>:80 -U <domain>\<username> <URL> | openssl aes-256-cbc -out sample.enc -k password

To decrypt later in your lab:
D:\curl>openssl aes-256-cbc -d -in sample.enc -out sample.exe
The decryption password will be "password"
There's also WebSniffer (http://web-sniffer.net). It appears to do everything that Curl can do (at least in reference to the above diary entry). And, you're not putting yourself at risk when using it, as it is essentially a proxy.

Diary Archives