Using File Entropy to Identify "Ransomwared" Files

Published: 2016-08-08
Last Updated: 2016-08-08 01:17:49 UTC
by Rob VandenBrink (Version: 1)
2 comment(s)

Any engineer or physisist will tell you that Entropy is like Gravity - there's no fighting it, it's the law!  However, they can both be used to advantage in lots of situations.

In the IT industry, a file's entropy refers to a specific measure of randomness called "Shannon Entropy", named for Claude Shannon.  This value is essentially a measure of the predictability of any specific character in the file, based on preceding characters (full details and math here: http://rosettacode.org/wiki/Entropy).  In other words, it's a measure of the "randomness" of the data in a file - measured in a scale of 1 to 8 (8 bits in a byte), where typical text files will have a low value, and encrypted or compressed files will have a high measure.

How can we use this concept?  On it's own, it's not all that useful, when you consider that many data filetypes (MS Office for one) are highly compressed, and so already have a high entropy value.  So using just entropy, there's no telling a good MS Office file from one encrypted by ransomware.

However, most data files have a specific file header, which includes a set of identification bytes (called "magic bytes") that identify what the data file is.  For instance, those bytes for PKZIP files are "PK", and PE32 executable files use "MZ".  You can identify these files by using the "files" program.  While this command is native to Linux, there are nice ports of this for Windows.

If "files" can't identify a file type, we' can then use the entropy value as a second check.  If a file is encrypted, it will have a higher entropy value than one that isn't.  In this case you are looking for a file where the character distribution is random, or at least much more random than a "normal" data file.

You need both of these checks to identify suspect files - as mentioned today's complex data files are becoming much more compressed - Office files are actually PKZIP compressed these days, so will identify as PKZIP when checked with "files".

Using these two checks to identify suspect files depends on a couple of things:

  1. Ransomware encrypts the whole file, so it clobbers the magic bytes of data files when it encrypts them (at least all the variants that I've encountered)
  2. Ransomware generally employs a decent encryption algorithm, so the entropy will be "high".  So far this has been true - for instance, we're not seeing ramsomware using simple XOR techniques (which would by terrible encryption, but would preserve a file's average entropy value).

Using these two checks, I started with a simple powershell script that copies a subdirectory from one location to another, but leaves all of the suspected infected files behind.  A log file is created that lists all files copied and all files that are suspect and should be looked at.  I used Sigcheck (from sysinternals) to compute the entropy value.  If the entropy is above 6 (8 is the maximum) and the magic bytes are unknown, I flag the file as "suspect".

And for a test "ransomed" file, I'm using Didier Steven's sample file: ransomed.bin (see the bottom of this story for links).

I used Powershell because most Ransomware affected shops that I've worked with have been infected in MS-Office files and other Windows data files, so Powershell seemed simpler than adding "install python" into a customer's crash IR situation.  You could certainly take this same logic and code an equivalent python script that would cover Windows, Linux and OSX.

After finishing my replication script and taking a step back, I realized a few things:

  • I was trying to recreate rsync, which does the replication job much better than I could ever hope to.
  • Using Sigcheck was adding some Powershell overhead to extract the "entropy" value from the output text, which was adding up to seconds and minutes for larger volumes.

So I re-wrote the script to be more of a "single-minded".  The final script simply lists suspect files rather than copies them.  And I wrote a short C program to compute the entropy value (thanks to Rosettacode for the starting point on this!), which simply spits out the numeric value, rather than lots of other stats I don't need for this job.

The final output list of suspect files can be used in a few ways:

  • It can be used as an exception list, to tell rsync which files to skip in a replication job.  This starts to approach some commercial backup, storage and replication products, which have product features that allow you to create "sanitized" copy of a data volume (these are generally new new features, so your mileage may vary).
  • It can also be used to catch infection incidents early.  Once you are in the midst of IR, you can usually identify infected files by something more obvious - like the file extension is now "locky" for instance.  But if you don't know that you're infected, you won't know which IOC to look for - this can help identify the problem early.
  • For many of us, the prevalence of malware that uses encryption has changed the focus of our jobs.  We're not just protecting the perimeter and the workstations, we're now tasked with protecting the actual data (which philosophically should have been the case all along).  What this means is that we should start getting more familiar with the data in our organization, what the file type and size distribution is, how quickly it changes and what data is actually in the files.  Hopefully the approach I've outlined can be helpful in this effort.
  • Lastly, I'll be using this to extend the "files" command to include more file types.  I've added the ID for 7zip files already, but look for a story on this project sometime in the next few weeks.  So far I've identified that I need to add file definitions for wma, mobi, m4a, epub, pgp, gzip / tgz and mp3/mp4 files.  VMware Workstation "appicon" files (and several other workstation files) also aren't correctly identified.  If you find more filetypes that are not correctly detected please let us know via the comment form.  The more effective the "files" command is, the more effective this scanning approach will be.
identify-ransomed-files.ps1

<#

.SYNOPSIS
This is a simple Powershell script to recursively scan a subdirectory tree, and identify any files that:
a/ do not have "magic bytes" that identify it as a specific file type to the "file" command
b/ has a high value for Shannon Entropy (between 6 and 8), indicating that it likely either compressed or encrypted

.DESCRIPTION
identify-ransomed-files.ps1

The default path is ".", the current working directory

.EXAMPLE
identify-ransomed-files.ps1 [root of subdirectory path]


.LINK
https://github.com/robvandenbrink/Ransomware-Scan-and-Replicate

#>


function checkfile($FNAME) {
# check file magic char for type - unknown file type has "data" as type
$cmd = "file -m c:\utils\magic `"" + $FNAME + "`""
$ftype = iex $cmd
$copy = 0
if ($ftype -notlike "*; data*") {$copy = 1 ; return $copy}

# check on entropy. If less than 6, it's not encrypted well and is not compressed well, so it's likely data

# this entropy computation method uses sysinternal's sigcheck  
#
#$cmd = "sigcheck -a `"" + $FNAME + "`""
#$a = (iex $cmd | select-string -pattern "Entropy:") -split ":"

# this entropy computation method uses dedicated C code, slightly faster  
#
$cmd = "entropy `"" + $FNAME + "`""
$a = iex $cmd  
$entropy = [double]$a[1]
$copy = 0
if ($entropy -lt 6) {$copy = 1 }
return $copy
}


if ($args[0]) { $src = $args[0] }
else { $src = "." }

 

$files = Get-ChildItem -recurse $src
$files | foreach {
  # directories
  if ($_.psiscontainer) {
       # nothing to do here
       }

  else {
     $go = checkfile($_.fullname)
     if ($go -eq 0) {
        write-host $_.fullname
        }
     }
  }

 

entropy.c

#include
#include
#include
#include
#include
#include
#include
#include

 
int makehist(FILE *fh,int *hist,int len)
    {
    int wherechar[256];
    int i,j,histlen,buflen;
        unsigned char c[102400];          /* define a reasonable buffer to read the file - 1 byte at a time is too slow  */
    histlen=0;
    for(i=0;i<256;i++)wherechar[i]=-1;
    for(i=0;i
        {
                buflen = fread(&c,sizeof(unsigned char),102400,fh);
                for(j=0;j
            {
            if(wherechar[(unsigned int)c[j]]==-1)
                {
                wherechar[(unsigned int)c[j]]=histlen;
                histlen++;
                            }
            ++i;
            hist[wherechar[(unsigned int)c[j]]]++;
            }
        }
    return histlen;
    }
 

double entropy(int *hist,int histlen,int len){
    int i;
    double H;
    H=0;
    for(i=0;i
        H-=(double)hist[i]/len*log2((double)hist[i]/len);
    }
    return H;
}

main(int argc , char *argv[])

 

{
    FILE *fh;
    struct stat fileinfo;
    long fsz;
    int len,*hist,histlen;
    double H;
    if ((fh = fopen(argv[1],"rb")) == NULL )
        printf("Error opening file %s\n", argv[1]);
    else
    {
        fstat(fileno(fh),&fileinfo);
        fsz = fileinfo.st_size;
    }
    hist=(int*)calloc(fsz,sizeof(int));
    histlen=makehist(fh,hist,fsz);
    //hist now has no order (known to the program) but that doesn't matter
    H=entropy(hist,histlen,fsz);
    fclose(fh);
    printf("%lf\n",H);
    return 0;
}

From this script, you can see that "cleaning" ramsomware infected files isn't an insurmountable problem.  A simple script like this feeding rsync can be used to create a "clean" copy of a datastore, and identify suspect files.  Just be SURE to keep up with evaluating that "suspect" file list - as noted, depending on your data store there might be lots of clean files in that list at the moment (give me a few weeks to improve this).  If you run the script as-is to blindly feed rsync, you won't have a complete copy of your datastore.

As always, this was a 1 evening coding effort, so I'm sure that there is more "elegant" Powershell syntax for one thing or another.  Also, you'll see that my C code chooses readability over efficiency in a few spots.  For either piece of code, if you find any errors, or if you identify better syntax to get the job done, please do use our comment form and let me know.  Of more interest, if you find this code useful in your environment and you want to see a "version 2" - let me know in the comments also!

As I work on this code, you'll find the most up-to-date version at: https://github.com/robvandenbrink/Ransomware-Scan-and-Replicate

Didier's diaries on Ransomware and Entropy can be found here:
https://isc.sans.edu/forums/diary/Ransomware+Entropy/20271/
https://isc.sans.edu/diary/Ransomware+%26+Entropy%3A+Your+Turn/20321

===============
Rob VandenBrink
Compugen

Keywords:
2 comment(s)

Comments

Im trying to get this to work, how do I set it up and run it? Am I just suppose to be running identify-ransomed-files.ps1?
Thank you Rob for such a detailed implemtation. I did something similar for scanning network file shares.
Instead of powershell, I used Ent and a Bash script. From http://www.fourmilab.ch/random/
There is a Windows version as well.

Using a combination of numbers such as Entropy, Chi Square, and the Arithmetic Mean... I was able to consistently differentiate all Microsoft Office files and their encrypted versions. So much better than using only Entropy.

The only file types that still look too random, are PNG and GIF files. Very compressed images. JPGs haven't been a problem. I also have had some trouble with false positives with compressed archives like gzip, 7zip, winzip, etc.
For those, I do use the *nix "file" command to get the file signature, and just ignore.
I only rely on the magic bytes for these very few false positives.

I don't want to use magic bytes or any other "expected signature" as a first check. I would rather it be a last check, for a few false positives.
One reason for this is because I don't know when Ransomware authors will eventually start leaving file signatures alone. Many already have stopped changing file extensions. Although it may be quicker to just encrypt the whole file, it may be worth it to skip the header in an attempt to be stealthy.

Also, you should keep in mind that reading the entire file is not necessary. I perform the entropy check only on a couple of select 1 kilobyte blocks inside the file. This means it is scalable and just as fast on large files, as small files.


My continued concern which will evade this type of detection, is Ransomware that merely uses a common archive program and its built in crypto, to simply "zip" up group of files with passwords. This encryption will be detected by both our methods, but will be ignored becuase of the check for common compressed file type. At that point we'll need behavioral analysis to see if this could be a slow human vs. a fast Ransomware program. If the Ransomware is written to behave, and zip/encrypt as a human would... then this cat and mouse may have a dead end.

Perhaps if we had a way to determine whether the encryption is done with a public key (common for Ransomware) or shared key (common in archiving software). But I've heard of fast Ransomware encrypting with a shared key (symmetrical), and then encrypting that shared key with a public key (asymmetrical) to be ransomed.

Diary Archives