Detecting Random - Finding Algorithmically chosen DNS names (DGA)

Published: 2015-07-09. Last Updated: 2015-07-09 12:46:55 UTC
by Mark Baggett (Version: 3)
6 comment(s)

Most normal user traffic communicates via a hostname and not an IP address.   So looking at traffic communicating directly by IP with no associated DNS request is a good thing do to.   Some attackers use DNS names for their communications.   There is also malware such as Skybot and the Styx exploit kit that use algorithmically chosen host name rather than IP addresses for their command and control channels.  This malware uses what has been called DGA or Domain Generation Algorithms to create random looking host names for it's TLS command and control channel or to digitally sign it's SSL certificates.   These do not look like normal host names.   A human being can easily pick them out of our logs and traffic, but it turns out to be a somewhat challenging thing to do in an automated process.  "Natural Language Processing" or measuring the randomness don't seem to work very well.  Here is a video that illustrates the problem and one possible approach to solving it.

One way you might try to solve this is with a tool called ent.   "ent" a great Linux tool for detecting entropy in files.  Consider this:

[~]$ head -c 10000000 /dev/urandom | ent

Entropy = 7.999982 bits per byte.       <--  8 = random

[~]$ python -c "print 'A'*1000000" | ent
Entropy = 0.000021 bits per byte.
             <--  0 = not random

So 8 is highly random and 0 is not random at all.  Now lets look at some host names.

[~]$ echo "google" | ent
Entropy = 2.235926 bits per byte.
[~]$ echo "clearing-house" | ent
Entropy = 3.773557 bits per byte.   
    <-  Valid hosts are in the 2 to 4 range

Google scores 2.23 and clearing-house scores 3.7.  So it appears as though legitimate host names will be in the 2 to 4 range.   Now lets try some host names that we know are associated with malware that uses random host names.

[~]$ echo "e6nbbzucq2zrhzqzf" | ent
Entropy = 3.503258 bits per byte.
[~]$ echo "sdfe3454hhdf" | ent
Entropy = 3.085055 bits per byte.  
    <- Malicious host from Skybot and Styx malware are in the same range as valid hosts

That's no good.  Known malicious host names are also in the 2 to 4 range.  They score just about the same as normal host names.  We need a different approach to this problem. 

Normal readable English has some pairs of characters that appear more frequently than others. "TH", "QU" and "ER" appear very frequently but other pairs like "WZ" appear very rarely.   Specifically, there is approximately a 40% chance that a T will be followed by an H.  There is approximately a 97% change that a Q will be followed by the letter U.  There is approximately a 19% chance that E is followed by R.   With regard to unlikely pairs, there is approximately a 0.004% chance that W will be followed by a Z.   So here is the idea, lets analyze a bunch of text and figure out what normal looks like.  Then measure the host names against the tables.   I'm making this script and a Windows executable version of this tool available to you to try it out.  Let me know how it works.    Here is a look at how to use the tool.

Step 1) You need a frequency table.  I shared two of them in my github if you want to use them you can download them and skip to step 2.

1a)  Create the table:  I'm creating a table called custom.freq.  Create a table with the --create option

C:\freq>freq.exe --create custom.freq

1b)  You can optionally turn ON case sensitivity if you want the frequency table to count uppercase letters and lowercase letters separately.  Without this option the tool will convert everything to lowercase before counting character pairs.  Toggle case sensitivity with -t or --toggle_case_sensitivity

C:\freq>freq.exe -t custom.freq

1c) Next fill the frequency table with normal text.  You might load it with known legitimate host names like the Alexa top 1 million most commonly accessed websites. (http://s3.amazonaws.com/alexa-static/top-1m.csv.zip)  I will just load it up with famous works of literature. The --normalfile argument is used to load the table with a text file containing normal text.

C:\freq>for %i in (txtdocs\*.*) do freq.exe --normalfile %i custom.freq
C:\freq>freq.exe --normalfile txtdocs\center_earth custom.freq
C:\freq>freq.exe --normalfile txtdocs\defoe-robinson-103.txt custom.freq
C:\freq>freq.exe --normalfile txtdocs\dracula.txt custom.freq
C:\freq>freq.exe --normalfile txtdocs\freck10.txt custom.freq
C:\freq>freq.exe --normalfile txtdocs\invisman.txt custom.freq

Step 2) Measure badness!

Once the frequency table is filled with data you can start to measure strings to see how probable they are according to our frequency tables.  The --measure option is used for this purpose.

C:\freq>freq.exe --measure "google" custom.freq
6.59612840648
C:\freq>freq.exe --measure "clearing-house" custom.freq
12.1836883765

So normal host names have a probability above 5 (at least these two and most others do).  We will consider anything above 5 to be good for our tests.  Now lets feed it the host name we know are associated with malware.

C:\freq>freq.exe --measure "asdfl213u1" custom.freq
3.15113061843
C:\freq>freq.exe --measure "po24sf92cxlk" custom.freq
2.44994506765

Our malicious hosts are less than 5.  5 seems to be a pretty good benchmark.  In my testing it seems to work pretty well for picking out these abnormal host names.  But it isn't perfect.  Nothing is.   One problem is that very small host names and acronyms that are not in the source files you use to build your frequency tables will be below 5.  For example, "fbi" and "cia" both come up below 5 when I just use classic literature to build my frequency tables.  But I am not limited to classic literature.  That leads us to step 3.

Step 3) Tune for your organization.

The real power of frequency tables is when you tune it to match normal traffic for your network.  The program has two options for this purpose; --normal and --odd.   --normal can be given a normal string and it will update the frequency table with that string.   Both --normal and --odd can be used with the --weight option to control how much influence the given string has on the probabilities in the frequency table.  It's effectiveness is demonstrated by the accompanying youtube video.  Note that marking "random" host names as --odd is not a good strategy.  It simply injects noise into the frequency table.   Like everything else in security identifying all the bad in the world is a losing proposition.   Instead focus on learning normal and identifying anomalies.  So passing --normal "cia" --weight 10000 adds 10000 counts of the pair "ci" and the pair "ia" to the frequency table and increases the probability of "cia" to some number above 5..

C:\freq>freq.exe --normal "cia" --weight 10000 custom.freq

The source code and a Windows Executable version of this program can be downloaded from here: https://github.com/MarkBaggett/MarkBaggett/tree/master/freq

Tomorrow I in my diary I will show you some other cool things you can do with this approach and how you can incorporate this into your own tools.

Follow me on twitter @MarkBaggett

Want to learn to use this code in your own script or build tools of your own?  Join me for Python SEC573 in Las Vegas this September 14th!  Click here for more information.

What do you think?  Leave a comment.

Keywords:
6 comment(s)

Comments

What about the legitimate DNS entries that are randomly generated, like:

r20---sn-vgqsen7s.googlevideo.com
st14p04sa.guzzoni-apple.com.akadns.net
lb2-271289844.us-east-1.elb.amazonaws.com
d1y6jrbzotnyjg.cloudfront.net
d2o307dm5mqftz.cloudfront.net
0872255533d5cc533fb862f22302bb41.azr.msnetworkanalytics.testanalytics.net
ba8b8fc50762644213f59fa9156db8a6.nrb.footprintdns.com
3a77cc93e75ca0f4058c81c7f48d39d0.azr.msnetworkanalytics.testanalytics.net

I assume you whitelist some. But, I also assume that cloudfront and the Amazon cloud are frequently the source of things we don't want.
Those will be detected as well. Id be happy to add whitelisting capability to the tool if people find that beneficial. Try it out. Let me know how it works for you and what you would like it to differently.

Mark
Some pre-processing might help. If you extract the TLD, you could probably rule out any TLD that's really short like fbi, cia. Most of those are taken, at least for .com.
Another code to detect DGA with machine learning
http://nbviewer.ipython.org/github/ClickSecurity/data_hacking/tree/master/dga_detection/
better to exclude small length size and handle a few things separately like IDN or new gTLD.
[quote=comment#34517]What about the legitimate DNS entries that are randomly generated, like:

r20---sn-vgqsen7s.googlevideo.com
[/quote]
echo vgqsen7s | tr "0-9a-z" "uzpkfa50vqlgb61wrmhc72xsnid83ytoje94"
ord31s03
it's not exactly random :)
How did you come up with that character set to replace the other characters? I saw it was from Maurizio M. Munafo, but, not sure how he developed it either.

Diary Archives