freq.py super powers?

Published: 2015-07-10
Last Updated: 2015-07-10 03:08:30 UTC
by Mark Baggett (Version: 1)
1 comment(s)

Look, up in the URL.  Its a BASE64, its a URI.  This looks like a job for freq.py

This diary is a follow up to yesterdays post on using freq.py to find DGA (Domain Generation Algorithm) host names in your logs using frequency tables.  If you didn't see that post you should review it first before continuing.   You can find that diary here:  https://isc.sans.edu/forums/diary/Detecting+Random+Finding+Algorithmically+chosen+DNS+names+DGA/19893/

Fellow SANS Instructor/GSE Kevin Fiscus suggested another use for the tool to me last week when I showed him what I had been working on.   Kevin pointed out that it is difficult to find BASE64 encoded strings programmatically.   Because BASE64 uses uppercase letters, lowercase letters, numbers, slashes and plus signs a program has a hard time telling the difference between a BASE64 encoded string and a URI (The part of the URL after the domain name).  Consider these two strings found that you might find embedded in a suspicious document:

String 1: Q09NRSBUTyBNWSBQWVRIT04gQ0xBU1MgSU4gVkVHQVMhISEh
String 2: forums/diary/Detecting+Random+Finding+Algorithmically+chosen+DNS+names+DGA/19893

Both of these are properly formatted BASE64 encoded strings.  But only one of them is intentionally a BASE64 encoded string and only one of them will decode to something other than garbage random data.   Our frequency table will work nicely to determine the difference between the BASE64 strings and the URI.   This time a high probability (above 5) indicates that it is NOT BASE64 encoded and a LOW probability (below 5) indicates that it is BASE64 encoded.  You can do this from either the command line with "python freq --measure 'suspect base 64 string' frequency_table.freq" or freq.exe on Windows using the tools available for download here.   But we used the command line yesterday and I want to show you another way to use the Python script.  This time lets import freq.py as a Python module.  First I start up my Python interactive mode session.  Then I import the freq module.  Note: What I type is in RED.  The computers output is in black.

student@python-vm:~/freq$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:38)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from freq import *

Next I will assign the variable fc to hold a frequency counter object, then load my prebuilt "english_lowercase.freq" frequency table from the current working directory.

>>> fc = FreqCounter()
>>> fc.load("english_lowercase.freq")

Next I use the .probability() method to measure the probability of the two strings.  First string2 then string1.
>>> fc.probability('forums/diary/Detecting+Random+Finding+Algorithmically+chosen+DNS+names+DGA/19893')
9.490788394012279
>>> fc.probability('Q09NRSBUTyBNWSBQWVRIT04gQ0xBU1MgSU4gVkVHQVMhISEh')
2.578357325221811

The URI scores well above 5 indicating that it is not random text and in this case not BASE64.  But the 2nd string scores a 2.5 indicating that it more likely to be a BASE64 encoded string.   Once again this isn't perfect.  You could still have URI with random ASCII strings in then that score low, but it does help to differentiate between common URI strings and BASE64 encoded strings.  This is confirmed when we try to BASE64 decode each of hte strings.  The URI decodes to junk and the other string decodes to pure gold!
>>>'forums/diary/Detecting+Random+Finding+Algorithmically+chosen+DNS+names+DGA/19893'.decode("BASE64")
'~\x8a\xee\x9a\xcf\xdd\x89\xaa\xf2\xfc7\xady\xcbb\x9e\x0f\x91jwh\x9b\xe1b\x9d\xd8\xa7\x83\xe0%\x82\x8a\xe2\xb6\x19\xa2q\xa9e\xcb\xe7!\xa2\xc7\xa7\xf83R\xfav\xa6z\xcf\x83\x18\x0f\xf5\xf7\xcfw'
>>>'Q09NRSBUTyBNWSBQWVRIT04gQ0xBU1MgSU4gVkVHQVMhISEh'.decode("BASE64")
'COME TO MY PYTHON CLASS IN VEGAS!!!!'

Follow me on twitter @MarkBaggett

Want to learn to use this code in your own script or build tools of your own?  Join me for Python SEC573 in Las Vegas this September 14th!  Click here for more information.

Keywords:
1 comment(s)

Comments

This is good stuff Mark. Would love to see this work "in bulk"...awk out $9 from bro-ids dns.log, then bounce this against freq to see what is found.

Diary Archives