Scripting Web Categorization

Published: 2016-01-29. Last Updated: 2016-01-29 14:26:04 UTC
by Xavier Mertens (Version: 1)

When you are dealing with a huge amount of data, it can be very useful to enhance them by adding more valuable content. Example:

Geolocalization for IP addresses
Get an IP address DShield score
Lookup domain names in list of malicious domains
...

When you are processing many URLs during a security incident investigation or while extracting IOC's from a malware sample or logs, it can also be very interesting to categorize them. The process of categorization helps to tag an URL with a label like the classic "Adult Content", "Government", "Forums", etc. Many commercial solutions offer this feature. It can be very powerful to configure your firewall to deny access to non-business categories. But, integrated in closed solutions, it's not easy to re-use them to benefit of this information in your own scripts. For years, Bluecoat has a product called "K9" that helps to protect kids surfing the web. It's free, you just can get a license key and install the tool or... use the online API! I had to categorize a bunch of URLs , so I decided to take some time to write a few lines of Python to automate this task.

My script webcat.py fetches the defined categories at regular interval (every two hours) and perform a lookup for each URL passed as argument:

$ ./webcat.py isc.sans.org
isc.sans.org,Education

Multiple URLs can be passed on the same command line or the script can be fed via STDIN if you use "-" as parameter:

$ ./webcat.py isc.sans.org blog.rootshell.be
isc.sans.edu,Education
blog.rootshell.be,Technology/Internet
$ cat suspicious-urls.tmp | ./webcat.py -
getmooresuccess.com,Business/Economy
weddingme.net,Business/Economy
riverbird.usa.cc,Malicious Outbound Data/Botnets
1ntershipping.co,Malicious Outbound Data/Botnets
secureemail.bz,Malicious Sources/Malnets
vsreviewsa.com,Malicious Sources/Malnets
felceconserve.com,Malicious Outbound Data/Botnets
flashsync.cf,Uncategorized
cy-m0ld.com,Malicious Outbound Data/Botnets
berettitdint.ru,Malicious Outbound Data/Botnets
vehanmace.ru,Malicious Outbound Data/Botnets
redderbest.gq,Uncategorized
googlemails.ga,Uncategorized
msportf1.com,Sports/Recreation
www.vai-t.com,Malicious Sources/Malnets
duotthenaning.ru,Malicious Sources/Malnets
duotthenaning.ru,Malicious Sources/Malnets
littrecdintoft.ru,Malicious Sources/Malnets
vsreviewsa.com,Malicious Sources/Malnets
doncglobal.com,Malicious Outbound Data/Botnets

The API returns an hexadecimal code corresponding to the web category. That's why the script fetches them at regular interval and store them in a local file:

$ ./webcat.py -h
usage: webcat.py [-h] [-f CACHEFILE] [-F] [URL [URL ...]]

Categorize URL using BlueCoat K9

positional arguments:
  URL                   the URL(s) to check. Format: fqdn[:port]

optional arguments:
  -h, --help            show this help message and exit
  -f CACHEFILE, --file CACHEFILE
                        Categories local cache file (default:
                        /var/tmp/categories.txt)
  -F, --force           force a fetch of categories

Before using the script, you have to register to get your K9 license, add it to the script (line 30).

Note: I'm not aware of any rate-limit in place while querying the API. During my investigations, I was never blocked.

Xavier Mertens
ISC Handler - Freelance Security Consultant
PGP Key

Keywords: cateogorization url websites

1 comment(s)

ISC Stormcast For Friday, January 29th 2016 http://isc.sans.edu/podcastdetail.html?id=4845