Scripting Web Categorization

Published: 2016-01-29
Last Updated: 2016-01-29 14:26:04 UTC
by Xavier Mertens (Version: 1)
1 comment(s)

When you are dealing with a huge amount of data, it can be very useful to enhance them by adding more valuable content. Example:

  • Geolocalization for IP addresses
  • Get an IP address DShield score
  • Lookup domain names in list of malicious domains
  • ...

When you are processing many URLs during a security incident investigation or while extracting IOC's from a malware sample or logs, it can also be very interesting to categorize them. The process of categorization helps to tag an URL with a label like the classic "Adult Content", "Government", "Forums", etc. Many commercial solutions offer this feature. It can be very powerful to configure your firewall to deny access to non-business categories. But, integrated in closed solutions, it's not easy to re-use them to benefit of this information in your own scripts. For years, Bluecoat has a product called "K9" that helps to protect kids surfing the web. It's free, you just can get a license key and install the tool or... use the online API!  I had to categorize a bunch of URLs , so I decided to take some time to write a few lines of Python to automate this task.

My script webcat.py fetches the defined categories at regular interval (every two hours) and perform a lookup for each URL passed as argument:

$ ./webcat.py isc.sans.org
isc.sans.org,Education

Multiple URLs can be passed on the same command line or the script can be fed via STDIN if you use "-" as parameter:

$ ./webcat.py isc.sans.org blog.rootshell.be
isc.sans.edu,Education
blog.rootshell.be,Technology/Internet
$ cat suspicious-urls.tmp | ./webcat.py -
getmooresuccess.com,Business/Economy
weddingme.net,Business/Economy
riverbird.usa.cc,Malicious Outbound Data/Botnets
1ntershipping.co,Malicious Outbound Data/Botnets
secureemail.bz,Malicious Sources/Malnets
vsreviewsa.com,Malicious Sources/Malnets
felceconserve.com,Malicious Outbound Data/Botnets
flashsync.cf,Uncategorized
cy-m0ld.com,Malicious Outbound Data/Botnets
berettitdint.ru,Malicious Outbound Data/Botnets
vehanmace.ru,Malicious Outbound Data/Botnets
redderbest.gq,Uncategorized
googlemails.ga,Uncategorized
msportf1.com,Sports/Recreation
www.vai-t.com,Malicious Sources/Malnets
duotthenaning.ru,Malicious Sources/Malnets
duotthenaning.ru,Malicious Sources/Malnets
littrecdintoft.ru,Malicious Sources/Malnets
vsreviewsa.com,Malicious Sources/Malnets
doncglobal.com,Malicious Outbound Data/Botnets

The API returns an hexadecimal code corresponding to the web category. That's why the script fetches them at regular interval and store them in a local file:

$ ./webcat.py -h
usage: webcat.py [-h] [-f CACHEFILE] [-F] [URL [URL ...]]

Categorize URL using BlueCoat K9

positional arguments:
  URL                   the URL(s) to check. Format: fqdn[:port]

optional arguments:
  -h, --help            show this help message and exit
  -f CACHEFILE, --file CACHEFILE
                        Categories local cache file (default:
                        /var/tmp/categories.txt)
  -F, --force           force a fetch of categories

Before using the script, you have to register to get your K9 license, add it to the script (line 30).

Note: I'm not aware of any rate-limit in place while querying the API. During my investigations, I was never blocked.

Xavier Mertens
ISC Handler - Freelance Security Consultant
PGP Key

1 comment(s)
ISC Stormcast For Friday, January 29th 2016 http://isc.sans.edu/podcastdetail.html?id=4845

Comments

What's this all about ..?
password reveal .
<a hreaf="https://technolytical.com/">the social network</a> is described as follows because they respect your privacy and keep your data secure:

<a hreaf="https://technolytical.com/">the social network</a> is described as follows because they respect your privacy and keep your data secure. The social networks are not interested in collecting data about you. They don't care about what you're doing, or what you like. They don't want to know who you talk to, or where you go.

<a hreaf="https://technolytical.com/">the social network</a> is not interested in collecting data about you. They don't care about what you're doing, or what you like. They don't want to know who you talk to, or where you go. The social networks only collect the minimum amount of information required for the service that they provide. Your personal information is kept private, and is never shared with other companies without your permission
https://thehomestore.com.pk/
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> public bathroom near me</a>
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> nearest public toilet to me</a>
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> public bathroom near me</a>
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> public bathroom near me</a>
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> nearest public toilet to me</a>
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> public bathroom near me</a>
https://defineprogramming.com/
https://defineprogramming.com/
Enter comment here... a fake TeamViewer page, and that page led to a different type of malware. This week's infection involved a downloaded JavaScript (.js) file that led to Microsoft Installer packages (.msi files) containing other script that used free or open source programs.
distribute malware. Even if the URL listed on the ad shows a legitimate website, subsequent ad traffic can easily lead to a fake page. Different types of malware are distributed in this manner. I've seen IcedID (Bokbot), Gozi/ISFB, and various information stealers distributed through fake software websites that were provided through Google ad traffic. I submitted malicious files from this example to VirusTotal and found a low rate of detection, with some files not showing as malware at all. Additionally, domains associated with this infection frequently change. That might make it hard to detect.
https://clickercounter.org/
Enter corthrthmment here...

Diary Archives