Scripting Web Categorization
When you are dealing with a huge amount of data, it can be very useful to enhance them by adding more valuable content. Example:
- Geolocalization for IP addresses
- Get an IP address DShield score
- Lookup domain names in list of malicious domains
- ...
When you are processing many URLs during a security incident investigation or while extracting IOC's from a malware sample or logs, it can also be very interesting to categorize them. The process of categorization helps to tag an URL with a label like the classic "Adult Content", "Government", "Forums", etc. Many commercial solutions offer this feature. It can be very powerful to configure your firewall to deny access to non-business categories. But, integrated in closed solutions, it's not easy to re-use them to benefit of this information in your own scripts. For years, Bluecoat has a product called "K9" that helps to protect kids surfing the web. It's free, you just can get a license key and install the tool or... use the online API! I had to categorize a bunch of URLs , so I decided to take some time to write a few lines of Python to automate this task.
My script webcat.py fetches the defined categories at regular interval (every two hours) and perform a lookup for each URL passed as argument:
$ ./webcat.py isc.sans.org isc.sans.org,Education
Multiple URLs can be passed on the same command line or the script can be fed via STDIN if you use "-" as parameter:
$ ./webcat.py isc.sans.org blog.rootshell.be isc.sans.edu,Education blog.rootshell.be,Technology/Internet $ cat suspicious-urls.tmp | ./webcat.py - getmooresuccess.com,Business/Economy weddingme.net,Business/Economy riverbird.usa.cc,Malicious Outbound Data/Botnets 1ntershipping.co,Malicious Outbound Data/Botnets secureemail.bz,Malicious Sources/Malnets vsreviewsa.com,Malicious Sources/Malnets felceconserve.com,Malicious Outbound Data/Botnets flashsync.cf,Uncategorized cy-m0ld.com,Malicious Outbound Data/Botnets berettitdint.ru,Malicious Outbound Data/Botnets vehanmace.ru,Malicious Outbound Data/Botnets redderbest.gq,Uncategorized googlemails.ga,Uncategorized msportf1.com,Sports/Recreation www.vai-t.com,Malicious Sources/Malnets duotthenaning.ru,Malicious Sources/Malnets duotthenaning.ru,Malicious Sources/Malnets littrecdintoft.ru,Malicious Sources/Malnets vsreviewsa.com,Malicious Sources/Malnets doncglobal.com,Malicious Outbound Data/Botnets
The API returns an hexadecimal code corresponding to the web category. That's why the script fetches them at regular interval and store them in a local file:
$ ./webcat.py -h usage: webcat.py [-h] [-f CACHEFILE] [-F] [URL [URL ...]] Categorize URL using BlueCoat K9 positional arguments: URL the URL(s) to check. Format: fqdn[:port] optional arguments: -h, --help show this help message and exit -f CACHEFILE, --file CACHEFILE Categories local cache file (default: /var/tmp/categories.txt) -F, --force force a fetch of categories
Before using the script, you have to register to get your K9 license, add it to the script (line 30).
Note: I'm not aware of any rate-limit in place while querying the API. During my investigations, I was never blocked.
Xavier Mertens
ISC Handler - Freelance Security Consultant
PGP Key
Reverse-Engineering Malware: Malware Analysis Tools and Techniques | Amsterdam | Jan 20th - Jan 25th 2025 |
Comments
C:\Python27>python webcat.py -F www.dshield.org
Traceback (most recent call last):
File "webcat.py", line 133, in <module>
main()
File "webcat.py", line 107, in main
webCats = fetchCategories(args.cacheFile)
File "webcat.py", line 43, in fetchCategories
data = json.load(r)
File "C:\Python27\lib\json\__init__.py", line 290, in load
**kw)
File "C:\Python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Anonymous
Feb 29th 2016
8 years ago