OpenAI Scans for Honeypots. Artificially Malicious? Action Abuse?

Published: 2024-08-22. Last Updated: 2024-08-22 17:01:37 UTC
by Johannes Ullrich (Version: 1)

For a whille now, I have seen scans that contain the pattern "%%target%%" in the URL. For example, today this particular URL is popular:

/%%target%%/wp-content/themes/twentytwentyone/style.css

I have been ignoring these scans so far. The "wp-content" in the URL suggests that this is yet another stupid WordPress scan for maybe the plugin vulnerability of the day. "twentytwentyone" points to a popular WordPress theme that apparently can be, HOLD YOUR BREATH, be used for version disclosure [1] . In short, this is the normal stupid stuff that I usually do not waste time on. Running WordPress with random themes and plugins? Good luck. I hope you at least add a "!" at the end of your password (which must be "password") to make it so much more secure.

The scan itself looked broken. The "%%target%%" pattern looked like it was supposed to be replaced with something.

So stupid hackers scanning stupid WordPress installs. I ignored it.

Leave it up to Xavier to educate me that this isn't stupid but artificially intelligent!

Xavier Mertens slack message about user agent with gptbot

So, as it turns out, these scans come predominantly from systems that identify themselves as part of an OpenAIs content-stealing machine. In their battle to keep up with Google's indexing prowess, OpenAI has decided that more is better and is now scanning random IPs, Honeypots, for content. Another option may be that ChatGTP "actions" can be used to trigger these scans.

The easiest way to identify OpenAI's bots is the user-agent:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)

I looked at all scans containing "%%target%%" since July, and they can all be assigned to OpenAI.

graph of scans for %%target%% showing that they almost exclusively originate from OpenAI

The graph shows the total number of scans for "%%target%%" each day in orange, and the scans originating from OpenAI in blue. For almost all days, the scans from OpenAI explain almost all the scans for "%%target%%" URLs.

With OpenAI finding value in data like this, its little cousin Claude can't be far behind. And indeed, we do see some scans in our honeypot that may be originating from Anthropic's Claude, but there are only few and far between compared to OpenAI. For example, we had this URL being hit on August 20th:

/legal-content/CS/AUTO/?uri=celex%3A32010R1099

The user-agent used by Claude is:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; [email protected])

To help manage this traffic, I added a category to our threatlist API. For IP addresses linked to OpenAI use:

https://isc.sans.edu/api/threatlist/openai?json

OpenAI published a list of ranges here: https://platform.openai.com/docs/actions/production . Based on our data, a few additional IPs are also scanning and located in adjacent network ranges, indicating that they are also owned by OpenAI (they also exhibit the same behavior).

or for Anthropic: