Podcast Detail

SANS Stormcast Monday Mar 3rd: AI Training Data Leaks; MITRE Caldera Vuln; modsecurity bypass

If you are not able to play the podcast using the player below: Use this direct link to the audio file: https://traffic.libsyn.com/securitypodcast/9346.mp3

Podcast Logo
AI Training Data Leaks; MITRE Caldera Vuln; modsecurity bypass
00:00

Common Crawl includes Common Leaks
The "Common Crawl" dataset, a large dataset created by spidering website, contains as expected many API keys and other secrets. This data is often used to train large language models
https://trufflesecurity.com/blog/research-finds-12-000-live-api-keys-and-passwords-in-deepseek-s-training-data

Github Repositories Exposed by Copilot
As it is well known, Github's Copilot is using data from public GitHub repositories to train it's model. However, it appears that repositories who were briefly left open and later made private have been included as well, allowing Copilot users to retrieve files from these repositories.
https://www.lasso.security/blog/lasso-major-vulnerability-in-microsoft-copilot

MITRE Caldera Framework Allows Unauthenticated Code Execution
The MITRE Caldera adversary emulation framework allows for unauthenticted code execution by allowing attackers to specify compiler options
https://medium.com/@mitrecaldera/mitre-caldera-security-advisory-remote-code-execution-cve-2025-27364-5f679e2e2a0e

modsecurity Rule Bypass
Attackers may bypass the modsecurity web application firewall by prepending encoded characters with 0.
https://github.com/owasp-modsecurity/ModSecurity/security/advisories/GHSA-42w7-rmv5-4x2j

Podcast Transcript

 Hello and welcome to the Monday, March 3rd, 2025
 edition of the SANS and at Storm Center's Stormcast. My
 name is Johannes Ulrich and today I'm recording from
 Baltimore, Maryland. Well, let's start today with some
 stories about AI training data. The first one here comes
 from Truffle Security. Truffle Security, of course, is the
 company behind Truffle Hawk, the very frequently used and
 well-respected tool that allows you to identify API
 keys and other secrets that you may leak in Git or other
 repositories and such. So Truffle Security took a big
 database of AI training data that's being offered by Common
 Crawl. Common Crawl is going out and spidering the web for
 many years now. They have something like 400 terabytes
 of data that they are offering. And well, it
 shouldn't really be a surprise because it's the same thing
 that we had with Google and other web spiders that offer
 them the data publicly, that it now becomes, well, probably
 straightforward to find things like API keys that people
 leaked on their websites. A little bit tricky here that
 this data is also historic data. I believe they're doing
 this for the last 10 years or such. So it is not just
 current data. Now, sites like Google, they offer some
 historic data, but usually focus more on current data.
 They found, again, 12,000 what Truffle Security considers
 live keys, which means that they work according to Truffle
 Hawk. Truffle Hawk has a little sort of test feature
 that allows you to make sure that these are not just simple
 sample or expired credentials that are being used here. They
 point out in their paper that this number of roughly 12,000
 secrets is, of course, just an estimate. There are some that
 they missed just because they were formatted not correctly.
 And then, of course, always a little bit tricky to figure
 out if they're actually being used, if they're just demo
 credentials and such. They also point out that many of
 the credentials can be found across a large number of
 websites in this data repository. Initially, when I
 read this, I first thought that, hey, maybe these are
 just demo credentials and such. Maybe you often have,
 like, the snake oil secret key that comes with Apache that,
 of course, is all over the place. Well, according to
 Truffle Security, they believe that this is more multiple
 websites using the same piece of JavaScript. It could
 identify, like, suppliers and supply chains and such. So I
 have to really see what this all means. Overall, yes, if
 data is exposed, it probably got captured by someone. From
 my own experience, particularly for a smaller
 website, the vast majority of sort of hits you get is
 crawlers like this. So no real big surprise if these
 credentials end up pretty quickly in repositories like
 this common crawl and can then easily be abused. The second
 story that's also related to training data comes from
 researchers at Lasso Security. And what they noticed is that
 the training data being used by GitHub's Copilot, which,
 well, is Microsoft, contains data from what's now private
 GitHub repositories. So Copilot uses GitHub as
 training data. And that's publicly known. That's well
 -established. But they only use public repositories. What
 Lasso Security here found is that, well, if your repository
 was public even for a relatively short amount of
 time or when you initially set it up, well, it's going to be
 added. And the GitHub Copilot doesn't necessarily remove
 data after it's marked as private by the author of that
 data. And not only that, now you may say, hey, you know, if
 it's part of that training data, it may not be such a big
 deal. It may just help people code a little bit or such. You
 can actually ask Copilot for, hey, list the files in that
 particular repository. And with that, you basically get a
 very direct interface into these files that were public
 at the time. Again, this is only if these files were
 public at a particular point in time. But they found
 literally thousands of these repositories were exposed,
 some sort of big name brand companies. Just like I said
 earlier, if at any point in time your data was exposed,
 assume it got grabbed by someone and, well, has to be
 considered leaked at this point. Well, then we got some
 vulnerabilities to talk about Miter Caldera. It's a
 framework to make it easy to simulate adversaries so your
 red teamers may use it. It implements a REST API and
 allows for plugins to be controlled to automate various
 parts of the attack scenario. Sadly, Miter announced last
 week that Caldera itself is vulnerable to some interesting
 command injection. The vulnerability derives from the
 Manx and Sandcat agents. These agents are intended to be used
 to implement a reverse shell, but they require
 authentication. However, these agents have the ability to be
 compiled just in time for a particular platform. And the
 attacker can actually then supply some compile parameters
 and with that they can execute arbitrary code. Interesting in
 part because this is part of an attack framework that's
 supposed to execute arbitrary code, but not for everybody,
 only for authorized users here. And that's sort of how
 you definitely want to update it. There's a great sort of
 post by Miter. Actually, I like that they go in detail
 what really went wrong here. But with that, they also did
 publish a proof of concept exploit. I don't see this as
 sort of a very likely to be exploited vulnerability, but
 could certainly be exploited in a more targeted attack. And
 we have an interesting vulnerability in mod security.
 This vulnerability is not super severe, but well, it
 does allow bypassing of mod security rules. And since
 that's the point of mod security, it sort of
 invalidates the tool somewhat. All you have to do is you have
 to prepend HTML encoded values with zeros. Luckily, it only
 affects one particular version of mod security. So double
 check and if necessary, update.