Why Red Teams Can't Answer Defenders' Most Important Questions

Red-team assessments aren't very good at validating that defenses are working, so defenders don't have a realistic sense of how strong their defenses are.

Jared Atkinson, Chief Strategist, SpecterOps

January 5, 2024

5 Min Read
Red and blue players on a foosball table
Source: Allan Swart via Alamy Stock Photo


In 1931, scientist and philosopher Alfred Korzybski wrote, "The map is not the territory." He meant that all models, like maps, leave out some information compared to reality. The models used to detect threats in cybersecurity are similarly limited, so defenders should always be asking themselves, "Does my threat detection detect everything it's supposed to detect?" Penetration testing and red- and blue-team exercises are attempts to answer this question. Or, to put it another way, how closely does their map of a threat match the reality of the threat? 

Unfortunately, red-team assessments don't answer this question very well. Red teaming is useful for plenty of other things, but it's the wrong protocol for answering this specific question about defense efficacy. As a result, defenders don't have a realistic sense of how strong their defenses are.

Red-Team Assessments Are Limited by Nature

Red-team assessments aren't that good at validating that defenses are working. By their nature, they only test a few specific variants of a few possible attack techniques that an adversary could use. This is because they're trying to mimic a real-world attack: first recon, then intrusion, then lateral movement, and so on. But all that defenders learn from this is that those specific techniques and varieties work against their defenses. They get no information about other techniques or other varieties of the same technique.

In other words, if defenders don't detect the red team, is that because their defenses are lacking? Or is it because the red team chose the one option they weren't prepared for? And if they did detect the red team, is their threat detection comprehensive? Or did the "attackers" just choose a technique they were prepared for? There's no way to know for sure.

The root of this issue is red teams don't test enough of the possible attack variants to judge the overall strength of defenses (although they add value in other ways). And attackers probably have more options than you realize. One technique I've examined had 39,000 variations. Another had 2.4 million! Testing all or most of these is impossible, and testing too few gives a false sense of security.

For Vendors: Trust but Verify

Why is testing threat detection so important? In short, it's because security professionals want to verify that vendors actually have comprehensive detection for the behaviors they claim to stop. Security posture is largely based on vendors. The organization's security team chooses and deploys intrusion prevention system (IPS), endpoint detection and response (EDR), user and entity behavior analytics (UEBA), or similar tools and trusts that the selected vendor's software will detect the behaviors it says it will. Security pros increasingly want to verify vendor claims. I've lost count of the number of conversations I've heard where the red team reports what they did to break into the network, the blue team says that shouldn't be possible, and the red team shrugs and says, "Well, we did it so ..." Defenders want to dig into this discrepancy.

Testing Against Tens of Thousands of Variants

Although testing each variant of an attack technique isn't practical, I believe testing a representative sample of them is. To do this, organizations can use approaches like Red Canary's open source Atomic Testing, where techniques are tested individually (not as part of an overarching attack chain) using multiple test cases for each. If a red-team exercise is like a football scrimmage, Atomic Testing is like practicing individual plays. Not all those plays will happen in a full scrimmage, but it's still important to practice for when they do. Both should be part of a well-rounded training program, or in this case, a well-rounded security program.

Next, they need to use a set of test cases that cover all possible variants for the technique in question. Building these test cases is a crucial task for defenders; it will directly correlate with how well the testing assesses security controls. To continue my analogy above, these test cases make up the "map" of the threat. Like a good map, they leave out non-important details and highlight the important ones to create a lower-resolution, but overall accurate, representation of the threat. How to build these test cases is a problem I'm still wrestling with (I've written about some of my work so far).

Another solution to the shortcomings of current threat detection is using purple teams — getting red and blue teams to work together instead of seeing each other as opponents. More cooperation between red and blue teams is a good thing, hence the rise of purple-team services. But most of these services don't fix the fundamental problem. Even with more cooperation, assessments that look at only a few attack techniques and variants are still too limited. Purple-team services need to evolve.

Building Better Test Cases

Part of the challenge of building good test cases (and the reason why red–blue team cooperation isn't enough on its own) is that the way we categorize attacks obscures a lot of detail. Cybersecurity looks at attacks through a three-layered lens: tactics, techniques, and procedures (TTPs). A technique like credential dumping can be accomplished by many different procedures, like Mimikatz or Dumpert, and each procedure can have many different sequences of function calls. Defining what a "procedure" is gets difficult very quickly but is possible with the right approach. The industry hasn't yet developed a good system for naming and categorizing all this detail.

If you're looking to put your threat detection to the test, look for ways to build representative samples that test against a wider swath of possibilities — this is a better strategy that will produce better improvements. It will also help defenders finally answer the questions that red teams struggle with.

About the Author(s)

Jared Atkinson

Chief Strategist, SpecterOps

Jared is a security researcher who specializes in Digital Forensics and Incident Response. Recently, he has been building and leading private sector Hunt Operations capabilities. In his previous life, Jared led incident response missions for the US Air Force Hunt Team, detecting and removing advanced persistent threats on Air Force and DoD networks. Passionate about PowerShell and the open source community, Jared is the lead developer of PowerForensics, Uproot, and maintains a DFIR-focused blog..

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like

More Insights