Tense security analyst security operations center failed dashboards dark blue red digital illustration

Simbian cyber defence benchmark finds all 11 AI models fail

Wed, 29th Apr 2026 (Today)

Simbian has launched the Simbian Research Lab and released a cyber defence benchmark for large language models. The tests found that none of the 11 models assessed achieved a passing score.

The benchmark measures whether models can detect MITRE ATT&CK chains in complex scenarios using real attack telemetry rather than question-and-answer prompts. According to Simbian, it is the first benchmark to use an agentic investigation format built around realistic attack data.

Frontier models have shown strength in identifying and exploiting software vulnerabilities, but performed poorly when asked to investigate attacks from a defender's perspective. The top result came from Anthropic Claude Opus 4.6, which detected an average of 46% of attack evidence per MITRE tactic.

All models missed entire categories of attacks. The group tested included models from Anthropic, OpenAI and Google, as well as open-weight models from Alibaba, Minimax, DeepSeek and Moonshot AI.

How it worked

Simbian Research Lab designed the benchmark around a task familiar to security teams: investigation. Rather than asking a model to answer isolated questions about an incident, the system presented attack telemetry and asked it to identify the attacker and the tactics used.

The models operated with a simple ReAct loop, which allows an AI system to reason through a task step by step while taking actions along the way. Simbian said this setup more closely reflects how language models would perform in a live security operations environment, where evidence is incomplete, noisy and spread across systems.

The benchmark was also designed to reduce the chance that models could perform well through memorisation or familiarity with standard test formats. The framework mutates context, uses deterministic scoring against ground truth and tracks cost alongside accuracy.

Performance gap

The results showed a wide spread in output and cost between models. Anthropic Opus 4.6 found three times more flags than Google Gemini 3 Flash, but at roughly 100 times the cost.

That comparison highlights a broader issue for companies exploring AI tools for security operations. A model with better detection rates may still be hard to justify if the cost of running investigations at scale rises sharply, particularly in environments handling large volumes of alerts and machine logs.

The findings also point to a persistent distinction between offensive and defensive uses of AI in cyber security. Research over the past two years has often shown that advanced models can support vulnerability discovery and code-related tasks. Simbian's benchmark suggests the leap from those skills to practical attack discovery in enterprise telemetry remains incomplete.

Ambuj Kumar, Founder and Chief Executive Officer of Simbian, said the issue was one of context and system design rather than raw model quality alone. "Our research shows you can't throw an LLM dart in the dark and expect to hit the cyber defense bullseye," Kumar said. "The same frontier models that perform strongly during cyberattacks struggle on the defense side. Defence is fundamentally harder than offense as it requires reasoning across noisy, partial evidence rather than executing against a known target. The LLMs must be accompanied by outside intelligence in the form of a sophisticated harness. Simbian has been able to get 95% accuracy in production enterprise environments on cyber defense SecOps following some of these techniques."

Wider debate

The release of a benchmark focused on defensive analysis comes as cyber security suppliers and corporate buyers try to separate marketing claims from measured performance. Many AI security tools have been introduced with promises of autonomous triage, threat hunting or incident analysis, but consistent standards for evaluating those claims are still emerging.

Existing cyber benchmarks often test a model's knowledge of attack methods rather than its ability to reconstruct an intrusion from raw evidence. Simbian's approach reflects the view that practical usefulness depends less on what a model knows in abstract terms and more on whether it can reason accurately over machine-generated logs and traces.

Richard Stiennon, Chief Research Analyst at IT-Harvest, said the benchmark addressed a central question for the sector. "We know the large models can do amazing things, but can we measure their efficacy in analyzing machine logs for security events?" he said. "This benchmark answers that question. In contrast to existing AI security benchmarks, this benchmark was designed to be difficult to game. It uses real telemetry rather than curated questions, mutates context to prevent memorization, enforces deterministic scoring against ground truth, and tracks detection cost alongside accuracy."

The creation of the Simbian Research Lab suggests the company wants to place greater emphasis on formal testing and evaluation as competition grows around AI-led security operations. The benchmark results, however, underline a less flattering message for the industry: current frontier models still fall well short of acting as stand-alone cyber defenders.

Preferred Source

Simbian cyber defence benchmark finds all 11 AI models fail

Top stories