Security methods, safety goals: Rethinking AI red teaming
AI systems are everywhere now. They're screening resumes, managing customer service, helping write code, even influencing what we see and believe. Yet the risks remain largely unmapped. We're deploying conversational assistants, autonomous agents, and enterprise copilots that can be tricked into leaking confidential data, manipulated into executing unauthorized actions, steered outside their intended scope, or prompted to generate harmful content. Meanwhile, the field is locked in a definitional battle: is this an AI safety problem or an AI security problem?
This isn't just semantics. Earlier this year, the UK renamed its AI Safety Institute to the AI Security Institute, explicitly narrowing its focus to threats like bioweapons, cyberattacks, and fraud, while removing bias and other social harms from scope. The move sparked immediate debate: does "security" framing ignore the full spectrum of AI risks?
Is This a Safety Problem or a Security Problem?
Having spent the past couple of years running large-scale automated adversarial evaluations against dozens of LLM-powered assistants and agents, I've seen this tension play out in practice. In one benchmark, we tested 24 frontier and near-frontier models configured as enterprise chatbots. Every single one was exploitable, with attack success rates ranging from low single digits to well over 60%, despite identical guardrails. These weren't exotic lab attacks. They were direct prompt injections that any creative user (attacker) could craft.
So where does this work sit: AI safety or AI security? My short answer: if you treat it as only one of these, you're already behind.
What the Data Actually Shows
Standards bodies like NIST have drawn a clear line: safe systems should not endanger life, health, property, or the environment; secure systems should maintain their functions and structure in the face of attack. IBM frames AI safety as minimizing harmful outcomes for people, while AI security is about defending models and infrastructure from adversaries. Recent discussions from AI security researchers sharpen this further: security is best seen as a subset of safety, the part that assumes an intelligent attacker.
Adversarial automated red teaming fits squarely in the security tradition. You start from a threat model: who is trying to break your system and why? You operationalize known attack patterns and measure exploitability at scale. You feed results into remediation cycles that mirror traditional vulnerability management. This is automated penetration testing for LLM-backed systems.
Yet the impact of those exploits lands in safety territory. When a chatbot generates targeted harassment, gives step-by-step instructions for self-harm, or uses tools to move real money, we're no longer just protecting "the model." We're protecting people, organizations, and sometimes critical infrastructure from harm. That's why Anthropic's Responsible Scaling Policy, Google DeepMind's Frontier Safety Framework, the EU AI Act, and NIST AI RMF all explicitly bake adversarial testing into safety governance. Regulators and frontier labs are already treating adversarial evaluations as safety infrastructure, not a security afterthought.
The Risk of Single-Lens Thinking
When you call something "security-only," it tends to get scoped, staffed, and measured like security: focused on CIA triad metrics, owned by security organizations, optimized for breach prevention. You risk sidelining equally serious harms that don't look like classic cyber incidents: biased treatment, deceptive persuasion, agentic misalignment, and subtle cognitive harms affecting millions of users.
Conversely, if you label everything "safety" and ignore the attacker's mindset, you end up with governance decks and model cards that look great on paper while your production chatbot falls over to the first halfway-competent jailbreak.
The most useful framing treats adversarial red teaming as a security method in service of safety outcomes. It should leverage threat modelling, attack libraries, and integration with security machinery. But its success criteria must be safety-driven: not merely reducing exploit counts, but reducing expected harm. Fraud enabled, sensitive data leaked, harmful content generated, unsafe actions executed.
What matters for your AI deployment
If you're deploying assistants or agents today, the practical question isn't "Is my adversarial testing safety or security?" It's whether whoever owns it has both security expertise and a mandate to care about downstream human harms. Whether you're measuring only exploit rates or quantifying the safety impact of those exploits. And most concretely: what's the riskiest prompt your bot could face today, and are you letting an automated attacker hit it before your users do? That's where the real line is drawn.