AI security needs a shift from models to systems, researchers argue

Enterprises cannot secure AI agents by making the underlying models more robust and must instead enforce security controls at the system level around them, researchers behind a paper published this month argued, warning that traditional AI-security approaches are increasingly misaligned with how autonomous agents actually operate inside enterprise environments. The paper argues that enterprises should stop treating AI agents as trusted software components and instead secure them as fundamentally untrusted systems operating inside enterprise infrastructure. “The AI model powering the agent must be treated as an untrusted component,” the researchers wrote in the paper , warning that “semantic guardrails” and prompt-level defenses alone cannot reliably secure systems once agents gain access to enterprise tools, memory, APIs, browsers, and execution environments. The authors drew the comparison to operating systems. “Similar to how an operating system treats a process as untrusted, we take the stance that the model powering the agent should be treated as untrusted and security properties should be expressed and enforced outside, at the level of the encompassing system,” they wrote. The paper was written by researchers at Google, the University of California, San Diego, the University of Wisconsin-Madison, and other institutions, including Mihai Christodorescu, Earlence Fernandes, and Somesh Jha. Five principles from systems security The authors distilled five principles from decades of systems security research that they said agentic systems should follow: least privilege, tamper resistance of the trusted computing base, complete mediation, secure information flow, and accounting for the human as a weak link. As evidence, the authors analyzed eleven real-world attacks on AI agents and mapped each to the principles it violated. The attacks included data exfiltration from the ChatGPT macOS app, a Claude Code exfiltration flaw, a Microsoft Copilot exfiltration vulnerability, and the AgentFlayer attack on Cursor through a malicious Jira ticket. Every one of the eleven violated the secure information flow principle, the paper said, while most violated the least privilege principle. The authors rejected the idea that stacking machine-learning guardrails amounts to a defense. “Merely stacking ML models does not constitute true defense-in-depth,” they wrote, because the guard models “often share the same statistical failure modes as the primary agents they monitor.” To put the principles into practice, the authors proposed three security mechanisms, each tied to an open research problem the community has yet to solve. The first is separating instructions from data, because language models mix the two in a single stream of tokens with no source-level distinction between them. The second is verifiable least-privilege policy generation, made difficult because security policies for agents are written in natural language and shift as a task evolves, which makes them hard to translate into rules a system can enforce. The third is information flow control, since tracking how sensitive data moves through a model remains unsolved. Beyond the model The paper challenges one of the dominant assumptions shaping enterprise AI-security efforts over the past two years — that increasingly capable models, alignment techniques, and prompt defenses would eventually make AI systems sufficiently secure for enterprise deployment. Instead, the researchers argue AI agents should increasingly be treated more like operating environments or distributed systems than conventional enterprise applications because they combine reasoning, autonomy, memory persistence, and external tool execution inside a single operational layer. “Security guarantees cannot emerge solely from better prompts, alignment tuning, or model-side mitigations,” the paper said, arguing enterprises instead need stronger runtime isolation, containment boundaries, least-privilege execution, and workflow observability controls around AI agents. That creates situations where prompt injection is no longer simply a content-manipulation issue but potentially a workflow-execution and systems-integrity problem capable of influencing downstream actions across interconnected enterprise environments. The visibility problem The researchers also argue that current enterprise security tooling lacks sufficient runtime visibility into how AI agents actually reason, invoke tools, retain memory, and execute actions across enterprise systems. Another paper published last week also points to a similar problem from a different angle, arguing that traditional endpoint detection and response platforms cannot adequately inspect AI-agent reasoning flows, prompt chains, memory interactions, or dynamic tool execution. The paper proposed what researchers described as an “agentic detection and response or ADR” framework designed specifically for AI-agent environments. “Current security tools are not designed to observe agent cognition or reasoning traces,” the researchers wrote, arguing that existing enterprise security stacks were built to monitor deterministic applications and endpoint activity — not systems capable of autonomous planning, probabilistic reasoning, and dynamic workflow orchestration. The paper described a production deployment monitoring more than 10,000 AI-agent sessions daily across roughly 7,200 hosts, where researchers said the framework identified hundreds of credential-exposure incidents and other agent-related risks spanning 26 attack categories. On a benchmark the team introduced, called ADR-Bench, the system detected 67% of attacks with zero false positives, outperforming three baselines, including Meta’s LlamaFirewall, by two to four times in F1-score, the paper said. On AgentDojo, a public prompt injection benchmark, it detected all attacks with three false alarms out of 93 tasks.

AI security needs a shift from models to systems, researchers argue

Executive Summary

Analysis

Related Threats

2 PhaaS 2 Furious: The Evolution of Chinese-language Phishing Services

The Alert Firehose Finally Meets Its Match

Anthropic: Mythos Detected 23,000 Potential Vulnerabilities Across 1,000 OSS Projects