Field Notes

What Happens When AI Agents Go Off the Rails

The vast majority of AI agents are over-permissioned. When an agent exposes sensitive files, traditional permission models can't help you.
March 23, 2026

Researchers at Invariant Labs found that GitHub's own MCP server (the official integration that connects AI coding agents to GitHub) could be hijacked through a poisoned issue on a public repository. An attacker creates an issue with hidden prompt injection in the body. A developer's AI agent, connected via MCP, processes that issue. The agent then accesses the developer's private repositories, extracts source code and sensitive data, and creates a pull request on the public repo containing the stolen information. Visible to the attacker. Visible to anyone. Private repo contents, exfiltrated through the platform's own AI tooling, by someone who never had access to begin with.

The MCP server was working as designed. The agent had the developer's permissions because that's how the integration works: it authenticates as the user. The agent accessed private repos because it was allowed to. The problem was that an AI agent, operating with a human's credentials at machine speed, doesn't distinguish between a legitimate task and an instruction injected into an issue description. It reads and follows both. And when it follows the wrong one, it does so with the full authority of the developer who connected it.

This is what over-permissioned agents look like at scale. And 90% of AI agents deployed today are over-permissioned, according to Obsidian Security's research.

Gartner projects that 40% of enterprise applications will feature embedded agents by 2026, up from under 5% in early 2025. Cisco found that 83% of organizations are planning agentic AI deployments, but only 29% feel ready to secure them. Those two numbers don't go together. We are building the deployment curve for a technology whose security model we haven't figured out yet.

Why permission models built for humans break with agents

Role-based access control works because humans are slow. A person with read access to the CRM, the wiki, the pricing database, and the email system uses those systems one at a time, in separate contexts, with natural friction between them. Opening a browser tab, authenticating, navigating to the right page, reading, deciding what to do next. That friction is accidental security. It means a human with broad permissions rarely exercises them all in a single sitting.

Agents don't have that friction. Three structural problems make human permission models fail for autonomous agents.

Scope expands dynamically

An agent told to "research our competitor's pricing" might chain web search, internal wiki lookup, CRM query, pricing database access, and email draft into a single execution flow. Each individual tool call looks reasonable. A web search is fine. A wiki lookup is fine. Reading the CRM is fine.

But the chain grants a scope of access the user never intended and probably never imagined. The agent discovered those tools dynamically, decided to use them based on its own reasoning, and executed a cross-system research operation that no human would have performed as a single action.

RBAC can't express "you can read the CRM but only if you got here from a web search and not from the pricing database." Permissions are binary. The agent's access is valid at every individual step. The aggregate access is the problem, and nothing in the permission model captures aggregates.

Tools chain across system boundaries

Agents cross system boundaries in ways humans rarely do within a single workflow. A human accesses one system at a time. An agent might read from CRM, query a database, draft a customer email with specific pricing, and send it — all in one execution. Each system has its own permission model. None of them know what the agent did in the other systems. None of them can evaluate whether the combination of actions across systems was appropriate.

This is different from API integrations, which have predefined data flows between systems. An agent's data flow is emergent. It decides at runtime which systems to connect and what data to pass between them. The integration wasn't designed. It was improvised by a language model.

Irreversible actions happen before anyone reviews them

Agents can take actions that can't be undone. Send an email to a customer. Modify a production database. Post to a public API. Make a pricing commitment. A human usually pauses before sending an email with pricing to a customer. An agent doesn't pause. It doesn't have the sensation of "wait, should I really send this?" That sensation is a security control, and agents don't have it.

The combination is what makes this dangerous. An agent with broad permissions dynamically discovers tools, chains them across system boundaries, and takes irreversible actions — all in the time it takes you to read this sentence. The permission model that made sense when a human was the actor becomes a liability when the actor operates at machine speed with machine thoroughness.

Here's the enterprise sales angle that nobody talks about. When a buyer's security team asks "what permissions do your agents have?" and you can't answer with specifics, the deal stalls the same way it stalls when you can't answer tenant isolation questions. Agent permission governance is becoming a security questionnaire section. Companies that can demonstrate structured controls are closing deals that their competitors lose in security review.

Why detection-based defenses are built on sand

The security industry's first instinct was predictable: detect bad inputs, filter bad outputs. The same playbook we've run for decades. Prompt injection is the OWASP LLM Top 10 number one risk, several years running now, and the defense ecosystem has organized around detecting it.

The problem is that detection-based defenses for prompt injection are fundamentally fragile. The problem is provably hard.

Input filters scan prompts for known injection patterns. They catch the attacks in their training data. They miss novel phrasings, encoded payloads, multi-turn attacks that build up context across messages. Every new filter creates a cat-and-mouse game where attackers adjust phrasing until the filter misses. This is the same dynamic as signature-based antivirus, and it ends the same way.

LLM-as-judge approaches use a second language model to evaluate whether the first model's output looks suspicious. This adds 200ms or more per evaluation, not to mention additional costs, which compounds quickly in agentic workflows where a single user request might generate dozens of tool calls. Worse, the judge model is vulnerable to the same class of attacks it's supposed to detect. You're using a system with known vulnerabilities to protect against those same vulnerabilities. The judge can be injected. The judge can be confused. The judge is a language model, with all the unpredictability that implies.

Output classifiers filter responses for sensitive data, harmful content, off-topic responses. They can't catch what they weren't trained on. A novel exfiltration technique that encodes data in formatting rather than content sails past output classifiers. So does a response that's subtly wrong — not obviously malicious, just enough to cause a bad business decision.

OpenAI has publicly acknowledged that prompt injection is "unsolved." Not "difficult." Not "an area of active research." Unsolved. Building your agent security on detection is building on a foundation that its own creators admit will fail. The question isn't whether your detection will miss an attack. It's when, and what the agent does with the permissions it has when detection fails.

NVIDIA's NeMo Guardrails, one of the more sophisticated detection-based solutions, adds 3.5 to 11 seconds of latency per evaluation. For a conversational chatbot, maybe tolerable. For an agentic workflow making dozens of tool calls per second, unusable. And after that latency cost, you still don't have a guarantee. You have a probabilistic filter that might catch the attack.

There is a better question than "how do we detect malicious intent?" It's: what if we made intent irrelevant?

Intent-based authorization: making injections irrelevant

Intent Based Access Control (IBAC), developed by Jordan Potti [disclosure: an advisor to Adversis] and documented at ibac.dev, inverts the entire strategy. Instead of trying to detect malicious intent in model inputs and outputs, IBAC restricts what agents can do regardless of intent. If an injected instruction tells the agent to exfiltrate data, and the agent's permission scope doesn't include data exfiltration capabilities, the instruction is irrelevant. Not detected and blocked — the agent literally cannot comply.

The architectural insight is what matters here. When a user makes a request, IBAC derives granular permission scopes from the user's legitimate, natural language input. "Book a flight to NYC for next Tuesday" generates specific, narrow capabilities: flight search, booking creation, calendar access for the relevant date. Nothing else. The permissions are scoped to exactly what the user asked for, not everything the agent's role might ever need.

Every tool invocation is then checked against these derived permissions via a deterministic authorization engine: a conventional access control system, not a language model. The check takes about 9 milliseconds per invocation, compared to seconds for detection-based approaches. No LLM is in the authorization path, which means the authorization layer can't be prompt-injected because it doesn't process prompts. It processes structured capability grants against an authorization policy. Yes or no.

Testing against AgentDojo, a benchmark of 240 prompt injection attempts, showed a 100% injection block rate in strict mode and 98.8% in permissive mode. Compare the architecture: detection-based approaches add seconds of latency, produce probabilistic results, and are vulnerable to the attacks they're detecting. Intent-based authorization adds milliseconds, produces deterministic results, and is immune to prompt injection by design. An agent making 50 tool calls in a workflow adds about 450ms with IBAC. With NeMo Guardrails, that same workflow adds 175 to 550 seconds. One is usable in production. The other isn't.

🔧 IBAC, Intent Based Access Control -- Framework for AI agent permission governance by Jordan Potti. Derives permissions from user intent and enforces them deterministically at every tool invocation.

Steer or kill: responding to deviation

When an agent starts doing something outside its intended scope, you have two choices. Redirect it or stop it. The decision should depend on one variable: whether irreversible actions have occurred or are imminent.

Steer when no irreversible action has been taken and the deviation is minor. The agent is drifting from the task but hasn't done anything that can't be undone. Redirect it back to the intended scope. Adjust the permissions. Let it continue with tighter constraints. This is IBAC's permissive mode: implied actions are authorized (if you asked to book a flight, sending the confirmation email is implied), and everything else is blocked.

Kill when irreversible actions have occurred or are about to. The agent is trying to send an email, modify a database, or post to an external API. Stop execution immediately. Don't try to redirect. Don't hope the next tool call will be the right one. Halt the workflow and escalate to a human. This is strict mode: any action outside the explicitly derived scope triggers an escalation prompt.

Automate this decision based on reversibility. Every tool in the agent's toolkit should be tagged with a reversibility score. Read operations are reversible — no state change. Draft creation is reversible. Email sending is irreversible. Database writes may or may not be reversible depending on whether the schema supports soft deletes. API calls to external services are generally irreversible.

When the agent attempts an action outside its scope, the system checks the reversibility tag. Reversible action outside scope? Steer. Irreversible action outside scope? Kill. This removes the judgment call from runtime. The policy is set in advance, based on actual consequences, not on a language model's assessment of whether the deviation "seems okay."

Most agent frameworks today don't make this distinction. They either let the agent do whatever it wants (most common) or block everything that isn't explicitly whitelisted (too restrictive to be useful). The steer-or-kill framework gives you a middle path. The permissive mode's 98.8% block rate shows that reasonable latitude doesn't mean open permissions.

MCP and the visibility problem

The Model Context Protocol has become the standard interface between AI clients and external tools. It's also become a large attack surface. In December 2025, researchers discovered over 30 vulnerabilities and 24 CVEs across major AI development tools, including GitHub Copilot, Cursor, and Windsurf. The vulnerabilities ranged from tool poisoning (injecting malicious instructions into tool descriptions) to cross-origin escalation (an MCP server accessing resources from a different server's context).

MCP's design philosophy is permissive by default. An MCP server exposes tools, and the AI client can call any of them. No built-in authorization layer beyond OAuth or API keys. Audit trail depend on the application. No mechanism to restrict which tools an agent can call based on what the user asked for. The protocol assumes that tool access decisions are the client's problem.

This creates a visibility gap. If you're running MCP servers in your environment, do you know which tools your agents are calling? How often? With what arguments? Whether any of those calls were the result of prompt injection rather than user intent?

We built MCP Snitch to address the visibility half of this problem. It sits as a proxy between the AI client and MCP servers, logging every tool call with full arguments and responses. It's a visibility layer, not an enforcement layer. You can see what your agents are doing before it becomes an incident.

IBAC addresses the enforcement half. It determines what's allowed based on user intent, not what's available based on server configuration. Visibility without enforcement is monitoring. Enforcement without visibility is blind policy. You need both.

🔧 MCP Snitch - Open-source proxy for MCP traffic visibility and control. See what your AI agents are doing before it becomes an incident.

Audit trails that prove authorization, not just activity

Here's a question that will come up in your next SOC 2 audit, if it hasn't already: can you prove what your AI agents were authorized to do?

Note that logs tell you what happened. Authorization records tell you what was supposed to happen. When those two things diverge, you might have an incident. When you can't compare them because you only have one, you have an audit finding.

Many agent deployments today have scattered logs with no unified record of the agent's authorization scope or the decision chain that led to each action. The agent's actions might appear in individual system logs: a CRM entry here, an email send log there, a database query somewhere else. But who can reconstruct whether those actions were within the agent's authorized scope.

This matters for enterprise sales as much as it matters for compliance. When a customer's security review includes "how do you govern AI agent behavior?" having authorization records and decision trails puts you ahead of the vast majority of companies deploying agents. Cisco found that 71% of organizations don't feel ready to secure their AI deployments. Being in the other 29% is a competitive advantage that shows up directly in security reviews.

For SOC 2 CC9.2 (vendor specific risk), you can demonstrate that AI agents operate under defined authorization policies with recorded decision trails. For the EU AI Act's transparency obligations, you can produce records of what your high-risk AI systems were authorized to do and what they actually did. "We trust the model to do the right thing" is not an answer that survives examination.

The audit trail also helps when something goes wrong. Was the action within its authorized scope? Did the permissions correctly reflect user intent? Did the authorization engine evaluate correctly? Each question points to a different root cause and a different remediation. Without the authorization record, you're guessing.

📋 AI Security Readiness Assessment -- Score your agent permission governance and other AI security dimensions across six categories.

What to do next

If you're deploying AI agents or planning to, start with inventorying your agents' effective permissions. In addition to their assigned roles: their effective permissions, meaning what they can actually reach when they chain tool calls together.

The agent you gave "read access to the wiki" also has read access to every document in the wiki, including the ones with customer data, credentials, and architecture diagrams. Map this out.

Next, tag every tool in your agent toolkit with a reversibility score. Read-only tools are safe for broad access. Draft creation is low risk. The ability to send external communications, make web calls, modify databases, call third-party APIs are high risk. This tagging is the foundation for automated steer-or-kill decisions. Without it, every out-of-scope action requires human judgment at machine speed, which means it doesn't get human judgment at all.

Then separate your visibility layer from your enforcement layer. Start with visibility. Deploy a proxy to see what your agents are doing. You will likely be surprised. The patterns you find will inform what enforcement policies to set. T

rying to write enforcement policies without visibility data is guessing. Guess wrong and you either block legitimate use (agents become useless) or miss the real risks (agents remain dangerous).

For enforcement, the architectural pattern matters more than any specific implementation. Derive permissions from user intent and enforce them deterministically, keeping LLMs out of the authorization path.

IBAC is open source and documented at ibac.dev. Whether you adopt it directly or build something similar, those principles address the structural problems that make agents dangerous.

Build the audit trail before auditors ask for it. SOC 2 auditors, EU AI Act regulators, and enterprise buyers are all converging on the same question: how do you govern AI agent behavior? Having an answer that includes authorization records and decision trails puts you ahead of most organizations deploying agents today. And in enterprise sales, "ahead of most" is often enough to close the deal your competitor can't.

📋 Check out the AI Maturity Domains at TRACTION.FYI

Honest uncertainty

I'll be direct about something. We are in the early stages of understanding how to secure autonomous agents. IBAC's test results against AgentDojo are strong, but 240 injection attempts is a benchmark, not a comprehensive adversarial evaluation.

Real-world agents will face attacks that these benchmarks don't cover — novel techniques, attacks that exploit the intent parsing layer itself, social engineering that happens across multiple sessions.

What gives me confidence in the architectural approach is not the numbers but the structural property: the authorization path doesn't contain an LLM - it's deterministic. You can attack the intent parser all you want. The worst case is that the agent can't do what the user asked, and the user has to clarify. The authorization engine itself is deterministic and conventional. It has the security properties of traditional access control systems, which we understand well.

The intent parsing layer, which uses an LLM to derive capability grants, is itself a potential target for manipulation. But compromising it produces overly restrictive permissions rather than overly permissive ones — a much safer failure mode than the alternative. Failing closed is a property worth designing for.

The alternative (detection-based security that its own creators call unsolved) isn't a real alternative. When 40% of enterprise apps embed agents by next year and most of those agents are over-permissioned, you need architecture that works even when detection fails.

The GitHub agent was doing it's job. The question is whether your security architecture assumes agents will always do their job, or whether it works even when they don't.

Adversis works with companies deploying AI agents to build security architectures that account for autonomous behavior, not just authorized behavior.

Our AI security assessments evaluate agent permissions, tool governance, and authorization architecture. If you're deploying agents and want to understand your exposure, we should talk.

Get Started

Let's Unblock Your Next Deal

Whether it's a questionnaire, a certification, or a pen test—we'll scope what you actually need.
Smiling man with blond hair wearing a blue shirt and dark blazer, with bookshelves in the background.
Noah Potti
Principal
Talk to us
Suspension bridge over a calm body of water with snow-covered mountains and a cloudy sky at dusk.