Field Notes

A demonstration of indirect prompt injection

Indirect prompt injection, reproduced on demand. Four poisoned documents, a seeded inbox, and an attacker console streaming what the LLM leaked.

April 22, 2026

The sequence, when reproduced in a laboratory, takes less than a minute. An AI assistant is asked to summarise a single resume from a seeded inbox. The summary that returns to the operator looks clean. What the operator does not see is a parallel HTTP POST to a webhook under the control of the attacker, carrying compensation bands and budget codes drawn from emails the language model had no legitimate reason to touch.

The resume contained no visible instruction to the model. The instruction was there. It sat inside an HTML comment, styled to resemble a note from the Applicant Tracking System.

This is indirect prompt injection. It is not a novel class of attack. Greshake and colleagues described it in a February 2023 paper, arXiv 2302.12173, and Microsoft, Anthropic, Google, and others have documented variants in the years since. It remains, as of this writing, one of the most reliable methods available for hijacking a chat-style assistant that reads documents authored by people other than the operator.

For the purposes of demonstrating the technique in front of a live audience, Adversis has published a training tool on GitHub. The repository is Adversis/cutout. It's a self-contained build: Docker Compose, a React front end, a FastAPI back end, and either a local Ollama runtime or the Anthropic API for a frontier-model comparison. It ships with three poisoned documents, a fourth scenario that uses a tracking pixel in lieu of any model-side network call, a seeded inbox of clean mail for context, an attacker console on a separate browser tab, and a test suite of sixty-seven pytest cases. It runs offline, against local inference, which matters in a venue with a hostile or absent network.

The tool itself

The architecture is spare.

The presenter opens two tabs. The first is the Chat view. It resembles any contemporary assistant. The second is the Attacker Console. Three subordinate tabs: Plant, for dropping a poisoned document into the inbox; Scenarios, a dossier view of each pre-built attack across three renderings; and Monitor, which streams, in real time, every artefact the attack has succeeded in extracting.

A Settings drawer permits a live switch between the local Llama model and Anthropic's Claude without restart. The purpose of the switch is comparative. The smaller the model, the more readily it follows an instruction embedded inside the document it has been asked to summarize. Claude, at the time of writing, resists the same payloads more often than it does not. This is not presented as a defence. It is presented as a data point.

Four scenarios

The first is the poisoned resume. An applicant called "Sarah Chen". The injection is an HTML comment framed as a routine ATS processing note. The model, tuned to be helpful, honours the request. Compensation bands drawn from adjacent emails leave the host and arrive at /webhooks/ingest, a path deliberately named to resemble any undistinguished SaaS integration. Nothing in the chat window discloses the event.

The second is a Q4 SOC 2 audit report, attributed to Meridian Partners. The injection presents itself as an auditor's processing note, requesting a cross-reference appendix. The model obliges. Stripe and SendGrid API keys, lifted from neighbouring mail, travel out over the same webhook.

The third is a set of third-floor building specifications with an injection dressed up as an internal project-management note inside a blockquote. The payload, in this case, is a million dollars' worth of CAPEX codes and vendor identifiers, exfiltrated via webhook and via a forwarded email the assistant composes on its own.

The fourth is the one a defender should find least comfortable. The payload is inside a Q2 brand-refresh brief. It does not ask the model to POST anything. It does not ask the model to write anything sensitive into its reply. It asks the model to include, in its rendered markdown, a tracking pixel: ![](/webhooks/pixel?payload=...). The chat client renders the markdown. The browser fetches the image. The payload, carried in the query string, is captured at the attacker's endpoint. The model never invoked a tool. The model never made an HTTP request. The rendering environment did both.

What the documents look like

Each scenario is shown in three views under the Scenarios tab. The Victim View renders the document as the recipient would see it in a mail client: the injection is invisible. The LLM View renders the raw markdown the model received, with the injected span highlighted. The Injection Only view isolates the payload with its technique labelled.

The tactic across all four is mimicry. No banner text reading IGNORE PREVIOUS INSTRUCTIONS. No SYSTEM OVERRIDE. Instead: routine enterprise automation. An ATS calibration note. An audit cross-reference request. A project-management cost-alignment note. A marketing-automation pixel hint. Each reads, in isolation, like ordinary business correspondence. Safety-tuned models do not flag ordinary business correspondence.

Countermeasures

The demonstration has a deliberate gap. It shows only attacks that succeed. The roadmap proposes a Defence Toggle Harness, not yet built, which would permit the presenter to re-run each scenario with controls enabled and to record the deltas. Until it ships, the post-demo discussion is carried by talking points rather than by live evidence.

The controls worth discussing, in the order a defender should consider them:

An allowlist at the agent's network boundary. Deny outbound HTTP from the model and its tools to anywhere the operator has not approved in advance. Breaks scenarios one through three. See MCP Snitch.
Markdown image stripping in the rendering surface. Kills the pixel beacon. Costs the operator a rendered logo.
Per-document context isolation. The model is permitted to read only the document it has been asked to process, not the adjacent inbox. This shrinks the set of secrets available to exfiltrate to whatever the poisoned document itself contains.
Output filtering against known sensitive-data patterns, applied before the chat client renders. The Monitor tab already demonstrates the detection regexes. They work as a pre-render filter too.
Human confirmation for any action with network or send-mail side effects. Tedious. Reliable.

None of these is novel but all are structural. None depends on the model refusing the instruction, which is a thing the model cannot be relied upon to do when the instruction reads like a memo from the auditor.

Close

The repository is github.com/Adversis/cutout. A clone, a docker compose up, and an Ollama pull is sufficient. The presenter's guide is in the README. The demonstration runs offline against local inference, which matters in a venue with a hostile or absent network.

The class of attack has been in the literature since early 2023. What the tool provides is reproducibility, in a conference room, in under two minutes, against a live audience. That is the form in which a defender tends to learn.

‍