Detecting prompt-injected agents with canary tools and cloud audit joins

Contribution

This post adds a detection design for browser and research agents that the cited sources do not spell out: treat canary tools, planted credentials and allowlisted tool parameters as first-class audit events, then join those events to cloud data reads and outbound delivery. The contribution is defensive detection content, not a claim that one model vendor has solved prompt injection.

The pattern

Browser and research agents are moving from chat windows into work surfaces that hold mail, documents, browser sessions, source repositories and cloud consoles. That shift changes prompt injection from a content-filter problem into an audit problem. The agent reads data that the operator did not author, then uses a tool with permissions the operator already granted. If the untrusted content wins that argument, the resulting action often looks like normal automation unless the defender records the run boundary, the tool call and the data plane event.

Anthropic's browser-use write-up is useful because it is unusually frank about the remaining gap. It describes a routine mail task where an email contains hidden instructions that tell the agent to forward messages containing the word "confidential" to an external address. Anthropic also says every webpage, embedded document, advertisement and dynamically loaded script can become an instruction source for a browser agent. The vendor reports progress against an internal adaptive attacker, but its own conclusion is the part defenders should keep: a 1% attack success rate still matters when the agent has access to high-value data.

The AgentShield paper takes the next step from prevention to detection. It argues that many defences try to stop indirect prompt injection before it works, but that defenders also need a signal when a compromised agent slips through. Its design places fake tools, planted credentials and parameter allowlists inside the agent tool interface. When a hijacked agent follows a hidden instruction, it often touches one of those traps. The paper reports 90.7% to 100% catch rates for successful attacks on commercial models in its tested setting, with zero false alarms on 485 normal-use tests.

Google's Gemini Deep Research documentation adds the operational clue. The agent can take documents directly as input, can connect to external tools through MCP servers and runs long tasks in the background. That is not a vendor warning, but it tells a SOC where records should exist: interaction start, plan approval where used, document ingestion, MCP tool connection, background status, final output and any external tool call. In other words, an agent run has enough structure to become a detection object if the platform exports it.

The pattern across those sources is plain. Prevention metrics belong in vendor evaluations. Enterprise detection should watch the boundary between untrusted content and privileged action.

Why it matters to cloud defenders

Cloud teams are exposed because agents increasingly sit beside identities that already reach SaaS and cloud data. A browser assistant with mailbox access, a research agent with uploaded design documents, or a coding agent wired to a repository can bridge content from the public web into a privileged cloud session. The attacker does not need to steal an IAM key first if the agent will use the operator's existing session, OAuth grant or MCP connector on command.

This is awkward for classic cloud detection. CloudTrail, Entra audit logs, Google Cloud audit logs and SaaS audit logs normally record the final action, not the text that made the agent choose it. A suspicious GetObject, files.export, mail.send or external webhook call may show the identity, the IP address and the target resource, but not the hidden instruction in the page or document that caused it. Agent platforms fill that missing middle only if they emit run IDs, tool-call names, input provenance and approval events.

The cloud relevance is strongest in three deployment shapes.

First, the agent runs inside a developer or analyst workstation but uses cloud identities through a browser. The cloud control plane sees a normal user or device. The agent log, if retained, shows a tool call that followed untrusted web content. Correlation has to join those two records within the same short window.

Second, the agent runs as a service account in a workflow engine. A research worker, ticket triage bot or internal assistant may have a fixed identity with access to document stores, issue trackers and object storage. Here the cloud log does not look like a human session at all. The useful signal is a change in tool path: a run that should read a ticket suddenly calls an export, webhook or shell tool after parsing external content.

Third, the agent connects to MCP servers or other tool gateways. That gateway becomes the choke point. If it records tool metadata, argument hashes, caller identity, approval state and data classification labels, the SOC can alert before the incident becomes a general SaaS hunt.

This is why canary-style detection is attractive. It does not ask the model to classify every prompt. It plants actions that benign workflows should not call. A fake send_external_email tool, a credential named do_not_use_prod_backup_key, or an allowlist that rejects a new recipient domain turns an invisible instruction-following failure into a discrete event.

ATT&CK mapping

The entry point is indirect prompt injection through untrusted content, which is not a clean ATT&CK technique by itself. For the cloud detection chain, the first ATT&CK-mappable behaviour is the agent using a command, script or tool interface in a way the operator did not request. That maps most honestly to T1059 Command and Scripting Interpreter when the agent reaches a shell, code runner, browser automation command or MCP tool that executes an action. The injected instruction is the trigger, but the observable event is the tool execution.

Credential use sits beside that, not inside it. If the agent acts through the operator's existing cloud session, the cloud side often looks like T1078.004 Valid Accounts: Cloud Accounts. That should be named in the analysis, but it should not be the only listed technique unless the logs prove credentialed cloud access. A prompt injection that only rewrites a draft email is not a cloud-account compromise. A prompt injection that causes a signed-in browser to export Google Drive files, read S3 objects or call an Entra admin API is.

The payoff depends on what the agent reaches. For document repositories and issue trackers, T1213 Data from Information Repositories is the better fit. For object stores, bucket reads and exported blobs, T1530 Data from Cloud Storage Object is more precise. If the run then sends the output to an attacker-controlled destination, the exfiltration behaviour should be tracked in prose even when the public technique list stays focused on the backed a13e pages.

That end-to-end mapping matters because a single prompt-injection label hides the defender's real choices. The SOC can alert on the agent tool call, on the valid cloud identity action, on the repository or storage read, or on the outbound delivery. The best detection joins at least two of those stages. A lone GetObject by a known automation role may be normal. The same GetObject five seconds after a canary tool trigger in the agent gateway is not.

Detection guidance

Start with the agent runtime, not the model. The minimum useful event has run_id, actor_identity, agent_identity, tool_name, tool_arguments, input_source, approval_state, destination, timestamp and a flag for whether the tool call followed untrusted content. If the platform cannot emit that shape, put a proxy in front of the tool gateway and log it there. MCP servers are a practical place to do this because all tool calls already cross the same boundary.

The first alert is a canary tool trigger. Register one or more tools that no benign task should call, such as export_all_documents, send_external_secret, rotate_backup_key or open_confidential_mailbox. Do not make them functional. The detector fires when any agent run attempts to call the tool, and the event should include the prior content source and the next real tool call. The alert condition is narrow: tool_name in canary_tools from any production agent identity. The likely false positive is a test run or a red-team exercise, so the tuning surface is a list of sanctioned test run IDs and lab tenants, not a broad user exclusion.

The second alert is a planted credential or honeytoken read. Put a fake token in a controlled document, email or repository that the agent may read but should never use. The string should be unique, inert and labelled as a trap in the monitoring backend, not in the document body. Alert when the token appears in a tool argument, outbound request, generated file or cloud secret lookup. This aligns with the AgentShield idea, but the cloud version should also watch for the same token in egress proxy logs and SaaS audit exports.

The third alert is a parameter-allowlist break. For tools that send mail, post webhooks, open URLs, query object storage or export documents, keep explicit allowed values per workflow. A finance reconciliation agent may call a storage-read tool only for one bucket prefix. A support triage agent may post only to approved ticketing endpoints. A research agent may browse the web but may not submit forms or upload local files. Alert on a new recipient domain, bucket, repository, webhook host or file export type when the same run has processed untrusted content.

The cloud correlation should be short and mechanical. Join the agent run to CloudTrail, Google Cloud audit logs, Entra audit logs, Google Workspace audit logs, Microsoft 365 audit logs and egress proxy records on actor identity, source IP, service account, browser profile or run ID where available. A high-signal rule is: canary trigger or parameter break in an agent run, followed within ten minutes by a sensitive data read or external send from the same actor. Sensitive reads include S3 GetObject on labelled buckets, Google Drive export, SharePoint file download, mailbox search, secret-manager access and repository clone.

The tuning work is mostly inventory. Name each production agent, its expected tool list, its approved domains and the data classes it may touch. Separate analyst sandbox agents from production agents. Store red-team run IDs. Record whether user approval was required for each sensitive action. If a platform only logs final output and not tool calls, treat it as unsuitable for high-value cloud data until the logging gap is closed.

What to do now

Defenders do not need to wait for a perfect prompt-injection scorecard. The useful work is smaller and less glamorous.

Inventory every browser, coding and research agent that can reach mail, repositories, ticketing systems, object storage or cloud consoles. Record the identity it uses and the tool gateway it calls.
Turn on tool-call logging before expanding permissions. The log should preserve run IDs, tool names, argument metadata, input provenance and approval state. Do not store raw secrets in logs, but do keep enough hashed or labelled metadata to join the agent event to cloud audit records.
Add one canary tool and one planted honeytoken to each production agent class. A single clean trap beats a long list of generic prompt filters. Test that a triggered trap reaches the SIEM with the agent run ID intact.
Build allowlists for high-risk tools: external send, URL open, file upload, object-store read, document export, shell command and webhook post. Alert on new destinations after untrusted content ingestion.
Join agent alerts to cloud data reads. For AWS, start with CloudTrail management and S3 data events for labelled buckets. For Google Cloud, use Admin Activity, Data Access where enabled and Workspace audit logs for Drive exports. For Microsoft, join Entra sign-in and audit logs with Microsoft 365 file and mail events.

The sources converge on one uncomfortable fact: browser and tool-using agents will keep seeing hostile content. The practical question is whether the defender can tell when hostile content crossed into privileged action. Canary tools, honeytokens and parameter allowlists make that visible without pretending the model will reject every attack.