Information-Flow Control: Microsoft's FIDES Stops Prompt Injection

When securing agents, you don’t need to stop an agent from having intrusive thoughts. But you do have to stop them from acting on it. In this case, intrusive thoughts can stem from prompt injection. This is where Microsoft’s new research paper, Securing AI Agents with Information-Flow Control, comes in.

It offers a way to enforce deterministic security policies based on the integrity and confidentiality of data that reaches the LLM.

It’s a simple idea but complex design to defend agentic systems against evil data. The paper introduces the concept of a Flow Integrity Deterministic Enforcement System (FIDES).

The Core of FIDES is a specialized planner agent. It uses an LLM to create a plan of action and orchestrates tool calls. It loops through those tool calls, reviewing tool output to determine next steps until it determines the task is complete. In each invocation, the planner appends the latest message to the conversation history. This is important because if a tool, like a browser, accesses a web page containing malicious instructions (aka prompt injection), an attacker can hijack that agent.

To address those issues, Microsoft created a more advanced planner that stores risky tool output in variables to silo it off from the rest of the agentic system. To safely use this data, it employs quarantined LLMs (like a Dual LLM pattern) with constrained outputs. This extraction process effectively endorses the specific slice of data, scrubbing the “untrusted” label so it can be used in subsequent tool calls.

We’ll walk through an example of this later, but first, we need to understand how FIDES constrains an agent. It starts with understanding and labeling the data that an agent is entering into its context.

Remember that with agentic systems, data is the new perimeter. FIDES uses Information-Flow Labels, which are defined in two ways:

Confidentiality: Who is authorized to see the information? It goes from public (everyone) to secret (restricted to certain users). Information can flow from public to secret, but not from secret to public. The goal of this label is to prevent data leakage.
Integrity: Defines the trustworthiness of the data's origin (trusted vs. untrusted). Information can flow from trusted to untrusted, but not the other way. Think of it this way: any trusted data that touches untrusted data is forever tainted. That once-trusted data now has a stain that spreads to everything else it touches. The goal of this label is to mitigate malicious instructions that could hijack the agent.

Labels are attached to data as metadata and follow the data along its entire journey. The label can be hard-coded (e.g., in the tool call) or can also originate from existing security solutions such as Microsoft Purview, Google Workspace, or other DLP solutions that label the data. It could even be inferred without explicit labeling, like email clients that flag messages from external domains or unknown senders. If it can’t be inferred, the developer can resort to safe defaults. The key here is that it’s up to the agent developer.

With the data labeled, we can now move on to the policies that dictate what an agent is allowed to do. This is accomplished through a policy engine that deterministically enforces those security policies.

FIDES uses data labels to enforce defined security policies. The paper outlines two fundamental security policies:

Trusted Action: Only allow tool calls based on input from trusted sources (Integrity). The plurality of sources is intentional. If there are multiple tool calls, FIDES reduces the likelihood that output from one tool call can negatively affect the action of another (aka avoid the stain mentioned before). This is where those variables come in. The stain is constrained to the specific variable.
Permitted Flow: Only allow a tool call that can send data to proceed if the recipients are permitted to read it (Confidentiality). The goal here is to avoid an agent leaking sensitive data.

Note the emphasis on tools. This is where the real risk of agents comes into play. FIDES broadly classifies tools into three categories:

Tools that have consequential actions
Tools that egress data
Tools that do neither

Before a tool is called, it verifies the integrity and confidentiality of the data and determines whether that tool call is allowed.

Microsoft’s tests found excellent results. With FIDES policy checks enabled, they stopped all prompt injection attacks during their tests. And when paired with reasoning models, they completed 16% more tasks than the basic planner, rising to 24% with further prompt tuning.

Said differently, they made the agent more secure AND performant. This is what you want to see!

* With policy checks enabled, FIDES stops all prompt injection attacks in AgentDojo. Without policy checks, all planners, including FIDES, succumb to practical PIAs.

* With policy checks enabled and using OpenAI’s reasoning models (o1, o3, o4-mini), FIDES completes on average about 16 % more tasks than a basic planner. With further prompt tuning, this rises to 24 %, approaching the performance of a human oracle.

* With policy checks disabled, the extra complexity of selectively hiding and revealing information does not decrease the overall task completion rate of FIDES w.r.t. a basic planner when using reasoning models.

Let’s put this all together with an example from the paper. You ask your email agent to read the top three emails in your inbox and check whether there was a request to set up a meeting. If there was, the agent should create a calendar event.

The planner agent devises a plan and begins to execute. At each tool call, the agent checks the security policy to confirm whether it is allowed. It executes the tools in the following manner:

read_emails: The agent recognizes that the emails are from an untrusted sender and labels them as untrusted. To accomplish the task, it understands it must execute a tool with a quarantined LLM to bypass the tool calling restrictions due to untrusted sources.
query_llm: The agent calls a quarantined LLM to review the emails and return a Boolean (yes/no) response for each email indicating whether it contains a calendar invite. The answer to the tool call is stored as a variable.
inspect: The agent reviews the constrained Boolean response to determine whether any emails included a meeting request. After seeing an email with a meeting request, it readies the next tool call.
query_llm: The agent uses the quarantined LLM again to extract meeting request details, using constrained outputs (e.g., only returning the date/time, users, etc.). This shifts the integrity label from untrusted to trusted, allowing the next tool call.
set_event: The agent creates a calendar event using the constrained details received from the previous call.

While the above helps prevent agent mishaps for that email agent, I still see some inherent issues. In the above example, if an event’s description in the email contains a prompt injection, it could still be inserted into the calendar invite. While the query_llm call attempts to enforce structure, it doesn’t validate the intent behind what is being added to the description field.

Previous research shows that calendar invites with prompt injection can be used to trick agents. In this context, the user’s calendar invite would be trusted as it’s a “trusted” source and user.

That same email agent, when asked to summarize the day’s meetings, could then be hijacked because it starts a fresh plan, one that doesn’t recognize that the calendar invite originally came from an untrusted source.

A trend I see with these secure designs is that they don’t account for “trusted” accounts being compromised. They assume bad data only comes from bad sources, which is often accurate. But I’ve seen this story before.

In traditional network attacks, I often saw security teams focus efforts on attackers creating new admin accounts. It was based on the mindset that an attacker would gain access to an environment, create a new administrator account, and use that account to rummage through the environment.

And yes, that happened. But more often than not, the attacker would simply gain access to legitimate accounts and use those to rummage around the environment. They just would blend in with the existing environment.

This is why you need run-time protection in place. You must start with secure designs, but security doesn’t end there. Run-time security, where you actively monitor what an agent is doing and prevent malicious actions, is the insurance policy for secure design.

This is where Evoke Security can help.

If you have questions about securing AI, let’s chat.

Securing AI Agents with Information-Flow Control

Reply

Keep Reading

The Weekend Byte

Home