AI Agent Security: 6 Design Patterns to Minimize Cyber Risk

Effectively securing agentic systems won't happen with a single product. Security has never and will never be “solved” that way. The path forward needs to start with secure design principles of agentic systems with security tooling sprinkled in to augment and support.

Naturally, the question becomes what secure design patterns look like for agents? Some light reading this week, Design Patterns for Securing LLM Agents against Prompt Injections, touches on exactly this.

The authors asked, “what kinds of agents can we build today that produce useful work while offering resistance to prompt injection attacks?” More broadly, how can we enable agents to be useful without vastly increasing an organization’s attack surface?

The paper offers six different design patterns that can help keep that blast radius as small as possible. The common denominator with these techniques is simple:

When an agent ingests untrusted content, the system must be made so that the input can’t execute malicious actions.

This is equivalent to declaring that no user should be able to click a malicious link in an email. It’s bold, but necessary. But is it possible? Let’s explore the design patterns to find out.

Action-Selector: let an agent decide the best action to take from a list of prescribed actions. In other words, give the agent a menu of options and let it choose the one that will best accomplish the goal. The blast radius of prompt injection is confined to only the actions it can take. This won’t necessarily stop an attacker from abusing the allowed tool calls, but it does limit the impact to a more known state. For example, if an allowed action is to read data from a database, an attacker won’t be able to execute another tool. But they could trick the agent into returning more data than it should.

This is one of the more common approaches I’ve seen companies take, as it’s an easy way to dip your toe into the agentic pool. It’s relatively safe and also stays within the realm of predictable outcomes.

Plan-Then-Execute: creates a plan of action and sticks to it regardless of the input it receives. It’s akin to a Type A person who creates a travel plan and under no circumstances deviates from it. This provides control-flow integrity, helping avoid prompt injection that forces the agent off the previously planned route.

What this doesn’t prevent is prompt injection that alters the planning phase. For our Type A person, this is like altering a restaurant’s reviews to increase the chances they'll pick it, only to be sorely disappointed when they realize they've been catfished.

LLM Map-Reduce: a main agent breaks a task down into smaller sub-tasks and spawns sub-agents to manage each task in isolation. Each of those sub-agents is neutered (sorry, lil buddy), so it can’t perform any harmful operations. When those sub-agents return data, it’s processed with either deterministic programming or a hardened LLM that constrains the outputs to filter out any prompt injection.

Let’s imagine this as a kindergarten teacher who needs to clean the classroom. They task each of the little ~~tyrants~~ kids with one task. The kids report back the status, and the teacher confirms it’s done, checking off their chore board with a gold star. When one of the kids lies, the teacher doesn’t fall for it because they know better.

Dual LLM Pattern: like dueling banjos, two LLMs work separately but together at the same time. One privileged LLM receives instructions, plans actions, and uses tools. The privileged LLM calls a quarantined LLM whenever untrusted data has to be processed. An output filtering mechanism exists between the two to avoid any cross-contamination.

This pattern has received a lot of attention with Google DeepMind’s CaMeL paper.

Code-Then-Execute: similar to the Plan-Then-Execute design pattern, this lets an LLM create and execute code to solve a task. This can be merged with the Dual LLM pattern, where the created code runs against untrusted content. I think agents using code to solve problems is the future, so we need to prepare for this.

This is where agents begin to gain greater agency and autonomy while balancing security measures.

Context-Minimization: while previous design patterns try to limit the impact of prompt injection after it enters the agent’s system (e.g., from a user prompt), this tries to eliminate malicious prompts entirely. Here, the agent removes any unnecessary content from the agent’s context. This can be done by translating the user’s request and, before returning anything, removing the user prompt from the context where it would normally live. Consider this the equivalent of a program that bleeps out f***ing curse words on a live stream.

Going back to our original question. Will this stop prompt injection? Use any one of these design patterns, and no, it won’t. Combine several, and now you have a fighting chance.

For the teams that take the time to implement a mixture of these, you’re better positioned. However, I’m not optimistic that teams will apply these design principles in practice because they’re quite bloated. You don’t want to run a turkey trot after Thanksgiving dinner, and that’s what many of these design principles feel like.

When designing these systems, consider combining elements of Plan + Code + Dual LLM, coupled with properly scoped permissions. Sprinkle in the right visibility to power detection and response, and now we have a semi-robust system.

While design patterns continue to get worked out, developers should take the following quick wins:

Least privilege, agency, and autonomy: grant agents freedom in a well-defined framework.
Control output formats: stay predictable where you can. If your program doesn’t need to have the option to receive instructions in Shakespearean prose, then don’t leave it open-ended.
Human-in-the-loop (HITL) for sensitive actions: an offshoot of least autonomy, humans should still play a pivotal role in keeping agents in check.

For security teams, here are your quick wins:

Threat model your agentic systems: as you build, assess where things can go wrong, whether through an agent going rogue on its own volition or someone manipulating it.
Monitor agents for anomalous behavior: engineers want to know when agents become problematic just as much as security teams do when an agent is doing something naughty. Bring the right observability and detections to the party.

Not sure where to start with these? Evoke can help.

If you have questions about securing AI, let’s chat.

Exploring Secure Agentic Design Patterns

Reply

Keep Reading

The Weekend Byte

Home