Attackers don't need to hack your AI agent. They just need to ask it nicely…or get one of your employees to. That’s one of two scenarios that stood out to me in Anthropic's latest blog, How we contain Claude across products.
Let’s jump straight into the two scenarios that need to be on every security team’s radar.
PromptFix: The User as the Injection Vector.
In February 2026, Anthropic’s red team successfully stole an employee’s AWS credentials using malicious prompts during an internal exercise. How did they do it? Good ol’ social engineering. Here’s how they did it:
Sent an email to an internal employee asking them to run a ready-to-paste prompt. The employee fired up Claude Code and entered the prompt, which read like routine task instructions. It’s a case of user-initiated direct prompt injection.
During task execution, the prompt instructed Claude Code to read the contents of “~/.aws/credentials,” where sensitive AWS credentials are stored.
Claude encoded the contents of the credentials files and sent them to an external website.
This playbook should sound familiar. It’s based on a technique called ClickFix. It’s a basic attempt to trick a human into copy/pasting a malicious command and executing it on their system. Most often, this results in the user running malicious code that downloads an infostealer to their system, shipping their credentials to the attacker.

In this case, it’s not malicious code. It’s a malicious prompt. We’ll call it PromptFix.
One issue I have with prompt injection is how unreliable it can be with a zero-shot prompt, where you don’t go into super detail on what to do. It doesn’t make for a reliable exploit, at least not at scale. What was fascinating with Anthropic’s red team exercise was that across 25 attempts at the prompt, Claude exfiltrated the credentials 24 times, a 96% success rate.
But things can go from bad to worse. Anthropic shared the malicious prompt in their internal Slack for further discussion. At which point, they realized that some of their internal agents read Slack. As Anthropic noted, “The payload was now ambient.” It went from direct prompt injection, where the user had to paste it into Claude, to an indirect prompt injection risk, where an agent reads the Slack message and then decides to take action. With a 96% success rate, that’s scary.
Living-off-Trusted Sites (LOTS)
I’ve said before that agent sandboxes are public beaches. You can restrict actions and outbound access, but for agents to be truly effective, they have to connect to various resources. Cowork does a lot to lock things down in a virtual machine running on your local endpoint; it’s a true sandbox. Network traffic is funneled through a network egress proxy that will filter traffic for approved domains, such as Anthropic and GitHub.
No surprise here, those crafty attackers have been bypassing network filtering for years. In a technique known as living-off-trusted sites, attackers simply use legitimate domains to facilitate their attacks.
Even if you rely on network controls outside of the agent, attackers can still use known-trusted sites to exfiltrate data. The LOTS project tracks over 120 trusted domains that attackers can use.
Anthropic was no different. A third-party disclosure identified the following data exfiltration risk: it bypassed Claude Cowork’s network egress controls by using Anthropic’s own product against them. Here’s how it went down:
The researchers placed a malicious file in the user’s workspace. The file contained malicious instructions with the researcher’s API key.
When Claude picked up the malicious instructions, it read other files in the user’s workspace and uploaded them to the researcher’s Anthropic account using Anthropic’s Files API, which allows users to upload and manage files. Now the files were sitting right in the researcher’s Anthropic account.
Anthropic has since fixed this issue with a defensive man-in-the-middle proxy that intercepts traffic to Anthropic’s API. They are now validating that requests only use the VM’s own session token, preventing a sneaky attacker from inserting their own API key. Great for securing the use case of abusing Anthropic’s URL, but any other approved domains (which include GitHub, depending on your configuration, or everything if you go super wide) are still open game.

Enterprise Lessons in Securing Claude
I grew up in incident response, responding to hacks every day. Every hack gave a real-world lesson in how to better secure an environment. That’s why I love these types of examples. They give real actionable insights. So what are they here?
Existing security tools don’t provide visibility or control. What’s monitoring the prompts your users are dropping into Claude to detect PromptFix attacks? It’s not your EDR. It’s not your DLP solution. Visibility is the foundational pillar of security. You must position yourself to monitor what agents are doing and the actions they’re taking. But monitoring is not enough. You need to block the actions that are outright malicious or dangerous.
Shared security responsibility does not mean equally shared. Anthropic will do solid things to secure Claude, but you will always draw the short straw. We saw this with Microsoft and the Windows operating system. The default install is never enough to keep you secure.
Credentials are today’s target. Even if it’s a safe red team exercise, attackers are targeting credentials your agent has access to. We saw this with OpenClaw, where Skills marketplaces were flooded with malicious Agent Skills that stole credentials.
Network allow lists are capability grants. You may think that the network allow list you create is keeping you safe, especially when it’s from trusted sources. But don’t look at that list as a destination; focus on the services it enables the agent to access. GitHub suddenly becomes a place to store more instructions or a place to upload stolen information.
If you’re looking for visibility to know when your agent is doing something it shouldn’t, like accessing credentials in unsafe ways, let’s chat.


