Skip to content
Menu
Menu

NIST/CAISI Study Finds AI Agents Vulnerable To Hidden Prompt Injection Attacks

A red-teaming competition revealed that all 13 tested AI models could be manipulated by hidden instructions embedded in external content.

 

The Center for AI Standards and Innovation (CAISI), along with Gray Swan, the UK AI Security Institute (UK AISI), and several leading industry AI companies, released new research based on data from an AI agent red-teaming competition where 13 frontier models were “attacked” to test their security readiness. All were compromised.

Key takeaways

  • All 13 frontier AI models tested in the study were successfully compromised in at least one indirect prompt injection scenario.
  • Attack success rates ranged from 0.5% to 8.5%, showing that model resilience differed materially across vendors.
  • The study found that greater overall model capability did not reliably translate into stronger security performance.
  • Some attack methods were transferable across 21 of 41 tested scenarios, pointing to shared weaknesses that security leaders should account for when evaluating AI agents for enterprise use.

In the recent red-teaming competition, 464 participants generated 272,000+ attacks and produced 8,648 successful attacks across 41 scenarios, manipulating AI agents via external content such as emails, websites, and code repositories. The study focused on a form of attack in which malicious instructions are embedded in external content that an AI agent is asked to read or process. Researchers said the risk is higher when the attack is concealed, since the agent’s final response may not show that it has been compromised even after it has taken an unintended action. The paper added, “[the agent] may even fabricate plausible explanations for actions that are in fact irrelevant or malicious. <emphasis added>” 

The competition tested agent behavior in three settings: tool calling, coding, and computer use. Attack success rates ranged from 0.5% for Claude Opus 4.5 to 8.5% for Gemini 2.5 Pro.

For chief information security officers (CISOs) and other executives, the paper points to two practical findings. First, model selection matters because tested systems showed meaningful differences in attack success rates. Second, capability and robustness were only weakly correlated, meaning strong performance on other tasks should not be taken as evidence of security against prompt injection.

The researchers also found that some attacks developed against one model could work against others. The paper said it identified “universal attack strategies” that transferred across 21 of 41 behaviors and across multiple model families. It also found that attacks built against more robust models were more likely to transfer to less robust ones than the reverse.

In a blog post, the National Institute of Standards and Technology’s Center for AI Standards and Innovation said, “Across more than 250,000 attack attempts from over 400 participants, at least one successful attack was found against all of the target frontier models.”

The paper said the organizers plan to issue quarterly updates through continued competitions and to open-source the competition environment for future evaluations.

Essential AI Risk Intelligence

Daily insights on AI governance, regulation, and enterprise risk management. Trusted by Chief Risk Officers and compliance leaders globally.

By subscribing, you agree to receive our daily newsletter. Unsubscribe anytime.

Advertise with AI RIsk Today, Today!