Mobile Study: Part 1/4: Prompt Injection Defense for AI Agents: Design Principles Revealed by OpenAI

2026年3月14日土曜日

Part 1/4: Prompt Injection Defense for AI Agents: Design Principles Revealed by OpenAI

Introduction: A New Threat Facing Autonomous Agents

Now that AI agents built on large language models (LLMs) can autonomously perform real-world actions such as executing code, searching the web, sending emails, and manipulating databases, security risks are growing rapidly as well. Among the most serious of these threats is prompt injection.

Prompt injection is an attack technique in which an attacker embeds malicious instructions into the model's input, overriding the intentions of developers and users. The OWASP (Open Worldwide Application Security Project) lists prompt injection as the number one risk (LLM01) in its "OWASP Top 10 for LLM Applications," and its dangers are widely recognized across the industry [Source: https://owasp.org/www-project-top-10-for-large-language-model-applications/].

In this first installment of the series "Building Production-Grade AI Agents: Security, Architecture, and Runtime," we explain the foundational defense principles for building robust AI agents in production environments, centered on OpenAI's published design guidelines, "Designing AI agents to resist prompt injection" [Source: https://openai.com/index/designing-agents-to-resist-prompt-injection].

Direct Injection and Indirect Injection

Prompt injection has two major attack vectors. Direct injection is the pattern in which an attacker directly accesses the prompt and inserts instructions (commonly known as jailbreaking). Indirect injection, on the other hand, is the pattern in which an agent's behavior is hijacked by instructions embedded in content it retrieves from external sources (web pages, PDFs, emails, tool execution results, etc.), making it a more serious threat for autonomous agents.

The 2022 paper by Perez & Ribeiro, "Ignore Previous Prompt: Attack Techniques For Language Models," was a pioneering study that systematically examined this problem and demonstrated the mechanism by which LLMs execute instructions found in external content as if they were legitimate system instructions [Source: https://arxiv.org/abs/2211.09527]. The more untrusted data sources an agent processes, the larger the attack surface becomes, and the danger grows even greater in multi-step task execution.

Five Design Principles Presented by OpenAI

OpenAI's design guidelines provide a practical framework for protecting agents from prompt injection.

1. Explicit Implementation of a Trust Hierarchy

The inputs an agent receives come from multiple sources with varying levels of trust. A fundamental principle is to explicitly design a priority order of system prompt (operator) > user messages > content retrieved from the environment, and to ensure that lower-tier sources cannot override instructions from higher-tier sources. This instruction hierarchy forms the core of OpenAI's agent design.

2. The Principle of Minimal Footprint

The scope and permissions of the tools an agent possesses should be kept to the minimum necessary for the current task. An agent performing a web search task should not be granted permission to send emails or delete files. Even if an attacker manipulates the agent through malicious content, limiting its permissions minimizes the "blast radius" of any potential damage.

3. Clear Separation of Data and Instructions

At every stage of the pipeline, a design that distinguishes whether the input to the model is an "instruction" or "data to be processed" is required. The "Spotlighting" technique proposed by Microsoft's research team is a promising approach that uses XML tags and special delimiters to syntactically differentiate trusted instructions from retrieved data, making it harder for the model to execute instructions found within environmental content [Source: https://arxiv.org/abs/2403.14720].

4. Human Approval Checkpoints for Irreversible Actions

Before executing actions that are difficult to undo — such as deleting files, sending emails, or writing to external APIs — it is recommended to establish checkpoints where a human can review and approve the action. This is a design decision that is mindful of the tradeoff between autonomy and safety, and it suppresses the risk of an agent unintentionally performing critical operations.

5. Ensuring Logging and Auditability

All tool calls and decision-making processes executed by the agent should be recorded in detailed logs. Even if an injection attack succeeds, an audit trail is indispensable for retrospectively identifying the root cause and responding quickly.

The Practical Limits of Defense

Even when these principles are implemented, complete defense is difficult. Unresolved challenges continue to accumulate: the model itself has not fully learned to respect the instruction hierarchy, the ambiguity of natural language creates edge cases, and in multi-agent configurations (where one agent calls another), trust boundaries become even more complex. OWASP recommends Defense in Depth, making it important not to rely on any single countermeasure, but to combine defenses at the model level, architecture level, and infrastructure level.

Preview of the Next Installment

In Part 2, we will take a deep dive into architecture patterns for production-grade AI agents — the orchestrator-executor model, memory management design, and tool boundary definition. The design principles for prompt injection defense explained in this installment only become truly effective when paired with appropriate architectural design. We will elaborate on those specific implementation approaches in the next part.

Category: LLM | Tags: プロンプトインジェクション, AIエージェント, LLMセキュリティ, OpenAI, セキュリティ設計

Mobile Study