2026年3月14日土曜日

Part 1/4: Prompt Injection Defense for AI Agents: Design Principles Revealed by OpenAI

Introduction: A New Threat Facing Autonomous Agents

Now that AI agents built on large language models (LLMs) can autonomously perform real-world actions such as executing code, searching the web, sending emails, and manipulating databases, security risks are growing rapidly as well. Among the most serious of these threats is prompt injection.

Prompt injection is an attack technique in which an attacker embeds malicious instructions into the model's input, overriding the intentions of developers and users. The OWASP (Open Worldwide Application Security Project) lists prompt injection as the number one risk (LLM01) in its "OWASP Top 10 for LLM Applications," and its dangers are widely recognized across the industry [Source: https://owasp.org/www-project-top-10-for-large-language-model-applications/].

In this first installment of the series "Building Production-Grade AI Agents: Security, Architecture, and Runtime," we explain the foundational defense principles for building robust AI agents in production environments, centered on OpenAI's published design guidelines, "Designing AI agents to resist prompt injection" [Source: https://openai.com/index/designing-agents-to-resist-prompt-injection].

Direct Injection and Indirect Injection

Prompt injection has two major attack vectors. Direct injection is the pattern in which an attacker directly accesses the prompt and inserts instructions (commonly known as jailbreaking). Indirect injection, on the other hand, is the pattern in which an agent's behavior is hijacked by instructions embedded in content it retrieves from external sources (web pages, PDFs, emails, tool execution results, etc.), making it a more serious threat for autonomous agents.

The 2022 paper by Perez & Ribeiro, "Ignore Previous Prompt: Attack Techniques For Language Models," was a pioneering study that systematically examined this problem and demonstrated the mechanism by which LLMs execute instructions found in external content as if they were legitimate system instructions [Source: https://arxiv.org/abs/2211.09527]. The more untrusted data sources an agent processes, the larger the attack surface becomes, and the danger grows even greater in multi-step task execution.

Five Design Principles Presented by OpenAI

OpenAI's design guidelines provide a practical framework for protecting agents from prompt injection.

1. Explicit Implementation of a Trust Hierarchy

The inputs an agent receives come from multiple sources with varying levels of trust. A fundamental principle is to explicitly design a priority order of system prompt (operator) > user messages > content retrieved from the environment, and to ensure that lower-tier sources cannot override instructions from higher-tier sources. This instruction hierarchy forms the core of OpenAI's agent design.

2. The Principle of Minimal Footprint

The scope and permissions of the tools an agent possesses should be kept to the minimum necessary for the current task. An agent performing a web search task should not be granted permission to send emails or delete files. Even if an attacker manipulates the agent through malicious content, limiting its permissions minimizes the "blast radius" of any potential damage.

3. Clear Separation of Data and Instructions

At every stage of the pipeline, a design that distinguishes whether the input to the model is an "instruction" or "data to be processed" is required. The "Spotlighting" technique proposed by Microsoft's research team is a promising approach that uses XML tags and special delimiters to syntactically differentiate trusted instructions from retrieved data, making it harder for the model to execute instructions found within environmental content [Source: https://arxiv.org/abs/2403.14720].

4. Human Approval Checkpoints for Irreversible Actions

Before executing actions that are difficult to undo — such as deleting files, sending emails, or writing to external APIs — it is recommended to establish checkpoints where a human can review and approve the action. This is a design decision that is mindful of the tradeoff between autonomy and safety, and it suppresses the risk of an agent unintentionally performing critical operations.

5. Ensuring Logging and Auditability

All tool calls and decision-making processes executed by the agent should be recorded in detailed logs. Even if an injection attack succeeds, an audit trail is indispensable for retrospectively identifying the root cause and responding quickly.

The Practical Limits of Defense

Even when these principles are implemented, complete defense is difficult. Unresolved challenges continue to accumulate: the model itself has not fully learned to respect the instruction hierarchy, the ambiguity of natural language creates edge cases, and in multi-agent configurations (where one agent calls another), trust boundaries become even more complex. OWASP recommends Defense in Depth, making it important not to rely on any single countermeasure, but to combine defenses at the model level, architecture level, and infrastructure level.

Preview of the Next Installment

In Part 2, we will take a deep dive into architecture patterns for production-grade AI agents — the orchestrator-executor model, memory management design, and tool boundary definition. The design principles for prompt injection defense explained in this installment only become truly effective when paired with appropriate architectural design. We will elaborate on those specific implementation approaches in the next part.


Category: LLM | Tags: プロンプトインジェクション, AIエージェント, LLMセキュリティ, OpenAI, セキュリティ設計

A Fine-Tuned 3B Model Surpasses Claude Haiku: The Potential and Limits of Small Models in Constrained Generation

"Small" Does Not Mean "Weak": Surprising Experimental Results

Experimental results that challenge the conventional wisdom of AI engineering have been published. It has been reported that by fine-tuning an open-source model with a mere 3 billion (3B) parameters for a specific task, it is possible to surpass Anthropic's commercial model "Claude Haiku" on a Constrained Generation benchmark [Source: https://serendip-ml.github.io/fine-tuned-3b-beats-haiku/].

Although Claude Haiku's parameter count is undisclosed, it is Anthropic's lightweight model designed with a focus on inference speed and cost efficiency, and it is widely adopted across many production use cases. The fact that a compact 3B model outperformed it on the specific task of constrained generation once again demonstrates the potential of domain-specific fine-tuning.

What Is Constrained Generation?

Constrained Generation refers to a technique that controls inference at runtime so that the output token sequence strictly conforms to a pre-defined schema, regular expression, or grammar. Strict JSON Schema output, classification tasks that return only specific enumerated values, or the generation of responses with complex nested structures — these are all critically important requirements when integrating LLMs into production environments.

General-purpose large LLMs can follow formats with high probability through prompt engineering, but "high probability" is not "certainty." Even an error rate of 0.1% can cause non-negligible incidents in a production system processing thousands of requests per second. The core of this experiment lies in an approach that addresses this problem by internalizing "format compliance" into the model itself through fine-tuning [Source: https://serendip-ml.github.io/fine-tuned-3b-beats-haiku/].

Experiment Overview: What Was Evaluated and How

In this experiment, supervised fine-tuning (SFT) specialized for constrained generation tasks was applied to a base model of 3B parameters, presumed to be from the Llama or Mistral family. The evaluation focused primarily on the following two axes:

  1. Format compliance rate: The proportion of outputs that fully satisfy the specified schema
  2. Semantic accuracy: Whether the content matches the expected values, not just whether the structure is correct

While Claude Haiku was evaluated using zero-shot prompting, the fine-tuned 3B model demonstrated notably higher scores on tasks with distributions close to its training data. This is a typical characteristic of fine-tuning known as in-distribution strength, but the fact that a 3B-parameter model surpassed Haiku had a strong impact on the industry [Source: https://serendip-ml.github.io/fine-tuned-3b-beats-haiku/].

Why Can a Small Fine-Tuned Model Surpass a Large Model?

Explaining this result requires a precise understanding of the nature of fine-tuning. Large general-purpose models possess vast knowledge and generalization capabilities, but in terms of "familiarity" with a specific output format, they can be inferior to domain-specific models that have learned thousands of examples of the same task.

In constrained generation in particular, three mechanisms come into play:

  • Concentration of probability distribution: Fine-tuning concentrates probability mass on the correct format tokens, making incorrect tokens virtually impossible to generate
  • Memorization of context-dependent patterns: The relationship between schema definitions and field names is directly embedded into the weights
  • Elimination of inference-time cost: Accuracy improves because the "effort" a general-purpose model expends trying to follow instructions via prompts becomes unnecessary

The case of NVIDIA's NeMo Agent Toolkit, published on Hugging Face, corroborates the same findings. When that team achieved first place on DABStep (Data Analysis Benchmark), they employed a reusable tool-generation strategy, demonstrating that an agent design optimized for specific tasks can surpass general-purpose large models [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place].

Additional Possibilities Brought by RL Fine-Tuning

Beyond supervised fine-tuning, reinforcement learning (RL)-based fine-tuning is also attracting attention for enhancing the capabilities of small models. A survey report on 16 open-source RL libraries published by Hugging Face organizes the cutting edge of asynchronous and distributed RL training, and shows that applying PPO and GRPO algorithms to small models is becoming feasible at realistic costs [Source: https://huggingface.co/blog/async-rl-training-landscape].

A notable point of RL fine-tuning is that it can improve constraint compliance rates without explicit demonstration data, through a reward design that gives a reward when output is produced in the correct format. The combination of constrained generation and RL is emerging as a promising approach for producing the next generation of small, high-precision models.

An Industrial Application Case Study: Granite 4.0 1B Speech

From the perspective of specializing small models, IBM's published Granite 4.0 1B Speech model is also highly instructive. Despite having only 1 billion parameters, it specializes in multilingual speech processing and is designed with on-device inference in mind [Source: https://huggingface.co/blog/ibm-granite/granite-4-speech]. This is a case that perfectly aligns with the theme discussed in this article — that optimization for a specific modality and specific task delivers more practical value than simply scaling up a model.

Challenges and Limitations: Beware of Overconfidence

However, one should also be cautious about overestimating these results. The following limitations must be clearly recognized.

  1. Vulnerability in out-of-distribution generalization: Fine-tuned models are vulnerable to schemas and structures outside the training distribution, and are inferior to large models like Haiku in their ability to handle unknown input patterns
  2. Ongoing maintenance costs: Re-fine-tuning is required every time a schema or requirement changes, resulting in a significant operational burden
  3. Lack of knowledge: 3B models have constraints on the amount of knowledge proportional to their parameter count, and may still be significantly inferior to large models on tasks other than constrained generation
  4. Evaluation bias: The advantage exists only on the specific evaluation axis of constrained generation and is not representative of the model's overall capabilities

Conclusion: The Future of Small Models Opened Up by Specialization

The fact that a fine-tuned 3B model surpassed Claude Haiku in constrained generation sharpens the contrast between "all-purpose large models vs. specialized small models." Considered alongside the cases of NeMo Agent and Granite Speech, the practical direction of AI engineering is shifting from "calling the most powerful general-purpose model" to "selecting and cultivating the optimal model for the task."

From the perspective of infrastructure costs and latency, it is impractical in a production environment to assign a large model to every request. Routine, repetitive tasks like constrained generation are prime targets for domain-specific fine-tuning, and this case quantitatively demonstrated the ROI of that approach. On the other hand, large models remain indispensable for knowledge-intensive and reasoning-intensive tasks, and designing a division of roles between the two will become central to modern LLM system architecture.


Category: LLM | Tags: ファインチューニング, LLM, 制約付き生成, 小型モデル, AIエージェント