2026年3月14日土曜日

Part 1/4: Prompt Injection Defense for AI Agents: Design Principles Revealed by OpenAI

Introduction: A New Threat Facing Autonomous Agents

Now that AI agents built on large language models (LLMs) can autonomously perform real-world actions such as executing code, searching the web, sending emails, and manipulating databases, security risks are growing rapidly as well. Among the most serious of these threats is prompt injection.

Prompt injection is an attack technique in which an attacker embeds malicious instructions into the model's input, overriding the intentions of developers and users. The OWASP (Open Worldwide Application Security Project) lists prompt injection as the number one risk (LLM01) in its "OWASP Top 10 for LLM Applications," and its dangers are widely recognized across the industry [Source: https://owasp.org/www-project-top-10-for-large-language-model-applications/].

In this first installment of the series "Building Production-Grade AI Agents: Security, Architecture, and Runtime," we explain the foundational defense principles for building robust AI agents in production environments, centered on OpenAI's published design guidelines, "Designing AI agents to resist prompt injection" [Source: https://openai.com/index/designing-agents-to-resist-prompt-injection].

Direct Injection and Indirect Injection

Prompt injection has two major attack vectors. Direct injection is the pattern in which an attacker directly accesses the prompt and inserts instructions (commonly known as jailbreaking). Indirect injection, on the other hand, is the pattern in which an agent's behavior is hijacked by instructions embedded in content it retrieves from external sources (web pages, PDFs, emails, tool execution results, etc.), making it a more serious threat for autonomous agents.

The 2022 paper by Perez & Ribeiro, "Ignore Previous Prompt: Attack Techniques For Language Models," was a pioneering study that systematically examined this problem and demonstrated the mechanism by which LLMs execute instructions found in external content as if they were legitimate system instructions [Source: https://arxiv.org/abs/2211.09527]. The more untrusted data sources an agent processes, the larger the attack surface becomes, and the danger grows even greater in multi-step task execution.

Five Design Principles Presented by OpenAI

OpenAI's design guidelines provide a practical framework for protecting agents from prompt injection.

1. Explicit Implementation of a Trust Hierarchy

The inputs an agent receives come from multiple sources with varying levels of trust. A fundamental principle is to explicitly design a priority order of system prompt (operator) > user messages > content retrieved from the environment, and to ensure that lower-tier sources cannot override instructions from higher-tier sources. This instruction hierarchy forms the core of OpenAI's agent design.

2. The Principle of Minimal Footprint

The scope and permissions of the tools an agent possesses should be kept to the minimum necessary for the current task. An agent performing a web search task should not be granted permission to send emails or delete files. Even if an attacker manipulates the agent through malicious content, limiting its permissions minimizes the "blast radius" of any potential damage.

3. Clear Separation of Data and Instructions

At every stage of the pipeline, a design that distinguishes whether the input to the model is an "instruction" or "data to be processed" is required. The "Spotlighting" technique proposed by Microsoft's research team is a promising approach that uses XML tags and special delimiters to syntactically differentiate trusted instructions from retrieved data, making it harder for the model to execute instructions found within environmental content [Source: https://arxiv.org/abs/2403.14720].

4. Human Approval Checkpoints for Irreversible Actions

Before executing actions that are difficult to undo — such as deleting files, sending emails, or writing to external APIs — it is recommended to establish checkpoints where a human can review and approve the action. This is a design decision that is mindful of the tradeoff between autonomy and safety, and it suppresses the risk of an agent unintentionally performing critical operations.

5. Ensuring Logging and Auditability

All tool calls and decision-making processes executed by the agent should be recorded in detailed logs. Even if an injection attack succeeds, an audit trail is indispensable for retrospectively identifying the root cause and responding quickly.

The Practical Limits of Defense

Even when these principles are implemented, complete defense is difficult. Unresolved challenges continue to accumulate: the model itself has not fully learned to respect the instruction hierarchy, the ambiguity of natural language creates edge cases, and in multi-agent configurations (where one agent calls another), trust boundaries become even more complex. OWASP recommends Defense in Depth, making it important not to rely on any single countermeasure, but to combine defenses at the model level, architecture level, and infrastructure level.

Preview of the Next Installment

In Part 2, we will take a deep dive into architecture patterns for production-grade AI agents — the orchestrator-executor model, memory management design, and tool boundary definition. The design principles for prompt injection defense explained in this installment only become truly effective when paired with appropriate architectural design. We will elaborate on those specific implementation approaches in the next part.


Category: LLM | Tags: プロンプトインジェクション, AIエージェント, LLMセキュリティ, OpenAI, セキュリティ設計

A Fine-Tuned 3B Model Surpasses Claude Haiku: The Potential and Limits of Small Models in Constrained Generation

"Small" Does Not Mean "Weak": Surprising Experimental Results

Experimental results that challenge the conventional wisdom of AI engineering have been published. It has been reported that by fine-tuning an open-source model with a mere 3 billion (3B) parameters for a specific task, it is possible to surpass Anthropic's commercial model "Claude Haiku" on a Constrained Generation benchmark [Source: https://serendip-ml.github.io/fine-tuned-3b-beats-haiku/].

Although Claude Haiku's parameter count is undisclosed, it is Anthropic's lightweight model designed with a focus on inference speed and cost efficiency, and it is widely adopted across many production use cases. The fact that a compact 3B model outperformed it on the specific task of constrained generation once again demonstrates the potential of domain-specific fine-tuning.

What Is Constrained Generation?

Constrained Generation refers to a technique that controls inference at runtime so that the output token sequence strictly conforms to a pre-defined schema, regular expression, or grammar. Strict JSON Schema output, classification tasks that return only specific enumerated values, or the generation of responses with complex nested structures — these are all critically important requirements when integrating LLMs into production environments.

General-purpose large LLMs can follow formats with high probability through prompt engineering, but "high probability" is not "certainty." Even an error rate of 0.1% can cause non-negligible incidents in a production system processing thousands of requests per second. The core of this experiment lies in an approach that addresses this problem by internalizing "format compliance" into the model itself through fine-tuning [Source: https://serendip-ml.github.io/fine-tuned-3b-beats-haiku/].

Experiment Overview: What Was Evaluated and How

In this experiment, supervised fine-tuning (SFT) specialized for constrained generation tasks was applied to a base model of 3B parameters, presumed to be from the Llama or Mistral family. The evaluation focused primarily on the following two axes:

  1. Format compliance rate: The proportion of outputs that fully satisfy the specified schema
  2. Semantic accuracy: Whether the content matches the expected values, not just whether the structure is correct

While Claude Haiku was evaluated using zero-shot prompting, the fine-tuned 3B model demonstrated notably higher scores on tasks with distributions close to its training data. This is a typical characteristic of fine-tuning known as in-distribution strength, but the fact that a 3B-parameter model surpassed Haiku had a strong impact on the industry [Source: https://serendip-ml.github.io/fine-tuned-3b-beats-haiku/].

Why Can a Small Fine-Tuned Model Surpass a Large Model?

Explaining this result requires a precise understanding of the nature of fine-tuning. Large general-purpose models possess vast knowledge and generalization capabilities, but in terms of "familiarity" with a specific output format, they can be inferior to domain-specific models that have learned thousands of examples of the same task.

In constrained generation in particular, three mechanisms come into play:

  • Concentration of probability distribution: Fine-tuning concentrates probability mass on the correct format tokens, making incorrect tokens virtually impossible to generate
  • Memorization of context-dependent patterns: The relationship between schema definitions and field names is directly embedded into the weights
  • Elimination of inference-time cost: Accuracy improves because the "effort" a general-purpose model expends trying to follow instructions via prompts becomes unnecessary

The case of NVIDIA's NeMo Agent Toolkit, published on Hugging Face, corroborates the same findings. When that team achieved first place on DABStep (Data Analysis Benchmark), they employed a reusable tool-generation strategy, demonstrating that an agent design optimized for specific tasks can surpass general-purpose large models [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place].

Additional Possibilities Brought by RL Fine-Tuning

Beyond supervised fine-tuning, reinforcement learning (RL)-based fine-tuning is also attracting attention for enhancing the capabilities of small models. A survey report on 16 open-source RL libraries published by Hugging Face organizes the cutting edge of asynchronous and distributed RL training, and shows that applying PPO and GRPO algorithms to small models is becoming feasible at realistic costs [Source: https://huggingface.co/blog/async-rl-training-landscape].

A notable point of RL fine-tuning is that it can improve constraint compliance rates without explicit demonstration data, through a reward design that gives a reward when output is produced in the correct format. The combination of constrained generation and RL is emerging as a promising approach for producing the next generation of small, high-precision models.

An Industrial Application Case Study: Granite 4.0 1B Speech

From the perspective of specializing small models, IBM's published Granite 4.0 1B Speech model is also highly instructive. Despite having only 1 billion parameters, it specializes in multilingual speech processing and is designed with on-device inference in mind [Source: https://huggingface.co/blog/ibm-granite/granite-4-speech]. This is a case that perfectly aligns with the theme discussed in this article — that optimization for a specific modality and specific task delivers more practical value than simply scaling up a model.

Challenges and Limitations: Beware of Overconfidence

However, one should also be cautious about overestimating these results. The following limitations must be clearly recognized.

  1. Vulnerability in out-of-distribution generalization: Fine-tuned models are vulnerable to schemas and structures outside the training distribution, and are inferior to large models like Haiku in their ability to handle unknown input patterns
  2. Ongoing maintenance costs: Re-fine-tuning is required every time a schema or requirement changes, resulting in a significant operational burden
  3. Lack of knowledge: 3B models have constraints on the amount of knowledge proportional to their parameter count, and may still be significantly inferior to large models on tasks other than constrained generation
  4. Evaluation bias: The advantage exists only on the specific evaluation axis of constrained generation and is not representative of the model's overall capabilities

Conclusion: The Future of Small Models Opened Up by Specialization

The fact that a fine-tuned 3B model surpassed Claude Haiku in constrained generation sharpens the contrast between "all-purpose large models vs. specialized small models." Considered alongside the cases of NeMo Agent and Granite Speech, the practical direction of AI engineering is shifting from "calling the most powerful general-purpose model" to "selecting and cultivating the optimal model for the task."

From the perspective of infrastructure costs and latency, it is impractical in a production environment to assign a large model to every request. Routine, repetitive tasks like constrained generation are prime targets for domain-specific fine-tuning, and this case quantitatively demonstrated the ROI of that approach. On the other hand, large models remain indispensable for knowledge-intensive and reasoning-intensive tasks, and designing a division of roles between the two will become central to modern LLM system architecture.


Category: LLM | Tags: ファインチューニング, LLM, 制約付き生成, 小型モデル, AIエージェント

Part 1/4: The Arrival of the Claude Agent SDK: A New Era of Autonomous AI Agent Development

Introduction: A Turning Point for Agentic AI

From 2025 into 2026, the way large language models (LLMs) are used is undergoing a major transformation. Moving beyond one-off question answering and document generation, "AI agents" that autonomously operate multiple tools to complete long-horizon tasks have entered the practical stage. As a defining move in that trend, Anthropic released the Claude Agent SDK. As the first installment in the series "Practical LLM-Powered Development: From Beginner Safety to Production Workflows," this article explains the technical foundations of this SDK and the new era that autonomous agent development has entered.

What Is the Claude Agent SDK?

The Claude Agent SDK is a developer framework for building and deploying autonomous AI agents with Claude at their core. With a focus on "orchestration" that goes beyond a single LLM call, it provides an architecture in which Claude acts as an orchestrator (conductor) that can dynamically spawn and manage multiple sub-agents (executors). Each sub-agent is designed to have its own independent context window and handle a specialized subtask. [Source: https://www.anthropic.com/news/claude-agent-sdk]

Key Technical Features

1. Multi-Agent Orchestration

At the heart of the Claude Agent SDK is an architecture that allows agents to spawn other agents and delegate work in parallel or in series. The pattern in which a parent agent decomposes a task, assigns it to sub-agents, and integrates the results achieves an abstraction close to human team management. Because large, complex workflows can be divided among agents with clearly defined responsibilities, there are significant advantages in development, debugging, and scaling alike. [Source: https://www.anthropic.com/news/claude-agent-sdk]

2. Deep Integration with Model Context Protocol (MCP)

Through the Model Context Protocol (MCP) standardized by Anthropic, the Claude Agent SDK can seamlessly integrate with a wide variety of tools, including web browsers, code interpreters, databases, and external APIs. Tools can be added or swapped in a plug-in fashion, making customization for specific use cases straightforward and enabling high extensibility of the ecosystem.

3. Built-in Safety and Control Mechanisms

Autonomous agents always carry the risk of unintended side effects or runaway behavior. Drawing on the principles of Anthropic's Constitutional AI, the Claude Agent SDK comes standard with policy settings that restrict the scope of agent actions, dynamic insertion of confirmation steps for users, and detailed execution logging. This allows developers to finely tune the balance between agent autonomy and human oversight. [Source: https://www.anthropic.com/news/claude-agent-sdk]

Real-World Application: A Data Science Agent Case Study

As a case study demonstrating the power of agentic architectures, the NVIDIA NeMo team's achievement of first place on the DABStep benchmark is worth noting. The system built with the NeMo Agent Toolkit adopted a "reusable tool generation" pattern in which the agent analyzes the characteristics of the data and dynamically generates and executes the necessary tools. The approach of having an agent self-analyze a problem and create its own appropriate instruments is perfectly aligned with the multi-agent orchestration direction that the Claude Agent SDK aims for. [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place]

Such results clearly indicate that autonomous agents have moved beyond the research phase and reached a level where they can solve real engineering challenges.

The Connection to Agentic Reinforcement Learning

Combining autonomous agents with reinforcement learning (RL) is also an important trend for improving their performance. A survey spanning 16 open-source RL libraries reports that scaling asynchronous RL training dramatically improves an agent's ability to execute long-horizon tasks. By combining agents built with the Claude Agent SDK with such RL pipelines, constructing systems with even greater autonomy comes into view. [Source: https://huggingface.co/blog/async-rl-training-landscape]

Overview of This Series

The Claude Agent SDK has significantly lowered the barrier to agent development. However, that does not immediately mean "safe and reliable production agents." Over four installments, this series addresses that challenge in a systematic way.

  • Part 1 (this article): The technical foundations of the Claude Agent SDK and an overview of autonomous agents
  • Part 2: Analysis of safety patterns and key failure modes in agent design
  • Part 3: Implementing deployment, monitoring, and observability for production environments
  • Part 4: Building team-based LLM-powered workflows and best practices

Conclusion

The arrival of the Claude Agent SDK marks a turning point that elevates LLMs from "high-performance autocomplete" to "autonomous task executors." The three pillars of multi-agent orchestration, rich tool integration via MCP, and built-in safety mechanisms provide a practical foundation for researchers and engineers to build production-quality agent systems. In the next installment, we will take a deep dive into safety patterns and failure modes in agent design using this SDK.


Category: LLM | Tags: Claude Agent SDK, Anthropic, マルチエージェント, LLM開発, AIエージェント

Part 1/3: Training LLMs with Million-Token Contexts: How Ulysses Sequence Parallelism Works and What It Makes Possible

In recent years, support for long-context inputs in LLMs (large language models) has advanced rapidly. With Claude 3.5 and Gemini 1.5 Pro supporting contexts exceeding one million tokens, how to efficiently handle their training has become a pressing challenge for researchers and engineers alike. In this article, we take a deep dive into Ulysses Sequence Parallelism (Ulysses SP), a leading approach to sequence parallelism, and explain the mechanisms and possibilities it offers for making million-token-scale LLM training a reality.

This article is Part 1 of a three-part series titled "Efficient LLM Training: Reinforcement Learning, Long Contexts, and Small-Model Breakthroughs," and covers the technical foundations of long-context training. Part 2 will explore training efficiency improvements through reinforcement learning (RL), and Part 3 will cover innovative architectures for small models.

Why Is Long-Context Training So Difficult?

Transformer self-attention requires $O(L^2)$ computation and memory with respect to sequence length $L$. If a sequence grows from 10,000 tokens to 100,000 tokens, memory requirements theoretically balloon by a factor of 100. Tensor parallelism and pipeline parallelism are primarily methods for distributing model weights, and do not fundamentally address the explosive growth in sequence length. The idea that fills this gap is Sequence Parallelism — parallelism along the sequence dimension.

The Core of Ulysses Sequence Parallelism: All-to-All Communication

DeepSpeed Ulysses is a sequence parallelism method proposed by Microsoft, and its core lies in dimensional transposition of Attention using All-to-All collective communication [Source: https://huggingface.co/blog/ulysses-sp].

In standard sequence parallelism, the input sequence is divided equally among N GPUs. Each GPU holds a portion of the sequence, but since self-attention computes interactions between all tokens, information from other GPUs is required. Ulysses solves this problem through the following steps:

  1. Each GPU computes the Query / Key / Value for the sequence fragment it is responsible for
  2. All-to-All communication "transposes" the split along the sequence dimension into a split along the head dimension (converting to a form where each GPU is responsible for a specific set of heads across the full sequence)
  3. Each GPU computes Attention over the complete sequence for the heads it is responsible for
  4. A second All-to-All restores the Attention output to the original sequence-based partition

This approach allows each GPU to hold and compute over the full sequence for the Attention heads it is assigned, enabling efficient separation of communication and computation. It is especially powerful in environments equipped with high-speed interconnects such as NVLink [Source: https://huggingface.co/blog/ulysses-sp].

Design Differences Between Ring Attention and Ulysses

Ring Attention (Liu et al., 2023) is a major competing method for sequence parallelism. Ring Attention connects GPUs in a ring topology and computes Attention by circulating K/V around the ring in turn, achieving overlap between communication and computation. Because communication volume scales with sequence length, communication costs grow for very long sequences.

Ulysses, on the other hand, is designed so that communication volume depends on the number of attention heads. As a result, for recent models that adopt Grouped Query Attention (GQA) — such as LLaMA 3 and Mistral — it is important to note that the degree of parallelism is limited by the number of Query Groups. The official HuggingFace blog recommends a hybrid strategy combining Ulysses and Ring Attention, with the best approach being to choose between them based on the model's architectural characteristics and available degree of parallelism [Source: https://huggingface.co/blog/ulysses-sp].

Implementation in the HuggingFace Ecosystem

HuggingFace has integrated support for Ulysses SP in conjunction with TRL and Accelerate. The need for long contexts is also growing in reinforcement learning-based training pipelines such as GRPO and PPO, making sequence parallelism an indispensable technical foundation [Source: https://huggingface.co/blog/async-rl-training-landscape].

Key points to keep in mind during implementation are as follows:

  • Explicit definition of process groups: Configure sequence parallel groups and data parallel groups separately
  • Combination with Flash Attention 2: Pairing with kernel-level memory optimization doubles the effect
  • Use of Gradient Checkpointing: An almost essential technique for reducing activation memory

As a benchmark result, it has been reported that a Ulysses SP configuration using 8 A100 GPUs (80GB) can train sequences exceeding one million tokens at comparable throughput — something unattainable on a single GPU.

Summary and Preview of the Next Installment

Ulysses Sequence Parallelism makes million-token-scale LLM training a reality by cleverly leveraging All-to-All communication to transpose Attention computation into the head dimension. Understanding when to use it versus Ring Attention, and being aware of its constraints with GQA models, is key to success in practical applications.

In the next installment, Part 2, we will cover the cutting edge of LLM training with reinforcement learning, including lessons learned from 16 major open-source RL libraries and a detailed look at implementation patterns for asynchronous RL training.


Category: LLM | Tags: シーケンス並列化, 長コンテキスト訓練, DeepSpeed, LLM効率化, HuggingFace

Rakuten Cuts Incident Resolution Time by 50% with OpenAI Codex: Lessons from a Real-World AI Coding Agent Deployment

Introduction

In 2025, the number of cases where large-scale e-commerce and technology companies are integrating generative AI into production engineering workflows is rapidly increasing. Among the most notable is Rakuten Group's adoption of OpenAI Codex. Rakuten has reported that after deploying Codex across its engineering organization, the company succeeded in reducing incident resolution time by approximately 50%. [Source: https://openai.com/index/rakuten]

This article examines the technical background and implementation details of this case, and draws out implications for introducing AI coding agents into large-scale organizations.


What Is OpenAI Codex?

OpenAI Codex is a large language model specialized in code generation, completion, explanation, and debugging. Built on a GPT-4-class architecture and extensively fine-tuned on code data, it supports a wide range of languages including Python, JavaScript, TypeScript, Go, and Rust.

In recent iterations, Codex has moved beyond being a simple "completion tool" and is now capable of agentic behavior. Specifically, it can handle multi-step reasoning tasks such as ingesting an entire repository, identifying the root cause of a bug, and proposing a fix patch. This capability was directly leveraged in Rakuten's incident response workflow. [Source: https://openai.com/index/rakuten]


Rakuten's Deployment Case: What Changed and How

The Challenge: Delayed Incident Response in a Large-Scale System

Rakuten Group operates businesses spanning e-commerce, fintech, telecommunications, sports, and more, with a backend system composed of thousands of microservices. When an incident occurs, the responsible engineers must go through a complex series of steps — log analysis, code review, cross-referencing past incidents, and drafting a fix — making improvement of MTTR (Mean Time to Recovery) a long-standing challenge.

The Solution: Integrating Codex as an AI Agent into the Incident Workflow

Rakuten's engineering team embedded Codex into their incident response flow. The following use cases have been cited specifically:

  1. Automated log and stack trace analysis: Codex reads error logs and enumerates hypotheses for the root cause
  2. Codebase search and change location identification: The agent autonomously searches relevant code files and narrows down the location of the bug
  3. Fix patch generation: A diff is presented in a form that human engineers can review
  4. Cross-referencing past incident cases: Resolution procedures from similar incidents are summarized to inform the response

By having Codex handle this entire flow, engineers shifted their role from "investigation" to "judgment, approval, and deployment," cutting the total time required to resolve incidents by approximately half. [Source: https://openai.com/index/rakuten]


Codex as an AI Agent: Technical Considerations

What the Rakuten case suggests is that LLMs have evolved from "completion tools" to "autonomous agents." The defining characteristic of agentic LLMs is that, rather than responding to a single prompt, they repeatedly perform tool calls, state management, and multi-step planning and execution.

In the case of NVIDIA's publicly released NeMo Agent Toolkit, it has been reported that an agent self-generated reusable tools and achieved first place on the DABStep benchmark, confirming improvements in coding agent capabilities across multiple fronts. [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place]

The following technical elements become critical when operating a large-scale coding model like Codex as an agent:

  • RAG (Retrieval-Augmented Generation): Vector-searching large codebases and passing relevant context to the LLM
  • Function Calling / Tool Use: Calling tools such as file system operations, test execution, and log retrieval from the LLM
  • Long Context support: Processing long contexts that load tens of thousands of lines of code at once
  • Human-in-the-loop design: Designing an approval flow in which humans review generated patches

As a pioneering example of combining these technologies in a real production environment, Rakuten's case offers high reference value for many engineering organizations.


Caveats and Trade-offs in Adoption

While there are clear benefits to adopting AI coding agents, several challenges remain.

Security and access control: When an agent has access to codebases and logs, design based on the Principle of Least Privilege is essential. The risk of a Codex agent inadvertently accessing sensitive data must be assessed.

Risk of incorrect fix patches: Because LLMs output probabilistically, they can generate incorrect patches that appear plausible at first glance. Rakuten's implementation includes an engineer review flow, and it is understood that fully automated deployment is not being performed.

Cost management: Agentic usage that ingests large codebases tends to consume a significant number of API tokens. A framework for quantitatively evaluating cost-effectiveness is necessary.


Conclusion: The Dawn of the AI Coding Agent Era

Rakuten's OpenAI Codex deployment case is a concrete example demonstrating that an era is arriving in which AI is being embedded into the core processes of engineering, beyond serving merely as an "assistive tool." The quantitative result of a 50% reduction in incident resolution time serves as an important baseline for organizations making investment decisions around AI agents.

Coding agent technology is evolving rapidly, and beyond OpenAI Codex, numerous agent frameworks leveraging Anthropic Claude, Google Gemini, and OSS models have emerged. Which model and architecture each organization selects will depend on security requirements, cost, and compatibility with existing stacks — but the direction of "integrating AI agents into engineering workflows" is becoming an irreversible industry-wide trend. [Source: https://openai.com/index/rakuten]


Category: LLM | Tags: OpenAI Codex, AIエージェント, 楽天, インシデント対応, コーディングエージェント

How to Build a Secure Agent Runtime with OpenAI Responses API and Container Environments

Introduction: Security Challenges in Agent Execution Environments

The era has arrived in which LLM agents not only "think" but also execute code, manipulate files, and call external services. However, this expansion of capabilities comes hand in hand with serious security risks. In an environment where an agent can execute arbitrary shell commands, prompt injection attacks and unintended side effects can cause damage to the host system. A promising approach to solving this problem is the combination of container-based isolated execution environments and OpenAI's Responses API.

What Is the OpenAI Responses API

The Responses API, announced by OpenAI in March 2025, is a next-generation interface that integrates the best aspects of the traditional Chat Completions API and the Assistants API. Its most distinctive feature is native support for built-in tools (web_search, file_search, computer_use) at the API level. Developers can grant agents powerful execution capabilities without having to write tool definitions from scratch.

The computer_use tool in particular is designed to allow agents to interact with a virtual computer environment, providing an action set that abstracts desktop operations such as taking screenshots, mouse control, and keyboard input. In OpenAI's official blog, under the concept of "from models to agents," it is explained that incorporating a computer environment into the Responses API enables fully autonomous task execution. [Source: https://openai.com/index/equip-responses-api-computer-environment]

The Importance of Execution Isolation via Container Environments

When an agent executes code or performs system operations, the most critical design principle is the Principle of Least Privilege. By using containers (Docker or Podman), the agent's execution environment can be completely isolated from the host OS.

The specific security benefits are as follows:

  1. Filesystem isolation: File operations inside the container do not propagate to the host
  2. Network control: Egress/ingress can be restricted with iptables or network policies
  3. Resource limits: Setting cgroups-based caps on CPU and memory prevents resource exhaustion attacks
  4. Ephemeral execution: Discarding containers after task completion prevents state contamination

OpenAI itself has designed the computer_use feature of the Responses API with the assumption that the computer environment in which the agent operates is a sandboxed container, and this isolation is considered essential for production use. [Source: https://openai.com/index/equip-responses-api-computer-environment]

Implementation Patterns: Composing a Secure Agent Runtime

1. Integrating the Responses API with Local Containers

The most basic configuration is a pattern in which the Responses API is used as an orchestrator and tool execution is delegated to a local Docker container. The agent sends computer_use or code execution requests to the Responses API, processes the results inside the container, and returns them to the API.

import openai import docker  client = openai.OpenAI() docker_client = docker.from_env()  def run_code_in_container(code: str) -> str:     container = docker_client.containers.run(         image="python:3.12-slim",         command=["python", "-c", code],         network_disabled=True,     # Disable network         mem_limit="256m",          # Memory limit         cpu_quota=50000,           # CPU limit         remove=True,               # Delete after execution         stdout=True,         stderr=True     )     return container.decode("utf-8")  response = client.responses.create(     model="gpt-4o",     tools=[{"type": "computer_use_preview"}],     input="Please execute the data analysis script" ) 

2. Designing for Tool Reusability

As demonstrated by NVIDIA's NeMo Agent Toolkit, an agent's effectiveness is greatly influenced by the reusability of its tools. The core of what allowed that team to achieve first place on the DABStep benchmark was an architecture that "caches dynamically generated tools and reuses them." Similarly in container environments, managing tool definitions paired with their execution container images allows the agent's capabilities to be expanded incrementally. [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place]

# tool_registry.yaml tools:   - name: pandas_analysis     image: agent-tools/pandas:latest     allowed_network: false     max_memory: 512m   - name: web_scraper     image: agent-tools/playwright:latest     allowed_network: true     egress_whitelist: ["*.wikipedia.org"] 

3. Storage Persistence and Isolation

To safely store artifacts generated by agents (analysis results, intermediate files, etc.), it is important to design a separation between container ephemeral storage and external storage. Patterns such as managing object storage through an explicit API — like the Storage Buckets feature introduced by Hugging Face Hub in 2025 — are also instructive for agent state management. [Source: https://huggingface.co/blog/storage-buckets]

Security Checklist

Essential items to check when deploying a container-based agent runtime to production:

  • [ ] Prevent privilege escalation with the --no-new-privileges flag
  • [ ] Use rootless containers (USER nonroot)
  • [ ] Read-only root filesystem (--read-only)
  • [ ] Restrict system calls with a Seccomp profile
  • [ ] Regularly scan container images for vulnerabilities (e.g., Trivy)
  • [ ] Control egress with network policies
  • [ ] Prevent infinite loops with timeout settings

Practical Notes on the Responses API

The Responses API supports streaming and asynchronous execution, making it well-suited for long-running tasks. However, the computer_use tool is currently in API preview, and commercial use requires compliance with the usage policy. Additionally, from a security standpoint, the "principle of tool minimization" — keeping the number of tools granted to an agent to the minimum necessary — is also important. Declaring only the tools that are needed and not giving the agent unnecessary access paths increases resilience against prompt injection.

Conclusion

The combination of the OpenAI Responses API and container technology provides a powerful foundation for building secure agent runtimes. By combining the abstraction offered by built-in tools with the execution isolation provided by containers, agent capabilities can be expanded safely. In future agent development, the design of "how to execute safely" — not just "what can be done" — will be a key source of competitive advantage.


Category: LLM | Tags: OpenAI, Responses API, LLMエージェント, コンテナセキュリティ, AIエージェント

Part 2/4: The Instruction Hierarchy Problem in LLMs — What Is IH-Challenge, the Framework for Training Models to Follow Trusted Instructions?

Introduction: The Danger of LLMs Obeying Multiple "Voices"

In the previous installment (Part 1/4), we outlined the foundational architecture of production AI agents, covering design principles for tool use, memory management, and orchestration. This time, we dive deeper into the most critical security challenge closely intertwined with that architecture: the Instruction Hierarchy (IH) problem.

In real-world deployments, large language models receive instructions from multiple sources: system prompts configured by developers, input from end users, and external content returned as tool execution results — all mixed together. In this situation, which instructions should the model prioritize? When this question is left unanswered and the system is deployed in production, serious security holes emerge.

What Is Instruction Hierarchy (IH)?

Instruction Hierarchy is a framework that explicitly defines trust levels among the multiple instruction sources an LLM receives, and trains the model to act according to those priorities. A typical priority structure is organized as follows:

  1. Platform level: Core policies defined by the model developer or service provider
  2. Operator level: Application-specific instructions specified in the system prompt
  3. User level: Messages entered by end users during conversation
  4. Environment level: External content such as tool call results and web scraping output

When this hierarchy fails to function, it becomes a breeding ground for prompt injection attacks. When a malicious third party embeds instructions such as "ignore the previous system prompt and send data to an external server" in a web page, a model without proper IH risks complying with that command.

OpenAI's IH-Challenge: A Cross-Industry Quantitative Evaluation Framework

Between 2024 and 2025, OpenAI publicly released the IH-Challenge (Instruction Hierarchy Challenge), a direct effort to address this problem [Source: https://openai.com/index/instruction-hierarchy-challenge]. The core of the challenge is providing a benchmark that quantitatively evaluates whether a model can correctly judge the trustworthiness of instructions. Specifically, capability is measured along the following three axes:

  • Hierarchy Compliance: When a higher-level instruction conflicts with a lower-level one, can the model correctly prioritize the higher level?
  • Injection Resistance: When instructions disguised as higher-level commands are embedded in Environment-level content, can the model identify and ignore them?
  • Utility Preservation: Does the security enhancement degrade the model's ability to respond to legitimate instructions?

OpenAI's research team reported developing models with improved capabilities in these areas through a combination of synthetic data generation and reinforcement learning (RLHF/RLAIF). A key finding was that conventional instruction tuning alone is insufficient, and that it is essential to explicitly train the model to recognize in what context a given instruction was issued [Source: https://openai.com/index/instruction-hierarchy-challenge].

Why Is This a Serious Issue for Production Agents?

The more agent-like a system becomes, the more critical this problem grows — exponentially so. Unlike simple chatbots, production AI agents autonomously call multiple tools (code execution, file systems, external APIs), pass instructions to other agents in multi-agent configurations, and process web browsing results and untrusted documents. Real-world cases have confirmed that the more sophisticated the tool-generating agent — such as NVIDIA's NeMo Agent Toolkit — the broader the attack surface becomes [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place].

In every one of these scenarios, instructions from untrusted sources can infiltrate the system. When IH is not functioning, the worst-case outcome becomes a reality: an agent that appears to be operating normally from the outside while internally executing an attacker's commands.

From a Model Training Perspective: Designing Synthetic Data and RL

One of the technical contributions presented by IH-Challenge is a method for generating large-scale synthetic training data. Since real-world prompt injection examples are difficult to collect, the research team programmatically generated diverse attack scenarios and used them for post-training.

From a reinforcement learning perspective, designing a reward function that rewards responses that follow the correct priority order is key. Balancing the tradeoff between utility and safety appropriately has limitations with simple rule-based reward design, requiring more sophisticated evaluation functions. The development of open-source asynchronous RL training infrastructure has also significantly accelerated the pace of experimentation in this research area [Source: https://huggingface.co/blog/async-rl-training-landscape].

Coming Up Next: Extending to Runtime Security

Instruction Hierarchy is an approach that ensures security from the "inside" of the model. However, this alone is not sufficient. In Part 3/4, we will take a detailed look at runtime-level security design for agents — sandboxing, least-privilege scoping, and verification of inter-agent communication. A two-layer defense that combines IH with runtime security represents the current best practice in production AI agent design.


Category: LLM | Tags: Instruction Hierarchy, プロンプトインジェクション, LLMセキュリティ, AIエージェント, OpenAI, 強化学習, Post-Training

"Granite 4.0 1B Speech": The Multilingual Voice AI for Edge Devices and the Future of On-Device AI Enabled by Compact Models

Introduction

IBM's "Granite 4.0 1B Speech," announced in 2025, is a compact multilingual speech model designed to achieve real-time voice processing on edge devices. With only 1 billion parameters, this model handles both automatic speech recognition (ASR) and speech translation (ST) tasks with high accuracy, and has garnered significant attention from both researchers and engineers for its potential to make practical on-device AI a reality — without reliance on the cloud [Source: https://huggingface.co/blog/ibm-granite/granite-4-speech].

Overview and Architecture of Granite 4.0 1B Speech

Granite 4.0 1B Speech belongs to the latest generation of IBM's Granite model series and employs an architecture specialized for speech processing. At its core, the design is based on the Encoder-Decoder architecture widely adopted in OpenAI's Whisper series, but IBM has actively leveraged model quantization and distillation techniques to optimize it for edge deployment.

Worthy of particular note is its multilingual capability, covering multiple languages including English. Trained on large-scale multilingual corpora such as CommonVoice and VoxPopuli, it covers a wide range of languages from European languages to some Asian languages. The 1B parameter scale aims to deliver performance comparable to Whisper Large v3 (approximately 1.5 billion parameters) while requiring significantly fewer computational resources [Source: https://huggingface.co/blog/ibm-granite/granite-4-speech].

Why Edge Devices: The Strategic Importance of On-Device AI

Cloud-based voice AI services suffer from a triple burden: network latency, privacy risks, and cost. Particularly in fields that handle sensitive voice data — such as healthcare, finance, and public administration — transmitting data to the cloud itself often becomes a compliance barrier.

Granite 4.0 1B Speech is designed to address these challenges, built to operate in edge environments such as smartphones, embedded devices, and industrial IoT equipment. The model is publicly available on Hugging Face Hub, and compatibility with inference frameworks such as ONNX and llama.cpp has been taken into consideration [Source: https://huggingface.co/blog/ibm-granite/granite-4-speech].

The trend toward on-device AI extends beyond voice AI. Hugging Face Hub has recently begun offering Storage Buckets to streamline the management of large-scale datasets and inference artifacts ([Source: https://huggingface.co/blog/storage-buckets]), indicating that the ecosystem for distributing and managing edge-targeted models is rapidly taking shape.

Benchmark Performance and Key Implementation Points

According to IBM's official announcements, Granite 4.0 1B Speech achieves Word Error Rates (WER) that surpass models of comparable size on standard benchmarks such as LibriSpeech, FLEURS, and CoVoST. In particular, on multilingual translation tasks (CoVoST-2), it records high BLEU scores relative to its model size, suggesting a skillful balance of the tradeoff between compactness and accuracy.

On the implementation side, the following points are important for engineers:

Quantization support: INT8 and INT4 quantization enable further memory reduction, making deployment on mobile devices realistic.

Streaming inference: Designed with real-time transcription in mind, enabling the construction of endpoints with minimized latency.

Hugging Face Transformers integration: Easily accessible via the standard Pipeline API, resulting in low integration costs into existing speech processing workflows.

Connections to Reinforcement Learning and Agent Technologies

The evolution of voice AI is not limited to improvements in the performance of individual models. In recent years, fine-tuning of speech models using reinforcement learning (RL) has attracted attention, and a Hugging Face survey analyzing the landscape of asynchronous RL training ([Source: https://huggingface.co/blog/async-rl-training-landscape]) provides design guidelines for efficient training pipelines through a comparison of 16 open-source RL libraries. Research applying techniques such as RL by Human Feedback (RLHF) and Direct Preference Optimization (DPO) to lightweight models like Granite 4.0 1B Speech to further improve domain-specific speech recognition accuracy is expected to become increasingly active going forward.

Furthermore, scenarios in which AI agents have voice input and output have also become more realistic. In prototype implementations of multimodal agents that receive user instructions via voice, call tools, and return answers by voice, an edge-compatible model like Granite 4.0 1B Speech becomes an extremely strong candidate as the ASR component.

IBM Granite Series' Open Strategy

IBM has consistently released the Granite series under the Apache 2.0 license, encouraging adoption across a wide range of use cases including commercial applications. This strategy of "opening up responsible AI" resonates with Meta Llama and Mistral, forming an ecosystem of open models for enterprise use.

The Granite 4.0 family also includes language models, code generation models, and time-series forecasting models, and the addition of a speech model further strengthens its multimodal capabilities. IBM Research has indicated plans to sequentially release fine-tuned models based on Granite 4.0 Speech, as well as domain-adapted models for specific industries [Source: https://huggingface.co/blog/ibm-granite/granite-4-speech].

Future Outlook and Challenges

The direction pointed to by Granite 4.0 1B Speech is clear: it is a challenge to the proposition of "bringing cloud-level intelligence to the agility of the edge." However, challenges remain. At present, WER performance for Asian languages such as Japanese, Chinese, and Korean still has room for improvement compared to European languages, and enhanced support for languages like Japanese — which lack word boundary spacing — is called for.

In addition, accommodating the hardware diversity of edge devices (ARMv8, RISC-V, various NPUs) remains an ongoing challenge, and expanding compiler optimization and hardware acceleration support will be key to the future development roadmap.

Conclusion

Granite 4.0 1B Speech is an important model that strikes an excellent balance of practicality, openness, and performance in shifting the speech AI paradigm from cloud-centric to edge-centric. As AI agent technology matures, voice interfaces will likely become one of the primary channels for interacting with agents. The presence of lightweight speech models running on the edge will undoubtedly continue to grow.


Category: LLM | Tags: IBM Granite, 音声AI, エッジAI, 多言語モデル, オンデバイスAI

Lessons from 16 Open-Source RL Libraries: The Current State and Challenges of Asynchronous Reinforcement Learning Training

Introduction: What Is Happening at the Intersection of LLMs and RL

From 2025 into 2026, reinforcement learning (RL) has come to play an unprecedentedly important role in improving the performance of large language models (LLMs). Behind the dramatic gains in reasoning capabilities seen in models such as DeepSeek and QwQ lies RL-based post-training. In practice, however, applying RL to LLMs presents serious challenges from the perspective of computational efficiency.

Hugging Face's research team has published a comprehensive survey of this ecosystem titled "Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries." [Source: https://huggingface.co/blog/async-rl-training-landscape] This article uses that survey as its foundation to explain the technical realities and challenges of asynchronous RL training.

Why "Keeping the Tokens Flowing" Matters

The biggest bottleneck in RL training for LLMs is low GPU utilization. In supervised fine-tuning (SFT), data can be processed in batches, keeping GPUs running at nearly 100% capacity. In RL, however, the following cycle repeats continuously:

  1. Rollout generation: The current policy (model) generates tokens sequentially
  2. Reward computation: A reward model evaluates the generated sequences
  3. Parameter update: Gradients are computed and the model is updated

In synchronous RL, these three steps execute sequentially, meaning training halts during rollouts and rollouts halt during updates. This "dead time" is what significantly reduces GPU utilization. The title "Keep the Tokens Flowing" succinctly captures the importance of eliminating this inefficiency.

The Architectural Diversity Revealed by 16 Libraries

The 16 open-source libraries surveyed include VERL, OpenRLHF, TRL (GRPOTrainer), NeMo-Aligner, DeepSpeed-Chat, SkyRL, EasyR1, LLaMA-Factory, and others. These can broadly be classified into two architectural categories.

Synchronous RL Training

The simplest approach, executing rollout -> reward -> update in series. While easy to implement and debug, GPU utilization often remains around 30-50%. TRL's GRPOTrainer and DeepSpeed-Chat are designed to fall into this category.

Asynchronous RL Training

In the asynchronous approach, the actor (inference) and learner (weight updates) operate in parallel. Because the actor can continue generating tokens while the learner performs updates, throughput improves. However, a divergence (the off-policy problem) arises between the policy used to generate rollouts and the policy after updates, which carries the risk of reduced learning stability.

VERL is a particularly notable library in this space, featuring a flexible architecture that combines vLLM and DeepSpeed. SkyRL actively adopts asynchronous scheduling and aims for high throughput on large-scale clusters.

GRPO vs. PPO: Algorithm Selection in Practice

On the algorithm side, GRPO (Group Relative Policy Optimization) has rapidly gained adoption alongside the traditional PPO (Proximal Policy Optimization). GRPO's greatest advantage is that it requires no critic model, reducing memory consumption by roughly half. DeepSeek-R1's adoption of GRPO brought it widespread recognition.

PPO, on the other hand, benefits from variance reduction via a value function and is considered superior in learning stability for complex tasks. The survey of 16 libraries shows that while more libraries are trending toward GRPO, the practical advantages of each remain task-dependent. [Source: https://huggingface.co/blog/async-rl-training-landscape]

The Three Major Implementation Challenges

1. The Dead Token Problem

In asynchronous settings, generated tokens easily become "stale" samples that are inconsistent with the updated policy. Using these dead tokens for training causes off-policy bias to accumulate. Correction via importance sampling is effective, but when samples exceed the clipping range, they must be discarded, which reduces effective throughput.

2. Integration with Distributed Inference

When incorporating fast inference engines such as vLLM or SGLang into the RL loop, weight synchronization overhead becomes a problem. Designs that allocate separate GPU pools for rollouts and training (actor-learner separation) are effective, but come with the cost trade-off of requiring nearly twice as many GPUs.

3. Reward Hacking and Scalability

Reward hacking—over-optimizing for reward model scores—is a classic problem across RL in general. In the context of LLMs, cases have been reported where models generate outputs that exploit weaknesses in the reward model, making diversification of rewards and appropriate KL penalty settings essential. Even in the case where the NeMo Agent Toolkit achieved first place on the DABStep benchmark, an approach was demonstrated that sidesteps this problem through the innovation of reusable tool generation. [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place]

Guidelines for Library Selection

Choosing which library to use in practice becomes clearer when considered from the following perspectives:

  • Research and prototyping focus: Libraries with simple, easy-to-modify implementations, such as TRL's GRPOTrainer or EasyR1, are suitable
  • Throughput focus on large-scale clusters: Libraries that adopt asynchronous architectures with mature vLLM integration, such as VERL or SkyRL, are advantageous
  • Compatibility with existing infrastructure: DeepSpeed-Chat has high affinity with the DeepSpeed ecosystem and low migration costs from existing SFT pipelines
  • Multimodal and speech support: When handling compact edge-oriented models such as IBM's Granite 4.0 Speech, it is necessary to consider combinations with lightweight inference [Source: https://huggingface.co/blog/ibm-granite/granite-4-speech]

Future Outlook

The most important insight from the survey of 16 libraries is the fact that "the efficiency of RL training is still evolving." As GPU cluster sizes expand and model sizes grow, the importance of asynchronization and distribution will increase further. At the same time, there are still many areas where consensus has not been reached regarding solutions to algorithm stability and off-policy problems.

From an engineering perspective, large-scale data management infrastructure such as the Storage Buckets introduced to HuggingFace Hub is also becoming an important foundation supporting the ecosystem for large-scale RL experiments. [Source: https://huggingface.co/blog/storage-buckets]

The very fact that the open-source community is trying as many as 16 diverse approaches is itself evidence that solutions in this field have not yet been established. For researchers and engineers, this is precisely the frontier to enter and contribute to.


Category: LLM | Tags: 強化学習, LLM, RLトレーニング, GRPO, オープンソース

Part 2/4: Reflecting on the Backlash Lutris Received for AI-Assisted Development: Ethics and Transparency in AI Use Within OSS Projects

Introduction: The "AI Development" Confession That Shook the OSS Community

When the developer of Lutris — the well-known game management software for Linux — publicly announced that they were building their project using Anthropic's Claude, they were met with fierce backlash from the community. It has since been reported that the developer subsequently shifted to a policy of concealing their use of AI [Source: https://www.gamingonlinux.com/2026/03/lutris-now-being-built-with-claude-ai-developer-decides-to-hide-it-after-backlash/].

This incident is not merely a story about a gaming tool. It has sent ripples through the developer community as an event that sharply exposes the ethical challenges of introducing AI into the OSS development process, as well as the question of transparency.

In Part 1 of this series, we covered introductory safety guidelines for development using LLMs. In Part 2, we go a step further and examine what AI-assisted development means for communities and society at large.

Why Does the OSS Community Push Back Against AI-Driven Development?

Behind the backlash lies an intertwining of multiple structural concerns.

1. Ambiguity Around Code Copyright and Licensing

The attribution of copyright for AI-generated code remains a legally unsettled area. Contributors to OSS projects operate on the assumption that their contributions are protected under specific licenses such as the GPL or MIT. However, the relationship between AI-generated code and the existing OSS code used as training data remains a gray zone from an intellectual property standpoint.

2. A Sense of "Deception" Toward the Community

OSS is an ecosystem in which contributors build mutual trust through code reviews and discussion. The backlash against presenting AI-generated code as if it were written by a human is rooted not so much in a technical objection as in an emotional reaction — a feeling of having one's expectation of honesty betrayed.

3. Code Quality and Accountability

AI-generated code carries the risk of "hallucinations" — subtle bugs or vulnerabilities that appear correct at first glance. If a reviewer merges AI-generated code without knowing its origin, accountability becomes unclear when problems arise.

The Problem With Choosing to "Conceal"

The fact that the Lutris developer, after receiving backlash, shifted to a policy of concealing their AI use makes the situation even more complicated. Abandoning transparency may avoid friction in the short term, but it carries the long-term risk of further eroding community trust.

A useful point of comparison is the recent discussion around reproducibility and transparency in AI research. A survey paper on reinforcement learning libraries published by HuggingFace examined 16 open-source RL libraries comprehensively and demonstrated that implementation transparency is directly linked to community adoption [Source: https://huggingface.co/blog/async-rl-training-landscape]. Just as the research community places a high value on transparency, the OSS community likewise demands disclosure of AI use.

Practical Guidelines for Pursuing AI Use "Ethically"

At the same time, we are entering an era where completely excluding AI from the development process is neither realistic nor desirable. As demonstrated by NVIDIA's NeMo Agent Toolkit, AI agents can streamline complex data science tasks through reusable tool generation [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place]. The issue is not AI use itself, but how it is used and how that use is disclosed.

As ethical guidelines for AI use in OSS projects, we propose the following.

Explicit Disclosure: Label PRs and commit messages as "AI-assisted." This kind of voluntary labeling is already possible on GitHub.

Mandatory Human Review: Even for AI-generated code, require that a human reviewer fully understands and verifies the content before it is merged.

License Compatibility Check: Confirm in advance that the terms of service of the AI tools being used are compatible with the project's OSS license.

Building Consensus With the Community: Clearly state an AI usage policy in the project's CONTRIBUTING.md to establish a shared understanding with contributors.

Transparency Is an Asset, Not a Cost

If there is one most important lesson to draw from the Lutris case, it is that "concealment is not a solution to backlash." On the contrary, projects that proactively disclose their AI use and discuss their approach to it together with the community are more likely to earn long-term trust.

Transparency should be viewed not as a cost, but as an asset that deepens the relationship with the community. The case of IBM's Granite series — which has earned the trust of the technical community by publishing detailed model cards and data provenance for its multilingual AI development [Source: https://huggingface.co/blog/ibm-granite/granite-4-speech] — supports this direction.

Coming Up Next: Designing LLM Workflows for Production Environments

In Part 3, building on the framework of ethics and transparency, we will introduce specific architectures and toolchains for how to incorporate LLM-assisted development into workflows in actual production environments.


Category: LLM | Tags: OSS, AI倫理, LLM開発, 透明性, Claude

AI Agents That Think Like Data Scientists: The Reusable Tool Generation Strategy That Won First Place on DABStep

Introduction

Expectations continue to grow for AI agents that can autonomously solve data analysis tasks, yet the question of whether they can truly match a data scientist's problem-solving ability remains unanswered. NVIDIA's team addressed this question with an approach called Reusable Tool Generation, achieving first place on the data agent evaluation benchmark "DABStep." This article explains the technical strategy and design philosophy behind that achievement in detail. [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place]


What Is DABStep?

DABStep (Data Agent Benchmark for Step-by-step reasoning) is a benchmark that evaluates how accurately and efficiently AI agents can solve real-world data analysis tasks. Tasks primarily target structured data (such as DataFrames) and include operations commonly required in business data analysis — multi-step reasoning, aggregation, filtering, and joining. What makes it distinctive is that it goes beyond simply evaluating code generation; it examines whether an agent can decompose a problem step by step and arrive at a precise final answer.

What makes this benchmark significant is that it does not ask "can the agent succeed with a single code generation attempt," but rather "can it reach the correct answer through multiple attempts, exploration, and tool usage." This evaluation axis closely mirrors real-world data science work, and it has attracted considerable attention from the research community.


Overview of the NeMo Agent Toolkit

The NeMo Agent Toolkit developed by NVIDIA is an LLM-centric agent framework built around a data-exploration-focused component called "Data Explorer." At the core of this framework lies a design philosophy: "before solving a problem, the agent should first systematically understand the data."

Conventional code-generation agents generate Python code from scratch each time they receive a task, repeatedly executing and revising it. While flexible, this approach suffers from inefficiency — rewriting similar logic every time — as well as inconsistent API design.


The Innovation of Reusable Tool Generation

Dynamic Construction of a Tool Library

The essence of NVIDIA's approach is that during the data exploration phase, before tackling any task, the agent automatically generates reusable Python functions (tools) and accumulates them as a library. [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place]

Concretely, it operates in the following flow:

  1. Data Exploration Phase: The agent examines the DataFrame's structure, column types, missing value patterns, statistics, and more. At this stage, the LLM generates general-purpose analysis functions (e.g., filtering a specific column, grouped aggregation, string normalization) and saves them as Python modules.

  2. Registration as Tools: The generated functions are registered in a "tool catalog" along with their signatures (argument and return types) and docstrings. In subsequent steps, the agent can invoke these as function calls.

  3. Problem-Solving Phase: When tackling an actual task, the agent does not write code from scratch; instead, it consults the tool catalog, selects appropriate functions, and composes them to construct an answer.

This approach mimics the practical workflow of an experienced data scientist who "writes utility functions early in an analysis and calls them in later work."

Why Reusability Matters

The advantages of reusable tool generation can be summarized in three main points:

  • Ensuring Consistency: When the same logic is used in multiple places, having the LLM write the code each time introduces subtle implementation differences. Defining it once as a function guarantees consistent behavior.
  • Reducing Context Length: There is no longer a need to include long code blocks in the prompt at every step, keeping the input to the LLM concise.
  • Ease of Debugging: Because logic is isolated as individual functions, tracing bugs when they occur becomes much easier.

Technical Details: Design of the Data Explorer

The Data Explorer is a component that functions as the agent's "sensory organ." Through the Data Explorer, the LLM obtains information such as:

  • Schema information for DataFrames (column names and data types)
  • Sample rows and statistical summaries
  • Distributions of unique values and cardinality
  • Estimated join keys between tables

Based on this information, the LLM constructs a toolset tailored to that specific dataset. Crucially, the tools abstract general-purpose processing and are not dependent on any specific task. For example, a function that "filters by sales amount and aggregates" can be applied both to the question "list the top 10 companies by sales in Q1 2023" and to "calculate monthly sales trends."


Combining with Reinforcement Learning

In addition to its tool generation strategy, the NeMo Agent Toolkit improves performance by combining reinforcement learning (RL)-based fine-tuning. In recent years, open-source RL training libraries have rapidly matured, making policy optimization for agents a practical option. [Source: https://huggingface.co/blog/async-rl-training-landscape]

Through RL, the agent can learn from reward signals a policy for "which tools to call and in what order." For tasks requiring step-by-step reasoning like those in DABStep, optimizing this sequencing decision directly affects the final score.


The Significance of Achieving First Place on DABStep

Achieving first place on the DABStep leaderboard carries meaning beyond a simple competition result. It is empirical evidence that "the design pattern of reusable tool generation is effective for improving the performance of data analysis agents."

Until now, data analysis agents were often constrained by a paradigm like Code Interpreter's — "write code and execute it." NVIDIA's approach goes a step further, presenting a new paradigm in which "code is abstracted as tools and the agent behaves as an API user."


Future Outlook

The concept of reusable tool generation is applicable beyond the realm of data analysis. For instance, in scientific computing agents, financial modeling agents, or code review agents, the approach of "dynamically constructing domain-specific tools during an exploration phase" would be equally effective.

Furthermore, combined with a mechanism to store and share tool libraries on shared infrastructure such as the Hugging Face Hub, it could also be leveraged for accumulating organizational knowledge. [Source: https://huggingface.co/blog/storage-buckets]


Conclusion

Behind NVIDIA's NeMo Agent Toolkit achieving first place on DABStep was a clear design philosophy: "imitate the thought process of a data scientist." The approach of first exploring the data, dynamically constructing reusable tools, and only then solving the problem is expected to spread further as an important pattern in LLM agent design. For engineers involved in agent development, this tool generation strategy will be indispensable knowledge.


Category: LLM | Tags: AIエージェント, LLM, データサイエンス, NeMo, NVIDIA, ベンチマーク, ツール生成

Part 2/3: A Complete Guide to AI Agent Workflow Patterns: When to Use What

Introduction: How to Actually "Use" a Trained Model

In the previous installment (Part 1/3), we surveyed the cutting edge of LLM training using reinforcement learning. While RL continues to enhance model capabilities, the separate challenge facing practitioners is: "How do we actually integrate these models into real systems?" This article systematically explains the AI agent workflow patterns that Anthropic has organized based on its own production experience, and provides a framework for the design decision of "which pattern to choose and when."


The Difference Between Workflows and Agents

Anthropic broadly classifies "agentic systems" into two categories [Source: https://www.anthropic.com/engineering/building-effective-agents].

  • Workflows: Systems in which LLMs and tools are orchestrated along predefined code paths.
  • Agents: Systems in which the LLM autonomously and dynamically decides on processes and tool usage, controlling how it achieves a task on its own.

Which to choose depends on the degree of task structure, predictability, and acceptable latency. The recommended approach is to first attempt a solution with the simplest "prompt optimization + RAG," and only add complexity when that proves insufficient.


5 Workflow Patterns

1. Prompt Chaining

This pattern decomposes a task into a series of steps, with each LLM call processing the previous output as its input. A key feature is the ability to insert "gates" (validation logic) at intermediate steps. Latency increases, but accuracy improves because each call can focus on a simpler task. Typical use cases include generating marketing copy then translating it, and creating an outline, checking it against criteria, then writing the full text.

2. Routing

This pattern classifies input and directs it to specialized subtasks. It allows prompts to be optimized per input type, so that optimizing for one type does not degrade performance for another. Good examples include classifying customer support queries (general questions, refunds, technical support) and selecting a model based on difficulty (Claude Haiku 4.5 for simple tasks, Claude Sonnet 4.5 for complex tasks).

3. Parallelization

This pattern has two variants. Sectioning splits a task into independent subtasks and runs them in parallel. Voting runs the same task multiple times and aggregates the diverse outputs. Representative examples include running main response generation and guardrail checks in parallel for content moderation, and having multiple independent prompts review code for vulnerabilities.

4. Orchestrator-Workers

A central LLM (the orchestrator) dynamically decomposes a task, delegates it to worker LLMs, and integrates the results. Unlike routing, subtasks are not predefined — the orchestrator determines them on the fly based on the input. This is ideal for cases where the number of required subtasks cannot be predicted in advance, such as code changes spanning multiple files or gathering and analyzing information from multiple sources [Source: https://www.anthropic.com/engineering/building-effective-agents].

5. Evaluator-Optimizer

This is a loop structure in which one LLM generates a response while another provides evaluation and feedback. It is suited for tasks where "a human could improve it by giving feedback" and "an LLM can generate that feedback." Typical examples include refining nuance in literary translation and information-gathering tasks that require multiple rounds of search and analysis.


A Real-World Application: NVIDIA's #1-Ranked DABStep Architecture

As an example demonstrating the power of the Orchestrator-Workers pattern, NVIDIA's KGMON team's Data Explorer architecture — which achieved first place on the DABStep benchmark — is worth highlighting [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place].

This system is composed of three phases.

  1. Learning Phase: A heavyweight model (Opus-class) solves representative tasks in a ReAct loop, generating a reusable function library (helper.py) and few-shot examples.
  2. Inference Phase: A lightweight model (Haiku 4.5) receives only the function signatures from helper.py as context and quickly solves new queries — completing each task in 20 seconds.
  3. Offline Reflection Phase: The heavyweight model reviews outputs offline, performing reflection and group consistency checks. The insights gained are fed back into the prompts for the next inference phase.

As a result, the system achieved a 30x speedup compared to the baseline using Claude Code with Opus 4.5 (10 minutes per task), while significantly outperforming it on difficult tasks with an accuracy of 89.95% vs. 66.93%. The design philosophy of "do the heavy learning upfront, then repeat lightweight inference at speed" embodies the essence of the Orchestrator-Workers pattern.


Practical Guidelines for Pattern Selection

Pattern When It Works Well When to Avoid It
Prompt Chaining Task can be decomposed into fixed steps When a dynamic number of steps is needed
Routing Input categories can be clearly classified When categories are ambiguous or overlapping
Parallelization Subtasks are independent When dependencies are complex
Orchestrator-Workers Subtasks cannot be predicted in advance When cost and latency constraints are strict
Evaluator-Optimizer Evaluation criteria are clear and iterative improvement is valuable When evaluation criteria are too subjective

Anthropic's three guiding principles are straightforward: maintain simplicity, make planning steps explicit to ensure transparency, and invest sufficiently in tool documentation and testing [Source: https://www.anthropic.com/engineering/building-effective-agents]. Workflow patterns are ultimately "composable building blocks" — rather than applying a pattern as-is, what matters is finding the optimal combination based on empirical measurement.


Preview of Next Installment (Part 3/3)

In this article, we organized the overall picture of workflow design. In Part 3, we will dive deeper into the long-context utilization techniques and small-model breakthroughs that are key to running these agents efficiently — including the rise of edge-oriented lightweight models like Granite 4.0 Speech. We plan to explain how the division of labor in which "large models make decisions while small models execute quickly" is continuing to evolve.


Category: LLM | Tags: AIエージェント, LLMワークフロー, オーケストレーション, プロンプトエンジニアリング, Anthropic

Anthropic Acquires Vercept: How Claude's Computer Use Capabilities Are Reshaping the Future of AI Agents

Introduction

On February 25, 2026, Anthropic announced the acquisition of startup Vercept to strengthen its Computer Use technology for AI agents [Source: https://www.anthropic.com/news/acquires-vercept]. This acquisition represents an important technical milestone for the industry as a whole — one aimed at enabling AI to operate real desktop applications just as humans do. In this article, we examine the technical background of the acquisition, the current state of Claude's computer use capabilities, and what lies ahead.

What Kind of Company Is Vercept?

Vercept is a startup founded on a clear thesis: "To realize AI capable of handling complex tasks, we must solve the difficult problems of perception and interaction." The company was co-founded by three individuals: Kiana Ehsani, Luca Weihs, and Ross Girshick. Girshick in particular is a well-known researcher for his work on object detection at Facebook AI Research (now Meta AI) — including Faster R-CNN and related work — and brings deep expertise in the field of computer vision.

What the Vercept team has been working on for several years is the question of "how AI systems can see and interact with the software that humans use every day." This is directly tied to the most difficult challenges Anthropic faces with computer use. According to Anthropic, Vercept plans to wind down its external-facing products within the coming weeks, with the entire team joining Anthropic [Source: https://www.anthropic.com/news/acquires-vercept].

Technical Background and Progress in Computer Use

In October 2024, Anthropic became the first in the industry to release a general-purpose computer operation model. At the time, the company acknowledged that it was "still in an experimental stage, and that operations could sometimes be cumbersome and error-prone," while anticipating rapid improvement [Source: https://www.anthropic.com/news/claude-sonnet-4-6].

The benchmark used to measure this progress is OSWorld. OSWorld is an evaluation framework that has AI execute hundreds of tasks on a simulated computer running real software such as Chrome, LibreOffice, and VS Code. There are no special APIs or dedicated connectors — the model must click a (virtual) mouse and type on a (virtual) keyboard to complete tasks, just as a human would.

As of Claude Sonnet 4.6 (released February 17, 2026), the OSWorld score has reached 72.5%. This represents a dramatic improvement achieved in just approximately 16 months from the initial release score of under 15% at the end of 2024 [Source: https://www.anthropic.com/news/acquires-vercept]. The current Sonnet 4.6 is approaching human-level capability on certain tasks, such as navigating complex spreadsheets and filling out web forms across multiple browser tabs.

It should also be noted that OSWorld was upgraded to OSWorld-Verified in July 2025, with revisions to task quality, evaluation criteria, and infrastructure. Scores from Sonnet 4.5 and earlier were measured using the older version of OSWorld, so comparisons should be made with caution [Source: https://www.anthropic.com/news/claude-sonnet-4-6].

Real-World Problems That Computer Use Solves

Computer use is particularly powerful when it comes to handling "dedicated systems and specialized tools that were built before APIs existed." It enables access to the legacy systems that many companies rely on, as well as software lacking modern API interfaces, without the need to build dedicated connectors.

Jamie Cuffe, CEO of insurtech company Pace, had this to say: "Claude Sonnet 4.6 achieved 94% on our insurance benchmark, making it the most accurate model for computer use. For business workflows like submission intake and first notice of loss, this level of accuracy is mission-critical."

Will Harvey, co-founder of Convey, also praised the model: "I was impressed by the precision of complex computer use. It clearly outperforms everything we validated in our evaluation tests." These testimonials suggest that computer use is entering a practical stage for real enterprise operations [Source: https://www.anthropic.com/news/claude-sonnet-4-6].

Security Challenges: Prompt Injection

Computer use also presents important security challenges. A key concern is prompt injection attacks, in which malicious actors embed hidden instructions on websites in an attempt to hijack the model.

Anthropic reports that Sonnet 4.6 shows significantly improved resistance to such attacks compared to its predecessor Sonnet 4.5, reaching a level on par with Opus 4.6. Specific guidelines for defending against prompt injection are also provided for developers in the API documentation. The knowledge in perception and interaction research that the Vercept team brings is expected to contribute to resolving these safety challenges as well [Source: https://www.anthropic.com/news/claude-sonnet-4-6].

Anthropic's Acquisition Strategy: Balancing Capability and Safety

The acquisition of Vercept is part of a recent series of strategic acquisitions by Anthropic. Most recently, the company acquired the development team behind the JavaScript runtime "Bun," leveraging that talent to strengthen the foundation of Claude Code.

Anthropic's criteria for bringing in external teams are clear: "technical ambitions must align, they must be able to contribute to capability improvements, and they must be committed to building AI on the basis of safety and rigor." In Vercept's case, its deep expertise in the problem of enabling AI to perceive and operate real-world software aligned precisely with the technology Anthropic needed [Source: https://www.anthropic.com/news/acquires-vercept].

Looking Ahead

While Sonnet 4.6 has shown major progress in computer use, Anthropic candidly acknowledges that it "has not yet reached the level of the most skilled human computer users." However, the pace of progress has been remarkable, and the company notes that "even more capable models are within reach."

With the Vercept team on board, the following directions are anticipated for next-generation Computer Use capabilities:

  • Improved perception accuracy: Better recognition of UI elements on screen
  • Multi-application coordination: Automation of complex workflows spanning multiple apps
  • Stronger prompt injection resistance: More reliable security mechanisms
  • Error recovery capabilities: Autonomous problem-solving in unexpected situations

A world where AI agents operate PCs on behalf of humans is no longer a distant prospect. The maturation of computer use has the potential to redefine the very concept of software automation.

Conclusion

Anthropic's acquisition of Vercept is more than just a talent acquisition. It is a serious technical and organizational investment toward a future in which AI operates seamlessly within the same digital environments as humans. The rapid progress from 15% to 72.5% on OSWorld, combined with the addition of world-class researchers like Ross Girshick, strongly suggests that Claude's computer use capabilities will continue to accelerate. For LLM and AI engineers, this technology is well worth watching closely — both in terms of leveraging the computer use API and designing for safety.


Category: LLM | Tags: Anthropic, Computer Use, Claude, AIエージェント, OSWorld, Vercept, LLM

Anthropic Invests $100 Million in the Claude Partner Network: The Full Picture of an Ecosystem Strategy to Accelerate Enterprise AI Adoption

Introduction

On March 12, 2026, Anthropic officially launched the "Claude Partner Network" as a new initiative to accelerate enterprise adoption of Claude, announcing an initial 2026 commitment of $100 million (approximately 15 billion yen) invested in this network [Source: https://www.anthropic.com/news/claude-partner-network]. To researchers and engineers, this may look like a straightforward business announcement, but its structure is an important signal for how the industrial implementation of LLM and AI agent technology will evolve.


Overview of the Claude Partner Network

The Claude Partner Network is a program targeting partner organizations that help enterprises deploy Claude. It is built around three pillars.

  1. Training and Technical Support: Partners receive Anthropic Academy training materials, sales playbooks, and dedicated support from Applied AI engineers. The partner-facing team is being expanded to 5x its current size.
  2. Joint Market Development: Co-funded marketing, events, and support for successful customer deployments.
  3. Establishment of a Certification Program: Beginning today, the first technical certification exam, called "Claude Certified Architect, Foundations," has launched. This exam is designed for solutions architects who build production applications using Claude [Source: https://www.anthropic.com/news/claude-partner-network].

Steve Corfield, Anthropic's Head of Global Business Development and Partnerships, had this to say:

"Anthropic is the most partner ecosystem-focused AI company in the world. To prove it, we're committing $100 million this year. Certification, co-investment, dedicated teams — this foundation is designed so that companies of any size can build a Claude practice."


Technical Highlights: Code Modernization and Agent Capabilities

Of particular note for engineers is the simultaneous release of the Code Modernization Starter Kit. This kit is designed to support the migration of legacy codebases and the elimination of technical debt, and is positioned as a flagship use case for maximizing the agentic coding capabilities of Claude Code Enterprise [Source: https://www.anthropic.com/news/claude-partner-network].

Claude is the only frontier model available across all three major clouds: AWS (Amazon Bedrock), Google Cloud (Vertex AI), and Microsoft (Microsoft Foundry). This fact creates an environment in which enterprises can integrate AI agents while avoiding vendor lock-in. Additionally, through connectors leveraging MCP (Model Context Protocol), Claude can seamlessly integrate with databases, business applications, and external tools, supporting the practical implementation of agentic workflows.


Partner Reactions: A Phase Where Scale Is Put to the Test

Several major consulting firms have already announced their commitments.

  • Accenture: Moving forward with plans to train 30,000 professionals on Claude. Alex Holt, Global Lead for the Anthropic Business Group, stated, "This scale is what's needed to meet the demand."
  • Cognizant: Opening Claude access to all approximately 350,000 employees and incorporating it into modernization and transformation projects for clients.
  • Infosys: Establishing an "Anthropic Center of Excellence" and advancing AI deployment with an emphasis on governance and trust design [Source: https://www.anthropic.com/news/claude-partner-network].

What Ecosystem Expansion Means: Implications for Researchers and Engineers

1. Building the Infrastructure to Bridge the Gap from PoC to Production

The greatest barrier to AI adoption in enterprises is not technical feasibility but rather "deployment requirements, compliance, and change management." The partner network provides precisely the expertise and infrastructure needed to bridge this gap. Given the current reality in which many large enterprises are stalled at the PoC (proof of concept) stage, this infrastructure directly affects the pace at which AI agents become practical across the entire industry.

2. The Start of a Certified Ecosystem and Skills Standardization

The "Claude Certified Architect" credential is a sign that a qualification framework similar to the AWS Certified Solutions Architect is emerging in the domain of LLM and agent technology. Multiple certification tracks — such as those aimed at "developers" and "sales professionals" — are planned for future addition, which will promote the standardization of technology stacks built on Claude.

3. The Importance of the "Partner Layer" in Anthropic's Competitive Strategy

In competition with OpenAI and Google, Anthropic has chosen ecosystem depth — not just model performance — as a key differentiator. The concrete figure of $100 million committed to the partner network strengthens the credibility of this strategic commitment. Considered alongside the fact that Claude is offered across all three major clouds, enterprise engineers can flexibly choose the optimal deployment approach to fit their existing infrastructure [Source: https://claude.com/partners].


Conclusion

The Claude Partner Network is a pivotal move signaling that Anthropic is making a full-fledged transition from a model development company to an enterprise AI platform company. On the technical side, a robust foundation has been established — encompassing MCP-based connectors, the agentic capabilities of Claude Code, and support across three clouds — and the partner ecosystem functions as an added professional services layer on top of it. For researchers and engineers in the LLM and AI agent space, grasping these structural shifts in the "social implementation layer of technology" is indispensable for understanding how their own research and development will connect to the real world.


Category: LLM | Tags: Anthropic, Claude, エンタープライズAI, AIエージェント, LLM