Mobile Study: 2025

2025年9月28日日曜日

GitHub’s Spec Kit: Grounding AI Coding with Software Engineering Best Practices

Microsoft and GitHub have made AI an essential part of modern software development. With GitHub Copilot now integrated into both Visual Studio and Visual Studio Code, developers can access AI-assisted code completion, intelligent coding agents, and Model Context Protocol servers directly from their editors.

These tools can significantly accelerate development, but without proper structure, they can also encourage what’s often called "vibe coding"—writing code rapidly without sufficient planning, which can result in unnecessary features and overly complex solutions.

Introducing Spec Kit

To address this challenge, GitHub recently released Spec Kit, an open-source tool designed to bring structure and software engineering discipline to AI-assisted development.

Spec Kit goes beyond basic AI coding tools. It provides a command-line environment that integrates with GitHub Copilot and other AI agents to guide developers through the entire software development lifecycle—from initial specification to working prototype.

The goal is simple: build smarter, not just faster.

“The issue isn’t the coding agent’s coding ability, but our approach. We treat coding agents like search engines when we should be treating them more like literal-minded pair programmers.”
— Den Delimarsky, GitHub Principal Product Manager

How Spec Kit Works

Spec Kit is built to complement tools like Copilot while embedding traditional software development principles. It begins by helping you structure a Git repository and then provides a framework to guide your AI assistant through structured, intentional development.

With a focus on clarity and correctness, Spec Kit reduces the chances of AI-generated errors or hallucinations by prompting for clarification when needed. It allows developers to remain in control of the process while benefiting from the productivity gains of AI.

Setting Up Spec Kit in Visual Studio Code

Spec Kit supports both Windows and Unix-like environments. Here's a quick setup overview:

Install Astral uv: A Rust-based Python project management tool that handles environments and dependencies.
Download and Run Spec Kit: Use a script to get started, either as a one-time setup or a permanent installation.
Launch Visual Studio Code: Run it inside WSL or your preferred environment. Navigate to the project folder to begin development.

Once installed, Spec Kit scaffolds your project and sets up integration with your selected AI coding assistant.

Spec Kit Workflow Overview

1. Constitution

Begin by defining a "constitution"—a high-level set of principles that guide your project. These could include requirements like writing unit tests, adhering to specific architectural patterns, or optimizing for performance.

2. Specification (/specify)

Define what you’re building. This spec should include a detailed description of the application, its purpose, and the technologies involved. The spec evolves as your project grows, supporting new features and changing requirements.

3. Technology Plan (/plan)

Select the stack and services you'll use. This might start with simple tools (e.g., SQLite during development) and scale to more robust solutions (e.g., Azure SQL in production). Plans can be updated throughout the development cycle.

4. Task Breakdown (/tasks)

Based on your specification and plan, Spec Kit breaks the work into tasks. These cover front-end and back-end components, business logic, storage integration, and more—similar to a traditional project management breakdown.

5. Implementation (/implement)

Using test-driven development principles, Copilot helps generate code, write tests, and iterate through multiple passes. The system includes built-in prompts to flag incomplete or ambiguous requirements with [NEEDS CLARIFICATION] markers, encouraging human oversight where necessary.

Why Spec Kit Matters

Spec Kit offers a middle ground between freeform AI-generated code and the structured demands of production-quality software. By grounding AI development in proven practices, it helps teams:

Minimize errors and hallucinations
Enforce architectural consistency
Promote test-driven development
Maintain control and oversight

This ensures that AI tools work alongside developers rather than replacing intentional design with quick code snippets.

Final Thoughts

AI coding agents like GitHub Copilot can dramatically boost productivity, but they need structure to deliver reliable, maintainable code. Spec Kit fills that gap by introducing engineering discipline into AI-assisted workflows.

Whether you're a solo developer or part of a larger team, Spec Kit helps ensure that AI remains a powerful assistant—not a shortcut that leads to technical debt.

Learn More

Join the Community

NVIDIA Open Sources Audio2Face: Real-Time Facial Animation Powered by AI

Big news from NVIDIA! We're open-sourcing Audio2Face, our AI-powered real-time facial animation model. This cutting-edge tech brings 3D avatars to life — from games to virtual assistants — with stunningly realistic lip-sync and emotional expression.

What Is Audio2Face?

Audio2Face uses generative AI to animate faces in real time from just audio input. Whether you're building a video game NPC or a virtual customer service rep, this tech helps characters speak and emote like real people.

🔊 It analyzes audio features like intonation and phonemes, then turns them into facial animations and expressions — all in real time or offline.

Why It Matters

Until now, creating realistic character animation took tons of time and manual work. Audio2Face changes that — and now that it’s open source, any developer can integrate it into their workflow.

What’s Included?

Here’s what NVIDIA is offering as part of the open source release:

Package	Use
Audio2Face SDK	Everything you need to animate faces from audio — locally or in the cloud
Autodesk Maya Plugin (v2.0)	Add AI-driven facial animation to Maya projects
Unreal Engine Plugin (v2.5)	Real-time integration for UE5.5 and 5.6
Training Framework (v1.0)	Train and fine-tune your own facial animation models

Also included: example datasets, pretrained models, and emotional expression tools.

Real-World Adoption

Audio2Face isn’t just theory — it’s powering production pipelines today:

🎮 Reallusion integrated it with iClone and Character Creator to simplify animation workflows.

🧑‍🚀 Survios used it in Alien: Rogue Incursion Evolved Edition to accelerate lip-syncing and improve immersion.

☢️ The Farm 51 brought it into Chernobylite 2: Exclusion Zone, enabling detailed, emotion-rich animations that weren’t possible before.

More Developer Updates from NVIDIA

📦 RTX Kit – Improved neural rendering tools including texture compression, global illumination, and real-time ray tracing.

💻 NVIDIA vGPU – Activision revamped its dev pipeline, replacing 100 servers with just 6 RTX-powered machines. This resulted in:

82% smaller footprint
72% lower power use
250,000+ daily tasks across 3,000 devs

📹 Watch: Activision's GPU-Powered Dev Pipeline

🛠️ Nsight Tools – New profiling tools help developers debug ray tracing, optimize shaders, and manage VRAM performance.

Join the Community

Want to start building with Audio2Face? Join us:

🔗 NVIDIA Developer Program (select “Gaming”)
💬 Join our Discord
📱 Follow us on X (Twitter), LinkedIn, YouTube

Final Thoughts

We’re excited to see what the community creates with Audio2Face. Whether you’re working on AAA games or indie projects, this technology unlocks new levels of realism and creative freedom.

🚀 Let’s build the future of digital characters — together

2025年9月24日水曜日

Modernizing Java Projects with GitHub Copilot Agent Mode: A Step-by-Step Guide

Modernizing legacy Java applications can feel like trying to fix a plane mid-flight — especially when you're juggling outdated dependencies, deprecated APIs, or prepping for a cloud migration. But there’s good news: GitHub Copilot agent mode is here to turn that headache into a streamlined, guided workflow.

In this post, you'll learn how to use GitHub Copilot agent mode with the App Modernization for Java extension in VS Code to automatically upgrade, fix, test, and prepare your Java apps for the cloud.

Bonus: It’s not just for Java — .NET devs using Visual Studio can enjoy a similar guided experience!

What is GitHub Copilot Agent Mode?

Think of Copilot agent mode as an AI-powered junior developer that doesn’t just suggest code — it understands your goals and carries out multi-step tasks with you.

Instead of typing every little instruction, you can give it a high-level prompt like:

"Upgrade this Java app to Java 21, fix deprecated APIs, and get it cloud-ready."

And just like that, Copilot will:

Analyze your code
Build an upgrade plan
Apply changes
Fix build errors
Suggest secure dependencies
Run tests
Even help deploy to Azure

All inside VS Code.

What You’ll Need Before You Start

To follow along:

Visual Studio Code
GitHub Copilot license (Pro, Pro+, Business, or Enterprise)
GitHub Copilot App Modernization – Java extension
A legacy Java project (built with Maven or Gradle, JDK 8+)

Step-by-Step: Modernizing a Java App with Copilot Agent Mode

Step 1: Open Your Java Project in VS Code

Use your own project or clone a sample:


git clone https://github.com/your-org/your-legacy-java-app.git
cd your-legacy-java-app
code .

Make sure it's Git-initialized and has a working test suite (unit tests preferred).

Step 2: Start an Agent Session

Open the Copilot chat sidebar in VS Code
Launch a new Agent Mode session
Select: "GitHub Copilot App Modernization – Upgrade for Java"

Paste this prompt to kick things off:

"Using Java upgrade tools, upgrade this project to Java 21. Analyze deprecated APIs, update Gradle dependencies, and propose a safe, testable migration plan."

Step 3: Let Copilot Analyze & Plan

Copilot will:

Scan your JDK usage
Review build.gradle or pom.xml
Detect deprecated APIs
Flag security vulnerabilities (CVEs)
Create an upgrade plan (you can edit this!)

Example output:
A structured upgrade plan in markdown with goals, target JDK version, framework upgrades, and next steps.

Step 4: Apply Upgrades & Resolve Errors

Once you approve the plan, Copilot will:

Update Java syntax and imports
Apply OpenRewrite transformations
Enter a build-test-fix loop until the app compiles and passes tests

You’ll also get:

A change log
Commit history
API/dependency diff
A complete summary report

Example Code Change


// Before (deprecated)
View view = this.resolver.resolveViewName("intro", new Locale("EN"));

// After (Java 21)
View view = this.resolver.resolveViewName("intro", Locale.of("EN"));

Step 5: Make Your App Azure-Ready

After upgrading, it's time to prepare for the cloud. From the Copilot extension panel:

Click “Migrate to Azure”
Run an App Assessment to identify issues
Configure your deployment target (e.g., AKS) via assessment-config.yaml

Copilot will:

Highlight Azure readiness gaps
Propose fixes (e.g., migrate from on-prem auth to Microsoft Entra ID)
Update dependencies and configs
Write documentation
Validate everything

Sample changes include:

build.gradle updates (adding Entra ID support)
New application.properties entries
Custom Spring Security config for Entra ID
Markdown docs for dev teams

Step 6: Run Tests & Validate Everything

Once your migration is done, Copilot helps run and validate tests:

Run manually:


./mvnw test     # for Maven
./gradlew test  # for Gradle

If any test fails, Copilot can help debug or suggest new tests.

CVE Scanning

Copilot also performs automatic CVE scans and suggests secure dependency replacements — essential for maintaining compliance.

Example:
No known CVEs found for spring-cloud-azure-starter-active-directory:5.22.0

Step 7: Deploy to Azure (Automatically!)

Once everything checks out, deploy with one click:

Provision infrastructure (or use existing)
Deploy your app to Azure
Get full logs, status, and monitoring setup

Deployment report includes:

6 Azure resources provisioned
Auto-scaling enabled
Monitoring via App Insights & Log Analytics
Managed identity secured deployment
A full deployment record for documentation

Java Modernization Complete

With GitHub Copilot agent mode and the app modernization extension, you can:

Modernize legacy Java apps
Automatically fix code issues
Validate with tests
Scan for CVEs
Migrate to the cloud
Deploy to Azure

All inside a guided, chat-driven experience in VS Code.

Try It Today

Whether you’re upgrading Java or migrating .NET, Copilot agent mode is ready to help you:

Analyze large codebases
Plan upgrades and migrations
Execute changes safely
Cut hours of manual effort

Learn more in the GitHub Copilot Agent Mode doc

2025年9月23日火曜日

量子材料のブレイクスルーを加速！MITが開発したAIツール「SCIGEN」の衝撃

MITの研究者たちは、生成AIを使って量子計算に役立つような「革新的な材料」を創出する新しい手法を開発しました。

その名も SCIGEN（サイジェン）。これにより、これまで発見が困難だった超伝導体や量子スピン液体などの次世代材料が一気に現実味を帯びてきました。

生成AIで材料設計が可能に？その課題とは

近年、Google、Microsoft、Metaなどの大手企業が開発した生成AIモデルを用いて、新しい材料の設計が進められています。これらのモデルは、テキストから画像を生成するのと同じ原理で、数千万にもおよぶ新しい材料候補を生み出してきました。

しかし、量子特性（例：超伝導や特異な磁気状態）を持つ材料の創出には限界がありました。
なぜなら、これらの特性は非常に繊細な原子構造のパターンに依存しており、既存のAIモデルはそれを考慮できなかったからです。

例：量子スピン液体という量子計算に革命をもたらす可能性のある材料も、10年かけてわずか十数種類しか発見されていません。

解決策：構造制約付きAI生成「SCIGEN」

MITの研究チームが今回開発したSCIGEN（Structural Constraint Integration in GENerative modelの略）は、生成AIに幾何学的な「構造ルール」を与えることができる革新的なツールです。

これにより、以下のような量子特性を生み出す原子構造をAIが「意図的に」生成できるようになります：

カゴメ格子（上下逆さの三角形が重なった幾何構造）
リーブ格子（対称性の高い特殊な正方形構造）
アルキメデス格子（異なる多角形のタイルでできた2次元格子）

🔬「我々が求めているのは、1000万個の安定した材料ではありません。世界を変える“たった一つの材料”です」
── MIT・李明達（Mingda Li）教授

実際に2つの新材料を発見！

SCIGENは、既存のAI生成モデル（例：DiffCSP）に簡単に組み込むことができます。
研究チームは、SCIGENを使って1,000万以上の材料候補を生成し、その中から

構造が安定しているものを100万件に絞り込み、
さらに2万6千件をスーパーコンピュータで詳細解析し、
実際に2つの新しい材料（TiPdBi、TiPbSb）を合成・実証しました。

これらの新材料は、予測通り特異な磁気特性を持ち、量子コンピューティングの実用化に向けた重要な手がかりとなる可能性があります。

SCIGENの仕組みを簡単に解説

生成AI（拡散モデル）は、訓練データに基づき材料構造をランダムに生成します。
しかし、**SCIGENが「構造的なルールブック」**となり、不適切な構造は排除するため、生成される材料は最初から目的に適った形になります。

これにより、例えば以下のような目的でAIをコントロールできるようになります：

特定の格子構造を持つ材料だけを生成する
量子効果を発現しやすい設計ルールに基づいた候補を作る
実験で合成可能な安定性のある材料に絞る

研究の意義と今後の展望

量子スピン液体など、次世代の量子材料の発見は現在も非常に困難で時間がかかります。
SCIGENは、構造的に可能性のある材料を「大量にかつ目的に沿って」生成できるため、実験研究の効率を飛躍的に高めることができます。

MITの研究チームは今後、次のような機能の追加も検討しています：

化学的制約や機能的特性のルール化
プロンプト最適化や複合AIとの連携
新しい格子構造に特化した発見アルゴリズムの導入

🧪「安定性だけを求めてもブレイクスルーは生まれません。必要なのは、新しい“可能性”の扉を開くことです」
── 論文ファーストオーサー・岡部亮太郎氏（MIT博士課程）

実用化への期待

Drexel大学のSteve May教授も次のように評価しています：

「SCIGENは、次世代の電子・磁気・光学技術に必要な、これまで探索されてこなかった材料の発見を加速する革新的なツールです。」

材料科学の世界では、たった1つの発見が未来を変えます。
SCIGENは、そんな**“世界を変える材料”を見つけるための灯台**になるかもしれません。

📄 元記事（MIT News）

New tool makes generative AI models more likely to create breakthrough materials – MIT News

Microsoftの「Agent Lightning」が切り拓く、次世代AIエージェントの訓練パラダイム

AIエージェントはもはやSFの産物ではありません。コードの生成やツールの呼び出し、複雑なマルチターン対話の遂行、さらにはエンドツーエンドのソフトウェア開発まで――AIエージェントは、金融、ゲーム、ソフトウェア開発といったさまざまな分野で、現実のタスクを実行する存在へと進化しています。

しかし、AIエージェントの「訓練」には大きな課題が残っていました。

課題：従来の強化学習はAIエージェントと相性が悪い？

従来の強化学習（Reinforcement Learning, RL）は、ゲームやロボット制御などでは成功してきましたが、複雑で動的な環境におけるAIエージェントの訓練には向いていませんでした。
その理由は主に以下の3つです：

開発コストが高い：既存のAIエージェントをRLで訓練しようとすると、大幅なコード変更が必要。
拡張性がない：タスクごとにRL手法をカスタマイズしなければならず、汎用性に欠ける。
データが活かせない：実行時に得られるリッチなインタラクションデータが、訓練に活用しづらい。

このような状況を打破するため、Microsoft Researchが新たに開発したのが 「Agent Lightning」 です。

Agent Lightningとは？

Agent Lightning 概要図
（出典：Microsoft Research）

Agent Lightning は、あらゆるAIエージェントを対象に、強化学習を用いた効率的な訓練を可能にする柔軟かつ拡張可能なフレームワークです。
最大の特徴は、「エージェントの実行」と「訓練」の完全な分離（デカップリング） を実現した点にあります。

これにより、エージェントのロジックを変更することなく、そのまま訓練が可能になります。

技術的な仕組み：LightningRLとTraining-Agentアーキテクチャ

Agent Lightningは、以下の2つの中核コンポーネントによって構成されています。

1. LightningRL：分解して訓練する新しいRL手法

強化学習では、エージェントが生成した「軌跡（トレース）」から学習データを抽出し、モデルを訓練します。
LightningRL は、複雑なマルチステップのエージェント操作を、単一のRL問題として再構成することで、既存の強化学習アルゴリズム（PPO、DPOなど）を再利用可能にします。
また、「信用割当（credit assignment）」モジュールによって、報酬を各ステップにうまく分配することが可能です。

2. Training-Agent アーキテクチャ：前後分離で開発効率を向上

Lightning Server：訓練プロセスの中心。GPU管理やモデル更新などを担当。
Lightning Client：エージェントの実行とデータ収集を担う。既存コードを変更せず導入可能。

この構成により、エージェント開発者は訓練基盤の煩雑な設定を気にせず、「エージェントの設計とロジック」に集中できる ようになります。

実験結果：さまざまなタスクで有効性を実証

Agent Lightningは、以下のような現実的なタスクでその性能を検証済みです。

● Text-to-SQL（LangChain）

3つのエージェント（SQL生成、チェック、再生成）が連携する複雑なワークフローにおいて、SQL生成エージェントと再生成エージェントの性能を選択的に訓練。報酬が安定して向上し、ツール使用を伴うマルチステップ処理の最適化に成功。

● RAG（OpenAI Agent SDK）

検索拡張型の生成タスクでも、Agent Lightningは訓練を通じて持続的な性能向上を実現。現実的なRAGシナリオにも適応可能であることが証明されました。

● 数学問答＋ツール利用（AutoGen）

計算ツール（電卓）を活用した問題解決において、Agent Lightningが呼び出し精度と回答正確性の両方を改善。外部ツールとの連携が必須なタスクにも強い ことが示されました。

今後の展望：RLだけじゃない、多様な最適化への応用

Agent Lightningは、今後以下の方向で進化が期待されています：

Prompt最適化やコンポーネント指向の最適化 への拡張（CoI＝Component of Interest の概念導入）
長期的な信用割当やオフポリシー学習 などの高度なRL手法との統合
LLM最適化向けのシステム分離アーキテクチャ（推論・訓練・実行の分離）への対応

将来的には、Agent Lightningが収集した実行データを最大限に活用することで、AIエージェントの自律的な進化を大幅に加速することが期待されています。

まとめ：AIエージェントの進化を後押しする「訓練の標準化」

従来、AIエージェントの訓練はカスタム開発が前提でしたが、Agent Lightningにより「訓練の標準化」が可能になります。

コードを変えずに訓練可能
強化学習とツール利用の融合が容易
多様なアルゴリズムやシステム構成と統合しやすい

AIエージェントが今後、社会のさまざまな分野に広く展開されるうえで、Agent Lightningはその基盤となる重要な技術となるでしょう。

📄 論文リンク：
https://arxiv.org/abs/2508.03680

🔗 公式プロジェクトページ：
https://www.microsoft.com/en-us/research/project/agent-lightning/

2025年9月14日日曜日

「非決定モード → 決定モード」に切り替える例

前提

Python + PyTorch 環境がある（GPU 使用可能、CUDA 有効）。
thinking-machines-lab/batch_invariant_ops ライブラリをインストール済みであること。 GitHub
モデルが FlashAttention など高速な Attention バックエンドを使う場合、Attention 部分を固定戦略 (fixed splits, 固定チャンク・KV キャッシュの統一レイアウト等) に対応できることが望ましい。

セットアップ

まず、batch_invariant_ops を使い始めるための準備。


# リポジトリをクローン or pip 経由でインストール
git clone https://github.com/thinking-machines-lab/batch_invariant_ops.git
cd batch_invariant_ops
pip install -e .

または、PyPI で公開されていれば:


pip install batch_invariant_ops

サンプルコード：非決定モード → 決定モード切り替え

以下は、簡単な Transformer モデル（あるいは vLLM 等）で、標準モードとバッチ不変モードを切り替えて推論を行い、応答が一貫するかを比較する例です。


import torch
from batch_invariant_ops import set_batch_invariant_mode
# 仮に FlashAttention を使うライブラリ
from your_model_lib import TransformerModel, FlashAttentionBackend

def inference(model, input_ids, attention_mask):
    """
    単純な推論関数
    """
    # モデルの順伝播 (推論モード)
    with torch.no_grad():
        logits = model(input_ids=input_ids, attention_mask=attention_mask)
    return logits

def test_determinism(input_ids, attention_mask, runs=5, batch_mode=False):
    """
    同じ入力で複数回推論して結果が一致するかどうかチェックする
    batch_mode=False: 標準モード（非決定モード）
    batch_mode=True: バッチ不変モードを有効にした状態
    """
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = TransformerModel(backend=FlashAttentionBackend()).to(device)
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)

   _outputs = []
    for i in range(runs):
        if batch_mode:
            # 決定性モードを有効にする
            with set_batch_invariant_mode():
                out = inference(model, input_ids, attention_mask)
        else:
            out = inference(model, input_ids, attention_mask)
        _outputs.append(out.cpu())

    # 比較
    first = _outputs[0]
    all_same = all(torch.allclose(first, o, atol=0, rtol=0) for o in _outputs[1:])
    return all_same, _outputs

if __name__ == "__main__":
    # 適当な入力を作成（例：バッチサイズ 2、シーケンス長 128 等）
    batch_size = 2
    seq_len = 128
    vocab_size = 30522  # 例
    # ダミー input_ids, attention_mask
    input_ids = torch.randint(low=0, high=vocab_size, size=(batch_size, seq_len), dtype=torch.long)
    attention_mask = torch.ones((batch_size, seq_len), dtype=torch.long)

    # 標準モードでテスト
    same_std, outs_std = test_determinism(input_ids, attention_mask, runs=3, batch_mode=False)
    print(f"Standard mode deterministic? {same_std}")

    # バッチ不変モードでテスト
    same_bi, outs_bi = test_determinism(input_ids, attention_mask, runs=3, batch_mode=True)
    print(f"Batch‐invariant mode deterministic? {same_bi}")

注意点・補足実装

この切り替えを実用レベルで使うには、以下の点を押さえておく必要があります。

Attention / FlashAttention に関する扱い

FlashAttention や類似バックエンドでは、Attention の計算中にチャンク分割 (chunking) や Split‑KV, カーネル最適化などが入っていることが多い。これらの最適化戦略がバッチサイズやシーケンス長、KV キャッシュ状態等に応じて動的に変化する場合、非決定性の原因となる。
決定モードにおいては、チャンク／分割サイズを固定する、KV キャッシュの内部レイアウトを一貫させるなどの設計が必要。

例えば：


# 仮想コードスニペット：FlashAttention バックエンドの Attention 呼び出しを修正する例

class FixedAttention(FlashAttentionBackend):
    def forward(self, query, key, value, mask=None):
        # 固定チャンクサイズと分割戦略を明示
        chunk_size = 64  # 例：固定
        # KV キャッシュがあるかどうかに関わらず、キャッシュと現在入力を統一したレイアウトで結合
        if hasattr(self, 'kv_cache') and self.kv_cache is not None:
            # レイアウト統一処理
            key = torch.cat([self.kv_cache['key'], key], dim=1)
            value = torch.cat([self.kv_cache['value'], value], dim=1)
        # 固定チャンク毎に処理
        # 注意：この部分はライブラリ／GPU カーネル設定によっては細部が異なる
        out = super().attend(query, key, value, mask, chunk_size=chunk_size)
        return out

他の演算子（RMSNorm, MatMul, mean/reduction etc.）

batch_invariant_ops ライブラリでは、torch.mm(), torch.addmm(), torch.log_softmax(), torch.mean() など、標準的な行列演算や削減 (reduction) 演算をバッチ不変なカーネルに置き換える仕組みが用意されている。 GitHub
例：torch.mean() を使うとき、通常はバッチ軸または特徴軸での reduce の順序がバッチサイズ等に依存することがあるので、決定モードではその順序を固定するか、reduce を分割しない／一定の分割サイズでのみ行うようにする。

テスト・検証

決定性モードを導入したあと、次のようなテストを行うと良い：

繰り返し推論テスト：同じ入力を多回（例 10～100 回）推論し、それぞれの出力ロジットがビット単位で一致するかをチェック。
バッチサイズ変更テスト：入力のバッチサイズを変えても、個々の要素の出力が一致するかどうか（例、バッチサイズ = 1 vs バッチサイズ = N）を比較。
シーケンス長・KV キャッシュ状態のテスト：アテンションのプリフィル／デコード／キャッシュ未使用・使用時で、出力が変わらないか確認。
異なる GPU / 同じモデルでも異なるカードでのテスト：異なるハードウェア構成で同じコード・ライブラリを使ったときの再現性を確認。
性能測定：標準モードと決定性モードの速度・メモリ消費・レイテンシを比較し、「どの程度のトレードオフがあるか」を把握。

トレードオフと実用上の判断基準

決定性モードを有効にすると、特に小バッチや短シーケンス入力で GPU のコア利用効率が下がることがあり、レイテンシが増える可能性が高い。
通常のモデル提供時には、応答の多様性や速度が重視されるケースもあるので、用途によって標準モード／決定モードを切り替えるオプションを持たせるのが望ましい。
モデルが大きく・入力が長くなるほど、非決定性が蓄積されやすいため、決定モードの効果を実感しやすいが、そのぶんオーバーヘッドも出やすい。

論文／ブログ「Defeating Nondeterminism in LLM Inference」の技術的なポイント

LLM 推論における非決定性を克服する

Thinking Machines Lab（以下 TML）が論じる「非決定性（nondeterminism）」は、ただの「GPUの並列＋浮動小数点誤差」の話を越えており、「バッチ不変性（batch invariance）」という概念を中心とした設計選択が鍵、というものです。以下詳細。

背景と問題の定義

LLM（Large Language Models）を使って「同じプロンプトを、temperature=0 に設定 → 決定的（deterministic）応答が返るべき」状況であっても、実務上／API提供環境では毎回応答が異なることがある。 Simon Willison’s Weblog+3Tildes+3eu.36kr.com+3
従来の仮説 (“concurrency + floating point hypothesis”): 浮動小数点演算の非結合性 (associativity 非保持) と、GPU/並列演算における計算順序の揺らぎ（どのスレッド／コアがいつ実行を終えるか等）が、異なる実行で微小誤差を生み、それが最終出力に影響を与える、というもの。 Tildes+1

TML はこの仮説を「正しくはあるが、決定性が失われる根本原因とは言いきれない」と指摘し、「バッチサイズやバッチ処理の条件」に依存する設計が、非決定性を引き起こすもっとも重要なファクターだとした。 eu.36kr.com+3Tildes+3Analytics India Magazine+3

バッチ不変性 (Batch Invariance) の意味

「バッチ不変性(batch invariance)」とは：

入力（プロンプトなど）が同じであれば、バッチの構成（バッチサイズ、同時リクエスト数、他のリクエストの有無や順序など）が異なっても、各入力に対する出力はビット単位で同一であるべき、という性質。 Tildes+1

さらに、

サーバーサイドでは、多数のユーザー／リクエストを効率よく処理するために、リクエストをある規模ごと（サーバ負荷／リクエスト到来状況）にまとめて一括（バッチ）処理することが常。これにより “バッチサイズ” が実行ごとに変動する。 Tildes+2eu.36kr.com+2
多くの GPU カーネル（行列乗算やアテンションなど）は、バッチサイズに応じて内部の計算戦略を切り替える（分割アルゴリズムを変える、使うユニット（TensorCoreなど）を変える、最適化パスを変える等）ように設計されており、これが「非決定性」の実際の発生源である。 eu.36kr.com+1

非決定性が発生する具体メカニズム

以下のようなチェーンが問題を引き起こす：

サーバー負荷の変動 → 同時処理すべきリクエスト数が変わる
バッチサイズの変動 → 単一の入力が “大きなバッチ” 内で処理されるか、“小さなバッチ” で処理されるかが変わる
カーネルの戦略切り替え
- 小バッチだとコア利用率を上げたりアイドル時間を減らす戦略を使う。
- 大バッチや特定形状 (行列の次元) の場合は別のアルゴリズムや最適化パス（たとえば分割処理、Split‑K／Split‑KV, FlashAttention モードなど）を使う。
演算順序／削減 (reduction) の内部パスの変化 → 浮動小数点演算の非結合性により、順序が違えば丸め誤差等の差異が生じる
差異の累積と伝播 → Transformer のような多数層を持つモデルでは、このわずかな違いが層を重ねるごとに拡大し、最終トークン生成やロジット (logit) 出力に目に見える違いをもたらす。 eu.36kr.com+1

対応策：バッチ不変なカーネル設計

TML は、この非決定性を抑えるために、以下のような設計変更／実装戦略を提案／検証している：

演算（Kernel）	問題点	解決策
RMSNorm	正規化操作で “Reduce” を行う部分で、バッチ要素をどう扱うか（分割するかどうか、複数コアで分担するかどうか）によって演算順序が変わる。バッチサイズが小さいと Split‑Reduction のような戦略が使われ, 演算経路が異なることがある。 eu.36kr.com+1	バッチ不変性を維持するため、「データ並列(data‑parallel)」戦略を使う。つまりバッチサイズがどうであれ、各バッチ要素を同様な扱いで処理し、削減 (reduction) の順序を一定に保つ実装を強制する。
行列乗算（MatMul）	バッチの次元やサイズ（M, N, K 等）に応じて異なるアルゴリズムを選択することがある。たとえば、小さな行列サイズでは Tensor Core を使わない、または Split‑K を使う／使わないなど。これが部分的に演算順序の変化を引き起こす。 eu.36kr.com+1	全ての形状（matrix shapes）で単一のカーネル構成を使う。アルゴリズムやタイルサイズ（tile size）を予め固定し、それに最適化されたコードを用意しておく。パフォーマンスは落ちるが、実証実験では約 20% の落ち込みでバッチ不変性を維持可能と示されている。 eu.36kr.com
アテンション (Attention)	アテンションには行列乗算の他、シーケンス長(seq length)、チャンク（chunked prefill, prefix caching 等）の扱い、KV キャッシュの扱いなどバッチの構成以外の変数が絡む。これらが異なる戦略やレイアウトを引き起こし、演算パス（特に reduction）を変える原因となる。 eu.36kr.com	- KV キャッシュを使う場合、常にキャッシュと現在のデータを一貫したメモリレイアウト (layout) に統合してからアテンション演算をする。 - チャンク／分割（Split‑KV, FlashDecode など）の戦略を固定分割サイズ (fixed‑size splits) を使うようにし、バッチやシーケンス長の違いによる分割方式のばらつきを減らす。 - 注意して設計されたアテンション・カーネルを使う。 eu.36kr.com

実装上の工夫：ツールとライブラリ

TML は batch_invariant_ops というライブラリを発表していて、既存の PyTorch モデルを比較的容易に “batch‑invariant モード” に切り替えることができるようになっている。 GitHub
- このライブラリは、既存のカーネルを置き換える形で、「バッチ不変なカーネル」を導入する。
- torch.Library を使って既存の演算（matmul, addmm, log_softmax など）を batch-invariant なものに差し替える仕組みを含む。 GitHub
実験例：vLLM を使った推論で、1000 回の同一プロンプト／条件下でサンプリングをしたところ、非決定性を抑えたモードではすべて同じ出力が得られた。 Tildes+2eu.36kr.com+2

トレードオフ・限界

技術的に有望ではあるものの、以下のような注意点や制限が存在する：

性能低下（スループット／レイテンシ）
固定戦略を使う、最適化パスを限定する、分割 (split) の自由度を下げるなど、演算パスの選択肢を狭めるため、特定の入力サイズ／バッチ構成では効率が落ちる。実験で報告されている数字は MatMul などで約 20% の性能低下。 eu.36kr.com
メモリ使用量やコア利用効率の影響
特に小さなバッチやチャンク処理を固定戦略にする場合、GPU コアがアイドルになる時間が増えるかもしれない。リソースの最適利用 vs 決定性の確保、というトレードオフ。
システム全体の制御の難しさ
本番環境では、リクエスト到来の変動、サーバー負荷、他ユーザーリクエストとの干渉など、バッチサイズを意図的に制御できない要因が多い。完全な決定性を保証するには、そのような外部要因を含めたシステム設計／運用が必要。
用途による要求の違い
- 研究・監査・法規制用途では「ビット単位での決定性」が重要。
- 対話型チャットなどでは「多様性」「創造性」の方が価値を持つ場合もあり、あえて多少の揺らぎを残すことが望ましいケースもある。
- RL（強化学習）での「オンポリシー vs オフポリシー」の整合性にも関連するため、トレーニングおよび推論の一貫性を維持する設計が重要。

実用に向けた設計上のチェックリスト

LLM を提供・構築しようとするエンジニア／研究者として、非決定性を抑えて決定性を確保するために確認すべきポイント：

バッチサイズの制御可能性と一貫性
　- 推論エンドポイントで負荷によってバッチが動的に変わるかを把握する。
　- 必要なら「batch invariant モード／設定」を導入。
主要カーネルの設計・実装
　- RMSNorm、MatMul、Attention の reduced operations 部分がバッチ不変かどうかを確認。
　- 使用するライブラリ／フレームワーク（PyTorch, cuBLAS, FlashAttention 等）のカーネルがどのような戦略を取っているか調べる。
　たとえば、小行列の場合にはどの戦略が選ばれているか、TensorCore の切り替え条件、split‑K や split‑KV の使用時の挙動など。
KV キャッシュ／チャンク分割（Prefill vs Decode 等）の扱い
　- キャッシュの内容と現在の入力の扱いが常に同じ内部レイアウトで処理されるか。
　異なるレイアウトや異なるマスク (mask/full/partial) の扱いが、計算パスを変えないようにすること。
ライブラリとフレームワークのサポート
　- TML の batch_invariant_ops のような補助ライブラリを使う or それに類する取り組みがなされているか。 GitHub
　- フレームワーク自体が決定性モードを提供しているかどうか。
性能計測と妥協点の評価
- 決定性モードでの速度・スループット低下を測る。
- バッチ構成・シーケンス長による違いを実際の運用ワークロードで確認する。
- 出力の決定性 vs レイテンシ／コストのバランスをどう設計するかを決める。

総括

TML の研究は、LLM の「応答の揺らぎ」を単に避けられない副作用とみなすのではなく、システム設計／カーネル設計の選択次第で大きく制御できるということを示した点で技術的に非常に重要です。エンジニアリング視点では、「バッチ不変性」を設計パラメータの一つと位置付け、そのための実装戦略をあらかじめ設計に組み込むことが、今後の LLM 推論サービスの品質向上（再現性・信頼性・監査可能性等）にとって鍵になります。

元OpenAIの研究者ら、AIの応答が毎回違う理由をついに解明

はじめに：揺らぐAI応答――その“謎”が明らかに

元OpenAIの幹部も参加するスタートアップ Thinking Machines Lab が、「AIが同じ入力でも毎回少し異なる応答を返す」――この長年の謎の“根っこ”を突き止めた。通説だったGPU並列処理ではなく、「バッチ不変性（batch invariance）」の欠如こそが主原因、というのだ。彼らの論文「Defeating Nondeterminism in LLM Inference（LLM推論における非決定性の打破）」が、それを鮮やかに示している。TechCrunch+2Dataconomy+2

従来の仮説：GPU + 浮動小数点演算の“非結合性”

AIの“非決定的”応答が起きる理由として、これまでもっとも広く信じられてきたのが以下のような仮説だ：

GPUは多数のコアを用いて並列演算を行う。TechCrunch+2Dataconomy+2
演算の順序がコアの完了タイミングなどの微妙な要因で変わる。eu.36kr.com+2note（ノート）+2
浮動小数点数（floating‑point）の演算は「非結合性」を持ち、(a + b) + c と a + (b + c) の結果が必ずしも同じにならない。順序の違いが誤差として出る。note（ノート）+2eu.36kr.com+2

この組み合わせにより、同一プロンプトでも、環境によって（特にGPU内部での並列処理／演算順序の違いにより）出力が微妙に異なる、という説明が「常識」とされてきた。note（ノート）+1

Thinking Machines Labの発見：真の原因は“バッチ不変性の欠如”

しかし、Horace He 氏らはこの仮説だけでは説明できない現象があることを指摘する。たとえば、GPU上で行列乗算（matrix multiplication）を同じデータで何度も繰り返しても、理論上「決定的（bit‑レベルで完全に同じ）」な結果が得られる実験がある。これが示すのは、「GPU + 浮動小数点演算」が原因かもしれないが、それだけでは“いつも揺らぐ”理由にはならない、ということ。eu.36kr.com+2Dataconomy+2

彼らが導いた答えはこうだ：

サーバー側の“バッチサイズの変動”によって、推論時に使われるGPUカーネルの内部戦略が変わる。これが浮動小数点演算の非結合性と組み合わさって、出力のビット単位での違いを生む。TechCrunch+2eu.36kr.com+2

この「バッチ不変性の欠如（lack of batch invariance）」が非決定性の“元凶”というわけだ。eu.36kr.com+2note（ノート）+2

バッチ不変性とは何か？

「バッチ不変性（batch invariance）」とは、以下の性質を指す：

推論時に扱われるバッチ（まとめて処理される複数の入力）サイズが変わっても、
そのバッチに含まれる入力の数や他の同時処理中の仕事（他のユーザーのリクエストなど）に依存せず、
各入力に対して常に「同じ計算戦略」を通じて処理され、
結果として、同じ入力であれば常にビット単位で同じ応答が得られる。

Think Machines Lab は、多くの現行GPU／カーネル実装が、この「バッチ不変性」を満たしていないと指摘している。バッチサイズが大きい／小さい、他のリクエストとまとめられているかどうか、KVキャッシュの状態など、さまざまな要素で内部の計算手順が変わってしまうためだ。eu.36kr.com+2Dataconomy+2

解決へのアプローチ：3種の主要演算の再設計

Horace He 氏らは、非決定性を克服するために、以下のようなコア演算（operations）を“バッチ不変なカーネル”で再設計する必要があると提案している。性能低下を最小限に抑えつつ、決定性を確保することが目的だ。eu.36kr.com+1

演算	問題点	対応策
RMSNorm（正規化処理）	バッチサイズが小さい時に“Split‑Reduction”などの戦略で処理順序が変わる → 演算のリダクション部分で不整合が生じる。eu.36kr.com	常に「データパラレル」の戦略を使うなどして、バッチサイズに関係なく同じ計算順序／戦略を用いるようにする。eu.36kr.com
行列乗算（MatMul）	バッチが異なると、Tensor Coreを使う／使わない、あるいは分割戦略（Split‑Kなど）が異なることがあり、結果に差が出る。eu.36kr.com	全ての入力シェイプ（行列サイズ）に対して同じカーネル設定をあらかじめコンパイルしておき、どんなバッチサイズでもこれを用い続ける。性能低下はあるが、実験ではおおよそ 20%ほどの劣化で済んでいる。eu.36kr.com
アテンション（Attention）	シーケンス長、KVキャッシュの有無、チャンク分割（chunking／prefill vs decoding）などによって処理パスが変わる → リダクションの経路が変わることに起因する非決定性。eu.36kr.com+1	KVキャッシュを使う前にキャッシュと現在のデータを常に統一したレイアウトで扱うようにする。固定分割サイズ（fixed‑size splits）を採用するなどし、チャンク／分割戦略をバッチの構成に依存しすぎないようにする。note（ノート）+1

この発見の意義と今後の影響

この研究の意味するところは大きい。

再現性／信頼性の向上：同じ入力で同じ出力を得られるAIは、研究用途、産業用途（特に規制の厳しい金融・医療など）で非常に価値が高い。TechCrunch+1
強化学習（RL）との親和性：推論結果の揺らぎが少なければ、RLの報酬設計や学習の安定性も高まる。ヒューリスティックな補正を減らせる可能性がある。Dataconomy+1
製品化可能性：Thinking Machines Lab は、この技術を近い将来のプロダクトに組み込む意向を示しており、研究成果を単なる理論ではなく実用レベルに落とし込むことが期待されている。TechCrunch+1

残る課題と注意点

ただし、この“決定性への道”が完全にバラ色というわけではない。以下のようなトレードオフや議論点が残っている：

性能（スピード／効率）の低下
　完全なバッチ不変性を持たせるためには、現在の最適化されたカーネル戦略を制限したり、使い慣れた高速なGPU最適化を使わないこともあり得る。実験では約20%の性能低下が報告されている。eu.36kr.com
運用環境の複雑さ
　サーバーの負荷やリアルタイムでのバッチサイズの変動を制御するのは難しい。完全に同じバッチ構成を再現することは現実的には難しいケースも多い。
創造性とのバランス
　AI応答が“少し変わる”ことで、多様性や創造性が生まれるという側面もある。応答が完全に決定的になることが常に望ましいかどうかは、用途によって異なるだろう。
コミュニティでの検証と普及
　Thinking Machines Lab の報告は非常に注目に値するが、今後、この手法が他の LLM 実装（異なるハードウェア、異なるライブラリ）で再現されるかどうか、またその際のパフォーマンスの落ち込みがどの程度か、が鍵となる。eu.36kr.com+1

結び：AIは「気まぐれ」ではなくできるだけ「予測可能」に

Thinking Machines Lab のこの研究は、AIの「毎回答えがすこし違う」という現象を、単なる仕様・制約として受け入れるのではなく、「技術的に解決可能な問題」として扱う点で画期的だ。AIの信頼性・再現性を求める動きの中で、一つの大きなマイルストーンになる可能性がある。

私たちユーザーや開発者としても、この発見がどう実装され、どのように応答の“揺らぎ”を減らしていくかを注視していきたい。

登録: 投稿 (Atom)