LLM Council Architecture · LLM 議會架構

§01

請求流程Request Flow

每一輪對話依序通過身份驗證與速率限制，並行呼叫三個模型，由主持人整合後存入資料庫並回傳。 Every turn passes through auth, rate limiting, and parallel model invocations before the synthesised response is stored and returned.

flowchart TD
    A[("👤 User Request\nPOST /sessions/:id/turns")] --> B
    B["🛡 Cloudflare Worker\nAuth + Rate Limit\nHMAC · 30 req/min"] --> C
    C["⚙ Cloudflare AI Gateway\nUnified Billing\nCF_AIG_TOKEN"] --> D & E & F
    D["GPT-5-mini\nOpenAI\nmax_tokens 4096 · temp 1"] --> G
    E["Gemini 3.1 Pro\nGoogle\nmax_tokens 4096 · temp 0.2"] --> G
    F["Claude Sonnet 4.6\nAnthropic\nmax_tokens 4096 · temp 0.2"] --> G
    G["🎙 Moderator\nGemini 2.5 Flash\n14-field JSON · 8192 tokens"] --> H
    H[("🗄 Supabase PostgreSQL\ncouncil_turns\nON CONFLICT DO NOTHING")] --> I
    I[("📤 JSON Response\ncomposed_answer + expert_opinions[]")]

    style A fill:#a8552e,color:#fff,stroke:#7a352b
    style B fill:#3a5a40,color:#fff,stroke:#2a4a30
    style C fill:#9c7b30,color:#fff,stroke:#7a5a20
    style D fill:#3a5a40,color:#fff,stroke:#2a4a30
    style E fill:#9c7b30,color:#fff,stroke:#7a5a20
    style F fill:#7a352b,color:#fff,stroke:#5a2520
    style G fill:#a8552e,color:#fff,stroke:#7a352b
    style H fill:#4a3a1a,color:#ede4d0,stroke:#9c7b30
    style I fill:#a8552e,color:#fff,stroke:#7a352b

用戶端Client

發送請求User Request

POST /api/v1/llm-council/sessions/:id/turns

↓

Cloudflare Worker

身份驗證 + 速率限制Auth + Rate Limit

HMAC token · RL_COUNCIL_TURN 30/min

↓

Cloudflare AI Gateway

統一計費路由Unified Billing

CF_AIG_TOKEN · cf-aig-authorization

專家一 · OpenAIExpert 1 · OpenAI

GPT-5-mini

max_completion_tokens · temp 1

專家二 · GoogleExpert 2 · Google

Gemini 3.1 Pro

max_tokens 4096 · temp 0.2

專家三 · AnthropicExpert 3 · Anthropic

Claude Sonnet 4.6

max_tokens 4096 · temp 0.2

主持人Moderator

Gemini 2.5 Flash

max_tokens 8192 · 14-field JSON

↓

Supabase PostgreSQL

council_turns

INSERT ... ON CONFLICT DO NOTHING

↓

回傳用戶端Response

JSON 回應to Client

composed_answer + expert_opinions[]

§02

三位專家模型Three Expert Models

每位專家接收相同問題與對話歷史，從各自專業角度獨立分析。三者以 Promise.all() 並行執行——整體延遲等於最慢那位，而非三者累加。 Each expert receives the question plus conversation history and independently analyses the case. All three run concurrently via Promise.all() — total latency equals the slowest single model.

專家一 · OpenAIExpert 1 · OpenAI

GPT-5-mini

openai/gpt-5-mini

程序正義與正當程序分析Procedural rights and due process
《移民法 1958》條文解釋Statutory interpretation — Migration Act 1958
管轄錯誤識別Jurisdictional error identification
具約束力先例引用與權重Binding precedent weight and citation
聯邦法院覆核途徑Federal Court review pathways

max_completion_tokens: 4096

temperature: 1 ← 推理模型專用reasoning model

專家二 · GoogleExpert 2 · Google

Gemini 3.1 Pro

google-ai-studio/gemini-3.1-pro-preview

難民法與國際保護標準Refugee law and international protection
國別信息與可信度評估Country information and credibility
《難民公約》五項理由分析Convention ground analysis (race/religion/PSG)
補充保護途徑Complementary protection pathways
難民身份甄別程序公正性Procedural fairness in RSD interviews

max_tokens: 4096

temperature: 0.2

專家三 · AnthropicExpert 3 · Anthropic

Claude Sonnet 4.6

anthropic/claude-sonnet-4-6

簽證子類別資格與標準對應Visa subclass eligibility and criteria mapping
AAT / ART 裁判所覆核管轄Tribunal review grounds and AAT/ART jurisdiction
品格與健康要求分析Character and health requirement analysis
實質性覆核成功因素Merits review success factors
代理策略建議Representation strategy recommendations

max_tokens: 4096

temperature: 0.2

§03

主持人輸出 SchemaModerator Output Schema

Gemini 2.5 Flash 閱讀三位專家的完整回應與對話歷程後，輸出嚴格的 14 欄位 JSON。主持人負責解決分歧、標示不確定性，並引用最強論據。 Gemini 2.5 Flash reads all three expert responses and the conversation, then produces a strict 14-field JSON. The moderator resolves disagreements, flags uncertainty, and cites the strongest reasoning.

moderator_output — council_turns.moderator_output (jsonb) — 最大max 8192 tokens

"composed_answer"string對用戶問題的主要整合回答Primary synthesised response to the user's question

"outcome_prediction"enum stringlikely_success | likely_failure | uncertain

"confidence_score"number 0–1主持人對整合回答的信心指數Moderator confidence in the composed answer

"key_legal_issues"string[]本案涉及的核心法律問題清單Primary legal questions raised by this case

"risk_factors"string[]可能削弱申請人案件的因素Factors that could weaken the applicant's case

"positive_factors"string[]有利於申請人立場的因素Factors that strengthen the applicant's position

"recommended_actions"string[]申請人具體可行的下一步行動Concrete next steps for the applicant

"relevant_visa_subclasses"string[]適用的簽證子類別（如 866、785、790）Applicable visa subclasses (e.g. 866, 785, 790)

"case_precedents"string[]各專家分析中引用的相關判例Relevant case citations from expert analyses

"expert_consensus"enum stringfull | partial | disputed

"dissenting_views"string專家意見分歧摘要（若有）Summary of expert disagreements, if any

"urgency_level"enum stringcritical | high | medium | low

"disclaimer"string標準免責聲明：本分析非法律建議Standard "not legal advice" disclaimer

"follow_up_questions"string[]釐清案情所需的追問問題Clarifying questions to gather more context

§04

對話歷史注入 — 設計決策 D2Conversation History — Decision D2

buildHistoryMessages(prevTurns) 將每輪主持人的 composed_answer 作為 role: "assistant" 注入，而非重複三份原始輸出。多輪對話 Token 成本壓低約 70%。 buildHistoryMessages(prevTurns) injects each prior moderator composed_answer as a role: "assistant" turn. This collapses 3 expert outputs into 1 coherent summary per turn, saving ~70% tokens on multi-turn conversations.

每位專家看到的是交替出現的 user / assistant 對話，上下文完整，Token 消耗最小。 Each expert sees the conversation as interleaved user/assistant turns — preserving context at minimal token cost.

第一輪 · role: userTurn 1 · role: user

我申請保護簽證的勝算如何？What are my chances for a Protection visa?

第一輪 · role: assistant（主持人 composed_answer）Turn 1 · role: assistant (moderator composed_answer)

根據您的 AATA 案件歷程，保護簽證前景取決於… 【三位專家整合結果】 Based on your AATA case history, protection visa prospects depend on… [synthesised from 3 experts]

第二輪 · role: user（當前輪次）Turn 2 · role: user (current)

如果我同時主張補充保護途徑呢？What about the complementary protection pathway?

為何 D2 至關重要：Why D2 matters: 若將三位專家原始輸出全部注入，每輪歷史 Token 達 3 倍。D2 以單一主持人摘要取代，效果等同，成本降至 1/3。 Without this, each expert would see 3× the history tokens per turn. D2 substitutes all expert outputs with the single moderator summary.

// workers/llm-council/runner.js function buildHistoryMessages(prevTurns) { const msgs = []; for (const turn of prevTurns) { msgs.push({ role: "user", content: turn.user_message }); if (turn.moderator_output?.composed_answer) { msgs.push({ role: "assistant", // Only moderator summary, not 3 raw outputs content: turn.moderator_output.composed_answer }); } } return msgs; } // All three experts run concurrently const historyMsgs = buildHistoryMessages(prevTurns); const [e1, e2, e3] = await Promise.all([ runExpert(env, EXPERT_1_MODEL, EXPERT_1_SYSTEM, historyMsgs, userMsg), runExpert(env, EXPERT_2_MODEL, EXPERT_2_SYSTEM, historyMsgs, userMsg), runExpert(env, EXPERT_3_MODEL, EXPERT_3_SYSTEM, historyMsgs, userMsg), ]); // Then moderator synthesises e1, e2, e3 const modResult = await runModerator( env, e1, e2, e3, historyMsgs, userMsg );

§05

API 端點API Endpoints

六個 Worker 原生端點。會話繫結的路由需在標頭帶入 X-Session-Token，驗證為單次 HMAC 計算，無需查詢資料庫。 Six Worker-native endpoints. Session-bound routes require an X-Session-Token header — a single HMAC check, no DB lookup needed.

POST/api/v1/llm-council/sessions

建立新議會會話，回傳 session_id 與 HMAC Token。可傳入 case_id 預先載入案件上下文。 Create a new Council session. Returns session_id + HMAC token. Accepts optional case_id to pre-load case context.

驗證Auth: JWT Bearer (Telegram login)

POST/api/v1/llm-council/sessions/:id/turns

新增一輪提問，觸發完整三專家 + 主持人流程。透過 RL_COUNCIL_TURN 限流（每 IP 30 次/分）。 Add a user turn. Triggers the full 3-expert + moderator pipeline. Rate-limited via RL_COUNCIL_TURN (30 req/min per IP).

驗證Auth: X-Session-Token

GET/api/v1/llm-council/sessions/:id

取得完整會話及所有輪次，含每一輪三位專家意見與主持人輸出。 Retrieve a full session with all turns, including expert opinions and moderator output for each turn.

驗證Auth: X-Session-Token

GET/api/v1/llm-council/sessions

列出當前用戶所有會話，回傳 id、建立時間、輪次數量、最後訊息摘要。 List all sessions for the authenticated user. Returns session metadata: id, created_at, turn count, last message snippet.

驗證Auth: JWT Bearer

DELETE/api/v1/llm-council/sessions/:id

刪除指定會話，透過 Supabase RLS 策略串聯刪除對應 council_turns 記錄。 Delete a session. Cascades to council_turns via Supabase RLS cascade policy.

驗證Auth: X-Session-Token

POST/api/v1/llm-council/run舊版Legacy

無狀態單輪分析，不儲存會話。保留以維持舊版前端相容性，新程式碼應改用 sessions API。 Stateless single-turn analysis. No session storage. Kept for backward compatibility — new code should use the sessions API.

驗證Auth: JWT Bearer

§06

六項設計決策Design Decisions

每個決策都源自具體約束——延遲、Token 成本、計費複雜度或安全性。非拍腦袋，而是 trade-off 後的閉環結論。 Six architectural choices that shaped the Council. Each arose from a concrete constraint: latency, token cost, billing complexity, or security.

D1 · 效能Parallelism

三位專家同時並行，不排隊Three experts run concurrently

所有專家呼叫包在單一 Promise.all()。整體延遲等於最慢單一模型，非三者累加。序列執行下，GPT-5-mini 長提示詞單次就可能耗費 8–12 秒。 All three expert calls are wrapped in a single Promise.all(). Total latency equals the slowest single model, not their sum. Without this a turn would take ~3× longer.

p95 輪次延遲 ~15s vs 序列 ~40sOutcome: p95 turn latency ~15 s vs ~40 s sequential

D2 · 架構History Injection

只注入主持人摘要，不注入三份原始輸出Only the moderator's answer goes into history

每輪歷史只將主持人的 composed_answer 作為 role: "assistant" 注入，三份輸出壓縮為一份摘要。 Instead of feeding all three expert outputs into each expert's next-turn context, only the moderator's composed_answer is used as the role: "assistant" turn.

多輪對話 Token 節省約 70%Token savings: ~70% on multi-turn conversations

D3 · 營運Unified Billing

所有模型經 Cloudflare AI Gateway，單一 TokenAll models via Cloudflare AI Gateway

單一 CF_AIG_TOKEN 認證 OpenAI、Anthropic、Google AI Studio。帳單在同一 Dashboard，Worker Secrets 內不儲存任何 per-provider API Key。 A single CF_AIG_TOKEN authenticates to all three providers via the compat endpoint. Credits tracked in one dashboard. No per-provider API keys stored as Worker secrets.

供應商前綴Provider prefix: openai/ · anthropic/ · google-ai-studio/

D4 · 相容性GPT-5 Reasoning Model

GPT-5 / o 系列模型使用不同 API 參數Special-cased params for reasoning models

isGpt5ReasoningModel() 檢查模型名稱。若符合，改用 max_completion_tokens 並強制 temperature: 1。OpenAI 推理模型收到非 1 的溫度值直接回傳 HTTP 400。 isGpt5ReasoningModel() checks the model name. If true: uses max_completion_tokens and temperature: 1. OpenAI's o-series and gpt-5 reject non-1 temperature with a 400 error.

偵測條件：模型名含Detection: model name contains "gpt-5", "o1", "o3", "o4"

D5 · 可靠性Storage Idempotency

ON CONFLICT DO NOTHING 確保重試安全ON CONFLICT DO NOTHING on turn inserts

addTurn() 在昂貴的 LLM 呼叫前先以 nanoid21 分配主鍵，再以 INSERT … ON CONFLICT DO NOTHING 寫入。Worker 超時重試時，重複主鍵不會產生重複記錄。 addTurn() uses INSERT ... ON CONFLICT DO NOTHING on the council_turns primary key. The client-visible turn ID (nanoid21) is assigned before the expensive LLM calls, so Worker retries cannot create duplicate rows.

ID 在 LLM 呼叫前預先分配，天然具備重試安全性ID pre-assigned before LLM calls — retry-safe by design

D6 · 安全Session Authentication

HMAC Token 取代 JWT 用於輪次驗證HMAC token, not JWT, for turn-level auth

會話 Token：nanoid(21) + "." + HMAC-SHA256(id, secret)。驗證僅需一次 HMAC 計算，無需查詢 Supabase。用戶 JWT 只在建立會話時需要，後續輪次改用更輕量的 Session Token。 Session tokens are nanoid(21) + "." + HMAC-SHA256(id, secret). Verification is one HMAC check — no DB lookup. The user's JWT is only required to create a session; subsequent turns use the cheaper session token.

驗證成本：1 次 HMAC，無資料庫來回Verification: 1 HMAC · no round-trip to Supabase

GPT + Gemini + Claude 議會架構Council

請求流程Request Flow

三位專家模型Three Expert Models

GPT-5-mini

Gemini 3.1 Pro

Claude Sonnet 4.6

主持人輸出 SchemaModerator Output Schema

對話歷史注入 — 設計決策 D2Conversation History — Decision D2

API 端點API Endpoints

六項設計決策Design Decisions

GPT + Gemini + Claude
議會架構Council