三個專家 LLM 並行分析移民案件,主持人將其整合為含 14 個欄位的結構化裁決 JSON。 Three expert LLMs analyse immigration cases in parallel. A moderator synthesises their findings into a 14-field structured verdict.
每一輪對話依序通過身份驗證與速率限制,並行呼叫三個模型,由主持人整合後存入資料庫並回傳。 Every turn passes through auth, rate limiting, and parallel model invocations before the synthesised response is stored and returned.
flowchart TD
A[("👤 User Request\nPOST /sessions/:id/turns")] --> B
B["🛡 Cloudflare Worker\nAuth + Rate Limit\nHMAC · 30 req/min"] --> C
C["⚙ Cloudflare AI Gateway\nUnified Billing\nCF_AIG_TOKEN"] --> D & E & F
D["GPT-5-mini\nOpenAI\nmax_tokens 4096 · temp 1"] --> G
E["Gemini 3.1 Pro\nGoogle\nmax_tokens 4096 · temp 0.2"] --> G
F["Claude Sonnet 4.6\nAnthropic\nmax_tokens 4096 · temp 0.2"] --> G
G["🎙 Moderator\nGemini 2.5 Flash\n14-field JSON · 8192 tokens"] --> H
H[("🗄 Supabase PostgreSQL\ncouncil_turns\nON CONFLICT DO NOTHING")] --> I
I[("📤 JSON Response\ncomposed_answer + expert_opinions[]")]
style A fill:#a8552e,color:#fff,stroke:#7a352b
style B fill:#3a5a40,color:#fff,stroke:#2a4a30
style C fill:#9c7b30,color:#fff,stroke:#7a5a20
style D fill:#3a5a40,color:#fff,stroke:#2a4a30
style E fill:#9c7b30,color:#fff,stroke:#7a5a20
style F fill:#7a352b,color:#fff,stroke:#5a2520
style G fill:#a8552e,color:#fff,stroke:#7a352b
style H fill:#4a3a1a,color:#ede4d0,stroke:#9c7b30
style I fill:#a8552e,color:#fff,stroke:#7a352b
每位專家接收相同問題與對話歷史,從各自專業角度獨立分析。三者以 Promise.all() 並行執行——整體延遲等於最慢那位,而非三者累加。
Each expert receives the question plus conversation history and independently analyses the case. All three run concurrently via Promise.all() — total latency equals the slowest single model.
Gemini 2.5 Flash 閱讀三位專家的完整回應與對話歷程後,輸出嚴格的 14 欄位 JSON。主持人負責解決分歧、標示不確定性,並引用最強論據。 Gemini 2.5 Flash reads all three expert responses and the conversation, then produces a strict 14-field JSON. The moderator resolves disagreements, flags uncertainty, and cites the strongest reasoning.
buildHistoryMessages(prevTurns) 將每輪主持人的 composed_answer 作為 role: "assistant" 注入,而非重複三份原始輸出。多輪對話 Token 成本壓低約 70%。
buildHistoryMessages(prevTurns) injects each prior moderator composed_answer as a role: "assistant" turn. This collapses 3 expert outputs into 1 coherent summary per turn, saving ~70% tokens on multi-turn conversations.
每位專家看到的是交替出現的 user / assistant 對話,上下文完整,Token 消耗最小。 Each expert sees the conversation as interleaved user/assistant turns — preserving context at minimal token cost.
六個 Worker 原生端點。會話繫結的路由需在標頭帶入 X-Session-Token,驗證為單次 HMAC 計算,無需查詢資料庫。
Six Worker-native endpoints. Session-bound routes require an X-Session-Token header — a single HMAC check, no DB lookup needed.
session_id 與 HMAC Token。可傳入 case_id 預先載入案件上下文。
Create a new Council session. Returns session_id + HMAC token. Accepts optional case_id to pre-load case context.
RL_COUNCIL_TURN 限流(每 IP 30 次/分)。
Add a user turn. Triggers the full 3-expert + moderator pipeline. Rate-limited via RL_COUNCIL_TURN (30 req/min per IP).
council_turns 記錄。
Delete a session. Cascades to council_turns via Supabase RLS cascade policy.
每個決策都源自具體約束——延遲、Token 成本、計費複雜度或安全性。非拍腦袋,而是 trade-off 後的閉環結論。 Six architectural choices that shaped the Council. Each arose from a concrete constraint: latency, token cost, billing complexity, or security.
Promise.all()。整體延遲等於最慢單一模型,非三者累加。序列執行下,GPT-5-mini 長提示詞單次就可能耗費 8–12 秒。
All three expert calls are wrapped in a single Promise.all(). Total latency equals the slowest single model, not their sum. Without this a turn would take ~3× longer.
composed_answer 作為 role: "assistant" 注入,三份輸出壓縮為一份摘要。
Instead of feeding all three expert outputs into each expert's next-turn context, only the moderator's composed_answer is used as the role: "assistant" turn.
CF_AIG_TOKEN 認證 OpenAI、Anthropic、Google AI Studio。帳單在同一 Dashboard,Worker Secrets 內不儲存任何 per-provider API Key。
A single CF_AIG_TOKEN authenticates to all three providers via the compat endpoint. Credits tracked in one dashboard. No per-provider API keys stored as Worker secrets.
isGpt5ReasoningModel() 檢查模型名稱。若符合,改用 max_completion_tokens 並強制 temperature: 1。OpenAI 推理模型收到非 1 的溫度值直接回傳 HTTP 400。
isGpt5ReasoningModel() checks the model name. If true: uses max_completion_tokens and temperature: 1. OpenAI's o-series and gpt-5 reject non-1 temperature with a 400 error.
addTurn() 在昂貴的 LLM 呼叫前先以 nanoid21 分配主鍵,再以 INSERT … ON CONFLICT DO NOTHING 寫入。Worker 超時重試時,重複主鍵不會產生重複記錄。
addTurn() uses INSERT ... ON CONFLICT DO NOTHING on the council_turns primary key. The client-visible turn ID (nanoid21) is assigned before the expensive LLM calls, so Worker retries cannot create duplicate rows.
nanoid(21) + "." + HMAC-SHA256(id, secret)。驗證僅需一次 HMAC 計算,無需查詢 Supabase。用戶 JWT 只在建立會話時需要,後續輪次改用更輕量的 Session Token。
Session tokens are nanoid(21) + "." + HMAC-SHA256(id, secret). Verification is one HMAC check — no DB lookup. The user's JWT is only required to create a session; subsequent turns use the cheaper session token.