TTFT Optimization
TTFT Optimization Design
Goal: Reduce perceived TTFT from >9s to ~1s by streaming thinking tokens (first thinking token arrives at ~1.1s with full prompt). Separately, reduce total generation latency (first text token) to < 5s.
Status: Thinking streaming implemented via AcaClaw gateway patch (
patches/openclaw-thinking-stream/apply.mjs). Gateway now forwardsreasoning_contentdeltas astype: "thinking"content blocks. Perceived TTFT is now the first thinking token (~533ms direct, ~3.2s gateway).Verified by
node scripts/test-ttft.mjs(Direct API vs Gateway comparison in thinking mode).
Table of Contents
- Problem Statement
- Current Measurements
- Thinking Mode Impact
- Root Cause Analysis
- System Prompt Breakdown
- Gateway Overhead Decomposition
- Optimization Strategies
- Implementation Plan
- Success Criteria
Problem Statement
Users experience >9 seconds of blank waiting after sending a chat message before seeing any response. The DashScope API actually returns the first thinking token in ~1.1s (with full 15K prompt), but OpenClaw does not stream thinking deltas — so the user sees nothing until text generation begins, 3–17s after thinking starts.
Two separate problems:
- Thinking tokens not streamed — the #1 issue. The model’s reasoning output is discarded during streaming, making the user wait for the full thinking phase to complete.
- High total generation latency — 15,300-token system prompt adds ~6.2s of prefill time on every turn, compounded by non-functional prompt caching.
Current test results (qwen3.5-plus via DashScope, verified 2026-04-17):
| Test | First thinking token | First text token | Status |
|---|---|---|---|
| Raw API (9 tokens, no thinking) | N/A | 522–726 ms | PASS |
| Raw API (9 tokens, thinking) | 0.50–0.69s | 2.87–4.46s | First thinking: PASS |
| Raw API (~15K tokens) | 0.60–0.69s | 0.60–0.68s (no thinking) / 2.28–17.74s (thinking) | First thinking: PASS |
| Gateway chat (~15K tokens) | N/A (not streamed) | 8,990 ms (median, 34 sessions) | FAIL |
Thinking-mode comparison (verified 2026-04-18, node scripts/test-ttft.mjs --runs=5)
Direct API vs Gateway, both with enable_thinking=true, same prompt (“25+36=? just answer”):
| Metric | Direct API (median) | Gateway (median) | Overhead |
|---|---|---|---|
| TTFT (thinking) | 533ms | 3.23s | +2.70s (506%) |
| TTFT (text) | 4.59s | 3.97s | −617ms |
| Total | 4.60s | 4.04s | −566ms |
After Phase 1 prompt trimming (22 tools denied, verified 2026-04-18)
| Metric | Direct API (median) | Gateway (median) | Overhead |
|---|---|---|---|
| TTFT (thinking) | 546ms | 2.61s | +2.06s (378%) |
| TTFT (text) | 4.56s | 3.65s | −915ms |
| Total | 4.72s | 3.70s | −1.01s |
| Path | Thinking range | Text range | Total range |
|---|---|---|---|
| Direct API | 459ms–774ms | 4.49s–4.70s | 4.52s–4.73s |
| Gateway | 2.27s–5.66s | 3.17s–6.56s | 3.23s–6.62s |
Phase 1 impact: Denying 15 additional tools (22 total) saved 620ms on thinking TTFT (3.23s → 2.61s, −19%). Gateway overhead dropped from 2.70s to 2.06s.
Key finding: The raw DashScope API delivers the first thinking token in ~1.1s with 9,200 input tokens (0.55s with minimal prompt). Even with 15K tokens through the gateway, first thinking would arrive in ~1.4s — well within acceptable TTFT. The problem is entirely in OpenClaw’s WebSocket layer not forwarding these tokens.
Key finding: DashScope prompt caching is NOT WORKING — all sessions show 0 cached tokens.
Current Measurements
Token Usage (first message in fresh session)
| Token Type | Count | Source |
|---|---|---|
| Input tokens | 14,997–15,433 | System prompt + user message |
| Cached tokens | 0 (always) | DashScope caching NOT WORKING |
| Output tokens | 115–665 | Model response |
Measured across 34 sessions. Input tokens are remarkably stable (~15,300 ± 200) because the system prompt is static.
Latency Pipeline
Total TTFT: median ~8,990 ms (range 3,670–18,030 ms, 34 sessions)
├── Gateway overhead: ~260 ms (3%) ← measured by profiler
│ ├── WS connect + message_received: ~33 ms
│ ├── Session creation + before_agent_reply: ~75 ms
│ ├── Model resolve + agent start: ~90 ms
│ └── Prompt build + llm_input: ~62 ms
│
├── DashScope API (9,200 tokens, no tools, thinking enabled):
│ ├── First thinking token: ~1,100 ms
│ └── First text token: ~2,300 ms (gap = ~1,200 ms thinking)
│
├── Tool schema overhead: ~6,200 extra tokens
│ └── Additional prefill: ~290 ms (est., 0.047 ms/token)
│
└── *** Variable thinking: ~2,450 ms (est.) *** ← unaccounted
└── Model thinks longer with full tool context vs raw API
└── Highly variable: 1–17s depending on query complexity
└── NOT streamed to user — appears as blank wait time
TTFT Decomposition (measured via profiler + isolation test)
| Component | Median | Range | % of total |
|---|---|---|---|
| Gateway overhead | 260 ms | 250–330 ms | 3% |
| DashScope first thinking token | 1,100 ms | 768–1,368 ms | 12% |
| Think → text gap (thinking gen) | 1,200 ms | 1,150–1,620 ms | 13% |
| Tool schema processing (est.) | 290 ms | — | 3% |
| Unaccounted (extra thinking + variance) | 6,140 ms | — | 68% |
| Total (first TEXT token) | 8,990 ms | 3,670–18,030 ms |
However, thinking tokens ARE output — if they were streamed, the user would see the first thinking token in ~1.1s. See Thinking Mode Impact.
Key correction: Earlier analysis attributed ~3s to “gateway overhead.” Profiling revealed the gateway takes only ~260ms. The 3s difference between raw API and gateway was caused by (1) tool schemas adding ~6,200 extra input tokens, and (2) the model thinking longer with the full tool context.
Note: The 160% variance in gateway TTFT at the same token count (3.67s–18.03s across 34 sessions) is primarily caused by variable-length thinking and DashScope server-side instability, not gateway processing.
Gateway Overhead Profiling
Key finding: Gateway overhead is only ~260ms — verified by gateway profiler (removed; data preserved below). All hooks from
chat.sendtollm_inputcomplete in 260ms. The remaining 5+ seconds is DashScope API time (prefill + thinking).
Profiler timeline (3 runs, consistent):
+0ms chat.send (WebSocket)
+33ms message_received, before_dispatch
+107ms session.start, before_agent_reply
+200ms before_model_resolve, before_agent_start
+260ms before_prompt_build, llm_input (API request sent)
──── 5,200ms gap ──── (DashScope processing)
+5,300ms FIRST WS DELTA (first text token received)
| Gateway step | Time | Notes |
|---|---|---|
| WS → message dispatch | 33ms | WebSocket + routing |
| Session creation | 75ms | JSONL write, session lookup |
| Model resolve + agent start | 90ms | Config resolution |
| Prompt build + API call | 62ms | SOUL.md, workspace, skills, tool schemas |
| Total gateway | ~260ms |
The 5.2s gap is entirely DashScope API (prefill of ~15,400 tokens + thinking generation). Gateway adds only 3% to total TTFT.
Implication: There is no significant optimization opportunity in gateway code. The two priorities are:
- Stream thinking tokens — perceived TTFT drops from ~5s to ~1.1s (first thinking token)
- Reduce input tokens — fewer tokens = faster DashScope processing (but only ~0.047ms/token)
Thinking Mode Impact
Key finding: Thinking tokens are valid output and should count toward TTFT. The DashScope API returns the first thinking token in ~0.5–0.7s even with a 15K-token prompt — excellent latency. The problem is that OpenClaw does not stream thinking deltas via WebSocket, so the user sees nothing until the first text token arrives 3–17s later.
The problem
qwen3.5-plus is a reasoning model that generates “thinking” tokens before producing text. With enable_thinking=true (the default), the DashScope API streams two content types:
reasoning_content— thinking tokens (first one arrives in ~0.5–0.7s)content— text tokens (arrives after thinking finishes, 2–17s later)
The DashScope API does stream thinking tokens — they arrive at the same low latency as text tokens (~0.5s TTFT). However, OpenClaw silently discards them and does not forward thinking deltas via WebSocket. The gateway only emits chat event deltas with type: "text" content, never type: "thinking".
This means AcaClaw’s thinking UI (collapsible block) only shows thinking content from the final message, not during streaming. The user waits 3–17s seeing nothing.
AcaClaw UI fix (implemented)
Two bugs were fixed in ui/src/views/chat.ts:
- Field name mismatch: The session stores thinking content in
{type: "thinking", thinking: "..."}but the UI was readingc.text(always empty). Fixed to readc.thinking ?? c.text. - History reload on completion: After a
finalchat event, the UI now reloads history viachat.historyto capture thinking content from the session. This makes thinking visible immediately after the response completes.
Result: Thinking content now displays correctly in the collapsible block after the response finishes.
AcaClaw gateway patch (implemented)
Instead of waiting for an upstream OpenClaw fix, AcaClaw patches the gateway directly:
- Patch:
patches/openclaw-thinking-stream/apply.mjs— addsonReasoningStreamcallback to the webchat reply handler ingateway-cli-CWpalJNJ.js - Mechanism: The patch intercepts
reasoning_contentdeltas from the LLM provider and broadcasts them as{type: "thinking"}content blocks via the WebSocketchatevent - Throttle: 100ms debounce to avoid flooding the WebSocket
- Config:
reasoningDefault: "stream"set on all 6 agents inconfig/openclaw-defaults.json - UI:
<details>auto-opens with?openduring streaming, collapses when done
Result: Full real-time thinking streaming. First thinking token visible in the UI within ~3.2s via gateway (vs ~533ms direct API). The 2.7s gap is gateway overhead (system prompt, tool schemas, WebSocket routing).
Measured impact (raw DashScope API, historical data)
| Metric | enable_thinking=false |
enable_thinking=true |
|---|---|---|
| First ANY token (TTFT if thinking is streamed) | 522–726ms | 0.49–0.69s |
| First TEXT token (TTFT if thinking is hidden) | 522–726ms | 2.28–17.74s |
| Thinking chars generated | 0 | 158–1,148 |
| Total response time (minimal prompt) | 0.75–1.37s | 3.03–4.64s |
| Total response time (15K prompt) | 2.18–3.56s | 4.18–19.13s |
Key insight: The first thinking token arrives in 0.49–0.69s — almost identical to the non-thinking TTFT. If OpenClaw streamed these tokens, users would see visible output (the thinking process) within 0.5–0.7s regardless of prompt size. The thinking output also helps users understand the model’s reasoning.
The 160% TTFT variance (3.67–18.03s across 34 sessions) comes from variable thinking length (158–1,148 chars) and server-side load, but this only affects time-to-first-text-token. If thinking is visible, the perceived TTFT is stable at ~0.5–0.7s.
Secondary issue: thinkingDefault config bug
OpenClaw’s thinkingDefault: "disabled" setting does not actually disable thinking at the API level:
- OpenClaw records
thinkingLevel: offin the session JSONL - However, the stored JSONL still contains thinking blocks (200–350 chars) even with
thinkingLevel: off - The DashScope API
enable_thinkingparameter is apparently not being forwarded - All 34 sessions across all time show
thinkingLevel: offand ALL contain thinking content (158–1,148 chars)
Session | thinkingLevel | Thinking chars | Output tokens | TTFT
57eb6533 | off | 257 | 135 | 6.53s ← config was "disabled"
162e8f4b | off | 217 | 111 | 3.78s ← config was "disabled"
618ae919 | off | 221 | 110 | 5.90s ← config was "adaptive"
b7b02635 | off | 536 | 302 | 5.57s ← config was "adaptive"
This is a lower-priority issue: if thinking tokens are streamed (the P0 fix), thinking is actually desirable — it provides useful reasoning output. Fixing this bug is still useful for cases where users want to disable thinking to reduce total generation time and output token cost.
TTFT with thinking streamed vs hidden
Current (thinking NOT streamed):
Total TTFT (to first visible content): median ~8,990 ms
├── Gateway overhead: ~260 ms (dispatch, session, model resolve, prompt build)
├── DashScope prefill (15K tokens): ~1,100 ms
├── Thinking generation: ~4,560 ms (median, range 1–17s)
└── User sees NOTHING during this entire time
With thinking streamed (upstream fix):
TTFT (to first visible thinking token): ~1,100 ms ← prefill + first thinking token
├── Gateway overhead: ~260 ms
└── DashScope first thinking token: ~1,100 ms after request
└── User sees thinking output while model reasons
Streaming thinking tokens would reduce perceived TTFT from 9.0s → ~1.1s — an 8× improvement — without changing the model, prompt, or generation behavior.
Thinking streaming status: IMPLEMENTED
Priority: P0 — Stream thinking deltas via WebSocket — DONE via AcaClaw gateway patch.
| Action | Perceived TTFT | Actual TTFT | Status |
|---|---|---|---|
| Stream thinking deltas (gateway patch) | ~3.2s (gateway) / ~533ms (direct) | unchanged | ✓ IMPLEMENTED |
| Keep thinking hidden (before fix) | ~8,990 ms | ~8,990 ms | — |
The 3.2s gateway perceived TTFT (vs 533ms direct) is due to system prompt + tool schema overhead, not the patch. See Thinking-mode comparison.
Secondary issue: thinkingDefault config bug
Root Cause Analysis
The system prompt is ~15,300 tokens (37,106 chars at 2.42 chars/token for the system prompt, plus ~6,200 tokens in tool schemas) — sent to the LLM on every chat turn. Gateway profiling confirms the gateway itself adds only ~260ms. The vast majority of TTFT is DashScope API time: token prefill (~1.1s to first thinking token) plus thinking generation (1–17s, highly variable).
The dominant factors are:
- Thinking tokens not streamed — the model starts thinking within ~1.1s, but OpenClaw only forwards text tokens. The user waits 5–18s seeing nothing. If thinking were streamed, perceived TTFT drops to ~1.1s.
- Variable thinking time (~68% of TTFT) — the model’s reasoning phase is unpredictable (1–17s), especially with complex prompts containing many tool schemas.
- DashScope caching failure — if caching worked, token prefill would drop, reducing time-to-first-thinking-token.
- Token processing (~15% of TTFT) — 15,300 tokens at ~0.047ms/token. Modest impact.
- Gateway overhead is negligible (~3% of TTFT) — only 260ms. No optimization needed here.
The prompt is assembled by OpenClaw’s agent runner (not AcaClaw) and contains:
- OpenClaw base (~14,010 tokens — 92%)
- Tool definitions (JSON schemas for ~20+ built-in tools)
- Skills (mandatory block: ~3,820 tokens)
- System rules (safety, formatting, behavior)
- Heartbeats, group chats, messaging rules
- AcaClaw plugins (~1,290 tokens — 8%)
- Agent SOUL.md identity (~738 tokens)
- Workspace context (~470 tokens)
- Computing environment context (~330 tokens)
- ⚠ SOUL.md injected twice (see below) — wastes ~450 tokens
System Prompt Breakdown
Verified token budget (~15,300 tokens total, 37,106 chars)
Token ratio calibration: 37,106 chars ÷ 15,300 tokens = 2.42 chars/token (Qwen tokenizer). Data from raw LLM input capture (
~/.acaclaw/logs/llm-input-YYYY-MM-DD.jsonl).
| Component | Chars | Tokens (est.) | % of Total | Source | Can Trim? |
|---|---|---|---|---|---|
| Skills (mandatory) | 9,243 | ~3,820 | 25% | OpenClaw core | Yes — skill filter |
| Tooling + call style | 4,716 | ~1,949 | 13% | OpenClaw core | Yes — deny unused tools |
| Heartbeats / groups / proactive | 2,887 | ~1,193 | 8% | OpenClaw core | Partially (upstream) |
| IDENTITY / USER / BOOTSTRAP | 2,478 | ~1,024 | 7% | OpenClaw core | Partially |
| Memory / sessions | 1,767 | ~730 | 5% | OpenClaw core | No |
| Reply / messaging | 1,647 | ~681 | 4% | OpenClaw core | Partially |
| TOOLS.md (local notes) | 746 | ~308 | 2% | OpenClaw core | Yes — trim |
| Other OpenClaw (safety, docs, etc.) | 10,426 | ~4,305 | 28% | OpenClaw core | Partially |
| SOUL.md (main agent) | 1,786 | ~738 | 5% | AcaClaw | Partially — compress |
| Workspace context | 1,137 | ~470 | 3% | AcaClaw | Yes — summarize |
| Computing environment | 798 | ~330 | 2% | AcaClaw | Yes — omit if unused |
| ⚠ SOUL.md DUPLICATE | ~1,098 | ~450 | 3% | AcaClaw (bug) | Yes — remove |
| User message | 141 | ~58 | <1% | User | No |
| Session history | 0 (fresh) | 0 | 0% | OpenClaw core | Compaction |
| TOTAL | 37,106 | ~15,300 | 100% |
⚠ SOUL.md is injected TWICE: Once in the agent personality section and again in the workspace files section (
/Users/.../AcaClaw/SOUL.md). This wastes ~450 tokens. Removing the duplicate would save ~1.3% of input tokens.
AcaClaw custom tools registered (8 total)
| Tool | Plugin | Can Deny? |
|---|---|---|
workspace_info |
acaclaw-workspace | No — essential |
env_status |
acaclaw-academic-env | Yes |
backup_restore |
acaclaw-backup | Yes |
backup_list |
acaclaw-backup | Yes |
compat_check |
acaclaw-compat-checker | Yes |
event_log |
acaclaw-logger | Yes |
security_audit |
acaclaw-security | Yes |
security_status |
acaclaw-security | Yes |
Key observations
- ~92% of input tokens come from OpenClaw itself (skills, tool schemas, system rules, heartbeats, etc.). AcaClaw only adds ~8% (~1,290 tokens).
- SOUL.md is injected twice — agent personality section + workspace files. Wastes ~450 tokens.
- Skills mandatory block is the single largest section — 9,243 chars, ~3,820 tokens (25% of total).
- DashScope prompt caching is NOT working — 0 cached tokens across all 34 measured sessions.
- DashScope API has 160% TTFT variance at the same token count (3.67s–18.03s across 34 sessions), indicating server-side instability.
- Denying 7 AcaClaw tools would save ~1,400–1,750 tokens (~0.4s) but won’t reach the 5s target alone.
Optimization Strategies
P0 — Prompt Trimming
Expected reduction: 3,000–4,500 tokens → TTFT savings: ~1,350–2,025 ms
These changes reduce token count without code changes.
1. Skill-aware agent filtering
Only include skills relevant to the active agent. Currently all installed skills are included for every agent.
// openclaw.json — already supported by OpenClaw
{
"agents": {
"list": [
{
"id": "main",
"skillFilter": ["literature-review", "pubmed-edirect"]
}
]
}
}
Impact: If 6 skills × ~800 tokens each = 4,800 tokens → filter to 2 skills = 1,600 tokens. Saves ~3,200 tokens.
2. Deny AcaClaw utility tools
Deny 7 AcaClaw-registered tools that aren’t needed during chat:
{
"tools": {
"deny": [
"gateway", "cron", "sessions_spawn", "sessions_send",
"mcp_install", "mcp_uninstall", "config_set",
"backup_restore", "backup_list", "compat_check",
"event_log", "security_audit", "security_status",
"env_status"
]
}
}
Impact: 7 AcaClaw tools × ~200–250 tokens each = saves ~1,400–1,750 tokens (~0.6–0.8s).
3. Deny OpenClaw tools not needed for academic use
Deny OpenClaw built-in tools that academic agents rarely use:
{
"tools": {
"deny": [
"lsp_diagnostics", "lsp_references", "lsp_hover",
"notebook_edit", "notebook_read"
]
}
}
Impact: 5 tools × ~300–500 tokens = saves ~1,500–2,500 tokens (~0.7–1.1s).
4. Compress agent SOUL.md
Current main SOUL.md is 1,786 chars (~447 tokens). Can be tightened:
<!-- Before: 447 tokens -->
You are AcaClaw's main research assistant. You help researchers with
literature review, data analysis, scientific writing, and computational
experiments. You have deep expertise in...
<!-- After: 200 tokens -->
AcaClaw main assistant. Expertise: literature review, data analysis,
scientific writing, computational experiments.
Impact: ~247 tokens per agent. Modest.
5. Workspace context compression
Instead of sending the full directory tree, send a summary:
Workspace: ~/AcaClaw (12 files, 3 dirs)
vs:
Workspace: ~/AcaClaw
├── documents/
│ ├── drafts/
│ │ └── aptamer_drugs_report.docx
│ └── papers/
│ └── ...
├── output/
│ └── aptamer_drugs_2026_review.md
└── ...
Impact: ~200–500 tokens saved.
P1 — Prompt Caching
Potential improvement: 2–4× on warm cache → TTFT ~2,500–3,500 ms
DashScope (current provider) — VERIFIED NOT WORKING
DashScope (Alibaba Cloud Model Studio) claims context caching for Qwen models via OpenAI-compatible API. However, testing shows 0 cached tokens across all 34 sessions:
$ node scripts/test-ttft.mjs --history
Caching: NONE — 0 cached tokens across all sessions
Possible reasons:
- DashScope may require explicit cache API calls (not automatic prefix matching)
- The
compatible-mode/v1endpoint may not support caching - System prompt may change slightly between requests (timestamps, workspace state)
Action items:
Verify DashScope returns→ VERIFIED: always 0cached_tokensin the usage response- Investigate DashScope’s explicit caching API (context_cache_id parameter)
- Test with a provider that has verified prompt caching (OpenRouter, Anthropic)
- Ensure AcaClaw sends identical system prompts — check for per-request variance in workspace context
Other providers
| Provider | Cold TTFT | Warm TTFT | Speedup |
|---|---|---|---|
| OpenRouter (MiniMax M2.7) | 9,579 ms | 2,247 ms | 4.3× |
| Anthropic (Claude) | ~8,000 ms | ~2,000 ms | 4× |
| DashScope (qwen3.5-plus) | 8,990 ms | N/A (caching broken) | N/A |
Deterministic session IDs (already implemented)
AcaClaw uses deterministic session IDs (agent:main:web:main) for default tabs, keeping the cache warm across page reloads. This only helps if the provider supports prefix caching AND the user stays on the same session.
P2 — Lazy Tool Loading
Expected reduction: 2,000–4,000 tokens → TTFT savings: ~1,500–3,000 ms
Instead of including ALL tool schemas in the system prompt, use a two-pass approach:
Pass 1 — Lightweight tool index
Send only tool names and one-line descriptions (~50 tokens per tool instead of ~300):
Available tools: web_search (search the web), web_fetch (fetch a URL),
read (read a file), write (write a file), edit (edit a file),
bash (run a command), ...
Pass 2 — Full schema on demand
When the model calls a tool, include the full JSON schema for that tool in the next turn. This is a change to OpenClaw’s agent runner (upstream contribution or plugin hook).
Trade-off: The model may make slightly worse tool-calling decisions without full schemas, especially for complex tools with many parameters.
Alternative — categories: Group tools into categories and include full schemas only for the category the user’s query implies:
| Query mentions | Load tools |
|---|---|
| “search”, “find papers” | web_search, web_fetch |
| “write code”, “run” | bash, write, edit |
| “read file”, “show” | read, list |
P3 — Model Selection
Potential: 2–5× TTFT improvement
Different models have different TTFT characteristics:
| Model | Provider | Raw TTFT (9 tokens) | Expected Gateway TTFT |
|---|---|---|---|
| qwen3.5-plus | DashScope | 522–726 ms | 8,990 ms (measured) |
| qwen-plus | DashScope | ~1,500 ms | ~7,500 ms (est.) |
| qwen-turbo | DashScope | ~500 ms | ~3,000 ms (est.) |
| claude-3.5-haiku | Anthropic | ~300 ms | ~2,000 ms (est.) |
| gpt-4o-mini | OpenAI | ~400 ms | ~2,500 ms (est.) |
Strategy: Use a faster model for the heartbeat/routing phase, and optionally escalate to a larger model only for complex queries:
{
"agents": {
"defaults": {
"model": "modelstudio/qwen-turbo",
"heartbeat": {
"model": "modelstudio/qwen-turbo"
}
}
}
}
Trade-off: Smaller/faster models may have lower reasoning quality.
P4 — Streaming UX
Perceived improvement, not actual TTFT reduction
Typing indicator
Show a typing animation immediately after sending:
User: search aptamer drugs in recent 5 years
Assistant: ⠋ Thinking... ← appears instantly
This doesn’t reduce actual TTFT but reduces perceived wait time.
Stream thinking tokens — ✅ IMPLEMENTED
Thinking streaming is now implemented via AcaClaw’s gateway patch (patches/openclaw-thinking-stream/apply.mjs). The gateway forwards reasoning_content deltas as type: "thinking" content blocks at 100ms throttle. The UI renders thinking in a collapsible <details> block that auto-opens during streaming.
Measured result (5 runs, node scripts/test-ttft.mjs --runs=5):
- Direct API first thinking token: 533ms (range 490ms–1.14s)
- Gateway first thinking token: 3.23s (range 2.57s–8.57s)
- Gateway overhead: +2.7s (system prompt + tool schemas + WebSocket routing)
This is the most impactful TTFT optimization: perceived TTFT dropped from ~9.0s (no thinking) to ~3.2s (first thinking token visible). Further reduction requires trimming the system prompt (Phase 1) or fixing prompt caching (Phase 2).
Implementation Plan
Phase 0: Stream thinking tokens — ✅ IMPLEMENTED
| Action | Perceived TTFT | Status |
|---|---|---|
| Stream thinking deltas via gateway patch | ~3.2s (gateway) / ~533ms (direct) | ✅ Done |
Fix thinkingDefault: disabled to pass enable_thinking=false |
incidental fix | Open |
Perceived TTFT after Phase 0: ~3.2s via gateway (first thinking token). Direct API achieves ~533ms.
Phase 1: Quick wins — PARTIALLY DONE
| Action | Token Savings | TTFT Impact | Status |
|---|---|---|---|
| Deny 15 additional tools (22 total) | ~3,750 | −620 ms | ✅ Done |
Add skillFilter to main agent |
~1,000 | −450 ms | N/A (not supported by OpenClaw) |
| Compress SOUL.md | ~250 | −110 ms | Not started |
| Total applied | ~3,750 | ~620 ms |
Expected cold-cache TTFT after Phase 1: ~6,800 ms (still above 5s)
Note: These savings are modest because token processing is only ~15% of total TTFT. Gateway profiling confirmed that gateway code overhead is only ~260ms — the dominant cost is DashScope API time (prefill + thinking). DashScope prompt caching is NOT WORKING. Cannot rely on warm cache until fixed.
Phase 1.5: Gateway overhead (confirmed negligible)
Gateway profiling (see Gateway Overhead Profiling) confirmed that all gateway code (dispatch → session → model resolve → prompt build → llm_input) completes in ~260ms. No code-level optimization of the gateway itself is needed — the 5.2s gap between llm_input and first text delta is 100% DashScope API time (token prefill + thinking generation).
No action items for this phase.
Phase 2: Address DashScope variance (provider-side)
| Action | TTFT Impact | Trade-off |
|---|---|---|
| Investigate DashScope explicit cache API | −3,000–5,000 ms (warm) | May need API changes |
| Test Anthropic/OpenRouter with prompt caching | −5,000 ms (warm) | Different model |
| Use qwen-turbo for simple queries | −3,000 ms | Lower reasoning |
| Total (switching provider) | −5,000 ms | Different model/cost |
Expected TTFT with working cache: ~3,500–5,000 ms (at or below target)
Phase 3: Prompt compression (code changes, needs upstream)
| Action | Token Savings | TTFT Impact |
|---|---|---|
| Lazy tool loading (upstream PR) | ~3,000 | −1,350 ms |
| Compress workspace context | ~200 | −90 ms |
| Total | ~3,200 | ~1,440 ms |
Expected TTFT after Phase 1+3: ~5,360 ms (cold cache, single provider)
Priority assessment
The biggest wins are (in order):
Stream thinking tokens (Phase 0)— ✅ DONE. Perceived TTFT dropped from ~9.0s to ~3.2s (gateway) via AcaClaw gateway patch. Direct API achieves ~533ms.Trim prompt / deny unused tools (Phase 1)— ✅ PARTIALLY DONE. 22 tools denied, saving 620ms (3.23s → 2.61s thinking TTFT).skillFilternot supported by OpenClaw.- Fix prompt caching (Phase 2) — If DashScope caching worked, prefill time for the 15,300-token prompt would drop significantly. Currently 0 cached tokens across all sessions.
- Lazy tool loading (Phase 3, upstream) — Removes ~3,000 tool-schema tokens, saving ~140ms of prefill. Requires upstream PR.
Gateway overhead is NOT a bottleneck (Phase 1.5) — Profiling confirmed only ~260ms. No code-level optimization needed.
Success Criteria
| Metric | Threshold | Test | Current |
|---|---|---|---|
| Gateway TTFT — first thinking token | < 4,000 ms | node scripts/test-ttft.mjs |
2,610 ms ✓ (after Phase 1) |
| Gateway TTFT — first text token | < 5,000 ms | Same test | 3,650 ms ✓ |
| Gateway overhead (pure) | < 500 ms | Gateway profiler (historical) | 260 ms ✓ |
| Direct API TTFT — first thinking token | < 1,000 ms | node scripts/test-ttft.mjs |
546 ms ✓ |
| Direct API TTFT — first text token | < 5,000 ms | Same test | 4,560 ms ✓ |
| Gateway thinking overhead | < 3,000 ms | node scripts/test-ttft.mjs (overhead) |
2,060 ms ✓ |
| Thinking streamed via WebSocket | type: "thinking" in deltas |
Check chat event deltas | ✓ — via gateway patch |
| Prompt caching active | cacheRead > 0 | Provider dashboard | 0 ✗ |
| Shell script TTFT | < 5,000 ms | bash scripts/test-chat-latency.sh |
10,000+ ms ✗ |
All TTFT tests use the 5-second threshold. TTFT > 5s = test failure.
Appendix: Measured Data
Token usage from real session
Session: 15d01efe-af32-41ac-824e-12e8a54edd07
Model: modelstudio/qwen3.5-plus
Query: "search aptamer drugs in recent 5 years and write a report in word"
First response usage:
input: 15,291 tokens
cacheRead: 0 tokens
cacheWrite: 0 tokens
output: 590 tokens
TTFT comparison: Direct API vs Gateway (thinking mode, 2026-04-18)
Direct API (enable_thinking=true, median of 5 runs, after Phase 1):
First thinking token: 546ms (range 459ms–774ms)
First text token: 4,560ms (range 4.49s–4.70s)
Total: 4,720ms (range 4.52s–4.73s)
Gateway (enable_thinking=true, median of 5 runs, after Phase 1):
First thinking token: 2,610ms (range 2.27s–5.66s)
First text token: 3,650ms (range 3.17s–6.56s)
Total: 3,700ms (range 3.23s–6.62s)
Overhead (gateway − direct):
Thinking TTFT: +2,060ms (378% of direct)
Text TTFT: −915ms (gateway text faster — shorter thinking chains)
Total: −1,014ms (gateway total faster)
Historical comparison: raw API vs gateway (pre-thinking-streaming)
Layer A — Raw API (no prompt): 679 ms (16 tokens, median of 3 runs)
Layer B — Raw API + system prompt: 1,110 ms (9,203 tokens, median of 3 runs)
Layer C — Gateway (full stack): 4,420 ms (~15,400 tokens, median of 3 runs)
Gateway historical: 8,990 ms (~15,300 tokens, median of 34 sessions)
Token overhead (B − A): 431ms for 9,187 extra tokens = 0.047ms/token
Token overhead (C − A): 3,741ms for ~15,384 extra tokens
Gateway profiling (historical, 3 runs):
chat.send → llm_input: ~260ms (gateway code overhead)
llm_input → first delta: ~5,200ms (DashScope API: prefill + thinking)
Total gateway TTFT: ~5,460ms
Conclusion: The apparent "3,030ms gateway overhead" from isolation test
(C − B − token processing) was a misattribution. The extra time in Layer C
vs Layer B is due to:
1. ~6,200 additional tool-schema tokens (15,400 − 9,200)
2. Model thinking longer with full tool context
3. Actual gateway code overhead: only ~260ms
Thinking vs non-thinking (raw DashScope API)
Minimal prompt (9 tokens):
thinking=false: 0.55–0.70s first text | total 0.75–1.37s
thinking=true: 2.87–4.46s first text | total 3.03–4.64s
Large prompt (~15K tokens):
thinking=false: 0.60–0.68s first text | total 2.18–3.56s
thinking=true: 2.28–17.74s first text | total 4.18–19.13s
Existing optimizations already in AcaClaw
- Deterministic session IDs (
agent:main:web:main) — keeps provider cache warm - Auth mode
"none"— eliminates auth overhead (~5ms saved) - Plugin load — startup only, not per-request