Kimi K2.6 — Capacity under realistic Fabric load

8× H200 · vLLM 0.21 · Eagle 3 spec decode · INT4 weights · --max-num-seqs 128 · empirically-driven think-time distribution

TL;DR — Unit economics for an "unlimited" enterprise / sovereign plan

All prices in USD

Capacity → subscriber-count derivation

  1. Empirical bench (below) shows H200 node sustains ~100 simultaneous active users at "good to excellent" experience with real Fabric usage pattern.
  2. Real data: peak hour in last 14 days saw 8 simultaneously-active users out of 30 subscribers~27% peak concurrency ratio.
  3. Conservative pricing choice: assume 50% peak concurrency, so 2 subscribers per concurrent seat.
  4. Per-node subscriber cap scales with hardware throughput at SLO floor. H200=200 (measured); B200≈700 (FP4 TP=4 × 2 replicas, Blackwell-native config); B300≈740 (modest gain over B200 since K2.6 already fits in B200 HBM).
Hardware: 8× H200 SXM (141 GB HBM3e) · Config: Kimi K2.6 INT4 TP=8 + Eagle 3 spec decode + Fabric system prompt (~11k tok) · Measured ceiling: 100 concurrent → 200 subscribers/node
H200 list price (USD) Node cost / month (USD)
(8 GPU × 730 h)
Per-subscriber cost (USD)
(÷ 200 subscribers)
Per-sub price (USD)
at 100% margin
Per-sub price (USD)
at 200% margin
$3 / GPU-hour$17,520$87.60$175.20 / mo$262.80 / mo
$4 / GPU-hour$23,360$116.80$233.60 / mo$350.40 / mo
$5 / GPU-hour$29,200$146.00$292.00 / mo$438.00 / mo

Hardware throughput comparison (raw — no spec decode, no idle)

From SemiAnalysis InferenceX benchmark: Kimi K2.5, ISL=8192, OSL=1024, vLLM, INT4, TP=8, no speculative decoding. Per-user decode tok/s at fixed concurrency.

GPUC=4C=8C=16C=32C=64vs H200 @ C=32
H200 (Hopper)91.270.346.628.916.71.00×
B200 (Blackwell)104.481.159.240.526.01.40×
B300 (Blackwell Ultra)111.184.460.642.026.81.45×
MI355x (AMD)41.734.721.916.310.70.56×
MI325x (AMD)46.035.225.814.79.90.51×
MI300x (AMD)43.831.722.712.98.60.45×
⚠ Important caveats — read before quoting any price
  • H200 is measured. B200 and B300 are extrapolated using SemiAnalysis InferenceX throughput ratios (1.40× and 1.45× respectively at C=32). We have NOT actually benchmarked Kimi K2.6 on either Blackwell SKU. Eagle 3.1 MLA + tokenspeed_mla backend should give an additional Blackwell-only boost we haven't quantified.
  • This is the Kimi node cost only. The all-in "unlimited Fabric" subscription must amortize fabric-large (H100), fabric-medium (GH200), fabric-small (A40s), voice (Omni+Parakeet+Orpheus), plus platform/auth/support/sales overhead. The per-seat price above is a floor.
  • "Concurrent active user" is closed-loop synthetic. Modeled "users" follow last-14-days empirical Fabric think-time (dominated by sub-5-second agentic-loop steps). Heavier real workloads (longer outputs, deeper tool chains, larger context) consume more per-seat capacity. Re-measure if usage patterns shift.
  • The 27% peak-concurrency figure comes from N=30 subscribers. Small-sample inference; not statistically robust. Re-baseline once subscriber count grows past ~200 before locking the 2× concurrency multiplier into financial models.
  • Capacity assumes only Kimi runs on the node. Any colocation reduces per-subscriber capacity proportionally.
  • GPU prices are list / commodity-provider estimates. AWS/GCP/Azure typically charge 1.5–2× more than commodity (CoreWeave, Lambda, RunPod, etc.). Self-owned hardware capex / depreciation has a different cost model entirely.

Capacity verdict

All users 🟢 Excellent
≤ 60 simultaneous users
Median user still 🟢 Excellent
~ 100 simultaneous users
SLO knee (17% drop below 30 tok/s)
~ 150 simultaneous users
25% unacceptable @
200 simultaneous users

Service-level objective

Each user is evaluated on its per-user decode tok/s p50 across all turns:
Excellent > 50 tok/s Meets expectations 30 – 50 tok/s Unacceptable < 30 tok/s
A request is also marked failed on HTTP non-200 or any transport error.

Per-level summary

N concurrent users Rig Turns OK Fail Thru (req/s) TTFT p50 TTFT p95 Turn p50 Turn p95 Turn p99 TPS p50 Per-user SLO
5Mac→tunnel646400.350.31s0.76s2.50s3.86s4.29s1765/5 🟢
10Mac→tunnel726930.380.34s1.30s2.65s5.24s5.34s16610/10 🟢
20Mac→tunnel22021731.190.37s2.37s3.66s7.23s7.81s11720/20 🟢
40Mac→tunnel45845172.360.39s4.28s4.94s10.78s11.80s8440/40 🟢
60H200-local55055003.060.32s4.93s5.76s14.14s14.98s7055 🟢 5 🟡
100H200-local75175104.170.43s6.94s8.06s18.66s24.53s5153 🟢 47 🟡
150H200-local1064106135.891.13s12.44s13.36s24.79s30.91s321 🟢 124 🟡 25 🔴
200H200-local1127112436.246.81s22.25s19.64s32.67s39.22s31.6151 🟡 49 🔴
Each level ran 180 s. N=5–40 from Mac via SSH tunnel (real users cascading to fabric-large during run). N=60–200 ran on the H200 box itself (localhost) for cleanest numbers.

Per-user breakdown at N=40 (40 concurrent users)

Every single user got 🟢 Excellent service. Min user 63.5 tok/s, median 83.9, max 104.7.

Failures, honestly

The bench measures two kinds of "failure":

RigTurnsFailuresRateCause
Mac → SSH tunnel → CF → LiteLLM → vLLM814131.60%Mostly Mac↔H200 SSH tunnel; some HTTP keep-alive
H200 → localhost vLLM (no tunnel, no network)349260.17%vLLM's uvicorn closes idle HTTP keep-alive after 5 s

Root cause of the H200-local failures: VLLM_HTTP_TIMEOUT_KEEP_ALIVE=5 (uvicorn default). Any aiohttp keep-alive connection idle for >5 s gets closed server-side. Empirical Fabric think-time has ~25% of gaps over 13 s, so this fires on long-idle users.

What this means for production: LiteLLM is configured with num_retries: 1 — any transport disconnect transparently retries on a fresh connection. Effective production failure rate ≈ 0.17% × 0.17% = ~0.0003%, well under the 0.1% target.

Hardening applied after bench completed:

  1. VLLM_HTTP_TIMEOUT_KEEP_ALIVE=75 set on the vllm-kimi26 container — server-side keep-alive bumped from 5 s → 75 s, eliminates idle-disconnect for any reasonable think time.
  2. -o TCPKeepAlive=yes added to the Buzz → H200 h200-tunnel.service systemd unit — extra resilience against middlebox idle timeouts on the tunnel itself.

Re-measurement of the failure rate with these fixes in place is pending (would require restarting the isolated bench, which would knock real users off Kimi again).

Methodology

Workload

Think-time distribution — empirically derived from real Fabric traffic

Pulled from LiteLLM_SpendLogs on Server A: 14 days of real Fabric traffic, 32 real users, 48,939 intra-session gaps, filtered to fabric-xlarge + fabric-large only (excludes title-gen, summarization, etc.).

Gap bucketWeightInterpretation
0.5 – 5 s57.0%interrupt / agentic loop step
5 – 15 s18.8%rapid reply
15 – 30 s11.3%quick reply
30 – 60 s5.4%normal think
60 – 180 s4.1%actual thinking
180 – 600 s1.4%slow / context switch
600 – 1800 s1.3%idle-ish

Per-turn think time is sampled from this distribution (weighted bucket → uniform within bucket). The dominant mode (57% under 5 s) reflects Fabric's agentic loop firing sequential model calls — one human action triggers many sub-requests in seconds.

Test isolation

Runtime config

vllm/vllm-openai:latest
  --model /models/Kimi-K2.6 (INT4)
  --tensor-parallel-size 8
  --max-num-seqs 128
  --max-model-len 262144
  --gpu-memory-utilization 0.92
  --enable-prefix-caching
  --no-async-scheduling          # defense vs LMCache #2356 (LMCache no longer in stack)
  --speculative-config '{"model":"/models/kimi-k2.6-eagle3","method":"eagle3","num_speculative_tokens":3}'
  --enable-auto-tool-choice --tool-call-parser kimi_k2 --reasoning-parser kimi_k2
  --init   --restart unless-stopped  --label autoheal=true
  HEALTHCHECK curl /health every 30 s, 15 min start-period
  GPU KV cache pool: 734,480 tokens

Limitations + caveats

Generated · Fabric Inference Stack · Kimi K2.6 on 8×H200