TL;DR — Unit economics for an "unlimited" enterprise / sovereign plan
All prices in USD
Capacity → subscriber-count derivation
Empirical bench (below) shows H200 node sustains ~100 simultaneous active users at "good to excellent" experience with real Fabric usage pattern.
Real data: peak hour in last 14 days saw 8 simultaneously-active users out of 30 subscribers → ~27% peak concurrency ratio.
Conservative pricing choice: assume 50% peak concurrency, so 2 subscribers per concurrent seat.
Per-node subscriber cap scales with hardware throughput at SLO floor. H200=200 (measured); B200≈700 (FP4 TP=4 × 2 replicas, Blackwell-native config); B300≈740 (modest gain over B200 since K2.6 already fits in B200 HBM).
Hardware: 8× B200 (Blackwell, 192 GB HBM3e) · Recommended config: Kimi K2.6 FP4 TP=4 × 2 replicas per node (Blackwell-native FP4 path) + Eagle 3.1 MLA (Blackwell-only) + Fabric system prompt
· Extrapolated ceiling: ~350 concurrent → ~700 subscribers/node (3.5× H200 per InferenceX FP4 TP=4 data at SLO floor)
B200 list price (USD)
Node cost / month (USD) (8 GPU × 730 h)
Per-subscriber cost (USD) (÷ 700 subscribers)
Per-sub price (USD) at 100% margin
Per-sub price (USD) at 200% margin
$6 / GPU-hour
$35,040
$50.06
$100.11 / mo
$150.17 / mo
$8 / GPU-hour
$46,720
$66.74
$133.49 / mo
$200.23 / mo
$10 / GPU-hour
$58,400
$83.43
$166.86 / mo
$250.29 / mo
Why FP4 TP=4 × 2 replicas wins on Blackwell: InferenceX shows B200 FP4 TP=4 sustains ~26 users/replica at the 50 tok/s SLO floor; 2 replicas per 8-GPU node = ~52 raw users/node (vs H200's ~15). With Eagle 3 + idle + prefix cache (our measured 7× boost over raw InferenceX), real capacity ≈ 350 concurrent.
Conservative fallback config (FP4 TP=8 single replica, comparable to H200's TP=8) gives ~140 concurrent → ~280 subscribers/node, yielding 2× higher prices.
Price range assumes B200 ≈ 2× H200 list (consistent with deploybase.ai); AWS/GCP can be 1.5–2× higher.
Hardware: 8× B300 (Blackwell Ultra, 288 GB HBM3e) · Recommended config: Kimi K2.6 FP4 TP=4 × 2 replicas + Eagle 3.1 MLA + Fabric system prompt
· Extrapolated ceiling: ~370 concurrent → ~740 subscribers/node (3.7× H200 per InferenceX FP4 TP=4 data — small gain over B200 because Kimi K2.6 INT4/FP4 fits in B200's 192 GB; B300's extra HBM is mostly idle here)
B300 list price (USD)
Node cost / month (USD) (8 GPU × 730 h)
Per-subscriber cost (USD) (÷ 740 subscribers)
Per-sub price (USD) at 100% margin
Per-sub price (USD) at 200% margin
$7.20 / GPU-hour
$42,048
$56.82
$113.64 / mo
$170.46 / mo
$9.60 / GPU-hour
$56,064
$75.76
$151.52 / mo
$227.28 / mo
$12 / GPU-hour
$70,080
$94.70
$189.41 / mo
$284.11 / mo
Price assumes B300 ≈ 1.2× B200 list. B200 is the better $/subscriber pick for Kimi K2.6 — B300's 50% more memory is wasted on a model that fits in 192 GB. B300 is the right choice if you upgrade to a larger model (K3, DeepSeek-V4-class) or run multi-tenant with longer context.
Hardware throughput comparison (raw — no spec decode, no idle)
From SemiAnalysis InferenceX benchmark: Kimi K2.5, ISL=8192, OSL=1024, vLLM, INT4, TP=8, no speculative decoding. Per-user decode tok/s at fixed concurrency.
GPU
C=4
C=8
C=16
C=32
C=64
vs H200 @ C=32
H200 (Hopper)
91.2
70.3
46.6
28.9
16.7
1.00×
B200 (Blackwell)
104.4
81.1
59.2
40.5
26.0
1.40×
B300 (Blackwell Ultra)
111.1
84.4
60.6
42.0
26.8
1.45×
MI355x (AMD)
41.7
34.7
21.9
16.3
10.7
0.56×
MI325x (AMD)
46.0
35.2
25.8
14.7
9.9
0.51×
MI300x (AMD)
43.8
31.7
22.7
12.9
8.6
0.45×
⚠ Important caveats — read before quoting any price
H200 is measured. B200 and B300 are extrapolated using SemiAnalysis InferenceX throughput ratios (1.40× and 1.45× respectively at C=32). We have NOT actually benchmarked Kimi K2.6 on either Blackwell SKU. Eagle 3.1 MLA + tokenspeed_mla backend should give an additional Blackwell-only boost we haven't quantified.
This is the Kimi node cost only. The all-in "unlimited Fabric" subscription must amortize fabric-large (H100), fabric-medium (GH200), fabric-small (A40s), voice (Omni+Parakeet+Orpheus), plus platform/auth/support/sales overhead. The per-seat price above is a floor.
"Concurrent active user" is closed-loop synthetic. Modeled "users" follow last-14-days empirical Fabric think-time (dominated by sub-5-second agentic-loop steps). Heavier real workloads (longer outputs, deeper tool chains, larger context) consume more per-seat capacity. Re-measure if usage patterns shift.
The 27% peak-concurrency figure comes from N=30 subscribers. Small-sample inference; not statistically robust. Re-baseline once subscriber count grows past ~200 before locking the 2× concurrency multiplier into financial models.
Capacity assumes only Kimi runs on the node. Any colocation reduces per-subscriber capacity proportionally.
GPU prices are list / commodity-provider estimates. AWS/GCP/Azure typically charge 1.5–2× more than commodity (CoreWeave, Lambda, RunPod, etc.). Self-owned hardware capex / depreciation has a different cost model entirely.
Capacity verdict
All users 🟢 Excellent
≤ 60 simultaneous users
Median user still 🟢 Excellent
~ 100 simultaneous users
SLO knee (17% drop below 30 tok/s)
~ 150 simultaneous users
25% unacceptable @
200 simultaneous users
Service-level objective
Each user is evaluated on its per-user decode tok/s p50 across all turns:
A request is also marked failed on HTTP non-200 or any transport error.
Per-level summary
N concurrent users
Rig
Turns
OK
Fail
Thru (req/s)
TTFT p50
TTFT p95
Turn p50
Turn p95
Turn p99
TPS p50
Per-user SLO
5
Mac→tunnel
64
64
0
0.35
0.31s
0.76s
2.50s
3.86s
4.29s
176
5/5 🟢
10
Mac→tunnel
72
69
3
0.38
0.34s
1.30s
2.65s
5.24s
5.34s
166
10/10 🟢
20
Mac→tunnel
220
217
3
1.19
0.37s
2.37s
3.66s
7.23s
7.81s
117
20/20 🟢
40
Mac→tunnel
458
451
7
2.36
0.39s
4.28s
4.94s
10.78s
11.80s
84
40/40 🟢
60
H200-local
550
550
0
3.06
0.32s
4.93s
5.76s
14.14s
14.98s
70
55 🟢5 🟡
100
H200-local
751
751
0
4.17
0.43s
6.94s
8.06s
18.66s
24.53s
51
53 🟢47 🟡
150
H200-local
1064
1061
3
5.89
1.13s
12.44s
13.36s
24.79s
30.91s
32
1 🟢124 🟡25 🔴
200
H200-local
1127
1124
3
6.24
6.81s
22.25s
19.64s
32.67s
39.22s
31.6
151 🟡49 🔴
Each level ran 180 s. N=5–40 from Mac via SSH tunnel (real users cascading to fabric-large during run). N=60–200 ran on the H200 box itself (localhost) for cleanest numbers.
Per-user breakdown at N=40 (40 concurrent users)
Every single user got 🟢 Excellent service. Min user 63.5 tok/s, median 83.9, max 104.7.
Failures, honestly
The bench measures two kinds of "failure":
HTTP non-200 — model rejected the request. Count across all runs: 0.
Transport disconnect (TCP reset / aiohttp "Server disconnected") before any HTTP response. 19 across all 4306 turns ≈ 0.44%.
Rig
Turns
Failures
Rate
Cause
Mac → SSH tunnel → CF → LiteLLM → vLLM
814
13
1.60%
Mostly Mac↔H200 SSH tunnel; some HTTP keep-alive
H200 → localhost vLLM (no tunnel, no network)
3492
6
0.17%
vLLM's uvicorn closes idle HTTP keep-alive after 5 s
Root cause of the H200-local failures: VLLM_HTTP_TIMEOUT_KEEP_ALIVE=5 (uvicorn default). Any aiohttp keep-alive connection idle for >5 s gets closed server-side. Empirical Fabric think-time has ~25% of gaps over 13 s, so this fires on long-idle users.
What this means for production: LiteLLM is configured with num_retries: 1 — any transport disconnect transparently retries on a fresh connection. Effective production failure rate ≈ 0.17% × 0.17% = ~0.0003%, well under the 0.1% target.
Hardening applied after bench completed:
✅ VLLM_HTTP_TIMEOUT_KEEP_ALIVE=75 set on the vllm-kimi26 container — server-side keep-alive bumped from 5 s → 75 s, eliminates idle-disconnect for any reasonable think time.
✅ -o TCPKeepAlive=yes added to the Buzz → H200 h200-tunnel.service systemd unit — extra resilience against middlebox idle timeouts on the tunnel itself.
Re-measurement of the failure rate with these fixes in place is pending (would require restarting the isolated bench, which would knock real users off Kimi again).
Methodology
Workload
Shared system prompt: Fabric's actual manual_chat.txt + coding_agent.txt concatenated (~11k tokens) — same prefix the real product sends.
Random programming questions from a pool of 20 realistic asks (algorithms, refactors, language idioms, system design).
Multi-turn: each "user" runs a closed-loop conversation; conversation resets every 6 turns. Each turn appends the prior assistant response to grow the context.
Output cap: max_tokens=384 per turn.
Think-time distribution — empirically derived from real Fabric traffic
Pulled from LiteLLM_SpendLogs on Server A: 14 days of real Fabric traffic, 32 real users, 48,939 intra-session gaps,
filtered to fabric-xlarge + fabric-large only (excludes title-gen, summarization, etc.).
Gap bucket
Weight
Interpretation
0.5 – 5 s
57.0%
interrupt / agentic loop step
5 – 15 s
18.8%
rapid reply
15 – 30 s
11.3%
quick reply
30 – 60 s
5.4%
normal think
60 – 180 s
4.1%
actual thinking
180 – 600 s
1.4%
slow / context switch
600 – 1800 s
1.3%
idle-ish
Per-turn think time is sampled from this distribution (weighted bucket → uniform within bucket).
The dominant mode (57% under 5 s) reflects Fabric's agentic loop firing sequential model calls — one human action triggers many sub-requests in seconds.
Test isolation
Buzz → H200 SSH tunnel stopped (systemctl stop h200-tunnel.service) for the duration.
Real-user requests for fabric-xlarge cascaded to qwen3.5-397b via the LiteLLM fallback chain (verified before bench started).
N=5–40 ran from Mac via a separate SSH tunnel; N=60+ ran directly on the H200 box itself for cleanest numbers.
"User" = a closed-loop synthetic worker, not a real Fabric user. Real users have heterogeneous prompt sizes, varying tool-call depths, and bursts.
Output capped at 384 tok/turn. Real conversations sometimes generate 1–2k tokens; longer outputs eat more compute per turn (TPS unchanged but latency stretches).
Empirical distribution conflates two behaviors: human think time + agentic-loop inner-loop. The 57% <5 s bucket is mostly the latter.
Cold prefix-cache numbers would be worse. The empirical workload shares prefix heavily; first request per session takes ~10× the TTFT.
Numbers are "right-now" capacity — assumes no other models on the box. fabric-large stays on H100; fabric-medium on GH200; nothing competes with Kimi on H200.
Generated · Fabric Inference Stack · Kimi K2.6 on 8×H200