Kimi K2.6 — Capacity under realistic Fabric load

8× H200 · vLLM 0.21 · Eagle 3 spec decode · INT4 weights · --max-num-seqs 128 · empirically-driven think-time distribution

TL;DR — Unit economics for an "unlimited" enterprise / sovereign plan

All prices in USD

Capacity → subscriber-count derivation

Empirical bench (below) shows H200 node sustains ~100 simultaneous active users at "good to excellent" experience with real Fabric usage pattern.
Real data: peak hour in last 14 days saw 8 simultaneously-active users out of 30 subscribers → ~27% peak concurrency ratio.
Conservative pricing choice: assume 50% peak concurrency, so 2 subscribers per concurrent seat.
Per-node subscriber cap scales with hardware throughput at SLO floor. H200=200 (measured); B200≈700 (FP4 TP=4 × 2 replicas, Blackwell-native config); B300≈740 (modest gain over B200 since K2.6 already fits in B200 HBM).

Hardware: 8× H200 SXM (141 GB HBM3e) · Config: Kimi K2.6 INT4 TP=8 + Eagle 3 spec decode + Fabric system prompt (~11k tok) · Measured ceiling: 100 concurrent → 200 subscribers/node

H200 list price (USD)	Node cost / month (USD) (8 GPU × 730 h)	Per-subscriber cost (USD) (÷ 200 subscribers)	Per-sub price (USD) at 50% gross margin	Per-sub price (USD) at 66.7% gross margin
$3 / GPU-hour	$17,520	$87.60	$175.20 / mo	$262.80 / mo
$4 / GPU-hour	$23,360	$116.80	$233.60 / mo	$350.40 / mo
$5 / GPU-hour	$29,200	$146.00	$292.00 / mo	$438.00 / mo

B200 list price (USD)	Node cost / month (USD) (8 GPU × 730 h)	Per-subscriber cost (USD) (÷ 700 subscribers)	Per-sub price (USD) at 50% gross margin	Per-sub price (USD) at 66.7% gross margin
$6 / GPU-hour	$35,040	$50.06	$100.11 / mo	$150.17 / mo
$8 / GPU-hour	$46,720	$66.74	$133.49 / mo	$200.23 / mo
$10 / GPU-hour	$58,400	$83.43	$166.86 / mo	$250.29 / mo

B300 list price (USD)	Node cost / month (USD) (8 GPU × 730 h)	Per-subscriber cost (USD) (÷ 740 subscribers)	Per-sub price (USD) at 50% gross margin	Per-sub price (USD) at 66.7% gross margin
$7.20 / GPU-hour	$42,048	$56.82	$113.64 / mo	$170.46 / mo
$9.60 / GPU-hour	$56,064	$75.76	$151.52 / mo	$227.28 / mo
$12 / GPU-hour	$70,080	$94.70	$189.41 / mo	$284.11 / mo

Operating point	TP	Concurrent users	tok/s/user	TTFT p50	SLO tier
"Interactive sweet spot"	24	60	92.5	516 ms	🟢 Excellent
"Excellent floor"	28	512	48.8	1473 ms	🟡 Meets (borderline)
"High throughput"	28	1024	46.2	9433 ms	🟡 Meets (TTFT bad)
"Max throughput"	40	4096	36.3	38003 ms	🟡 Meets (TTFT broken)

B200 list price (USD)	NVL72 rack cost / month (USD) (72 GPU × 730 h)	Per-subscriber cost (USD) (÷ 120 subs at 60 concurrent × 2)	Per-sub price (USD) at 50% gross margin	Per-sub price (USD) at 66.7% gross margin
$6 / GPU-hour	$315,360	$2,628.00	$5,256.00 / mo	$7,884.00 / mo
$8 / GPU-hour	$420,480	$3,504.00	$7,008.00 / mo	$10,512.00 / mo
$10 / GPU-hour	$525,600	$4,380.00	$8,760.00 / mo	$13,140.00 / mo

Hardware throughput comparison (raw — no spec decode, no idle)

From SemiAnalysis InferenceX benchmark: Kimi K2.5, ISL=8192, OSL=1024, vLLM, INT4, TP=8, no speculative decoding. Per-user decode tok/s at fixed concurrency.

GPU	C=4	C=8	C=16	C=32	C=64	vs H200 @ C=32
H200 (Hopper, 8-GPU node)	91.2	70.3	46.6	28.9	16.7	1.00×
B200 (Blackwell, 8-GPU node)	104.4	81.1	59.2	40.5	26.0	1.40×
B300 (Blackwell Ultra, 8-GPU node)	111.1	84.4	60.6	42.0	26.8	1.45×
GB200 NVL72 (FP4, TP=24, sweet spot)	140.8 (C=4)	129.1 (C=8)	114.7 (C=16)	96.9 (C=32)	92.5 (C=60)	3.35× (per-user, low-C only)
MI355x (AMD)	41.7	34.7	21.9	16.3	10.7	0.56×
MI325x (AMD)	46.0	35.2	25.8	14.7	9.9	0.51×
MI300x (AMD)	43.8	31.7	22.7	12.9	8.6	0.45×

⚠ Important caveats — read before quoting any price

H200 is measured. B200 and B300 are extrapolated using SemiAnalysis InferenceX throughput ratios (1.40× and 1.45× respectively at C=32). We have NOT actually benchmarked Kimi K2.6 on either Blackwell SKU. Eagle 3.1 MLA + tokenspeed_mla backend should give an additional Blackwell-only boost we haven't quantified.
This is the Kimi node cost only. The all-in "unlimited Fabric" subscription must amortize fabric-large (H100), fabric-medium (GH200), fabric-small (A40s), voice (Omni+Parakeet+Orpheus), plus platform/auth/support/sales overhead. The per-seat price above is a floor.
"Concurrent active user" is closed-loop synthetic. Modeled "users" follow last-14-days empirical Fabric think-time (dominated by sub-5-second agentic-loop steps). Heavier real workloads (longer outputs, deeper tool chains, larger context) consume more per-seat capacity. Re-measure if usage patterns shift.
The 27% peak-concurrency figure comes from N=30 subscribers. Small-sample inference; not statistically robust. Re-baseline once subscriber count grows past ~200 before locking the 2× concurrency multiplier into financial models.
Capacity assumes only Kimi runs on the node. Any colocation reduces per-subscriber capacity proportionally.
GPU prices are list / commodity-provider estimates. AWS/GCP/Azure typically charge 1.5–2× more than commodity (CoreWeave, Lambda, RunPod, etc.). Self-owned hardware capex / depreciation has a different cost model entirely.

Company P&L — net margin at full headcount

Operating model · CAD costs, USD revenue

The per-subscriber prices above are gross margin on compute only. This section layers the fixed cost of running the company on top to show the net margin of the business.

Staff (10 × CAD $150k)

$125k/mo CAD

Other opex (fixed)

$15k/mo CAD

Total fixed nut

$140k/mo CAD · $1.68M/yr

Fixed nut in USD

≈ $102k/mo @ 1.37 CAD/USD

The relationship

Net margin = Gross margin − (Fixed cost ÷ Revenue)

Net margin rises toward the compute gross-margin ceiling (50% or 66.7%) as you scale — the CAD $140k/mo fixed cost is a drag that shrinks as a share of revenue. Break-even revenue = Fixed ÷ gross margin.

Break-even — subscribers & nodes just to cover staff + opex

Pricing tier	Per-sub gross profit (USD)	Break-even subscribers	Subs / node	Nodes needed
H200 @ 50% GM ($175/sub)	$87.60	~1,170	200	6 (5.85)
H200 @ 66.7% GM ($263/sub)	$175.20	~585	200	3 (2.93)
B200 @ 50% GM ($100/sub)	$50.06	~2,040	700	3 (2.92)
B200 @ 66.7% GM ($150/sub)	$100.11	~1,020	700	2 (1.46)

H200 breaks even at fewer subscribers (each pays ~1.75× more) but needs more nodes — only 200 seats per node. B200 needs more bodies but a smaller hardware footprint and is the better $/subscriber pick.

Net profit & net margin at scale — B200 @ 50% gross margin

$100.11/sub revenue · $50.06/sub gross profit · 700 subs/node · nodes rounded up to whole units.

Subscribers	Nodes (B200)	Revenue / mo (USD)	Net profit / mo (USD)	Net margin
1,000	2 (1.4)	$100,110	−$52,130	loss
2,040	3 (2.9)	$204,224	~$0	0%
3,000	5 (4.3)	$300,330	+$47,980	16%
5,000	8 (7.1)	$500,550	+$148,100	30%
10,000	15 (14.3)	$1,001,100	+$398,380	40%

At the richer 66.7% gross-margin tier the curve is steeper: break-even at ~1,020 subs, net margin ~46% at 5,000 subs and ~56% at 10,000.

Reading the numbers

A 10-person team is a CAD $1.68M/yr fixed nut — you need ~1,000–2,000 paying subscribers just to cover it, depending on tier.
Net margin never quite reaches the gross-margin ceiling — at 5,000 subs you're at ~30% (50% tier) / ~46% (66.7% tier).
Node counts round up: a partly-filled node still costs a full node, so real economics lag the smooth curve until each node fills. At 3,000 subs you pay for 5 B200 nodes but bill only ~4.3 — that idle ~0.7 node (~$24k/mo at $6/GPU-hr) eats into net profit until you grow into it.
FX is an assumed 1.37 CAD/USD — adjust if the rate moves. Net profit shown in USD; revenue is naturally USD (subscriber billing), fixed costs are CAD payroll/opex.

Alternate scenario — usage-based pricing (Kimi published rates)

Real usage · 31 Kimi users · 2026-05-09 → 06-20 · normalized to monthly

Instead of a flat subscription, charge per token at Kimi K2.6 published rates (= Moonshot): $0.19/M cache-hit input · $0.95/M cache-miss input · $4.00/M output. Modeled against actual per-user token consumption for openai/kimi-k2.6 (9.85B input : 79.4M output = 124:1). Because input dwarfs output, the cache-hit ratio is the dominant lever — drag it below.

Cache-hit ratio

75%

Blended input price: 0.36 /M

⚠ The backend cache-hit number is currently unreliable for these models (fabric-large reports high cache hits on repeated turns — suspected correct; fabric-xlarge reports low — suspected a measurement bug). Until that's resolved, treat this as a tunable assumption. Agentic coding resends a large shared context every turn, so the true ratio is likely high (70–90%).

Total income / mo

—

Mean / user

—

Median / user

—

Top user / mo

—

CDF — % of users earning ≤ $X/mo

What this tells you

Income per user is a brutal power law. The median user earns a fraction of the mean — a couple of whales (top 2 users) drive most of the revenue, while the long tail of light users would pay very little.
At any realistic cache-hit ratio, the median user's usage-based bill is well below the $100–175/mo flat subscription. Usage pricing captures the whales' true cost but collapses revenue from light users — the opposite risk profile to all-you-can-eat.
Total monthly income across all 31 users ranges from ~$2.0k (90% cache hit) to ~$6.7k (0% cache hit) — a 3.3× swing entirely driven by the cache-hit ratio.

Recommended model — $25 prepaid credit pack

Low barrier · usage-scaled · whale-safe

Start using Fabric today for $25. A credit pack buys token usage at Kimi rates ($0.19/M cache-hit · $0.95/M cache-miss in · $4/M out); when you run out you buy more, and credits reset monthly. The base pack is anchored at the median user — so most people never think about topping up, while heavy users scale naturally up to ~$1,000/mo. Drag the pack price (reacts to the cache-hit slider above — currently 75%).

Base pack price

$25

Never top up (≤ base)

—

Book revenue / mo

—

Floor "breakage" uplift

—

Top user / mo

—

Why this wins

Low barrier: $25 entry — no flat-rate sticker shock, widest possible funnel.
The floor captures the long tail: light users pay the full pack but consume less, so book revenue runs above pure pay-as-you-go (the "breakage" KPI) — only if credits expire monthly, not roll over.
Whale-safe: heavy users pay for exactly what they burn (up to ~$1k) — no pooling bleed from a flat plan.
High margin per credit: credits are priced at Kimi retail but self-hosted below it — every credit carries the gross margin from the pricing section above.
Caveat: these 31 are mostly internal/partner users. Public signups skew lighter, which makes the $25 floor more valuable — re-baseline once real customers arrive.

Capacity verdict

All users 🟢 Excellent

≤ 60 simultaneous users

Median user still 🟢 Excellent

~ 100 simultaneous users

SLO knee (17% drop below 30 tok/s)

~ 150 simultaneous users

25% unacceptable @

200 simultaneous users

Service-level objective

Each user is evaluated on its per-user decode tok/s p50 across all turns:

Excellent > 50 tok/s Meets expectations 30 – 50 tok/s Unacceptable < 30 tok/s

A request is also marked failed on HTTP non-200 or any transport error.

Per-level summary

N concurrent users	Rig	Turns	OK	Fail	Thru (req/s)	TTFT p50	TTFT p95	Turn p50	Turn p95	Turn p99	TPS p50	Per-user SLO
5	Mac→tunnel	64	64	0	0.35	0.31s	0.76s	2.50s	3.86s	4.29s	176	5/5 🟢
10	Mac→tunnel	72	69	3	0.38	0.34s	1.30s	2.65s	5.24s	5.34s	166	10/10 🟢
20	Mac→tunnel	220	217	3	1.19	0.37s	2.37s	3.66s	7.23s	7.81s	117	20/20 🟢
40	Mac→tunnel	458	451	7	2.36	0.39s	4.28s	4.94s	10.78s	11.80s	84	40/40 🟢
60	H200-local	550	550	0	3.06	0.32s	4.93s	5.76s	14.14s	14.98s	70	55 🟢 5 🟡
100	H200-local	751	751	0	4.17	0.43s	6.94s	8.06s	18.66s	24.53s	51	53 🟢 47 🟡
150	H200-local	1064	1061	3	5.89	1.13s	12.44s	13.36s	24.79s	30.91s	32	1 🟢 124 🟡 25 🔴
200	H200-local	1127	1124	3	6.24	6.81s	22.25s	19.64s	32.67s	39.22s	31.6	151 🟡 49 🔴

Each level ran 180 s. N=5–40 from Mac via SSH tunnel (real users cascading to fabric-large during run). N=60–200 ran on the H200 box itself (localhost) for cleanest numbers.

Per-user breakdown at N=40 (40 concurrent users)

Every single user got 🟢 Excellent service. Min user 63.5 tok/s, median 83.9, max 104.7.

Failures, honestly

The bench measures two kinds of "failure":

HTTP non-200 — model rejected the request. Count across all runs: 0.
Transport disconnect (TCP reset / aiohttp "Server disconnected") before any HTTP response. 19 across all 4306 turns ≈ 0.44%.

Rig	Turns	Failures	Rate	Cause
Mac → SSH tunnel → CF → LiteLLM → vLLM	814	13	1.60%	Mostly Mac↔H200 SSH tunnel; some HTTP keep-alive
H200 → localhost vLLM (no tunnel, no network)	3492	6	0.17%	vLLM's uvicorn closes idle HTTP keep-alive after 5 s

Root cause of the H200-local failures: VLLM_HTTP_TIMEOUT_KEEP_ALIVE=5 (uvicorn default). Any aiohttp keep-alive connection idle for >5 s gets closed server-side. Empirical Fabric think-time has ~25% of gaps over 13 s, so this fires on long-idle users.

What this means for production: LiteLLM is configured with num_retries: 1 — any transport disconnect transparently retries on a fresh connection. Effective production failure rate ≈ 0.17% × 0.17% = ~0.0003%, well under the 0.1% target.

Hardening applied after bench completed:

✅ VLLM_HTTP_TIMEOUT_KEEP_ALIVE=75 set on the vllm-kimi26 container — server-side keep-alive bumped from 5 s → 75 s, eliminates idle-disconnect for any reasonable think time.
✅ -o TCPKeepAlive=yes added to the Buzz → H200 h200-tunnel.service systemd unit — extra resilience against middlebox idle timeouts on the tunnel itself.

Re-measurement of the failure rate with these fixes in place is pending (would require restarting the isolated bench, which would knock real users off Kimi again).

Methodology

Workload

Shared system prompt: Fabric's actual manual_chat.txt + coding_agent.txt concatenated (~11k tokens) — same prefix the real product sends.
Random programming questions from a pool of 20 realistic asks (algorithms, refactors, language idioms, system design).
Multi-turn: each "user" runs a closed-loop conversation; conversation resets every 6 turns. Each turn appends the prior assistant response to grow the context.
Output cap: max_tokens=384 per turn.

Think-time distribution — empirically derived from real Fabric traffic

Pulled from LiteLLM_SpendLogs on Server A: 14 days of real Fabric traffic, 32 real users, 48,939 intra-session gaps, filtered to fabric-xlarge + fabric-large only (excludes title-gen, summarization, etc.).

Gap bucket	Weight	Interpretation
0.5 – 5 s	57.0%	interrupt / agentic loop step
5 – 15 s	18.8%	rapid reply
15 – 30 s	11.3%	quick reply
30 – 60 s	5.4%	normal think
60 – 180 s	4.1%	actual thinking
180 – 600 s	1.4%	slow / context switch
600 – 1800 s	1.3%	idle-ish

Per-turn think time is sampled from this distribution (weighted bucket → uniform within bucket). The dominant mode (57% under 5 s) reflects Fabric's agentic loop firing sequential model calls — one human action triggers many sub-requests in seconds.

Test isolation

Buzz → H200 SSH tunnel stopped (systemctl stop h200-tunnel.service) for the duration.
Real-user requests for fabric-xlarge cascaded to qwen3.5-397b via the LiteLLM fallback chain (verified before bench started).
N=5–40 ran from Mac via a separate SSH tunnel; N=60+ ran directly on the H200 box itself for cleanest numbers.

Runtime config

vllm/vllm-openai:latest
  --model /models/Kimi-K2.6 (INT4)
  --tensor-parallel-size 8
  --max-num-seqs 128
  --max-model-len 262144
  --gpu-memory-utilization 0.92
  --enable-prefix-caching
  --no-async-scheduling          # defense vs LMCache #2356 (LMCache no longer in stack)
  --speculative-config '{"model":"/models/kimi-k2.6-eagle3","method":"eagle3","num_speculative_tokens":3}'
  --enable-auto-tool-choice --tool-call-parser kimi_k2 --reasoning-parser kimi_k2
  --init   --restart unless-stopped  --label autoheal=true
  HEALTHCHECK curl /health every 30 s, 15 min start-period
  GPU KV cache pool: 734,480 tokens

Limitations + caveats

"User" = a closed-loop synthetic worker, not a real Fabric user. Real users have heterogeneous prompt sizes, varying tool-call depths, and bursts.
Output capped at 384 tok/turn. Real conversations sometimes generate 1–2k tokens; longer outputs eat more compute per turn (TPS unchanged but latency stretches).
Empirical distribution conflates two behaviors: human think time + agentic-loop inner-loop. The 57% <5 s bucket is mostly the latter.
Cold prefix-cache numbers would be worse. The empirical workload shares prefix heavily; first request per session takes ~10× the TTFT.
Numbers are "right-now" capacity — assumes no other models on the box. fabric-large stays on H100; fabric-medium on GH200; nothing competes with Kimi on H200.

Generated · Fabric Inference Stack · Kimi K2.6 on 8×H200

Kimi K2.6 — Capacity under realistic Fabric load

TL;DR — Unit economics for an "unlimited" enterprise / sovereign plan

Capacity → subscriber-count derivation

Where it lands per-tier (real InferenceX numbers, GB200 NVL72, FP4)

Per-subscriber economics (using "Interactive sweet spot" — only operating point that keeps Fabric agentic UX viable)

Where Dynamo + system RAM + NAS-attached KV cache would pay off

Hardware throughput comparison (raw — no spec decode, no idle)

Company P&L — net margin at full headcount

Break-even — subscribers & nodes just to cover staff + opex

Net profit & net margin at scale — B200 @ 50% gross margin

Alternate scenario — usage-based pricing (Kimi published rates)

Recommended model — $25 prepaid credit pack

Capacity verdict

Service-level objective

Per-level summary

Per-user breakdown at N=40 (40 concurrent users)

Failures, honestly

Methodology

Workload

Think-time distribution — empirically derived from real Fabric traffic

Test isolation

Runtime config

Limitations + caveats

Kimi K2.6 — Capacity under realistic Fabric load

TL;DR — Unit economics for an "unlimited" enterprise / sovereign plan

Capacity → subscriber-count derivation

Where it lands per-tier (real InferenceX numbers, GB200 NVL72, FP4)

Per-subscriber economics (using "Interactive sweet spot" — only operating point that keeps Fabric agentic UX viable)

Where Dynamo + system RAM + NAS-attached KV cache *would* pay off

Hardware throughput comparison (raw — no spec decode, no idle)

Company P&L — net margin at full headcount

Break-even — subscribers & nodes just to cover staff + opex

Net profit & net margin at scale — B200 @ 50% gross margin

Alternate scenario — usage-based pricing (Kimi published rates)

Recommended model — $25 prepaid credit pack

Capacity verdict

Service-level objective

Per-level summary

Per-user breakdown at N=40 (40 concurrent users)

Failures, honestly

Methodology

Workload

Think-time distribution — empirically derived from real Fabric traffic

Test isolation

Runtime config

Limitations + caveats

Where Dynamo + system RAM + NAS-attached KV cache would pay off