Back to browse
NVFP4 on Desktop Blackwell – 122B MoE on a Single RTX PRO 6000 31 tok/s

NVFP4 on Desktop Blackwell – 122B MoE on a Single RTX PRO 6000 31 tok/s

by jcartu·Mar 9, 2026·2 points·0 comments

AI Analysis

●●●BangerWizardryDark Horse

Bypasses NVIDIA's artificial FP4 lock—122B MoE on single desktop GPU at 31 tok/s.

Strengths
  • Identifies dispatch logic lock, not hardware limitation—SM120 same as datacenter chips
  • Real benchmarks: 31 tok/s, 89GB VRAM, piecewise CUDA graphs working today
  • CUTLASS 4.2+ kernels already exist—this is Python-level capability check workaround
Weaknesses
  • Requires nightly vLLM and specific GPU (RTX PRO 6000/Blackwell)
  • NVIDIA could patch this dispatch bypass in future driver releases
Category
Target Audience

ML engineers running large models on consumer hardware

Similar To

vLLM · llama.cpp · TensorRT-LLM

Post Description

Qwen 3.5 122B-A10B (MoE, ~10B active parameters) running in native NVFP4 on a single RTX PRO 6000 Blackwell GPU. 31 tokens/sec, 89GB VRAM, piecewise CUDA graphs. No multi-GPU, no cloud.

Why this matters: NVIDIA's TRT-LLM explicitly blocks desktop Blackwell from FP4 — the error literally says "FP4 Gemm not supported before Blackwell, nor GeForce Blackwell." The RTX 5090, PRO 6000, and DGX Spark all use SM120 — same FP4 tensor cores as the B100/B200 datacenter chips (SM100). The lock is artificial product segmentation, not a hardware limitation.

CUTLASS 4.2+ already ships SM120 FP4 kernels. They're compiled into vLLM. The problem is purely dispatch logic — Python-level capability checks that only recognize SM100, not SM120.

Setup (vLLM 0.17.0, stable pip install):

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model Sehyo/Qwen3.5-122B-A10B-NVFP4 --port 8100 --max-model-len 4096 --gpu-memory-utilization 0.85 --compilation-config '{"cudagraph_mode": "piecewise"}'

Key gotchas: (1) Do NOT pass --quantization flag, model uses compressed-tensors format and vLLM auto-detects. (2) Full CUDA graphs OOM — use piecewise mode (31 tok/s vs 12 tok/s eager). (3) Python 3.14 breaks numba, stick with 3.13.

Results: 31 tok/s on 1 GPU vs 54 tok/s on 2 GPUs with Q8_0 llama.cpp. Half the hardware, ~60% the speed, ~98% the quality.

The broader point: SM120 and SM100 share the same FP4 tensor core architecture. CUTLASS has the kernels. The frameworks just need to route SM120 to them. A 122B MoE model on a single desktop GPU at 31 tok/s was datacenter-only six months ago.

Relevant issues: vLLM #33416, SGLang #18954, CUTLASS #2800. We're submitting a PR (~10 lines of Python).

Model: https://huggingface.co/Sehyo/Qwen3.5-122B-A10B-NVFP4

Similar Projects

AI/ML●●●Banger

SwiftLM – Qwen Chat on iPhone, 100B+ Moe on M5 Pro 64GB (Native Swift)

Native Swift inference with SSD streaming runs 100B MoE models without kernel panics.

WizardryNiche Gem
aegis_camera
122mo ago