OS Megakernel that match M5 Max Tok/w at 2x the Throughput on RTX 3090
Single CUDA dispatch beats M5 Max efficiency on a 2020 GPU — llama.cpp extracts 40% of available performance.

Bypasses NVIDIA's artificial FP4 lock—122B MoE on single desktop GPU at 31 tok/s.
ML engineers running large models on consumer hardware
vLLM · llama.cpp · TensorRT-LLM
Why this matters: NVIDIA's TRT-LLM explicitly blocks desktop Blackwell from FP4 — the error literally says "FP4 Gemm not supported before Blackwell, nor GeForce Blackwell." The RTX 5090, PRO 6000, and DGX Spark all use SM120 — same FP4 tensor cores as the B100/B200 datacenter chips (SM100). The lock is artificial product segmentation, not a hardware limitation.
CUTLASS 4.2+ already ships SM120 FP4 kernels. They're compiled into vLLM. The problem is purely dispatch logic — Python-level capability checks that only recognize SM100, not SM120.
Setup (vLLM 0.17.0, stable pip install):
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model Sehyo/Qwen3.5-122B-A10B-NVFP4 --port 8100 --max-model-len 4096 --gpu-memory-utilization 0.85 --compilation-config '{"cudagraph_mode": "piecewise"}'
Key gotchas: (1) Do NOT pass --quantization flag, model uses compressed-tensors format and vLLM auto-detects. (2) Full CUDA graphs OOM — use piecewise mode (31 tok/s vs 12 tok/s eager). (3) Python 3.14 breaks numba, stick with 3.13.
Results: 31 tok/s on 1 GPU vs 54 tok/s on 2 GPUs with Q8_0 llama.cpp. Half the hardware, ~60% the speed, ~98% the quality.
The broader point: SM120 and SM100 share the same FP4 tensor core architecture. CUTLASS has the kernels. The frameworks just need to route SM120 to them. A 122B MoE model on a single desktop GPU at 31 tok/s was datacenter-only six months ago.
Relevant issues: vLLM #33416, SGLang #18954, CUTLASS #2800. We're submitting a PR (~10 lines of Python).
Single CUDA dispatch beats M5 Max efficiency on a 2020 GPU — llama.cpp extracts 40% of available performance.
Clever hardware hack but this is a config guide, not a shipped tool.
First documented RTX 5090 ComfyUI setup without Docker, using triton-windows instead of xformers.
Native Swift inference with SSD streaming runs 100B MoE models without kernel panics.
Runs 19.5GB Qwen3.5 on 12GB RAM iPhone via memory swapping.
Yet another price tracker, but only for RTX cards on Amazon.