Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

Name: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU
Availability: InStock
Author: xaskasdf

by xaskasdf·Feb 21, 2026·395 points·101 comments

Visit Project View on HN

AI Analysis

●●SolidWizardryDark Horse

33x speedup over mmap for 70B on RTX 3090, but still 0.2 tok/s vs vLLM's 30+ tok/s.

Strengths

•NVMe-to-GPU DMA pipeline eliminates CPU bottleneck for memory-bound inference
•3-tier adaptive caching (VRAM/pinned RAM/NVMe) auto-optimizes per hardware constraints
•Zero external dependencies beyond CUDA; ships with GGUF quantization formats

Weaknesses

•Throughput (0.2 tok/s) remains impractical vs production inference servers like vLLM
•Linux-only with kernel 6.17+, gcc-14, CUDA 13.1—narrow compatibility envelope

Post Description

Hi everyone, I'm kinda involved in some retrogaming and with some experiments I ran into the following question: "It would be possible to run transformer models bypassing the cpu/ram, connecting the gpu to the nvme?"

This is the result of that question itself and some weekend vibecoding (it has the linked library repository in the readme as well), it seems to work, even on consumer gpus, it should work better on professional ones tho