Back to browse
GitHub Repository

High-efficiency LLM inference engine in C++/CUDA. Run Llama 70B on RTX 3090.

461 starsC++

Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

by xaskasdf·Feb 21, 2026·395 points·101 comments

AI Analysis

●●SolidWizardryDark Horse

33x speedup over mmap for 70B on RTX 3090, but still 0.2 tok/s vs vLLM's 30+ tok/s.

Strengths
  • NVMe-to-GPU DMA pipeline eliminates CPU bottleneck for memory-bound inference
  • 3-tier adaptive caching (VRAM/pinned RAM/NVMe) auto-optimizes per hardware constraints
  • Zero external dependencies beyond CUDA; ships with GGUF quantization formats
Weaknesses
  • Throughput (0.2 tok/s) remains impractical vs production inference servers like vLLM
  • Linux-only with kernel 6.17+, gcc-14, CUDA 13.1—narrow compatibility envelope
Category
Target Audience

ML engineers optimizing inference on resource-constrained hardware, retro-gaming modders

Similar To

vLLM · llama.cpp · NVIDIA TensorRT

Post Description

Hi everyone, I'm kinda involved in some retrogaming and with some experiments I ran into the following question: "It would be possible to run transformer models bypassing the cpu/ram, connecting the gpu to the nvme?"

This is the result of that question itself and some weekend vibecoding (it has the linked library repository in the readme as well), it seems to work, even on consumer gpus, it should work better on professional ones tho

Similar Projects

AI/ML●●●Banger

Llama CPU Benchmarks

Proves speculative decoding slows down 4B models on 4-core CPUs despite marketing claims.

Big BrainDark Horse
muthuishere
2023d ago