Back to browse
GitHub Repository

Fast LLM speculative inference server for consumer hardware.

2,432 starsC++

OS Megakernel that match M5 Max Tok/w at 2x the Throughput on RTX 3090

by GreenGames·Apr 8, 2026·6 points·1 comment

AI Analysis

●●●BangerWizardryBig Brain

Single CUDA dispatch beats M5 Max efficiency on a 2020 GPU — llama.cpp extracts 40% of available performance.

Strengths
  • First megakernel for hybrid DeltaNet/Attention architecture on consumer GPUs
  • 3.4x faster prefill and 1.55x faster decode than llama.cpp on identical hardware
  • Power sweep shows sweet spot at 220W with 95% speed at 30% less power
Weaknesses
  • Only validated on Qwen3.5-0.8B — unclear if approach generalizes to larger models
  • Requires custom kernel — won't work with standard inference frameworks
Category
Target Audience

ML engineers, inference optimization specialists

Similar To

Hazy Research Megakernel · llama.cpp · vLLM

Post Description

Hey there, we fused all 24 layers of Qwen3.5-0.8B (a hybrid DeltaNet + Attention model) into a single CUDA kernel launch and made it open-source for everyone to try it.

On an RTX 3090 power-limited to 220W: - 411 tok/s vs 229 tok/s on M5 Max (1.8x) - 1.87 tok/J, beating M5 Max efficiency - 1.55x faster decode than llama.cpp on the same GPU - 3.4x faster prefill

The RTX 3090 launched in 2020. Everyone calls it power-hungry. It isn't, the software is. The conventional wisdom NVIDIA is fast but thirsty. Apple Silicon is slow but sips power. Pick a side.

With stock frameworks, the numbers back that up: Setup | tok/s | Power | tok/J RTX 3090 (llama.cpp) | 267 | 350W | 0.76 M5 Max (LM Studio) | 229 | ~130W | 1.76

Case closed. Except the 3090 has 936 GB/s of bandwidth and 142 TFLOPS of FP16 compute, and llama.cpp extracts 267 tok/s out of it. That ratio is absurd.

Traditional inference dispatches one kernel per operation. For 24 layers, that's roughly 100 launches per token. Every boundary means: - Return control to the CPU - Dispatch the next kernel - Re-fetch weights from global memory - Synchronize threads

Why nobody had done this yet? Qwen3.5-0.8B isn't a vanilla transformer. It alternates: - 18 DeltaNet layers: linear attention with a learned recurrence - 6 Full Attention layers: standard MHA

This hybrid pattern is where frontier models are heading: Qwen3-Next, Kimi Linear, all of them. DeltaNet scales linearly with context length instead of quadratically.

It's new, and nobody has shipped a fused kernel for it. MLX doesn't have DeltaNet kernels at all. llama.cpp supports it generically. Everyone else is waiting. The 267 tok/s wasn't a hardware ceiling, it was the software ceiling for a brand-new architecture.

We wrote a single CUDA kernel that runs the entire forward pass in one dispatch. Data stays in registers and shared memory as it flows through the network. Zero CPU round-trips, zero redundant memory fetches.

- 82 blocks x 512 threads, all SMs occupied - BF16 weights and activations, FP32 accumulation DeltaNet recurrence runs in warp-cooperative F32 registers - Full attention fuses QKV, RoPE, causal softmax, and output projection - Cooperative grid sync replaces kernel launches between layers

Results on the same RTX 3090, same model, same weights: Setup | Prefill (pp520) | Decode (tg128) Megakernel | 37,800 tok/s | 413 tok/s llama.cpp BF16 | 11,247 tok/s | 267 tok/s PyTorch + HF | 7,578 tok/s | 108 tok/s

Then we turned the power down Fewer wasted cycles means less heat, so we swept nvidia-smi -pl: Power limit | Clock | Draw | tok/s | tok/J | Notes 420W (stock) | 1980 MHz | 314W | 433 | 1.38 | baseline 300W | 1935 MHz | 299W | 432 | 1.44 | -5% power, 99.8% speed 220W | 1635 MHz | 220W | 411 | 1.87 | -30% power, 95% speed 150W | 405 MHz | 150W | 194 | 1.29 | clock cliff, too aggressive

At 220W we hit the sweet spot: 95% of the throughput for 70% of the power. Tighter execution converts almost directly into saved watts. Measurement: NVML energy counters for NVIDIA, powermetrics for Apple Silicon, matching Hazy Research's Intelligence Per Watt methodology. Accelerator power only, not wall draw.

Without the megakernel the 3090 barely edges out a laptop chip. With it, a five-year-old GPU beats Apple's latest on throughput, matches it on efficiency, and costs a quarter as much. The NVIDIA vs Apple efficiency gap isn't silicon. It's software.

Try it git clone https://github.com/Luce-Org/luce-megakernel.git cd luce-megakernel pip install -e . python bench_pp_tg.py

Requires: NVIDIA Ampere+ (tested on 3090), CUDA 12+, PyTorch 2.0+, ~1.5GB VRAM.

Code is open source (MIT): https://github.com/Luce-Org/luce-megakernel

Let us know if you have any feedback

Similar Projects

AI/ML●●●Banger

iPhone ANE holds LLM tok/s while MLX and LiteRT thermal-throttle

LiteRT beats MLX on Gemma memory while CoreML sips power on the Neural Engine.

Dark HorseBig BrainSolve My Problem
mlboy
1010d ago