FlashQwen – A from-scratch CUDA inference engine for Qwen3
Another inference engine when vLLM and llama.cpp already dominate.
GPT-2-style LLM built from scratch in C/CUDA with hand-written backprop, BPE tokenizer, FlashAttention, pretraining, and SFT.
Hand-written FlashAttention and full gradient checks in pure CUDA with no PyTorch.
ML engineers and students learning transformer internals
nanoGPT · llm.c · Karpathy's implementations
Another inference engine when vLLM and llama.cpp already dominate.
Wavelet-based attention-free architecture beats GPT-2 Medium with 80x less training data.
Karpathy's microgpt in C99, proves tiny coordinated models beat single large models on logic.
Build a LLaMA-style model from scratch with zero ML prerequisites or math.
GPT-2 inference in pure C# allocating zero bytes per token beats ONNX Runtime.
Loads real Meta and OpenAI weights, not just training from scratch.