ZSE – Single-file LLM engine with dual INT4 kernels
INT4 inference engine beats llama.cpp on VRAM, but competing against established tools.

Comprehensive inference survey from CUDA to Kubernetes, but it's a book not a tool.
Machine learning engineers, inference specialists, AI infrastructure builders
Papers with Code · NVIDIA documentation · Hugging Face course materials
To make it easier for more engineers to learn about inference, I wrote a book that provides a survey of the dozens of technologies that work together to make inference possible, along with an introduction to the primary techniques for inference optimization as well as commentary on how those techniques apply across various modalities.
This book is completely free to download digitally, and I'll have print copies with me at various conferences + available to purchase once Amazon decides to approve my account.
I hope you find Inference Engineering useful! Am around to answer any questions.
INT4 inference engine beats llama.cpp on VRAM, but competing against established tools.
Cross-chip agent knowledge sharing beats CoreML by 6× on Apple Silicon.
3.59ms for 100 LoRA adapters with zero HBM writes—genuine GPU wizardry.
336× faster tree model inference; compiles sklearn/XGBoost to C99, serves like Ollama.
SF Signal flowchart as a web survey, but book discovery tools already solve this better.
400-page Codex CLI manual covering MCP and hooks before official docs catch up.