We built Talos – a full CNN inference engine running on silicon
Strips away PyTorch flexibility entirely; full CNN inference as deterministic hardware logic in SystemVerilog.

CNN inference fully hardcoded as silicon logic, not software optimized for hardware.
ML engineers, hardware designers, inference optimization specialists
NVIDIA TensorRT · Google TPU · Xilinx Vitis AI
Strips away PyTorch flexibility entirely; full CNN inference as deterministic hardware logic in SystemVerilog.
Zero-cycle matrix multiplication in combinatorial logic on Lattice ECP5 is genuinely wild.
Clever ML+hardware co-design, but a blog post without open-source code, benchmarks, or deployment examples.
Open-source logic synthesis running on FPGAs when Yosys dominates the space.
The repo does one practical thing well: quantify the real-world impact of Apple Silicon's unified memory on analytics by running six TPC-H queries plus a GPU-favorable QX and shipping the raw charts and code. It's specific and empirical — you get MLX vs NumPy vs DuckDB numbers and PNGs, not just hand-wavy claims — but it's narrowly scoped to M4 hardware and small-ish scales, so its conclusions are useful for experimentation rather than sweeping generalization.
Custom Metal shaders beat llama.cpp and MLX—1.67x faster on M4 Max.