GitHub Repository

AI-generated x86-64 assembly vs GCC -O3 on production kernels. 4.8-6.3x on base64, verified with 300K fuzz iterations.

2 starsPython

AI-optimized x86-64 assembly vs. GCC -O3 on three production kernels

Name: AI-optimized x86-64 assembly vs. GCC -O3 on three production kernels
Availability: InStock
Author: cod-e

by cod-e·Feb 15, 2026·1 point·1 comment

Visit Project View on HN

AI Analysis

●●SolidWizardryBig Brain

PSHUFB nibble trick beats GCC's lookup table by 4.8–6.3x on base64; reproducible fuzz methodology.

Strengths

•Differential fuzzing (300K iterations, zero mismatches) validates correctness rigorously—not hand-waved.
•Real-world kernels (base64, LZ4, SipHash) from production codebases, not toy examples.
•SSSE3 pshufb insight (gathering via shuffle instead of table) is a genuine algorithmic win.

Weaknesses

•Niche audience: only systems programmers optimizing hot paths benefit from hand-rolled asm.
•No tool/framework for reproducible AI-asm generation; results are a blog post, not a usable product.

Post Description

Show HN: AI-generated assembly vs GCC -O3 on real codebases (300K fuzz, 0 failures) Three kernels extracted from real open source projects, optimized with AI-generated x86-64 assembly, verified with 100K differential fuzz each: KernelAI strategySpeedupVerdictBase64 decodeSSSE3 pshufb table-free lookup4.8–6.3xAI winsLZ4 fast decodeSSE 16-byte match copy~1.05xAI wins (marginal)Redis SipHashReordered SIPROUND scheduling0.97xGCC wins The base64 win: GCC can't auto-vectorize a 256-byte lookup table (it's a gather pattern). The AI replaces it with a pshufb nibble trick — 16 parallel lookups in one instruction, zero table accesses. 1.8 GB/s → 11.6 GB/s. The SipHash loss: on pure ALU kernels (adds, rotates, XORs), GCC's scheduler is already near-optimal. 300K total fuzz iterations, zero mismatches. Every result is one command to reproduce.