Back to browse
GitHub Repository

High-performance, deterministic DLinear implementation in Chisel. Sub-5ns latency on 7nm. Target: HFT, Line-rate Networking, and Aerospace RTOS.

0 starsC++

Direct to silicon DLinear AI accelerator on the Sky130 open-source node

by NotJustBinary·Mar 5, 2026·2 points·2 comments

AI Analysis

●●●●GemWizardryBold BetZero to One

Sub-5ns DLinear in silicon. Balanced adder trees, 4-stage pipeline, zero-delay bit shifts.

Strengths
  • Genuine hardware innovation: eliminates instruction layer by turning model into pure dataflow circuit; verified on Sky130.
  • Deep optimization work (binary tree balancing, retiming, hold-violation fixes) shows real chip design mastery.
  • Deterministic latency and 1 prediction/cycle throughput solve actual HFT/aerospace constraints that software can't meet.
Weaknesses
  • Sky130 (130nm) is mature node; 7nm estimates unverified, and custom silicon is high-barrier, not accessible to most.
  • No tape-out, fabrication, or real-world deployment data; repo is early-stage (0 stars).
Category
Target Audience

Hardware engineers, HFT traders, aerospace/real-time systems engineers, academic researchers in AI accelerator design

Post Description

Hi HN, I have always been interested and inspired by the idea of speed in computing, which raises a logical question: why do we use general-purpose processors for tasks that require minimal latency and predictability? The question is interesting, so a couple of months ago I thought: can I make a DLinear-type time series model? DLinear is a simple but effective tool for time series analysis, to implement it directly into silicon using a PDK with an open Sky130.

Well, it wasn't as easy as it seemed... My first attempt at direct synthesis led to a nightmare. OpenLane (the RTL-to-GDSII flow I used) reported a setup slack of -7.88ns. Essentially, the signal was too slow to travel through the 64-tap window within a 100MHz clock cycle. I spent weeks refactoring the architecture in Chisel. I moved to a fully unrolled, 4-stage deep pipeline. The hardest part was balancing the binary adder tree; I had to ensure that the 6 levels of addition didn't bottleneck the entire chip. I also realized that I could cheat a bit: instead of using a resource-heavy divider for the moving average, I used a static bit-shift (>> 6). In hardware, that’s just re-wiring, which costs zero nanoseconds and zero gates. The final result is an 86,443-cell design that is LVS/DRC clean.

Of course, it now runs at 100 MHz on a 130 nanometer process, but physics shows that in theory it is possible to achieve 1.2GHz on a 7 nanometer process, which will reduce the delay to about 3.3 nanoseconds (yes, I did not try ASAP7 in OpenLane, the project is too controversial and I was not sure if it would give realistic results) I think we are now approaching the point where a software-defined interface is becoming too slow for line rate networking or high speed control loops. All GDSII layouts and surfer waveforms could be found in repo

If someone is interested, I'd love to get feedback from the community about the architecture, as there are still some problems with synchronization in the chip, which can cause power consumption to suddenly jump when the chip is running, which leads to the chip going into reboot mode or maybe someone knows a more elegant method for handling the summation of large windows in Chisel.

Similar Projects

Data●●Solid

Benchmarking Apple Silicon unified mem for GPU-accelerated SQL analysis

The repo does one practical thing well: quantify the real-world impact of Apple Silicon's unified memory on analytics by running six TPC-H queries plus a GPU-favorable QX and shipping the raw charts and code. It's specific and empirical — you get MLX vs NumPy vs DuckDB numbers and PNGs, not just hand-wavy claims — but it's narrowly scoped to M4 hardware and small-ish scales, so its conclusions are useful for experimentation rather than sweeping generalization.

WizardryNiche Gem
sadopc
313mo ago