GitHub Repository

The Turkish Sieve Methodology: Deterministic Computation of Twin and Cousin Prime Pairs Using an N/6 Bit Data Structure

11 stars

TurkishSieve CPU/GPU prime sieve found errors in Nicely's tables

Name: TurkishSieve CPU/GPU prime sieve found errors in Nicely's tables
Availability: InStock
Author: bilgisoft

by bilgisoft·Feb 24, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardryDark Horse

Found bugs in 30-year-old twin prime data; RTX 5090 hits 1.1T candidates/sec.

Strengths

•Concrete discovery: demonstrated off-by-one errors in Nicely's historical data with rigorous verification.
•Genuine algorithmic insight: N/6 bit structure is 6x more memory-efficient than classical N/3 sieves.
•Hardware-friendly: replaces modular arithmetic with integer addition for GPU parallelization.

Weaknesses

•Niche audience: prime computation research has limited commercial or developer adoption.
•Limited to specific hardware (RTX 5090 benchmarks); unclear how portable this is across GPU generations.

Post Description

While benchmarking my GPU-accelerated sieve engine, Turkish Sieve Engine (TSE), I discovered several inconsistencies in Dr. Thomas Nicely’s famous twin prime research (the work that led to the discovery of the Pentium FDIV bug).

The Discovery: My deterministic engine matches primesieve perfectly, but Nicely’s historical data (hosted at Lynchburg) shows persistent +1 errors in several cumulative counts:

0 to 30: Shows 5 twins (Actual: 4)

0 to 600: Shows 27 twins (Actual: 26)

0 to 30M: Shows 152,892 (Actual: 152,891)

It appears to be a systematic off-by-one error or a segment-boundary issue in the legacy code used decades ago.

Performance & Methodology: TSE achieves record-breaking speeds by using a unique N/6 bit data structure, making it 6x more memory-efficient than classical sieves.

Peak Throughput: 1.136 Trillion candidates/sec (on RTX 5090).

Efficiency: Scanned 10^14 range (twin and cousin primes) in ~6 minutes.

Hardware-Friendly: No modular arithmetic; uses simple integer additions (n <- n+p) optimized for CUDA warps.

Technical Deep Dive: The N/6 indexing paradigm leverages the mathematical distribution of twin (p, p+2) and cousin (p, p+4) pairs to eliminate redundant candidates before they even hit the VRAM. This allowed me to process 100 trillion numbers using only 1.1 GB of VRAM.

GitHub: https://github.com/bilgisofttr/TurkishSieve Zenodo (Methodology): https://zenodo.org/records/18038661

I'd love to hear your thoughts on the CUDA kernel optimization and the historical discrepancy I found!