Back to browse
GitHub Repository

A python library for efficient KNN search within metric spaces using multiple distance functions.

36 starsC++

PyNear – exact and approximate KNN, faster than Faiss

by pcael·Mar 29, 2026·2 points·0 comments

AI Analysis

●●SolidBig BrainNiche Gem

VP-Tree + SIMD beats Faiss 39× on exact L2, 257× on binary search.

Strengths
  • MIH binary indexing splits 512-bit descriptors into sub-tables for 257× speedup at 100% Recall@10.
  • Drop-in scikit-learn adapters mean one-line migration from existing KNN workflows.
  • No compiled dependencies beyond NumPy — picksles serialize cleanly without native libs.
Weaknesses
  • No GPU acceleration, while Faiss dominates production with CUDA support.
  • 22 GitHub stars suggests limited community validation compared to established alternatives.
Target Audience

ML engineers and data scientists building similarity search

Similar To

Faiss · Annoy · scikit-learn

Post Description

PyNear is a Python KNN library built around Vantage-Point Trees with a C++ SIMD core. I've been working on it for a while and just shipped v2.2 with two new approximate binary indices. Benchmarks surprised me so I wanted to share.

* Where it beats Faiss:

- Exact L2 search — VP-Trees prune aggressively using the triangle inequality. At d=512, N=500k: 2.2 ms vs Faiss IndexFlatL2's 85 ms (39×). At low dimensionality (d≤16) it's 2–4× faster.

- Approximate binary search — This one was unexpected. The new MIHBinaryIndex (Multi-Index Hashing) splits 512-bit descriptors into 8 sub-tables of 64-bit keys. By the pigeonhole principle, any true neighbour within Hamming radius 8 must match at least one sub-table exactly or with 1 bit flip — so each query is just 520 hash lookups instead of a linear scan. At N=1M, d=512: 0.037 ms vs Faiss IndexBinaryFlat's 9.5 ms (257×), with 100% Recall@10.

- Faiss's approximate binary index (IndexBinaryIVF) turned out to have an O(N²) bug in its add() path — 34 minutes to build at N=1M. So in practice Faiss can't do approximate binary search at scale.

* Where Faiss still wins: - Approximate float search at very large N (≥500k) and very high d — their compiled BLAS K-Means is faster than ours for big clustering jobs. If you're doing CLIP or LLM embedding retrieval at scale, Faiss IVF is still the right tool.

Other things PyNear does that Faiss doesn't: - Pure Python install (NumPy only, no compiled native lib to manage) - Pickle serialization out of the box - L1, L∞, and Hamming exact search with the same API - Drop-in scikit-learn adapter (same fit/predict/kneighbors interface) - BKTree for Hamming range/threshold queries

The binary approximate story is the most practically interesting to me — binary descriptors (ORB, BRIEF, AKAZE) are always high-dimensional and always approximate in practice, and it turns out MIH is a much better fit for that problem than IVF.

GitHub: https://github.com/pablocael/pynear Benchmark report (PDF): https://github.com/pablocael/pynear/blob/main/docs/benchmarks.pdf

Similar Projects

Social●●Solid

Dots Network Know when friends are nearby without exact location

Dots nails a clear product boundary: not full-precision sharing, not nothing — just city/place-level presence with audience controls and custom 'special places'. The site uses concrete micro-stories (Vermont, Back Bay) and features like nearby notifications to sell the workflow. Biggest questions: it's iOS-only and the privacy/anti-abuse mechanics (how fuzzy the locations are, spoof resistance, opt-in audience controls) are unexplained — that's the product risk, not the copy or UI.

Niche GemSlickShip It
jamiegoldstein
214mo ago
AI/ML●●●Banger

Crforest – Competing-risks RSF in Python, 6× faster than R's rfSRC

Native Python competing-risks RSF that's 6x faster than R's randomForestSrc.

Niche GemSolve My Problem
sunnyadn
101mo ago