Back to browse
DELN – An interactive atlas of AI training datasets

DELN – An interactive atlas of AI training datasets

by yshunnar·Jun 23, 2026·2 points·0 comments

AI Analysis

●●SolidNiche GemBig Brain

Force-directed dataset atlas with gap detection Hugging Face doesn't offer.

Strengths
  • Five similarity metrics (token overlap, JSD, embedding, Vendi, knowledge graph) for nuanced comparison.
  • Red ring gap detection visually highlights missing data coverage areas.
  • 173T tokens across 39 datasets with specific token counts for each.
Weaknesses
  • Dataset comparison tools already exist in various forms (Hugging Face, benchmark trackers).
  • Niche audience — only matters if you're actively auditing training data.
Category
Target Audience

ML engineers, AI researchers, data scientists

Similar To

Hugging Face Datasets · Papers With Code

Post Description

This is a map of major web crawl datasets. You can see how they relate and influence one another. Play with the dropdown to see new relationships.

Similar Projects