Back to browse
Describe a research topic, get a daily-updated ArXiv/S2 dataset

Describe a research topic, get a daily-updated ArXiv/S2 dataset

by dangerlego5·Jun 23, 2026·2 points·0 comments

AI Analysis

●●●BangerSolve My ProblemBig BrainSlick

Cross-source dedup with pgvector at 0.92 cutoff beats manual scraping workflows.

Strengths
  • pgvector similarity dedup at 0.92 cutoff merges same paper across sources
  • Quality scoring from citation signals filters noise before training
  • Scheduled refreshes keep datasets current without rerunning pipelines
Weaknesses
  • Parquet export and HuggingFace push still coming soon, not live yet
  • Quality score methodology could use more transparency on weighting
Category
Target Audience

ML researchers, data scientists building fine-tuning datasets

Similar To

HuggingFace Datasets · Semantic Scholar API · Papers With Code

Similar Projects