Back to browse
ArXiv metadata as Parquet files (2.99M papers, 1.44GB, 417 files)

ArXiv metadata as Parquet files (2.99M papers, 1.44GB, 417 files)

by tamnd·Mar 24, 2026·4 points·0 comments

AI Analysis

MidCozyNiche Gem

Pre-cleaned ArXiv metadata in Parquet saves hours of ETL pipeline work.

Strengths
  • CC0 license allows unrestricted commercial and research usage without attribution.
  • Partitioned Parquet format enables efficient querying with Pandas or Spark.
Weaknesses
  • Metadata only; does not include full text PDFs which limits NLP applications.
  • Existing Kaggle and Semantic Scholar datasets offer similar value with more features.
Category
Target Audience

Data scientists, ML researchers analyzing academic literature

Similar To

Kaggle ArXiv Dataset · Semantic Scholar Open Research Corpus · ArXiv API

Similar Projects