Back to browse
518K Vietnamese legal documents (1924–2026)

518K Vietnamese legal documents (1924–2026)

by th1nhng0·Mar 22, 2026·3 points·0 comments

AI Analysis

●●SolidNiche GemDark Horse

518k Vietnamese legal documents fill a massive gap in Southeast Asian NLP datasets.

Strengths
  • Parquet format and CC BY 4.0 license make it immediately usable.
  • Covers a century of legislation, providing historical context often missing.
Weaknesses
  • Limited to Vietnamese, reducing global appeal for general LLM pretraining.
  • No built-in retrieval interface, users must build their own RAG.
Category
Target Audience

NLP researchers focusing on Southeast Asian languages

Similar To

LexGLUE · CaseHOLD · Hugging Face Datasets

Post Description

I scraped and open-sourced a corpus of 518,255 Vietnamese legal documents — laws, decrees, circulars, decisions — spanning a century of legislation. Metadata + full Markdown text, ~3.6 GB parquet, CC BY 4.0. Vietnamese legal text is nearly absent from existing NLP datasets despite Vietnam having one of the more prolific legislative systems in Southeast Asia.

Similar Projects

AI/ML●●Solid

LexReviewer – Because "Chat with PDF" is broken for legal workflows

LangGraph agent adapts search strategy per query, but LLMs still hallucinate in contracts.

Solve My ProblemBig Brain
sherebanuk
113mo ago