NRC nuclear licensing RAG pipeline and regulatory embeddings dataset

Name: NRC nuclear licensing RAG pipeline and regulatory embeddings dataset
Availability: InStock
Author: davenporten

by davenporten·Apr 13, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerNiche GemSolve My Problem

First public NRC regulatory embeddings dataset—37K chunks ready for ChromaDB and Pinecone.

Strengths

•Complete regulatory corpus covering all documents needed for COL submissions
•Pre-embedded with OpenAI text-embedding-3-small for immediate vector store integration
•No comparable public dataset existed before this release

Weaknesses

•Narrow applicability limited to nuclear regulatory and compliance AI use cases
•Accompanying RAG pipeline code remains incomplete following the SaaS business pivot

Post Description

I've been building an AI system to automate parts of the NRC Combined Operational License process: gap analysis against the Standard Review Plan, FSAR strength scoring, and RAI prediction using vector similarity to historical NRC requests. I intended this as a SaaS business, but was ultimately beat to the market.

What I think is the most interesting artifact is the dataset: 37,734 chunks of NRC regulatory documents (NUREG-0800, 10 CFR Parts 20/50/51/52/72/73/100, and Regulatory Guides) embedded with OpenAI text-embedding-3-small. It covers the full regulatory corpus an applicant would need for a COL submission. I'm not aware of anything like this being publicly available before.

The embeddings are ready to load directly into ChromaDB, Pinecone, or any other vector store. If you're doing nuclear AI, regulatory NLP, or just want a large real-world RAG dataset to experiment with, it should be useful.

Here's the full codebase if you're interested: https://github.com/Davenporten/nrc-licensing-rag