Local Document Parsing for Agents
LlamaIndex open-sources their parser core, but LlamaParse cloud still handles complex layouts.

Student-built extraction API competing directly with established players like LlamaParse.
Developers building RAG pipelines or document processing apps
LlamaParse · Unstructured.io · Markitdown
I am still a student dev, graduating high school this year so I still have a lot to learn. I am trying to build this project to help pay for tuition this year but also to help me learn. So any feedback, advice, questions, etc... are super appreciated and either I will try to respond to the comments or you can email me at [email protected]
Thanks, bollethegoalie
LlamaIndex open-sources their parser core, but LlamaParse cloud still handles complex layouts.
Offline Ollama + OCR keeps your documents private when cloud APIs won't.
Canonical OOXML parsing beats HTML conversion by preserving document semantics and layout fidelity.
ProofPudding returns extraction results with explicit links back to the exact page and source text, supports native and scanned PDFs plus DOCX/images, and ships Python/TypeScript SDKs — handy for agents that need auditable facts. It’s a pragmatic product (per-extraction pricing and confidence scores are nice), but the market is crowded; I want clarity on underlying models, real-world accuracy numbers, and how it compares to Document AI/Textract in edge cases.
PDF-to-Markdown for LLMs when JinaAI and Firecrawl already exist.
Beats PyPDF and MarkItDown on accuracy without needing GPUs or cloud APIs.