L88 – A Local RAG System on 8GB VRAM (Need Architecture Feedback)
Agentic RAG with self-evaluator loop, but evaluator/generator sharing one model due to VRAM constraints.
Agentic RAG with self-evaluator loop, but evaluator/generator sharing one model due to VRAM constraints.
Researchers, data scientists, developers building local LLM applications with privacy requirements
LM Studio · Ollama + LangChain · Llamaindex local
I’ve been working on a project called L88 — a local RAG system that I initially focused on UI/UX for, so the retrieval and model architecture still need proper refinement.
Repo: https://github.com/Hundred-Trillion/L88-Full
I’m running this on 8GB VRAM and a strong CPU (128GB RAM). Embeddings and preprocessing run on CPU, and the main model runs on GPU. One limitation I ran into is that my evaluator and generator LLM ended up being the same model due to compute constraints, which defeats the purpose of evaluation.
I’d really appreciate feedback on:
Better architecture ideas for small-VRAM RAG
Splitting evaluator/generator roles effectively
Improving the LangGraph pipeline
Any bugs or design smells you notice
Ways to optimize the system for local hardware
I’m 18 and still learning a lot about proper LLM architecture, so any technical critique or suggestions would help me grow as a developer. If you check out the repo or leave feedback, it would mean a lot — I’m trying to build a solid foundation and reputation through real projects.
Thanks!
Agentic RAG with self-evaluator loop, but evaluator/generator sharing one model due to VRAM constraints.
Self-correcting LangGraph RAG with local LLM, hybrid retrieval, and role-based multi-user workspace.
Structured memory layers for agents—but vector search already solves this problem.
OpenClaw orchestration with MCP support, but agent management is crowded.
OpenClaw control plane + 15 providers, but orchestration dashboards are crowded.
MetaAgent rewrites Python code and tools in Docker to evolve task performance.