Architecture question: running an LLM as core infrastructure

Name: Architecture question: running an LLM as core infrastructure
Availability: InStock
Author: senza1dio

by senza1dio·Mar 14, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●MidShip ItNiche Gem

Industrial chatbot with circuit breakers, but still just a customer service bot at the end.

Strengths

•Real production deployment with token budgeting and timeout guards for safety
•Semantic caching with pgvector reduces redundant LLM calls and costs
•Parallel tool execution with loop detection prevents infinite agent loops

Weaknesses

•Fundamentally a chatbot interface — no differentiation from Intercom or Drift
•Architecture questions post suggests unfinished thinking, not a polished product

Post Description

I've been experimenting with running an LLM not as a chatbot but as the core runtime of a business system, and I'm curious how others approach this.

The idea is that the model doesn't just answer questions but orchestrates tools and interacts with real application logic.

The architecture I'm currently testing includes:

Runtime

tool orchestration parallel tool execution loop detection circuit breaker / timeout guards token budgeting Context

context compression dynamic token ceiling Caching

deterministic LLM response cache semantic cache using pgvector Memory

short-term session memory longer-term semantic memory Evaluation

prompt evaluation set to test tool reasoning and failures I'm trying to figure out which parts are actually necessary in production and which ones are over-engineering.

For people building LLM systems beyond simple chat interfaces:

how do you handle tool orchestration? do you implement memory layers or just rely on context? are semantic caches worth it in practice? Curious to hear how others structure this.