AI Cost Firewall – OpenAI-compatible gateway with semantic caching
LLM gateway with Redis + Qdrant caching, but LiteLLM does this.
Zero trust LLM gateway. OpenAI-compatible proxy with semantic routing and load balancing across OpenAI, Anthropic, Ollama, vLLM, and any compatible backend. Identity-based access, virtual API keys, and end-to-end encryption via OpenZiti
Zero-trust networking via zrok beats LiteLLM when your GPUs sit behind NAT.
Teams deploying self-hosted LLMs across distributed infrastructure
LiteLLM · OpenRouter · vLLM
It does the things you'd expect from this kind of gateway... semantic routing via a three-layer cascade (keyword heuristics, embedding similarity, LLM classifier) that picks the best model when clients omit the model field, weighted round-robin load balancing across local inference servers with health checks and failover.
The part I think is most interesting is the network layer. The gateway and backends communicate over zrok/OpenZiti overlay networks... reach a GPU box behind NAT, expose the gateway to clients, put components anywhere with internet connectivity behind firewalls... no port forwarding, no VPN. Zero-trust in both directions. Most LLM proxies solve the API translation problem. This one also solves the network problem.
Apache 2.0. https://github.com/openziti/llm-gateway
I work for NetFoundry, which sponsors the OpenZiti project this is built on.
LLM gateway with Redis + Qdrant caching, but LiteLLM does this.
Drop-in OpenAI API gateway with failover—LiteLLM does this but this has a dashboard.
Go gateway with circuit breakers, but auth isn't production-ready yet.
Stripped-down Portkey fork handling protocol translation for 77 providers without enterprise bloat.
Runs as a single binary with embedded SQLite and zero-config start, acting as a transparent, provider-agnostic proxy that logs model, tokens, latency, cost and API key hashes while leaving full body capture opt-in. It also proxies streaming responses in real time and exposes stable JSON analytics endpoints — a practical, instrumentable way to get reproducible, audit-ready traces for real LLM traffic, though long-term value depends on how it handles provider edge-cases and SDK compatibility.
Distributed LLM inference over P2P instead of centralized APIs, but early-stage and unproven.