AI load balancer and API translator
Unified API gateway for Ollama + vLLM with real-time GPU telemetry and drain mode.
htop for your LLM inference cluster
htop for vLLM clusters without the Prometheus overhead.
ML Engineers and Infrastructure Engineers running LLM inference clusters
Grafana · k9s · Datadog
llmtop is a real-time terminal dashboard for LLM inference workers. It scrapes the Prometheus /metrics endpoints that vLLM, SGLang, and LMCache already expose and shows everything in one view: KV cache usage, queue depth, TTFT/ITL latencies (P50/P99 from histogram buckets), token throughput, prefix cache hit rates. Color-coded — red means go fix it.
``` brew install InfraWhisperer/tap/llmtop Or go install github.com/InfraWhisperer/llmtop/cmd/llmtop@latest. ```
Single binary, no Prometheus server needed, no Grafana, no config. Just run llmtop and it auto-discovers local workers.
Written in Go with Bubbletea. Working on Kubernetes pod auto-discovery and a GPU metrics view next.
Unified API gateway for Ollama + vLLM with real-time GPU telemetry and drain mode.
RDMA-backed distributed KV cache cuts prefill latency 3.1× where vLLM's built-in caching maxes out.
Read-only GPU waste scanner finds 20-40% cluster spend waste without agents or sidecars.
Finally one CLI for Ollama, llama.cpp, and vLLM instead of three separate tools.
One-command GPU waste scanner when Kubecost requires full Prometheus setup.
Build vLLM from scratch with PagedAttention kernels when llama.cpp already exists.