Back to browse
GitHub Repository

htop for your LLM inference cluster

15 starsGo

Llmtop – Htop for LLM Inference Clusters (vLLM, SGLang, Nim, Ollama,)

by rpotluri·Mar 18, 2026·5 points·0 comments

AI Analysis

●●●BangerNiche GemSolve My ProblemShip It

htop for vLLM clusters without the Prometheus overhead.

Strengths
  • Single binary auto-discovery beats Grafana setup time for local clusters.
  • KV cache and prefix hit rates are critical LLM-specific metrics.
  • Supports 10+ backends including NVIDIA Dynamo out of the box.
Weaknesses
  • Limited to Prometheus-exposing backends, no custom metric ingestion.
  • Kubernetes auto-discovery still in progress according to README.
Target Audience

ML Engineers and Infrastructure Engineers running LLM inference clusters

Similar To

Grafana · k9s · Datadog

Post Description

I work on inference scheduling — KV cache-aware routing, load balancing across GPU workers, that kind of thing. I wanted something like k9s but for my inference stack. Nothing existed, so I built it.

llmtop is a real-time terminal dashboard for LLM inference workers. It scrapes the Prometheus /metrics endpoints that vLLM, SGLang, and LMCache already expose and shows everything in one view: KV cache usage, queue depth, TTFT/ITL latencies (P50/P99 from histogram buckets), token throughput, prefix cache hit rates. Color-coded — red means go fix it.

``` brew install InfraWhisperer/tap/llmtop Or go install github.com/InfraWhisperer/llmtop/cmd/llmtop@latest. ```

Single binary, no Prometheus server needed, no Grafana, no config. Just run llmtop and it auto-discovers local workers.

Written in Go with Bubbletea. Working on Kubernetes pod auto-discovery and a GPU metrics view next.

Similar Projects