GitHub Repository

htop for your LLM inference cluster

15 starsGo

Llmtop – Htop for LLM Inference Clusters (vLLM, SGLang, Nim, Ollama,)

Name: Llmtop – Htop for LLM Inference Clusters (vLLM, SGLang, Nim, Ollama,)
Availability: InStock
Author: rpotluri

by rpotluri·Mar 18, 2026·5 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerNiche GemSolve My ProblemShip It

htop for vLLM clusters without the Prometheus overhead.

Strengths

•Single binary auto-discovery beats Grafana setup time for local clusters.
•KV cache and prefix hit rates are critical LLM-specific metrics.
•Supports 10+ backends including NVIDIA Dynamo out of the box.

Weaknesses

•Limited to Prometheus-exposing backends, no custom metric ingestion.
•Kubernetes auto-discovery still in progress according to README.

Post Description

I work on inference scheduling — KV cache-aware routing, load balancing across GPU workers, that kind of thing. I wanted something like k9s but for my inference stack. Nothing existed, so I built it.

llmtop is a real-time terminal dashboard for LLM inference workers. It scrapes the Prometheus /metrics endpoints that vLLM, SGLang, and LMCache already expose and shows everything in one view: KV cache usage, queue depth, TTFT/ITL latencies (P50/P99 from histogram buckets), token throughput, prefix cache hit rates. Color-coded — red means go fix it.

``` brew install InfraWhisperer/tap/llmtop Or go install github.com/InfraWhisperer/llmtop/cmd/llmtop@latest. ```

Single binary, no Prometheus server needed, no Grafana, no config. Just run llmtop and it auto-discovers local workers.

Written in Go with Bubbletea. Working on Kubernetes pod auto-discovery and a GPU metrics view next.

Similar Projects

Infrastructure●●●Banger

AI load balancer and API translator

Unified API gateway for Ollama + vLLM with real-time GPU telemetry and drain mode.

Big BrainSolve My ProblemSlick

sheneman42

103mo ago

AI/ML●●●Banger

SiMM – Distributed KV Cache for the Long-Context and Agent Era

RDMA-backed distributed KV cache cuts prefill latency 3.1× where vLLM's built-in caching maxes out.

WizardryBig BrainNiche Gem

SherryWong

113mo ago

Infrastructure●●Solid

Piqc – GPU waste scanner for LLM inference clusters

Read-only GPU waste scanner finds 20-40% cluster spend waste without agents or sidecars.

Solve My ProblemSlick

paralleliq

3010d ago

Developer Tools●●●Banger

A single CLI to manage llama.cpp/vLLM/Ollama models

Finally one CLI for Ollama, llama.cpp, and vLLM instead of three separate tools.

Solve My ProblemSlick

everlier

213mo ago

AI/ML●●Solid

Piqc – An open-source GPU waste scanner for LLM inference clusters

One-command GPU waste scanner when Kubecost requires full Prometheus setup.

Solve My ProblemNiche Gem

samhoss93

118d ago

Education●●Solid

Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

Build vLLM from scratch with PagedAttention kernels when llama.cpp already exists.

Big BrainNiche Gem

yu3zhou4

2051814d ago