Back to browse
GitHub Repository

Open-source, local-first, read-only AI SRE: clusters alert storms, investigates root cause over your live systems, proposes human-gated fixes.

1 starsPython

Nightwatch, The open-source, read-only AI SRE

by egorferber·Jun 7, 2026·2 points·1 comment

AI Analysis

●●SolidBig BrainShip It

Read-only AI agent architecture prevents production accidents during incident response.

Strengths
  • Local-first owl agents dial outbound only, no inbound exposure
  • Alert storm grouping reduces 50 pages to one confirmed incident
  • Human-gated fix proposals with risk ranking and blast radius
Weaknesses
  • AI SRE investigation competes with Datadog, PagerDuty, FireHydrant
  • Read-only limitation means manual execution of proposed fixes
Target Audience

SREs, DevOps engineers, platform teams

Similar To

Datadog · PagerDuty · FireHydrant

Post Description

nightwatch is a local-first, read-only layer on top of your monitoring. it groups alert storm into incidents, flags noisy checks and has an agent that can investigate for you live systems. You can e.g. jump from the incident into the agent directly.

the reason for this weekend project is that we had a kubernetes upgrade that went wrong, and at some point a rollback wasn't possible anymore, so it had to be fixed live during the night while several problems came together. We run a lot of different systems, on-prem and several Kubernetes clusters, and in a situation like that you spend most of the time just figuring out what is actually broken and where.

So i thought that it would be pretty cool to have eyes in the dark in each system that can talk to your "brain".

so the idea is to put a baby owl into each environment. Each owl runs where the systems live, keeps that environment's credentials local, and only dials outbound to a central brain, so there is no inbound hole into prod. It exposes a set of read-only skills, and the agent uses them to gather evidence and form a root-cause hypothesis, so the on-call engineer starts with a head start instead of from zero.

read-only for now, i don't trust it near prod yet and honestly neither should you.

llocal-first for easy self-hosting and to keep credentials on your side. the clustering and recommendations run fully offline with no llm at all. the agent needs a tool-calling llm, you can point it at a remote one, or self-host one (ollama etc.) if you want to stay fully offline.

for non selfhosters: before every remote llm call, nightwatch strips real secrets (unrestorable) and swaps identifiers like ips, hostnames and paths for reversible placeholders, so the model only sees masked data while real values are restored only in the proposed commands and tool calls

Would love if you try it in your Systems

Similar Projects

Infrastructure●●Solid

RunbookAI – Stop scrolling dashboards at 3 a.m., let AI investigate

The project converts on-call triage into a hypothesis-driven agent that forms and prunes hypotheses, fetches evidence from CloudWatch/Kubernetes and your runbooks, and surfaces an investigation plus approval-gated remediation steps. I like the npx demo, read-only-by-default K8s stance, and built-in audit trail; the obvious caveat is its dependence on proprietary LLM keys and the ops work needed before trusting any mutating actions in production.

Solve My ProblemNiche GemWizardry
EmTekker
103mo ago