Back to browse
GitHub Repository

Risk-driven chaos experiment scheduler. Ranks which microservice to break next using service topology and incident history. 9.8x faster weakness discovery vs random selection.

7 starsPython

ChaosRank – which microservice should you break next?

by Medinz01·Mar 6, 2026·1 point·0 comments

AI Analysis

●●●BangerBig BrainSolve My Problem

Replaces gut-feel chaos picks with principled risk ranking: 9.8x faster weakness discovery.

Strengths
  • Novel scoring blend: PageRank centrality + incident-weighted fragility captures both structure and failure patterns
  • Rigorous evaluation on DeathStarBench; methodology is transparent and reproducible
  • Solves real prioritization gap—teams genuinely do pick chaos targets by intuition
Weaknesses
  • Evaluation limited to simulation; no production system case studies yet
  • Requires clean incident data export and trace collection (integration friction for adoption)
Target Audience

Chaos engineering teams, SREs managing complex microservice systems

Similar To

Gremlin · LitmusChaos

Post Description

I built ChaosRank after noticing that most chaos engineering teams pick targets by gut feel or random selection. A payment service with 15 downstream dependents isn't the same risk as a logging sidecar, but tools like LitmusChaos treat them identically.

ChaosRank takes your Jaeger trace export and incident history CSV and produces a ranked list of services to target, with a suggested fault type and confidence level for each.

The risk score combines two signals: - Blast radius: blended PageRank + in-degree centrality on the dependency graph (captures both deep chains and shallow-wide hubs) - Fragility: per-incident traffic-normalized severity with exponential decay (normalization order matters — post-hoc normalization produces ranking inversions at high traffic differentials)

Evaluated on the DeathStarBench social-network topology (31 services) from the UIUC/FIRM dataset (OSDI 2020). Found seeded weaknesses in 1 experiment on average vs 9.8 for random selection across 20 trials.

Output formats: Rich terminal table, JSON, and LitmusChaos ChaosEngine YAML (pipeable directly to kubectl apply).

To try it without your own traces — sample data is included:

pip install chaosrank-cli git clone https://github.com/Medinz01/chaosrank cd chaosrank chaosrank rank \ --traces benchmarks/real_traces/social_network.json \ --incidents benchmarks/real_traces/social_network_incidents.csv

Known limitations: async dependencies (Kafka, SQS) don't appear in trace spans so blast radius is underestimated for event-driven architectures. Jaeger JSON only for now — OTel OTLP is next.

Happy to discuss the algorithm design, particularly the PageRank direction choice and why per-incident normalization matters.

Similar Projects

Developer Tools●●Solid

Mapstr – AI-powered codebase mapper CLI

Tree-sitter + LLM codebase mapping, but Cursor, Continue, Sourcegraph already do this.

Solve My ProblemShip It
tahaio
103mo ago