GitHub Repository

Risk-driven chaos experiment scheduler. Ranks which microservice to break next using service topology and incident history. 9.8x faster weakness discovery vs random selection.

8 starsPython

ChaosRank – which microservice should you break next?

Name: ChaosRank – which microservice should you break next?
Availability: InStock
Author: Medinz01

by Medinz01·Mar 6, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainSolve My Problem

Replaces gut-feel chaos picks with principled risk ranking: 9.8x faster weakness discovery.

Strengths

•Novel scoring blend: PageRank centrality + incident-weighted fragility captures both structure and failure patterns
•Rigorous evaluation on DeathStarBench; methodology is transparent and reproducible
•Solves real prioritization gap—teams genuinely do pick chaos targets by intuition

Weaknesses

•Evaluation limited to simulation; no production system case studies yet
•Requires clean incident data export and trace collection (integration friction for adoption)

Post Description

I built ChaosRank after noticing that most chaos engineering teams pick targets by gut feel or random selection. A payment service with 15 downstream dependents isn't the same risk as a logging sidecar, but tools like LitmusChaos treat them identically.

ChaosRank takes your Jaeger trace export and incident history CSV and produces a ranked list of services to target, with a suggested fault type and confidence level for each.

The risk score combines two signals: - Blast radius: blended PageRank + in-degree centrality on the dependency graph (captures both deep chains and shallow-wide hubs) - Fragility: per-incident traffic-normalized severity with exponential decay (normalization order matters — post-hoc normalization produces ranking inversions at high traffic differentials)

Evaluated on the DeathStarBench social-network topology (31 services) from the UIUC/FIRM dataset (OSDI 2020). Found seeded weaknesses in 1 experiment on average vs 9.8 for random selection across 20 trials.

Output formats: Rich terminal table, JSON, and LitmusChaos ChaosEngine YAML (pipeable directly to kubectl apply).

To try it without your own traces — sample data is included:

pip install chaosrank-cli git clone https://github.com/Medinz01/chaosrank cd chaosrank chaosrank rank \ --traces benchmarks/real_traces/social_network.json \ --incidents benchmarks/real_traces/social_network_incidents.csv

Known limitations: async dependencies (Kafka, SQS) don't appear in trace spans so blast radius is underestimated for event-driven architectures. Jaeger JSON only for now — OTel OTLP is next.

Happy to discuss the algorithm design, particularly the PageRank direction choice and why per-incident normalization matters.