Back to browse
GitHub Repository

Crash-safe distributed job execution with fencing tokens, lease recovery and deterministic failure validation.

5 starsPython

Crash-safe job queue – lease-expiry race and fencing fix

by kritibehl·Feb 23, 2026·1 point·1 comment

AI Analysis

●●SolidBig BrainNiche Gem

Fencing tokens + lease expiry races caught with deterministic test harness—correctness, not just convenience.

Strengths
  • Lease-expiry race condition is genuinely subtle and well-articulated with structured traces.
  • Fencing token pattern (monotonic token per lease) is sound stale-writer protection.
  • Deterministic failure harness forces adversarial timing—validates claims, doesn't hand-wave.
Weaknesses
  • Job queues are a crowded category (BullMQ, Temporal, pgboss); no killer differentiator vs PostgreSQL-native competitors.
  • Limited scope: POST examples, no multi-worker benchmarks, latency, or throughput comparisons to existing solutions.
Target Audience

Backend engineers, SREs, distributed systems practitioners building resilient job processing systems

Similar To

pgboss · BullMQ · Temporal

Post Description

Most lease-based job queues look correct until you test them adversarially.

I built Faultline, a PostgreSQL-backed distributed job execution engine using:

- Lease-based claims - Retry scheduling - Idempotent side effects via a ledger table - A deterministic race reproduction harness

The interesting part wasn’t the happy path. It was the lease-expiry race.

Setup:

- Lease TTL: 1s - Worker A sleeps 2.5s (forces expiry) - Barrier enforces deterministic ordering - Worker B reclaims the job

Structured trace:

{"event": "lease_acquired", "job_id": "...", "token": 1, "forced": true} {"event": "execution_started", "job_id": "...", "token": 1} {"event": "lease_acquired", "job_id": "...", "token": 2, "forced": true} {"event": "execution_started", "job_id": "...", "token": 2} {"event": "stale_write_blocked", "job_id": "...", "stale_token": 1, "current_token": 2, "reason": "token_mismatch"} {"event": "worker_exit", "reason": "stale"} {"event": "worker_exit", "reason": "success"}

Worker A believed it still owned the lease. Worker B legitimately reclaimed it.

Without fencing, Worker A could still attempt mutation.

UNIQUE(job_id) alone is insufficient — it prevents duplicate rows but does not encode lease epoch ownership.

The fix:

- Add `fencing_token BIGINT` - Increment atomically on every lease acquisition - Bind side effects to `(job_id, fencing_token)` - Enforce a write gate before mutation

Claim logic:

UPDATE jobs SET state='running', lease_owner=$1, lease_expires_at = NOW() + make_interval(secs => $2), fencing_token=fencing_token+1, updated_at=NOW() WHERE id=$3 AND ( state='queued' OR (state='running' AND lease_expires_at < NOW()) ) RETURNING id, fencing_token;

Lease validity depends solely on DB time (`NOW()`); workers never use local clocks for correctness.

Guarantees under forced expiry + reclaim:

- No duplicate side effects - No stale worker mutation - Deterministic reproduction of the race - DB-enforced epoch ownership via `(job_id, fencing_token)`

The harness forces this race deterministically via barrier gating and forced TTL expiry.

Curious how others handle fencing under lease-based execution — specifically how teams handle fencing token overflow at scale and whether renewal logic changes the fencing guarantee.

Similar Projects