Viscacha - A crashsafe, zero infra job system for funcs/AI pipelines
Zero-Redis job queue that handles AI pipeline retries better than Celery.
Crash-safe distributed job execution with fencing tokens, lease recovery and deterministic failure validation.
Fencing tokens + lease expiry races caught with deterministic test harness—correctness, not just convenience.
Backend engineers, SREs, distributed systems practitioners building resilient job processing systems
pgboss · BullMQ · Temporal
I built Faultline, a PostgreSQL-backed distributed job execution engine using:
- Lease-based claims - Retry scheduling - Idempotent side effects via a ledger table - A deterministic race reproduction harness
The interesting part wasn’t the happy path. It was the lease-expiry race.
Setup:
- Lease TTL: 1s - Worker A sleeps 2.5s (forces expiry) - Barrier enforces deterministic ordering - Worker B reclaims the job
Structured trace:
{"event": "lease_acquired", "job_id": "...", "token": 1, "forced": true} {"event": "execution_started", "job_id": "...", "token": 1} {"event": "lease_acquired", "job_id": "...", "token": 2, "forced": true} {"event": "execution_started", "job_id": "...", "token": 2} {"event": "stale_write_blocked", "job_id": "...", "stale_token": 1, "current_token": 2, "reason": "token_mismatch"} {"event": "worker_exit", "reason": "stale"} {"event": "worker_exit", "reason": "success"}
Worker A believed it still owned the lease. Worker B legitimately reclaimed it.
Without fencing, Worker A could still attempt mutation.
UNIQUE(job_id) alone is insufficient — it prevents duplicate rows but does not encode lease epoch ownership.
The fix:
- Add `fencing_token BIGINT` - Increment atomically on every lease acquisition - Bind side effects to `(job_id, fencing_token)` - Enforce a write gate before mutation
Claim logic:
UPDATE jobs SET state='running', lease_owner=$1, lease_expires_at = NOW() + make_interval(secs => $2), fencing_token=fencing_token+1, updated_at=NOW() WHERE id=$3 AND ( state='queued' OR (state='running' AND lease_expires_at < NOW()) ) RETURNING id, fencing_token;
Lease validity depends solely on DB time (`NOW()`); workers never use local clocks for correctness.
Guarantees under forced expiry + reclaim:
- No duplicate side effects - No stale worker mutation - Deterministic reproduction of the race - DB-enforced epoch ownership via `(job_id, fencing_token)`
The harness forces this race deterministically via barrier gating and forced TTL expiry.
Curious how others handle fencing under lease-based execution — specifically how teams handle fencing token overflow at scale and whether renewal logic changes the fencing guarantee.
Zero-Redis job queue that handles AI pipeline retries better than Celery.
SQLite for local dev without Redis, but Trigger.dev and Inngest already own this space.
Ditches Redis entirely by leveraging Postgres LISTEN/NOTIFY for instant wakeups.
TUI and web dashboard for ARQ when the library has no built-in monitoring.
BullMQ replacement hitting 48k jobs/s via 1-RTT ops and Rust native bindings.
Job queue that eliminates Redis/RabbitMQ by storing state in a single JSON file with CAS.