Zagora, Distributed fine-tuning platform on mixed GPUs over internet

Name: Zagora, Distributed fine-tuning platform on mixed GPUs over internet
Availability: InStock
Author: miyamotomusashi

by miyamotomusashi·Mar 1, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●MidBig BrainBold Bet

Pipeline parallelism for mixed GPUs over internet, but unproven vs established frameworks.

Strengths

•Novel approach to heterogeneous training (adaptive layer assignment by GPU capability)
•Supports both managed and BYOC deployment modes
•Handles worker crashes via checkpoint-based recovery without full resync

Weaknesses

•Limitations not addressed: no full-parameter fine-tuning support mentioned as critical gap
•No public benchmarks comparing against vLLM, TorchRun, or Ray for same hardware setup

Post Description

I built Zagora, a distributed fine-tuning platform that turns fragmented or mixed GPUs into a unified training cluster over standard internet (1Gbps).

The problem:

Most distributed training assumes homogeneous GPUs and high-bandwidth interconnects (NVLink/InfiniBand). On heterogeneous fleets over standard internet, tensor/data parallel approaches become communication-bound and fragile.

What Zagora does under the hood:

- Uses pipeline-style parallelism instead of heavy tensor synchronization.

- Passes only boundary activations between stages rather than full parameter sync.

- Assigns layers proportionally to GPU capability to reduce straggler idle time.

- Uses checkpoint-based recovery to tolerate worker crashes.

- Supports adapter-based fine-tuning (e.g., QLoRA) to reduce memory pressure.

Zagora currently supports managed runs (we provision GPUs in-region) and a BYOC mode where users run workers on their own infrastructure.

Limitations:

- Full-parameter fine-tuning is not supported yet.

- It won't beat an NVLink cluster on raw throughput.

- Cross-region training is still latency-sensitive.

- Heterogeneous nodes scheduling is an ongoing tuning problem.