Back to browse
Zagora, Distributed fine-tuning platform on mixed GPUs over internet

Zagora, Distributed fine-tuning platform on mixed GPUs over internet

by miyamotomusashi·Mar 1, 2026·1 point·0 comments

AI Analysis

MidBig BrainBold Bet

Pipeline parallelism for mixed GPUs over internet, but unproven vs established frameworks.

Strengths
  • Novel approach to heterogeneous training (adaptive layer assignment by GPU capability)
  • Supports both managed and BYOC deployment modes
  • Handles worker crashes via checkpoint-based recovery without full resync
Weaknesses
  • Limitations not addressed: no full-parameter fine-tuning support mentioned as critical gap
  • No public benchmarks comparing against vLLM, TorchRun, or Ray for same hardware setup
Category
Target Audience

ML researchers and organizations with mixed GPU fleets lacking NVLink/InfiniBand

Similar To

Ray Tune · DeepSpeed · vLLM

Post Description

I built Zagora, a distributed fine-tuning platform that turns fragmented or mixed GPUs into a unified training cluster over standard internet (1Gbps).

The problem:

Most distributed training assumes homogeneous GPUs and high-bandwidth interconnects (NVLink/InfiniBand). On heterogeneous fleets over standard internet, tensor/data parallel approaches become communication-bound and fragile.

What Zagora does under the hood:

- Uses pipeline-style parallelism instead of heavy tensor synchronization.

- Passes only boundary activations between stages rather than full parameter sync.

- Assigns layers proportionally to GPU capability to reduce straggler idle time.

- Uses checkpoint-based recovery to tolerate worker crashes.

- Supports adapter-based fine-tuning (e.g., QLoRA) to reduce memory pressure.

Zagora currently supports managed runs (we provision GPUs in-region) and a BYOC mode where users run workers on their own infrastructure.

Limitations:

- Full-parameter fine-tuning is not supported yet.

- It won't beat an NVLink cluster on raw throughput.

- Cross-region training is still latency-sensitive.

- Heterogeneous nodes scheduling is an ongoing tuning problem.

IMPORTANT:

I'm currently running jobs manually, so it may take some time before training starts. However, I will run every submitted job.

Link: app.zagora.ai

I'd be interested in feedback from people who've worked on distributed training at scale.

Happy to answer technical questions.

Similar Projects