Quantifying the "Setup Tax" in Cloud vs WAN Clusters for Iterative Fine-Tuning

I’m researching the cost-efficiency of Pipeline Parallelism (PP) over Consumer WAN versus traditional Data Center (H100) training for iterative fine-tuning tasks.

The conventional wisdom is that WAN latency makes distributed training impossible. However, my preliminary benchmarks on a prototype cluster suggest a specific crossover point where consumer hardware becomes cheaper for short, iterative runs (<5 hours).

The thesis: While H100s have superior throughput, the setup tax (downloading 140GB weights + installing drivers) takes around 45 minutes per cold start.

  • H100 Cluster: High setup cost + High hourly rate.

  • WAN Swarm (4090s): Near-zero setup (Weights cached) + Lower hourly rate + High latency penalty (1.6x slower).

My question: Has anyone here successfully implemented Pipeline Parallelism (splitting layers sequentially similar to Petals architecture) over standard fiber connections? I’m looking to compare my latency logs regarding activation passing.

If you are working on decentralized training topology, I’d love to coordinate benchmarks.

1 Like