Part 9: What Breaks at Scale
Checkpointing, Failures, and Fault Tolerance
Training on 1,000 GPUs for three weeks? You will experience hardware failures. It’s not a question of if, but when. At scale, fault tolerance isn’t optional—it’s essential.
This part covers checkpointing, failure modes, and how to build training systems that survive inevitable hardware failures.
Why Should a Leader Care?
Large-scale training runs for days or weeks on thousands of GPUs. At that scale, hardware failures are not exceptions — they’re expected.
When engineers talk about:
“We lost 8 hours of training due to a node failure”
“We’re checkpointing every 30 minutes”
“Preemption caused a recovery”
...they’re describing the reality of operating at scale. Understanding fault tolerance helps you:
Appreciate infrastructure complexity
Understand unexpected costs and delays
Evaluate reliability tradeoffs
The One Concept: At Scale, Failure Is Certain
If one GPU has a 99.9% daily uptime (pretty good!), what’s the probability of no failures across 1,000 GPUs for a day?
P(no failures) = 0.999^1000 ≈ 0.37 (37%)
That means a 63% chance of at least one failure per day.
For a 30-day training run on 1,000 GPUs, the probability of zero failures is essentially 0%.
You will have failures. The question is how to handle them.
Checkpointing
Checkpoint = A snapshot of training state saved to persistent storage.
What gets saved:
Model weights: The current trained parameters
Optimizer state: Momentum, variance (for Adam)
Learning rate schedule position
Training step number: Where you are in training
Random number generator state: For reproducibility
Data loader state: Which examples you’ve processed
For a large model with Adam optimizer, a checkpoint size:
70B model checkpoint:
Weights (FP16): 140 GB
Gradients (FP16): 140 GB
Optimizer state (FP32): 280 GB (Adam stores 2 values per param)
Total: ~560 GB
Checkpoint Frequency Tradeoff
Checkpoint too rarely (every 6 hours):
Failure = lose up to 6 hours of training
For 1,024 GPUs at $3/hr each: up to $18K wasted compute
Checkpoint too often (every 5 minutes):
Saving 560 GB to distributed storage: ~60-120 seconds
At 5-minute intervals, 20-40% of time spent checkpointing
Significantly slower training
Typical approach: Checkpoint every 15-60 minutes. Balance recovery cost against checkpoint overhead (~2-5% of total time).
Asynchronous Checkpointing
Advanced systems checkpoint without pausing training:
Snapshot state to host memory (~10 seconds to copy from GPU to CPU RAM)
Resume training on GPU immediately
Write host memory to disk in background (~60-120 seconds)
Benefits:
GPU pause time: ~10 seconds (Copy to RAM)
Disk write happens in parallel with training
Effective overhead: ~2-3% instead of 10-15%
Complexity: Requires double buffering in host memory (expensive), careful state management.
Types of Failures
Hardware Failures
GPU failures:
Memory bit flips (ECC corrects most, but not all)
Overheating → thermal throttling or shutdown
Silent data corruption (rare but catastrophic)
Network failures:
Cable unplugged
Switch port dies
Packet corruption exceeds error correction
Node failures:
Power supply dies
Memory (host RAM) corrupts
OS crashes
Storage failures:
Disk dies, corrupting checkpoints
Distributed filesystem becomes unavailable
Frequency at scale: With 1,000+ GPUs, expect multiple hardware failures per week.
MTBF (Mean Time Between Failures): Industry estimate for GPUs is ~50,000 hours (5.7 years) per GPU. With 1,000 GPUs, expect a failure every ~2 days.
Software Failures
Out of memory:
Batch too large
Memory leak in training code
Activation memory explosion
Numerical issues:
NaN (Not a Number): Result of 0/0, √-1, etc.
Inf (Infinity): Result of overflow
Once NaN appears, it propagates through all subsequent operations
Bugs:
Training code errors
Framework bugs (PyTorch, JAX)
CUDA driver issues
Preemption (Cloud)
On cloud platforms, your GPUs might be taken away:
Spot/Preemptible instances:
50-70% cheaper than on-demand
Can be reclaimed with 30-120 seconds notice
Expected to be preempted every few hours to days
Maintenance:
Hardware needs firmware updates
Scheduled downtime
You treat preemption like a failure: save state, restore when capacity returns.
Recovery Strategies
Restart from Checkpoint
The standard approach:
Detect failure (heartbeat timeout, error signal)
Kill the failed training job
Load most recent checkpoint from storage
Restart training from that point
Time breakdown (typical):
Detect failure: 1-5 minutes (heartbeat timeout)
Terminate job: 1 minute
Reload checkpoint: 5-10 minutes (560 GB from S3)
Resume training: immediate
Total recovery time: ~10-20 minutes
Data parallelism: Easy recovery. All GPUs have the same model. Restart with N-1 GPUs or wait for replacement.
Pipeline/Tensor parallelism: Harder recovery. Specific GPUs hold specific model parts. Need exact GPU count and topology to resume.
Redundancy Strategies
Hot spares:
Keep 5-10% extra GPUs on standby
When one fails, swap in the spare
Expensive but minimizes downtime
Common for critical training runs
Elastic training:
Automatically adjust parallelism when GPUs come and go
GPU fails → reduce DP degree → continue
New GPU joins → increase DP degree → absorb capacity
Complex to implement, rarely used in practice
Health Monitoring
Large training runs need continuous monitoring:
Loss tracking:
Monitor for sudden spikes or NaN
Example: Loss goes from 2.3 to 1e10 → stop immediately, investigate
GPU metrics:
Temperature (>80°C sustained = problem)
Utilization (sudden drop from 90% to 30% = issue)
Memory errors (ECC correction count)
Throughput tracking:
Samples/second should be stable
10%+ drop = investigate (failing hardware, network congestion)
Example alert: “GPU 847 running 30% slower than peers for 10 minutes” → thermal throttling, about to fail.
Automated recovery: Some systems auto-stop training on NaN detection, revert to previous checkpoint.
The Cost of Downtime
Back-of-envelope calculation for a large training run:
1,024 H100 GPUs at $3/GPU/hour = $3,072/hour
4-week training run = 672 hours = $2.1 million
1 hour of lost training = $3,072 in compute
But the real cost is opportunity:
Training run delayed by 1 day → product launch delayed by 1 day
If racing a competitor, delays compound
Miss a conference paper deadline → 3-6 month delay to next venue
Checkpoint overhead vs. recovery cost:
Checkpoint every 30 min: ~3% overhead, lose max 30 min on failure
Checkpoint every 2 hours: ~1% overhead, lose max 2 hours on failure
At $3K/hour, saving 30 min of checkpointing costs $90K if you have one failure
Optimal checkpoint frequency balances:
Expected failure rate
Checkpoint time cost
Recovery time cost
Leader Implications
“We’re checkpointing every 30 minutes”
Standard practice. On failure, lose max 30 minutes of training (~$1,500 for 1,024 GPUs).
“We lost training time due to a node failure”
Expected at scale. Ask: How much time lost? Was it within checkpoint interval? Adjust frequency if needed.
“We’re using spot instances”
50-70% cheaper but expect preemptions every few hours. Requires robust checkpointing and automated restart. Makes sense for non-urgent training.
“We had a NaN explosion”
Numerical instability. Common causes: learning rate too high, bad data, mixed precision issues. Need to restart from earlier checkpoint before NaN appeared.
“The cluster achieved 95% uptime”
Sounds good, but 5% downtime on a 30-day run = 1.5 days lost. At $3K/hour, that’s $108K wasted compute.
“We’re implementing checkpoint sharding”
Splitting checkpoint across multiple files/GPUs. Faster save/load, more resilient to storage failures.
Vocabulary Checkpoint
Checkpoint: Saved snapshot of training state (~560 GB for 70B model)
Preemption: Cloud reclaiming allocated resources (spot instances)
NaN/Inf: Not a Number / Infinity — numerical errors that propagate
Hot spare: Standby GPU ready to replace failures
Elastic training: Auto-adjusting parallelism based on available resources
MTBF: Mean Time Between Failures (~50K hours per GPU)
Asynchronous checkpoint: Checkpointing to host RAM while training continues
Heartbeat: Periodic signal confirming a GPU is alive
What’s Next?
We’ve covered the distributed training stack: parallelism, communication, scaling laws, and fault tolerance. Now let’s turn to the software layer. What frameworks do engineers actually use to build this? And what do they do?
Next time: We’ll take a whirlwind tour of ML frameworks — PyTorch, JAX, TensorFlow, and what makes each one different.