Part 8: Batch Size, Learning Rate, and Scaling Laws
Why You Can’t Just Add More GPUs
More GPUs means larger batches and faster training—up to a point. Beyond a critical threshold, adding GPUs gives diminishing returns or even hurts model quality. There are fundamental limits to parallelism.
This part explains batch size, learning rate, scaling laws, and why you can’t always buy your way to faster training.
Why Should a Leader Care?
Adding more hardware doesn’t automatically mean faster or better training. There are fundamental relationships between:
How much compute you use
How big your batches are
How quickly the model learns
Understanding these helps you:
Evaluate “we need 2× more GPUs” requests
Know when more compute has diminishing returns
Understand the budget/performance tradeoffs
The One Concept: You Can’t Just Add More GPUs
More GPUs often means larger batch sizes. But larger batches don’t always help — and can sometimes hurt.
There’s a fundamental relationship between batch size, learning rate, and training quality.
Batch Size and Learning Rate
Batch size: Number of examples processed before updating weights.
Small batch (32): Noisy gradient estimates. Many small updates. More exploration.
Large batch (4096): Accurate gradient estimates. Fewer large updates. Less exploration.
Learning rate: How much to adjust weights per update.
Too small (0.00001): Convergence is painfully slow
Too large (1.0): Training diverges, loss goes to infinity
Just right (0.001): Steady, fast convergence
The relationship: As batch size increases, you often need to increase the learning rate to maintain training dynamics.
The Linear Scaling Rule
A widely-used heuristic:
If you multiply batch size by N, multiply learning rate by N.
Why? With a larger batch, your gradient is more accurate (averaged over more examples). You can afford to take bigger steps without overshooting.
Concrete example:
Starting point: batch=256, learning_rate=0.001
Scale to 8 GPUs with data parallelism: batch=2048 (8× larger)
Linear scaling rule: learning_rate=0.008 (8× larger)
The catch: This rule works up to a point. Beyond a certain batch size, model quality degrades no matter how you tune the learning rate.
The Critical Batch Size
Every training run has a critical batch size — a threshold beyond which larger batches give diminishing returns.
Below the critical batch size:
2× batch → 2× throughput → same quality
Linear scaling works well
Adding GPUs helps proportionally
Above the critical batch size:
2× batch → <1.5× speedup → often worse quality
Need more total training steps to converge
Eventually, quality degrades regardless of tuning
Concrete examples (approximate):
ResNet-50 on ImageNet: critical batch ~8K-16K
BERT pretraining: critical batch ~16K-32K
GPT-3: critical batch estimated ~2M-4M tokens
Why it exists: Large batches reduce gradient noise. Some noise actually helps escape local minima and generalize better.
Practical implication: There’s a limit to how much parallelism helps. Beyond the critical batch size, you’re better off training longer rather than wider.
Scaling Laws
In 2020, researchers at OpenAI published influential work on scaling laws: predictable relationships between compute, model size, dataset size, and performance.
The Key Findings
Performance scales predictably with compute
Loss follows a power law:
Loss ∝ Compute^(-α)where α ≈ 0.05-0.1You can predict final performance from early training
Optimal allocation exists
Given a compute budget, there’s an optimal split between model size and training tokens
Train too short → undertrained model
Train too long → wasted compute
Bigger models are more sample-efficient
A 70B model learns more per token than a 7B model
Given limited data, use a larger model
Chinchilla Scaling Laws
DeepMind’s 2022 research refined OpenAI’s findings:
For compute-optimal training: Model parameters and training tokens should scale roughly equally.
Rule of thumb: Train on ~20 tokens per parameter.
Chinchilla-Optimal Training:
1B model: ~20B tokens optimal, ~2×10²¹ FLOPs
7B model: ~140B tokens optimal, ~1×10²³ FLOPs
70B model: ~1.4T tokens optimal, ~1×10²⁵ FLOPs
Example calculation: 70B model
Chinchilla-optimal: 1.4 trillion tokens
At 4 million tokens/sec: 1.4T ÷ 4M = 350,000 seconds ≈ 4 days on a large cluster
Undertrained example: LLaMA 1 65B
Trained on ~1.4T tokens ≈ 20 tokens/param (compute-optimal)
Overtrained example: LLaMA 2 7B
Trained on ~2T tokens ≈ 285 tokens/param (14× “optimal”)
Why? Smaller model is cheaper to run millions of times during inference
If you train on fewer tokens: You’re “undertrained” — the model could be better with more training.
If you train on more tokens: You’re “overtrained” — compute efficiency drops, but inference is cheaper.
Compute-Optimal vs. Inference-Optimal
There’s a deliberate tension:
Compute-optimal training:
Use a bigger model, train on fewer tokens
Best quality per training FLOP
Example: 70B model on 1.4T tokens
Inference-optimal training:
Use a smaller model, train on more tokens
Cheaper to deploy later
Example: 7B model on 2T tokens
Why inference-optimal matters: Training runs once (one-time cost). Inference runs millions of times (ongoing cost).
Real-world tradeoff:
Training a 70B model: $2M compute cost
Serving 70B model: $50/hr per GPU × 8 GPUs × 24/7 = $3.4M/year
Training a 7B model longer: $500K compute cost
Serving 7B model: $50/hr × 1 GPU × 24/7 = $430K/year
For high-volume deployment, inference costs dominate. Over-training a smaller model saves money long-term.
Practical Tradeoffs
Batch Size Decisions
Batch Size Decisions:
Faster wall-clock training: Larger batch (up to critical batch size)
Better GPU utilization: Larger batch (fill GPU memory)
Best final model quality: Stay below critical batch size
Easier hyperparameter tuning: Smaller batches (more forgiving)
When More GPUs Don’t Help
Hit critical batch size: 2× GPUs won’t give 2× speedup
Out of training data: Can’t increase batch or train longer
Communication overhead: Spending >40% of time on communication
Learning rate limits: Sweet spot becomes too narrow to find
Leader Implications
“We need to increase the batch size”
Makes training faster (up to a point). May need to adjust learning rate. Ask if they’re approaching the critical batch size.
“Larger batches aren’t helping anymore”
You’ve hit the critical batch size. Adding more GPUs won’t improve wall-clock time or quality.
“We’re following Chinchilla scaling”
Training with ~20 tokens per parameter. Compute-optimal approach. Makes sense if training cost is the primary concern.
“We’re over-training the model”
Training longer than compute-optimal. Creates a smaller, more capable model. Saves inference costs. Makes sense for high-volume deployment.
“We can predict the final loss from early training”
Scaling laws enable this. Can decide at day 2 whether to continue a 30-day run. Saves millions if you can stop early.
“We’re doing a scaling law study”
Training multiple model sizes to predict optimal allocation. Good investment before committing to full-scale training.
Vocabulary Checkpoint
Batch size: Number of examples processed before a weight update
Learning rate: Step size when updating weights
Linear scaling rule: Increase learning rate proportionally with batch size
Critical batch size: Batch size beyond which returns diminish (e.g., 8K-32K)
Scaling laws: Predictable relationships between compute/size/data/performance
Compute-optimal: Best training efficiency (per FLOP spent)
Inference-optimal: Optimized for deployment cost (smaller model, longer training)
Chinchilla scaling: ~20 tokens per parameter for compute-optimal training
What’s Next?
We’ve covered how to distribute training and when more compute helps. But what happens when things go wrong? At the scale of thousands of GPUs running for weeks, hardware failures are guaranteed. How do you build systems that survive failures?
Next time: We’ll explore what breaks at scale — fault tolerance, checkpointing, and building resilient training systems.