AsyncGRPO Blog Figures
Interactive charts for the BF16 precision analysis blogpost.
Precision & Convergence
AsyncGRPO Convergence by Numerical Precision
FP16, FP32, BF16 configurations on the length-penalty task
Sync GRPO Convergence by Precision & Learning Rate
Standard GRPOTrainer across dtype and lr configurations
Ratio Decomposition (Section 4)
BF16-Aligned Ratio |α| Over Training
Run A (bf16=True) grows to 0.92; Run B (bf16=False) stays at ~0.23
Signal-to-Noise Ratio: |α| / |β|
SNR of the importance-sampling ratio — drops below 1 early in training when precision gap dominates
Gradient Distortion (Section 5)
Gradient Distortion: Cosine Similarity & Relative L2 Error
cosθ drops from 0.99 to ~0.55 once PPO clipping is included; noise exceeds signal by step 10
Intervention Experiments (Section 6)
Intervention Experiments: Reward Convergence
Runs A/B/F/G — removing β from the ratio restores convergence
Intervention Experiments: Deployed Improvement
Per-step model improvement for runs A/B/F/G
Deployed Improvement: bf16=True vs bf16=False
5.5x more effective improvement with bf16=True
Appendix C
Weight-Sync Boundary Crossings per Step
Fraction of FP32 weights crossing a BF16 grid boundary — starts at ~0.96%, decays monotonically
Phantom Clipping (Section 7)
Phantom Clipping: Token-Level Strip Plot
Interactive scatter — toggle β removal to see phantom-clipped tokens collapse back into the green zone
Phantom Clipping: Loss Structure Experiments
No-clip, detach, detach+center vs standard GRPO baseline