AsyncGRPO Blog Figures

Interactive charts for the BF16 precision analysis blogpost.

Precision & Convergence

AsyncGRPO Convergence by Numerical Precision

FP16, FP32, BF16 configurations on the length-penalty task

Sync GRPO Convergence by Precision & Learning Rate

Standard GRPOTrainer across dtype and lr configurations

Ratio Decomposition (Section 4)

BF16-Aligned Ratio |α| Over Training

Run A (bf16=True) grows to 0.92; Run B (bf16=False) stays at ~0.23

Signal-to-Noise Ratio: |α| / |β|

SNR of the importance-sampling ratio — drops below 1 early in training when precision gap dominates

Gradient Distortion (Section 5)

Gradient Distortion: Cosine Similarity & Relative L2 Error

cosθ drops from 0.99 to ~0.55 once PPO clipping is included; noise exceeds signal by step 10

Intervention Experiments (Section 6)

Intervention Experiments: Reward Convergence

Runs A/B/F/G — removing β from the ratio restores convergence

Intervention Experiments: Deployed Improvement

Per-step model improvement for runs A/B/F/G

Deployed Improvement: bf16=True vs bf16=False

5.5x more effective improvement with bf16=True

Appendix C

Weight-Sync Boundary Crossings per Step

Fraction of FP32 weights crossing a BF16 grid boundary — starts at ~0.96%, decays monotonically

Phantom Clipping (Section 7)

Phantom Clipping: Token-Level Strip Plot

Interactive scatter — toggle β removal to see phantom-clipped tokens collapse back into the green zone

Phantom Clipping: Loss Structure Experiments

No-clip, detach, detach+center vs standard GRPO baseline