Intervention Experiments: Reward Convergence

Intervention Experiments: Reward Convergence

Removing β from the ratio restores convergence. All runs use FP32 backward pass, lr=1e-6, simulated BF16 (QAT).

Smoothing 0.6