Intervention Experiments: Reward Convergence

Removing β from the ratio restores convergence. All runs use FP32 backward pass, lr=1e-6, simulated BF16 (QAT).

0.6