AsyncGRPO Convergence by Numerical Precision

Mean reward on length-penalty task (reward = −len). Optimal policy emits EOS immediately (reward = −1).

0.6