Mean reward on length-penalty task (reward = −len). Optimal policy emits EOS immediately (reward = −1). Sync GRPO (standard TRL GRPOTrainer).