Research note 2509.18851 arXiv September 23, 2025

NGRPO: Learning from Homogeneous Failures

NGRPO: Negative-enhanced Group Relative Policy Optimization

Gongrui Nan, Siye Chen, Jing Huang, Mengyu Lu, Dexun Wang, Chunmei Xie, Weiqi Xiong, Xianzhou Zeng, Qixuan Zhou, Yadong Li, Xingzhong Xu

NGRPO studies a precise failure mode in GRPO-style RLVR: when every response in a sampled group is wrong, reward normalization produces zero advantage and the model receives no gradient from the hardest prompts. The paper turns that dead zone into an exploration signal by adding a virtual maximum-reward sample to advantage normalization, then stabilizes the extra pressure with asymmetric clipping.

NGRPO mechanism diagram with virtual maximum-reward sample — Core mechanism: a virtual max-reward sample changes group statistics so all-wrong groups become trainable.

In Brief 简要结论

1
The central object is not reward design in general, but the zero-gradient pathology of homogeneous incorrect groups.
2
A single virtual max-reward sample shifts the group mean upward, giving every wrong rollout a calibrated negative advantage.
3
Asymmetric clipping makes positive updates looser and negative updates tighter, balancing exploration with stability.
4
The strongest empirical message is that learning from collective failure lifts both low-k accuracy and high-k exploration on math reasoning.

Problem 问题

GRPO samples a group of responses for the same prompt and normalizes their rewards inside the group. This works when a group contains both correct and incorrect trajectories, but collapses to zero advantage when all rewards are identical. Homogeneous correct groups are less concerning; homogeneous incorrect groups are exactly where the model needs pressure to explore. DAPO filters these groups out, while PSR-NSR keeps negative samples with fixed advantages but can collapse when the negative signal is too strong.

GRPO versus NGRPO advantage distributions across three correctness regimes — Case analysis: NGRPO is strongest when group accuracy is low and milder when the group is mostly correct.

Mechanism 机制

NGRPO adds a virtual sample with reward r_max only for computing group statistics. The real rollouts are still the only optimized samples, but their normalized advantages are now computed against the augmented reward set. In an all-wrong group, this produces a uniform negative advantage instead of zero. In mixed groups, it dampens positive advantages and strengthens negative ones more when group accuracy is low, then becomes less intrusive when the group is already mostly correct.

Pass@k curves comparing NGRPO with PPO, GRPO, DAPO, and PSR-NSR — Performance curves show gains across low-k accuracy and high-k exploration on math benchmarks.

Stability 稳定性

The virtual reward makes the sum of advantages negative, which increases exploratory pressure. NGRPO therefore changes the PPO-style clipping interval: positive advantages use a wider upper bound, while negative advantages use a tighter lower bound. The paper uses epsilon_pos = 0.24 and epsilon_neg = 0.16, following the intuition that correct trajectories can be amplified more freely while incorrect trajectories should be penalized with tighter control.

Policy entropy trend during NGRPO training — Entropy dynamics support the paper's claim that NGRPO keeps exploration alive without PSR-NSR-style collapse.

Evidence 证据

On Qwen2.5-Math-7B trained on MATH, NGRPO reports the best Pass@k AUC on AIME2025, AMC, and MATH among PPO, GRPO, PSR-NSR, DAPO, and NGRPO. The AIME2025 AUC rises to 31.28, above DAPO's 30.27 and GRPO's 28.33; AMC rises to 86.09; MATH reaches 90.31. Ablations show that the virtual reward gives a substantial gain, asymmetric clipping adds stability, and keeping homogeneous incorrect groups is necessary for the full effect.