Research note 研究笔记 2509.18851 arXiv September 23, 2025 2025年9月23日

NGRPO: Learning from Homogeneous Failures NGRPO:从同质失败中恢复学习信号

NGRPO: Negative-enhanced Group Relative Policy Optimization

Gongrui Nan, Siye Chen, Jing Huang, Mengyu Lu, Dexun Wang, Chunmei Xie, Weiqi Xiong, Xianzhou Zeng, Qixuan Zhou, Yadong Li, Xingzhong Xu

NGRPO studies a precise failure mode in GRPO-style RLVR: when every response in a sampled group is wrong, reward normalization produces zero advantage and the model receives no gradient from the hardest prompts. The paper turns that dead zone into an exploration signal by adding a virtual maximum-reward sample to advantage normalization, then stabilizes the extra pressure with asymmetric clipping. NGRPO 关注 GRPO 式 RLVR 中一个很具体但很关键的问题:当同一组采样回复全部错误时, 组内奖励归一化会给出零 advantage,最难题反而没有梯度。论文通过在 advantage 计算中 加入一个“虚拟满分样本”,把这个死区转化为探索信号,并用非对称 clipping 稳住额外的 负向更新压力。

NGRPO mechanism diagram with virtual maximum-reward sample
Core mechanism: a virtual max-reward sample changes group statistics so all-wrong groups become trainable. 核心机制:虚拟满分样本改变组内统计量,让全错组重新变得可训练。

In Brief 简要结论

  1. 1

    The central object is not reward design in general, but the zero-gradient pathology of homogeneous incorrect groups. 核心问题不是泛泛的奖励设计,而是“全错组”导致零梯度的病灶。

  2. 2

    A single virtual max-reward sample shifts the group mean upward, giving every wrong rollout a calibrated negative advantage. 一个虚拟满分样本会抬高组均值,让每个错误 rollout 获得经过归一化的负 advantage。

  3. 3

    Asymmetric clipping makes positive updates looser and negative updates tighter, balancing exploration with stability. 非对称 clipping 放宽正样本更新、收紧负样本更新,在探索和稳定之间做平衡。

  4. 4

    The strongest empirical message is that learning from collective failure lifts both low-k accuracy and high-k exploration on math reasoning. 最强的实证信号是:从集体失败中学习,同时提升数学推理里的低 k 准确率和高 k 探索能力。

Problem 问题

GRPO samples a group of responses for the same prompt and normalizes their rewards inside the group. This works when a group contains both correct and incorrect trajectories, but collapses to zero advantage when all rewards are identical. Homogeneous correct groups are less concerning; homogeneous incorrect groups are exactly where the model needs pressure to explore. DAPO filters these groups out, while PSR-NSR keeps negative samples with fixed advantages but can collapse when the negative signal is too strong. GRPO 会对同一个 prompt 采样一组回复,并在组内归一化奖励。只要组内同时有正确和 错误轨迹,这个相对 advantage 就有效;但如果奖励完全相同,advantage 会全部变成 零。全对组问题不大,全错组却恰恰是模型最需要探索的地方。DAPO 选择过滤这些组, PSR-NSR 则用固定 advantage 保留负样本,但负信号太强时容易训练崩塌。

GRPO versus NGRPO advantage distributions across three correctness regimes
Case analysis: NGRPO is strongest when group accuracy is low and milder when the group is mostly correct. 案例分析:NGRPO 在组内正确率低时作用最强,在大多正确时更温和。

Mechanism 机制

NGRPO adds a virtual sample with reward r_max only for computing group statistics. The real rollouts are still the only optimized samples, but their normalized advantages are now computed against the augmented reward set. In an all-wrong group, this produces a uniform negative advantage instead of zero. In mixed groups, it dampens positive advantages and strengthens negative ones more when group accuracy is low, then becomes less intrusive when the group is already mostly correct. NGRPO 只在计算组内统计量时加入一个奖励为 r_max 的虚拟样本;真正参与优化的仍然 是模型生成的 rollout。这样,全错组不再得到零 advantage,而是得到统一的负 advantage;在混合组里,模型越不熟练,正样本 advantage 越被压低、负样本惩罚越被 放大;当组内大多正确时,这个机制又会变得相对温和。

Pass@k curves comparing NGRPO with PPO, GRPO, DAPO, and PSR-NSR
Performance curves show gains across low-k accuracy and high-k exploration on math benchmarks. 性能曲线显示,NGRPO 同时改善低 k 准确率与高 k 探索能力。

Stability 稳定性

The virtual reward makes the sum of advantages negative, which increases exploratory pressure. NGRPO therefore changes the PPO-style clipping interval: positive advantages use a wider upper bound, while negative advantages use a tighter lower bound. The paper uses epsilon_pos = 0.24 and epsilon_neg = 0.16, following the intuition that correct trajectories can be amplified more freely while incorrect trajectories should be penalized with tighter control. 虚拟奖励会让组内 advantage 的总和偏负,从而提高探索压力。NGRPO 因此修改 PPO/GRPO 的 clipping 区间:正 advantage 使用更宽的上界,负 advantage 使用更紧的 下界。论文采用 epsilon_pos = 0.24、epsilon_neg = 0.16,直觉是正确轨迹可以更自由地 放大,而错误轨迹需要更受控地惩罚。

Policy entropy trend during NGRPO training
Entropy dynamics support the paper's claim that NGRPO keeps exploration alive without PSR-NSR-style collapse. 熵变化支持论文主张:NGRPO 保留探索能力,同时避免 PSR-NSR 式崩塌。

Evidence 证据

On Qwen2.5-Math-7B trained on MATH, NGRPO reports the best Pass@k AUC on AIME2025, AMC, and MATH among PPO, GRPO, PSR-NSR, DAPO, and NGRPO. The AIME2025 AUC rises to 31.28, above DAPO's 30.27 and GRPO's 28.33; AMC rises to 86.09; MATH reaches 90.31. Ablations show that the virtual reward gives a substantial gain, asymmetric clipping adds stability, and keeping homogeneous incorrect groups is necessary for the full effect. 在 Qwen2.5-Math-7B 上用 MATH 训练后,NGRPO 在 AIME2025、AMC、MATH 的 Pass@k AUC 上都优于 PPO、GRPO、PSR-NSR、DAPO 等基线。AIME2025 AUC 达到 31.28,高于 DAPO 的 30.27 和 GRPO 的 28.33;AMC 达到 86.09;MATH 达到 90.31。消融实验显示,虚拟奖励本身带来主要增益,非对称 clipping 提供稳定性,而保留 同质错误组是完整效果的关键。