Research note 研究笔记 2601.22648 arXiv January 30, 2026 2026年1月30日

UCPO: Uncertainty as a First-Class Learning Signal UCPO:把不确定性作为一等学习信号

UCPO: Uncertainty-Aware Policy Optimization

Xianzhou Zeng, Jing Huang, Chunmei Xie, Gongrui Nan, Siye Chen, Mengyu Lu, Weiqi Xiong, Qixuan Zhou, Junhao Zhang, Qiang Zhu, Yadong Li, Xingzhong Xu

UCPO targets a different blind spot in RL alignment: binary right/wrong rewards can make models overconfident, while a fixed reward for "I don't know" can swing between hallucination suppression and avoidance collapse. The paper reframes uncertainty as a ternary policy state and introduces Ternary Advantage Decoupling plus Dynamic Uncertainty Reward Adjustment to keep accuracy, hallucination reduction, and honest abstention in balance. UCPO 处理的是 RL 对齐里的另一类盲点:二元对错奖励会鼓励模型过度自信,而给“不知道” 一个固定奖励又容易在抑制幻觉和过度回避之间摇摆。论文把不确定性重新建模为第三种 策略状态,并用 Ternary Advantage Decoupling 与 Dynamic Uncertainty Reward Adjustment 平衡准确率、幻觉降低和诚实拒答。

Reward imbalance illustration for uncertainty alignment
Static uncertainty rewards can create overconfidence on easier tasks or avoidance degeneracy on harder tasks. 固定不确定性奖励会在简单任务上制造过度自信,在困难任务上制造回避退化。

In Brief 简要结论

  1. 1

    The paper identifies advantage bias as the root of both overconfidence and avoidance degeneracy in fixed uncertainty rewards. 论文把 advantage bias 识别为固定不确定性奖励导致过度自信和回避退化的根因。

  2. 2

    TAD splits deterministic rollouts from uncertain rollouts, so uncertainty is not normalized against ordinary right/wrong rewards. TAD 将确定性 rollout 与不确定 rollout 拆开,避免不确定性被普通对错奖励的归一化吞掉。

  3. 3

    DURA adjusts the uncertainty gain from group-level right, wrong, and uncertain ratios, making the reward evolve with task difficulty and model capability. DURA 根据组内正确、错误、不确定比例动态调节不确定性增益,让奖励随任务难度和模型能力演化。

  4. 4

    The strongest result is reliability: UCPO usually improves PAQ by converting hallucinations into uncertainty, while keeping F1 from collapsing. 最强结果体现在可靠性:UCPO 通常通过把幻觉转化为不确定性来提升 PAQ,同时尽量避免 F1 崩塌。

Problem 问题

Naively adding an uncertainty reward to GRPO creates a ternary reward space: right, wrong, and uncertain. With a fixed reward for uncertainty, the advantage sign depends on the group's current distribution. In easy or high-performance regimes, uncertainty can become a negative-advantage action and the model learns overconfidence. In hard regimes, uncertainty can dominate the gradient and the model learns to avoid reasoning by saying it is unsure. 直接给 GRPO 加一个不确定性奖励,会形成“正确、错误、不确定”的三元奖励空间。但固定的 不确定性奖励会让 advantage 符号依赖当前组的分布:在简单或高性能区间,不确定回答可能 变成负 advantage,模型因此学会过度自信;在困难区间,不确定回答又可能主导梯度,模型 通过“不确定”逃避推理。

Ternary imbalance plots comparing GRPO-UC and UCPO
The ternary plots visualize why global normalization makes uncertainty incentives unstable. 三元图解释了为什么全局归一化会让不确定性激励变得不稳定。

TAD TAD

Ternary Advantage Decoupling separates rollouts into a deterministic channel and an uncertainty channel. The deterministic channel normalizes only right and wrong responses, preserving the learning signal for correctness. The uncertainty channel assigns uncertain rollouts an advantage proportional to the right-sample advantage, scaled by a dynamic gain gamma. This treats uncertainty as a legitimate metacognitive action, not merely a middle reward between correct and wrong. Ternary Advantage Decoupling 把 rollout 拆成确定性通道和不确定性通道。确定性通道只在 正确/错误回复之间归一化,保留学习正确性的梯度;不确定性通道则把不确定 rollout 的 advantage 设为“正确样本 advantage 乘以动态增益 gamma”。这使不确定性成为一种真正的 元认知动作,而不是简单夹在正确和错误之间的中间奖励。

UCPO framework architecture
UCPO combines TAD and DURA: separate semantic channels plus adaptive gain control. UCPO 由 TAD 与 DURA 组成:语义通道拆分,加上自适应增益控制。

DURA DURA

Dynamic Uncertainty Reward Adjustment computes gamma from the group's right, wrong, and uncertain ratios. The gain term grows when wrong answers are common and uncertainty is scarce, encouraging the model to replace hallucinations with honest doubt. The suppression term grows as the model becomes more correct and uncertainty remains high, pushing the policy back toward definitive answers when it has enough knowledge. Dynamic Uncertainty Reward Adjustment 根据组内正确、错误、不确定比例计算 gamma。错误多且 不确定少时,增益项变大,鼓励模型把幻觉替换成诚实的不确定;当模型正确率提高但仍大量 不确定时,抑制项变大,推动策略回到有把握时给出确定答案。

Uncertainty ratio evolution during training
Training dynamics show UCPO avoiding both zero-uncertainty overconfidence and all-uncertainty collapse. 训练动态显示 UCPO 同时避免零不确定的过度自信与全不确定崩塌。

Evidence 证据

Experiments use Qwen3-8B and Llama-3.1-8B-Instruct across math and general tasks. UCPO reports the highest average PAQ in both domains for both models: for example, Qwen3-8B reaches 79.63 PAQ on math/text reasoning and 79.68 on general tasks; Llama-3.1-8B-Instruct reaches 28.45 and 58.58. The visual analyses show GRPO staying near zero uncertainty, fixed GRPO-UC variants swinging between suppression and 100% uncertainty, and UCPO converging to a more controlled uncertainty ratio. 实验覆盖 Qwen3-8B 和 Llama-3.1-8B-Instruct,在数学/文本推理和通用任务上测试。 UCPO 在两类任务、两个模型上都报告了最高平均 PAQ:例如 Qwen3-8B 在数学/文本推理上 达到 79.63 PAQ,在通用任务上达到 79.68;Llama-3.1-8B-Instruct 分别达到 28.45 和 58.58。可视化结果显示,GRPO 的不确定比例接近 0,固定 GRPO-UC 会在被抑制和 100% 不确定之间摇摆,而 UCPO 能收敛到更受控的不确定比例。

Aggregated accuracy, hallucination, and uncertainty distributions
Output distributions make the reliability story visible: fewer hallucinations, more calibrated uncertainty. 输出分布把可靠性改善直观呈现出来:更少幻觉,更多校准过的不确定。