每日论文 - 2025年08月29日

论文总数: 19

1. Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable

Text-to-Image Reinforcement Learning

作者: Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

信息: 📅 发布日期: 2025-08-28 | 👍 点赞数: 56

摘要:

Pref-GRPO: 基于成对偏好奖励的GRPO方法用于稳定的文本到图像强化学习

近期进展凸显了基于GRPO的强化学习方法及基准测试在提升文本到图像（T2I）生成方面的重要性。然而，当前使用逐点奖励模型（pointwise reward models, RM）对生成图像进行评分的方法容易受到奖励黑客（reward hacking）的影响。我们发现，当图像之间的评分差异在归一化后被放大时，会产生虚假的优势，导致模型过度优化微小收益，最终破坏图像生成过程的稳定性。为解决这一问题，我们提出了Pref-GRPO，一种基于成对偏好奖励的GRPO方法，将优化目标从最大化评分转换为拟合偏好，从而确保更稳定的训练过程。在Pref-GRPO中，通过在每组内使用偏好RM对图像进行两两比较，并以胜率作为奖励信号。大量实验表明，Pref-GRPO能够区分图像质量的细微差异，提供更稳定的增益并缓解奖励黑客问题。此外，现有的T2I基准测试受限于粗粒度的评估标准，难以全面评估模型性能。为此，我们引入了UniGenBench，一个统一的T2I基准测试，涵盖5个主要主题和20个子主题，共计600个提示词。该基准通过10项主要标准和27项子标准评估语义一致性，并利用多模态大语言模型（MLLM）进行构建与评估。我们的基准测试揭示了开源与闭源T2I模型的优劣势，并验证了Pref-GRPO的有效性。

2. rStar2-Agent: Agentic Reasoning Technical Report

作者: Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, Mao Yang

每日论文 - 2025年08月29日

1. Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable​

2. rStar2-Agent: Agentic Reasoning Technical Report​

3. USO: Unified Style and Subject-Driven Generation via Disentangled and​

4. AWorld: Orchestrating the Training Recipe for Agentic AI​

5. TCIA: A Task-Centric Instruction Augmentation Method for Instruction​

6. Mixture of Contexts for Long Video Generation​

7. MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World​

8. CogVLA: Cognition-Aligned Vision-Language-Action Model via​

9. OneReward: Unified Mask-Guided Image Generation via Multi-Task Human​

10. Turning the Spell Around: Lightweight Alignment Amplification via​

11. Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability​

12. Dress&Dance: Dress up and Dance as You Like It - Technical Preview​

13. OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn​

14. Multi-View 3D Point Tracking​

15. FakeParts: a New Family of AI-Generated DeepFakes​

16. Provable Benefits of In-Tool Learning for Large Language Models​

17. Collaborative Multi-Modal Coding for High-Quality 3D Generation​

18. ROSE: Remove Objects with Side Effects in Videos​

19. Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and​

1. Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable

2. rStar2-Agent: Agentic Reasoning Technical Report

3. USO: Unified Style and Subject-Driven Generation via Disentangled and

4. AWorld: Orchestrating the Training Recipe for Agentic AI

5. TCIA: A Task-Centric Instruction Augmentation Method for Instruction

6. Mixture of Contexts for Long Video Generation

7. MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World

8. CogVLA: Cognition-Aligned Vision-Language-Action Model via

9. OneReward: Unified Mask-Guided Image Generation via Multi-Task Human

10. Turning the Spell Around: Lightweight Alignment Amplification via

11. Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability

12. Dress&Dance: Dress up and Dance as You Like It - Technical Preview

13. OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn

14. Multi-View 3D Point Tracking

15. FakeParts: a New Family of AI-Generated DeepFakes

16. Provable Benefits of In-Tool Learning for Large Language Models

17. Collaborative Multi-Modal Coding for High-Quality 3D Generation

18. ROSE: Remove Objects with Side Effects in Videos

19. Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and