CLSep 19, 2025
Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended ReasoningZhiling Ye, Yun Yue, Haowen Wang et al.
Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, a lightweight framework that enables faster and more resource-efficient training while surpassing baselines. Remarkably, on Qwen3-32B, training with just the 4000-sample HealthBench Easy subset is sufficient to obtain a model that exceeds GPT-5 on HealthBench Hard. Incorporating a small amount of teacher-graded data further enhances performance for less capable models.
LGAug 11, 2025
Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized AlignmentHaowen Wang, Yun Yue, Zhiling Ye et al.
Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO's convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO's superior performance, achieving 57.70\%,17.65\% 7.95\% and 5.18\% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.
33.4HCMar 16
Where Digital Meets Place: Deriving Strategies for Curating Mixed Reality Exhibitions in Public SpacesYawei Zhao, Jiaxin Liang, Hao Li et al.
Mixed Reality (MR) technologies are increasingly being used to enrich exhibitions and public spaces by blending digital content with the physical environment in real time. However, little is known about curatorial strategies for embedding MR exhibitions into public spaces or promoting audience experiences. To explore this, we designed and curated a campus-based MR art exhibition, using contextualism as the fundamental concept. We conducted an interdisciplinary expert focus group alongside exhibition viewing to identify opportunities, challenges, and design strategies from multiple perspectives. In parallel, we conducted user studies with general audiences to examine how curatorial strategies foster ex-periential qualities. Our findings reveal insights from both experts and general users along with strategies in curating MR exhibitions and highlight the foundational role of contextualism in curating MR art exhibitions in urban public spaces.