2.0HCMay 4
Beyond Advocacy: A Design Space for Replication-Related StudiesYiheng Liang, Kim Marriott, Helen C. Purchase
The importance of replication is often discussed and advocated -- not only in the domains of visualization and HCI, but in all scientific areas. When replicating a study, design decisions need to be made with regards which aspects of the original study will remain the same and which will be altered. We present a supporting multi-dimensional design space framework within which such decisions can be identified, categorized, compared and analyzed. The framework treats replication experimental design as a pairwise comparison problem, and represents the design by four practical dimensions defined by three comparison levels. The design space is therefore a framework that can be used for both retrospective characterization and prospective planning. We provide worked examples, and relate our framework to other attempts at describing the scope of replication studies.
74.3LGApr 12
SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive WeightingBinbin Zheng, Xing Ma, Yiheng Liang et al.
On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.