A Unified Evaluation Framework for Multi-Annotator Tendency Learning
This addresses a gap in evaluating ITL methods for researchers and practitioners in machine learning, but it is incremental as it focuses on evaluation rather than a new learning paradigm.
The paper tackles the lack of an evaluation framework for Individual Tendency Learning (ITL) methods in multi-annotator learning, proposing a unified framework with two novel metrics (DIC and BAE) to assess how well these methods capture annotator-specific tendencies and provide meaningful behavioral explanations, validated through extensive experiments.
Recent works have emerged in multi-annotator learning that shift focus from Consensus-oriented Learning (CoL), which aggregates multiple annotations into a single ground-truth prediction, to Individual Tendency Learning (ITL), which models annotator-specific labeling behavior patterns (i.e., tendency) to provide explanation analysis for understanding annotator decisions. However, no evaluation framework currently exists to assess whether ITL methods truly capture individual tendencies and provide meaningful behavioral explanations. To address this gap, we propose the first unified evaluation framework with two novel metrics: (1) Difference of Inter-annotator Consistency (DIC) quantifies how well models capture annotator tendencies by comparing predicted inter-annotator similarity structures with ground-truth; (2) Behavior Alignment Explainability (BAE) evaluates how well model explanations reflect annotator behavior and decision relevance by aligning explainability-derived with ground-truth labeling similarity structures via Multidimensional Scaling (MDS). Extensive experiments validate the effectiveness of our proposed evaluation framework.