Better Than Their Reputation? On the Reliability of Relevance Assessments with Students
This addresses the problem of unreliable evaluation data in information retrieval research, particularly when using student assessors, and is incremental as it applies existing statistical measures to highlight and mitigate assessment disagreements.
The study investigated the reliability of relevance assessments made by students in information retrieval evaluations, finding that inter-assessor agreement was low with Fleiss' Kappa at 0.37 and Krippendorff's Alpha at 0.15, and that filtering unreliable assessments reduced root mean square error by 0.02 to 0.12.
During the last three years we conducted several information retrieval evaluation series with more than 180 LIS students who made relevance assessments on the outcomes of three specific retrieval services. In this study we do not focus on the retrieval performance of our system but on the relevance assessments and the inter-assessor reliability. To quantify the agreement we apply Fleiss' Kappa and Krippendorff's Alpha. When we compare these two statistical measures on average Kappa values were 0.37 and Alpha values 0.15. We use the two agreement measures to drop too unreliable assessments from our data set. When computing the differences between the unfiltered and the filtered data set we see a root mean square error between 0.02 and 0.12. We see this as a clear indicator that disagreement affects the reliability of retrieval evaluations. We suggest not to work with unfiltered results or to clearly document the disagreement rates.