Cross-replication Reliability -- An Empirical Approach to Interpreting Inter-rater Reliability
This work addresses the need for better interpretation of IRR in fields like psychology and data science, offering a practical tool for evaluating crowdsourced datasets, though it is incremental as it builds on existing reliability measures.
The authors tackled the problem of interpreting inter-rater reliability (IRR) by proposing an empirical framework based on benchmarking against baselines, including a novel cross-replication reliability (xRR) measure using Cohen's kappa, and applied it to a dataset of 4 million human judgments of facial expressions to assess crowdsourced dataset quality.
We present a new approach to interpreting IRR that is empirical and contextualized. It is based upon benchmarking IRR against baseline measures in a replication, one of which is a novel cross-replication reliability (xRR) measure based on Cohen's kappa. We call this approach the xRR framework. We opensource a replication dataset of 4 million human judgements of facial expressions and analyze it with the proposed framework. We argue this framework can be used to measure the quality of crowdsourced datasets.