Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations
This work addresses the challenge of scalable and cost-effective annotation for tasks like sentiment analysis and dialogue evaluation, though it is incremental in optimizing existing annotation approaches.
The researchers tackled the problem of efficiently assigning scalar ratings to large datasets by comparing iterative Best-Worst Scaling (IBWS) with direct assessment methods, finding that certain direct methods achieve high correlation (e.g., 0.95) with IBWS at lower cost.
Our goal is a mechanism for efficiently assigning scalar ratings to each of a large set of elements. For example, "what percent positive or negative is this product review?" When sample sizes are small, prior work has advocated for methods such as Best Worst Scaling (BWS) as being more robust than direct ordinal annotation ("Likert scales"). Here we first introduce IBWS, which iteratively collects annotations through Best-Worst Scaling, resulting in robustly ranked crowd-sourced data. While effective, IBWS is too expensive for large-scale tasks. Using the results of IBWS as a best-desired outcome, we evaluate various direct assessment methods to determine what is both cost-efficient and best correlating to a large scale BWS annotation strategy. Finally, we illustrate in the domains of dialogue and sentiment how these annotations can support robust learning-to-rank models.