Confidence Interval Estimators for MOS Values
This work addresses the need for reliable interval estimation in QoE research, particularly for small-sample lab studies, but it is incremental as it reviews and adapts existing statistical approaches.
The paper tackles the problem of estimating confidence intervals for Mean Opinion Scores in subjective quality-of-experience studies, where small sample sizes and bounded rating scales challenge traditional methods. It proposes a conservative estimator based on the SOS hypothesis and binomial distributions, showing it performs well with appropriate coverage and avoids out-of-bounds intervals, unlike studentized CIs which have positive outlier ratios, while bootstrapping yields smaller intervals but lower coverage.
For the quantification of QoE, subjects often provide individual rating scores on certain rating scales which are then aggregated into Mean Opinion Scores (MOS). From the observed sample data, the expected value is to be estimated. While the sample average only provides a point estimator, confidence intervals (CI) are an interval estimate which contains the desired expected value with a given confidence level. In subjective studies, the number of subjects performing the test is typically small, especially in lab environments. The used rating scales are bounded and often discrete like the 5-point ACR rating scale. Therefore, we review statistical approaches in the literature for their applicability in the QoE domain for MOS interval estimation (instead of having only a point estimator, which is the MOS). We provide a conservative estimator based on the SOS hypothesis and binomial distributions and compare its performance (CI width, outlier ratio of CI violating the rating scale bounds) and coverage probability with well known CI estimators. We show that the provided CI estimator works very well in practice for MOS interval estimators, while the commonly used studentized CIs suffer from a positive outlier ratio, i.e., CIs beyond the bounds of the rating scale. As an alternative, bootstrapping, i.e., random sampling of the subjective ratings with replacement, is an efficient CI estimator leading to typically smaller CIs, but lower coverage than the proposed estimator.