An Open source Implementation of ITU-T Recommendation P.808 with Validation
This work addresses the need for accessible and efficient crowdsourced speech quality testing for researchers and practitioners, though it is incremental as it builds on existing standards.
The authors tackled the problem of subjective speech quality assessment by providing an open-source implementation of ITU-T Recommendation P.808 on Amazon Mechanical Turk, extending it to include DCR and CCR methods and speeding up the test process, with validation showing comparable MOS to laboratory experiments and quantifying the impact of reliability improvements.
The ITU-T Recommendation P.808 provides a crowdsourcing approach for conducting a subjective assessment of speech quality using the Absolute Category Rating (ACR) method. We provide an open-source implementation of the ITU-T Rec. P.808 that runs on the Amazon Mechanical Turk platform. We extended our implementation to include Degradation Category Ratings (DCR) and Comparison Category Ratings (CCR) test methods. We also significantly speed up the test process by integrating the participant qualification step into the main rating task compared to a two-stage qualification and rating solution. We provide program scripts for creating and executing the subjective test, and data cleansing and analyzing the answers to avoid operational errors. To validate the implementation, we compare the Mean Opinion Scores (MOS) collected through our implementation with MOS values from a standard laboratory experiment conducted based on the ITU-T Rec. P.800. We also evaluate the reproducibility of the result of the subjective speech quality assessment through crowdsourcing using our implementation. Finally, we quantify the impact of parts of the system designed to improve the reliability: environmental tests, gold and trapping questions, rating patterns, and a headset usage test.