MMFeb 4, 2022Code
Generalised Score Distribution: A Two-Parameter Discrete Distribution Accurately Describing Responses from Quality of Experience Subjective ExperimentsJakub Nawała, Lucjan Janowski, Bogdan Ćmiel et al.
Subjective responses from Multimedia Quality Assessment (MQA) experiments are conventionally analysed with methods not suitable for the data type these responses represent. Furthermore, obtaining subjective responses is resource intensive. A method allowing reuse of existing responses would be thus beneficial. Applying improper data analysis methods leads to difficult to interpret results. This encourages drawing erroneous conclusions. Building upon existing subjective responses is resource friendly and helps develop machine learning (ML) based visual quality predictors. We show that using a discrete model for analysis of responses from MQA subjective experiments is feasible. We indicate that our proposed Generalised Score Distribution (GSD) properly describes response distributions observed in typical MQA experiments. We highlight interpretability of GSD parameters and indicate that the GSD outperforms the approach based on sample empirical distribution when it comes to bootstrapping. We evidence that the GSD outcompetes the state-of-the-art model both in terms of goodness-of-fit and bootstrapping capabilities. To do all of that we analyse more than one million subjective responses from more than 30 subjective experiments. Furthermore, we make the code implementing the GSD model and related analyses available through our GitHub repository: https://github.com/Qub3k/subjective-exp-consistency-check
MMApr 5, 2020Code
A Simple Model for Subject Behavior in Subjective ExperimentsZhi Li, Christos G. Bampis, Lukáš Krasula et al.
In a subjective experiment to evaluate the perceptual audiovisual quality of multimedia and television services, raw opinion scores collected from test subjects are often noisy and unreliable. To produce the final mean opinion scores (MOS), recommendations such as ITU-R BT.500, ITU-T P.910 and ITU-T P.913 standardize post-test screening procedures to clean up the raw opinion scores, using techniques such as subject outlier rejection and bias removal. In this paper, we analyze the prior standardized techniques to demonstrate their weaknesses. As an alternative, we propose a simple model to account for two of the most dominant behaviors of subject inaccuracy: bias and inconsistency. We further show that this model can also effectively deal with inattentive subjects that give random scores. We propose to use maximum likelihood estimation to jointly solve the model parameters, and present two numeric solvers: the first based on the Newton-Raphson method, and the second based on an alternating projection (AP). We show that the AP solver generalizes the ITU-T P.913 post-test screening procedure by weighing a subject's contribution to the true quality score by her consistency (thus, the quality scores estimated can be interpreted as bias-subtracted consistency-weighted MOS). We compare the proposed methods with the standardized techniques using real datasets and synthetic simulations, and demonstrate that the proposed methods are the most valuable when the test conditions are challenging (for example, crowdsourcing and cross-lab studies), offering advantages such as better model-data fit, tighter confidence intervals, better robustness against subject outliers, the absence of hard coded parameters and thresholds, and auxiliary information on test subjects. The code for this work is open-sourced at https://github.com/Netflix/sureal.
MMApr 6, 2021
Subjective Assessment Experiments That Recruit Few Observers With Repetitions (FOWR)Pablo Perez, Lucjan Janowski, Narciso Garcia et al.
Recent studies have shown that it is possible to characterize subject bias and variance in subjective assessment tests. Apparent differences among subjects can, for the most part, be explained by random factors. Building on that theory, we propose a subjective test design where three to four team members each rate the stimuli multiple times. The results are comparable to a high performing objective metric. This provides a quick and simple way to analyze new technologies and perform pre-tests for subjective assessment.
MMSep 28, 2020
Describing Subjective Experiment Consistency by $p$-Value P-P PlotJakub Nawała, Lucjan Janowski, Bogdan Ćmiel et al.
There are phenomena that cannot be measured without subjective testing. However, subjective testing is a complex issue with many influencing factors. These interplay to yield either precise or incorrect results. Researchers require a tool to classify results of subjective experiment as either consistent or inconsistent. This is necessary in order to decide whether to treat the gathered scores as quality ground truth data. Knowing if subjective scores can be trusted is key to drawing valid conclusions and building functional tools based on those scores (e.g., algorithms assessing the perceived quality of multimedia materials). We provide a tool to classify subjective experiment (and all its results) as either consistent or inconsistent. Additionally, the tool identifies stimuli having irregular score distribution. The approach is based on treating subjective scores as a random variable coming from the discrete Generalized Score Distribution (GSD). The GSD, in combination with a bootstrapped G-test of goodness-of-fit, allows to construct $p$-value P-P plot that visualizes experiment's consistency. The tool safeguards researchers from using inconsistent subjective data. In this way, it makes sure that conclusions they draw and tools they build are more precise and trustworthy. The proposed approach works in line with expectations drawn solely on experiment design descriptions of 21 real-life multimedia quality subjective experiments.
MESep 10, 2019
Generalized Score DistributionLucjan Janowski, Bogdan Ćmiel, Krzysztof Rusek et al.
A class of discrete probability distributions contains distributions with limited support, i.e. possible argument values are limited to a set of numbers (typically consecutive). Examples of such data are results from subjective experiments utilizing the Absolute Category Rating (ACR) technique, where possible answers (argument values) are $\{1, 2, \cdots, 5\}$ or typical Likert scale $\{-3, -2, \cdots, 3\}$. An interesting subclass of those distributions are distributions limited to two parameters: describing the mean value and the spread of the answers, and having no more than one change in the probability monotonicity. In this paper we propose a general distribution passing those limitations called Generalized Score Distribution (GSD). The proposed GSD covers all spreads of the answers, from very small, given by the Bernoulli distribution, to the maximum given by a Beta Binomial distribution. We also show that GSD correctly describes subjective experiments scores from video quality evaluations with probability of 99.7\%. A Google Collaboratory website with implementation of the GSD estimation, simulation, and visualization is provided.
MMMar 14, 2019
Notation for Subject Answer AnalysisLucjan Janowski, Jakub Nawała, Werner Robitza et al.
It is believed that consistent notation helps the research community in many ways. First and foremost, it provides a consistent interface of communication. Subjective experiments described according to uniform rules are easier to understand and analyze. Additionally, a comparison of various results is less complicated. In this publication we describe notation proposed by VQEG (Video Quality Expert Group) working group SAM (Statistical Analysis and Methods).