IVMay 1, 2020
Understanding the Perceived Quality of Video PredictionsNagabhushan Somraj, Manoj Surya Kashi, S. P. Arun et al.
The study of video prediction models is believed to be a fundamental approach to representation learning for videos. While a plethora of generative models for predicting the future frame pixel values given the past few frames exist, the quantitative evaluation of the predicted frames has been found to be extremely challenging. In this context, we study the problem of quality assessment of predicted videos. We create the Indian Institute of Science Predicted Videos Quality Assessment (IISc PVQA) Database consisting of 300 videos, obtained by applying different prediction models on different datasets, and accompanying human opinion scores. We collected subjective ratings of quality from 50 human participants for these videos. Our subjective study reveals that human observers were highly consistent in their judgments of quality of predicted videos. We benchmark several popularly used measures for evaluating video prediction and show that they do not adequately correlate with these subjective scores. We introduce two new features to effectively capture the quality of predicted videos, motion-compensated cosine similarities of deep features of predicted frames with past frames, and deep features extracted from rescaled frame differences. We show that our feature design leads to state of the art quality prediction in accordance with human judgments on our IISc PVQA Database. The database and code are publicly available on our project website: https://nagabhushansn95.github.io/publications/2020/pvqa
NCJul 23, 2018
Human peripheral blur is optimal for object recognitionR. T. Pramod, Harish Katti, S. P. Arun
Our vision is sharpest at the center of our gaze and becomes progressively blurry into the periphery. It is widely believed that this high foveal resolution evolved at the expense of peripheral acuity. But what if this sampling scheme is actually optimal for object recognition? To test this hypothesis, we trained deep neural networks on 'foveated' images with high resolution near objects and increasingly sparse sampling into the periphery. Neural networks trained using a blur profile matching the human eye yielded the best performance compared to shallower and steeper blur profiles. Even in humans, categorization accuracy deteriorated only for steeper blur profiles. Thus, our blurry peripheral vision may have evolved to optimize object recognition rather than merely due to wiring constraints.
CVMar 22, 2017
Can you tell where in India I am from? Comparing humans and computers on fine-grained race face classificationHarish Katti, S. P. Arun
Faces form the basis for a rich variety of judgments in humans, yet the underlying features remain poorly understood. Although fine-grained distinctions within a race might more strongly constrain possible facial features used by humans than in case of coarse categories such as race or gender, such fine grained distinctions are relatively less studied. Fine-grained race classification is also interesting because even humans may not be perfectly accurate on these tasks. This allows us to compare errors made by humans and machines, in contrast to standard object detection tasks where human performance is nearly perfect. We have developed a novel face database of close to 1650 diverse Indian faces labeled for fine-grained race (South vs North India) as well as for age, weight, height and gender. We then asked close to 130 human subjects who were instructed to categorize each face as belonging toa Northern or Southern state in India. We then compared human performance on this task with that of computational models trained on the ground-truth labels. Our main results are as follows: (1) Humans are highly consistent (average accuracy: 63.6%), with some faces being consistently classified with > 90% accuracy and others consistently misclassified with < 30% accuracy; (2) Models trained on ground-truth labels showed slightly worse performance (average accuracy: 62%) but showed higher accuracy (72.2%) on faces classified with > 80% accuracy by humans. This was true for models trained on simple spatial and intensity measurements extracted from faces as well as deep neural networks trained on race or gender classification; (3) Using overcomplete banks of features derived from each face part, we found that mouth shape was the single largest contributor towards fine-grained race classification, whereas distances between face parts was the strongest predictor of gender.
CVNov 22, 2016
Deep neural networks can be improved using human-derived contextual expectationsHarish Katti, Marius V. Peelen, S. P. Arun
Real-world objects occur in specific contexts. Such context has been shown to facilitate detection by constraining the locations to search. But can context directly benefit object detection? To do so, context needs to be learned independently from target features. This is impossible in traditional object detection where classifiers are trained on images containing both target features and surrounding context. In contrast, humans can learn context and target features separately, such as when we see highways without cars. Here we show for the first time that human-derived scene expectations can be used to improve object detection performance in machines. To measure contextual expectations, we asked human subjects to indicate the scale, location and likelihood at which cars or people might occur in scenes without these objects. Humans showed highly systematic expectations that we could accurately predict using scene features. This allowed us to predict human expectations on novel scenes without requiring manual annotation. On augmenting deep neural networks with predicted human expectations, we obtained substantial gains in accuracy for detecting cars and people (1-3%) as well as on detecting associated objects (3-20%). In contrast, augmenting deep networks with other conventional features yielded far smaller gains. This improvement was due to relatively poor matches at highly likely locations being correctly labelled as target and conversely strong matches at unlikely locations being correctly rejected as false alarms. Taken together, our results show that augmenting deep neural networks with human-derived context features improves their performance, suggesting that humans learn scene context separately unlike deep networks.