CVDec 30, 2025
Exploring Compositionality in Vision Transformers using Wavelet RepresentationsAkshad Shyam Purushottamdas, Pranav K Nayak, Divya Mehul Rajparia et al.
While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space, offering a new perspective on how ViTs structure information.
CVJun 17, 2024
Inpainting the Gaps: A Novel Framework for Evaluating Explanation Methods in Vision TransformersLokesh Badisa, Sumohana S. Channappayya
The perturbation test remains the go-to evaluation approach for explanation methods in computer vision. This evaluation method has a major drawback of test-time distribution shift due to pixel-masking that is not present in the training set. To overcome this drawback, we propose a novel evaluation framework called \textbf{Inpainting the Gaps (InG)}. Specifically, we propose inpainting parts that constitute partial or complete objects in an image. In this way, one can perform meaningful image perturbations with lower test-time distribution shifts, thereby improving the efficacy of the perturbation test. InG is applied to the PartImageNet dataset to evaluate the performance of popular explanation methods for three training strategies of the Vision Transformer (ViT). Based on this evaluation, we found Beyond Intuition and Generic Attribution to be the two most consistent explanation models. Further, and interestingly, the proposed framework results in higher and more consistent evaluation scores across all the ViT models considered in this work. To the best of our knowledge, InG is the first semi-synthetic framework for the evaluation of ViT explanation methods.
CVJun 11, 2024
Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling ApproachChallapalli Phanindra Revanth, Sumohana S. Channappayya, C Krishna Mohan
Computing the loss gradient via backpropagation consumes considerable energy during deep learning (DL) model training. In this paper, we propose a novel approach to efficiently compute DL models' gradients to mitigate the substantial energy overhead associated with backpropagation. Exploiting the over-parameterized nature of DL models and the smoothness of their loss landscapes, we propose a method called {\em GradSamp} for sampling gradient updates from a Gaussian distribution. Specifically, we update model parameters at a given epoch (chosen periodically or randomly) by perturbing the parameters (element-wise) from the previous epoch with Gaussian ``noise''. The parameters of the Gaussian distribution are estimated using the error between the model parameter values from the two previous epochs. {\em GradSamp} not only streamlines gradient computation but also enables skipping entire epochs, thereby enhancing overall efficiency. We rigorously validate our hypothesis across a diverse set of standard and non-standard CNN and transformer-based models, spanning various computer vision tasks such as image classification, object detection, and image segmentation. Additionally, we explore its efficacy in out-of-distribution scenarios such as Domain Adaptation (DA), Domain Generalization (DG), and decentralized settings like Federated Learning (FL). Our experimental results affirm the effectiveness of {\em GradSamp} in achieving notable energy savings without compromising performance, underscoring its versatility and potential impact in practical DL applications.
IVJan 3, 2022
Lung-Originated Tumor Segmentation from Computed Tomography Scan (LOTUS) BenchmarkParnian Afshar, Arash Mohammadi, Konstantinos N. Plataniotis et al.
Lung cancer is one of the deadliest cancers, and in part its effective diagnosis and treatment depend on the accurate delineation of the tumor. Human-centered segmentation, which is currently the most common approach, is subject to inter-observer variability, and is also time-consuming, considering the fact that only experts are capable of providing annotations. Automatic and semi-automatic tumor segmentation methods have recently shown promising results. However, as different researchers have validated their algorithms using various datasets and performance metrics, reliably evaluating these methods is still an open challenge. The goal of the Lung-Originated Tumor Segmentation from Computed Tomography Scan (LOTUS) Benchmark created through 2018 IEEE Video and Image Processing (VIP) Cup competition, is to provide a unique dataset and pre-defined metrics, so that different researchers can develop and evaluate their methods in a unified fashion. The 2018 VIP Cup started with a global engagement from 42 countries to access the competition data. At the registration stage, there were 129 members clustered into 28 teams from 10 countries, out of which 9 teams made it to the final stage and 6 teams successfully completed all the required tasks. In a nutshell, all the algorithms proposed during the competition, are based on deep learning models combined with a false positive reduction technique. Methods developed by the three finalists show promising results in tumor segmentation, however, more effort should be put into reducing the false positive rate. This competition manuscript presents an overview of the VIP-Cup challenge, along with the proposed algorithms and results.
GRFeb 8, 2020
Deep No-reference Tone Mapped Image Quality AssessmentChandra Sekhar Ravuri, Rajesh Sureddi, Sathya Veera Reddy Dendi et al.
The process of rendering high dynamic range (HDR) images to be viewed on conventional displays is called tone mapping. However, tone mapping introduces distortions in the final image which may lead to visual displeasure. To quantify these distortions, we introduce a novel no-reference quality assessment technique for these tone mapped images. This technique is composed of two stages. In the first stage, we employ a convolutional neural network (CNN) to generate quality aware maps (also known as distortion maps) from tone mapped images by training it with the ground truth distortion maps. In the second stage, we model the normalized image and distortion maps using an Asymmetric Generalized Gaussian Distribution (AGGD). The parameters of the AGGD model are then used to estimate the quality score using support vector regression (SVR). We show that the proposed technique delivers competitive performance relative to the state-of-the-art techniques. The novelty of this work is its ability to visualize various distortions as quality maps (distortion maps), especially in the no-reference setting, and to use these maps as features to estimate the quality score of tone mapped images.
CVNov 8, 2019
Quality Aware Generative Adversarial NetworksParimala Kancharla, Sumohana S. Channappayya
Generative Adversarial Networks (GANs) have become a very popular tool for implicitly learning high-dimensional probability distributions. Several improvements have been made to the original GAN formulation to address some of its shortcomings like mode collapse, convergence issues, entanglement, poor visual quality etc. While a significant effort has been directed towards improving the visual quality of images generated by GANs, it is rather surprising that objective image quality metrics have neither been employed as cost functions nor as regularizers in GAN objective functions. In this work, we show how a distance metric that is a variant of the Structural SIMilarity (SSIM) index (a popular full-reference image quality assessment algorithm), and a novel quality aware discriminator gradient penalty function that is inspired by the Natural Image Quality Evaluator (NIQE, a popular no-reference image quality assessment algorithm) can each be used as excellent regularizers for GAN objective functions. Specifically, we demonstrate state-of-the-art performance using the Wasserstein GAN gradient penalty (WGAN-GP) framework over CIFAR-10, STL10 and CelebA datasets.
MMJul 18, 2018
Streaming Video QoE Modeling and Prediction: A Long Short-Term Memory ApproachNagabhushan Eswara, S Ashique, Anand Panchbhai et al.
HTTP based adaptive video streaming has become a popular choice of streaming due to the reliable transmission and the flexibility offered to adapt to varying network conditions. However, due to rate adaptation in adaptive streaming, the quality of the videos at the client keeps varying with time depending on the end-to-end network conditions. Further, varying network conditions can lead to the video client running out of playback content resulting in rebuffering events. These factors affect the user satisfaction and cause degradation of the user quality of experience (QoE). It is important to quantify the perceptual QoE of the streaming video users and monitor the same in a continuous manner so that the QoE degradation can be minimized. However, the continuous evaluation of QoE is challenging as it is determined by complex dynamic interactions among the QoE influencing factors. Towards this end, we present LSTM-QoE, a recurrent neural network based QoE prediction model using a Long Short-Term Memory (LSTM) network. The LSTM-QoE is a network of cascaded LSTM blocks to capture the nonlinearities and the complex temporal dependencies involved in the time varying QoE. Based on an evaluation over several publicly available continuous QoE databases, we demonstrate that the LSTM-QoE has the capability to model the QoE dynamics effectively. We compare the proposed model with the state-of-the-art QoE prediction models and show that it provides superior performance across these databases. Further, we discuss the state space perspective for the LSTM-QoE and show the efficacy of the state space modeling approaches for QoE prediction.
MMMay 16, 2018
Modeling Continuous Video QoE Evolution: A State Space ApproachNagabhushan Eswara, Hemanth P. Sethuram, Soumen Chakraborty et al.
A rapid increase in the video traffic together with an increasing demand for higher quality videos has put a significant load on content delivery networks in the recent years. Due to the relatively limited delivery infrastructure, the video users in HTTP streaming often encounter dynamically varying quality over time due to rate adaptation, while the delays in video packet arrivals result in rebuffering events. The user quality-of-experience (QoE) degrades and varies with time because of these factors. Thus, it is imperative to monitor the QoE continuously in order to minimize these degradations and deliver an optimized QoE to the users. Towards this end, we propose a nonlinear state space model for efficiently and effectively predicting the user QoE on a continuous time basis. The QoE prediction using the proposed approach relies on a state space that is defined by a set of carefully chosen time varying QoE determining features. An evaluation of the proposed approach conducted on two publicly available continuous QoE databases shows a superior QoE prediction performance over the state-of-the-art QoE modeling approaches. The evaluation results also demonstrate the efficacy of the selected features and the model order employed for predicting the QoE. Finally, we show that the proposed model is completely state controllable and observable, so that the potential of state space modeling approaches can be exploited for further improving QoE prediction.
CVNov 20, 2017
Optical Character Recognition (OCR) for Telugu: Database, Algorithm and ApplicationChandra Prakash Konkimalla, Manikanta Srikar Yellapragada, Trishal Gayam et al.
Telugu is a Dravidian language spoken by more than 80 million people worldwide. The optical character recognition (OCR) of the Telugu script has wide ranging applications including education, health-care, administration etc. The beautiful Telugu script however is very different from Germanic scripts like English and German. This makes the use of transfer learning of Germanic OCR solutions to Telugu a non-trivial task. To address the challenge of OCR for Telugu, we make three contributions in this work: (i) a database of Telugu characters, (ii) a deep learning based OCR algorithm, and (iii) a client server solution for the online deployment of the algorithm. For the benefit of the Telugu people and the research community, we will make our code freely available at https://gayamtrishal.github.io/OCR_Telugu.github.io/
MMApr 26, 2016
Subjective Assessment of H.264 Compressed Stereoscopic VideoManasa K, Balasubramanyam Appina, Sumohana S. Channappayya
The tremendous growth in 3D (stereo) imaging and display technologies has led to stereoscopic content (video and image) becoming increasingly popular. However, both the subjective and the objective evaluation of stereoscopic video content has not kept pace with the rapid growth of the content. Further, the availability of standard stereoscopic video databases is also quite limited. In this work, we attempt to alleviate these shortcomings. We present a stereoscopic video database and its subjective evaluation. We have created a database containing a set of 144 distorted videos. We limit our attention to H.264 compression artifacts. The distorted videos were generated using 6 uncompressed pristine videos of left and right views originally created by Goldmann et al. at EPFL [1]. Further, 19 subjects participated in the subjective assessment task. Based on the subjective study, we have formulated a relation between the 2D and stereoscopic subjective scores as a function of compression rate and depth range. We have also evaluated the performance of popular 2D and 3D image/video quality assessment (I/VQA) algorithms on our database.