CVJun 21, 2025
HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMsNikitha SR, Aradhya Neeraj Mathur, Tarun Ram Menta et al.
The integration of high-resolution image features in modern multimodal large language models has demonstrated significant improvements in fine-grained visual understanding tasks, achieving high performance across multiple benchmarks. Since these features are obtained from large image encoders like ViT, they come with a significant increase in computational costs due to multiple calls to these encoders. In this work, we first develop an intuition for feature upsampling as a natural extension of high-resolution feature generation. Through extensive experiments and ablations, we demonstrate how a shallow feature enricher can achieve competitive results with tremendous reductions in training and inference time as well as computational cost, with upto 1.5x saving in FLOPs.
CVOct 30, 2020
LIFI: Towards Linguistically Informed Frame InterpolationAradhya Neeraj Mathur, Devansh Batra, Yaman Kumar et al.
In this work, we explore a new problem of frame interpolation for speech videos. Such content today forms the major form of online communication. We try to solve this problem by using several deep learning video generation algorithms to generate the missing frames. We also provide examples where computer vision models despite showing high performance on conventional non-linguistic metrics fail to accurately produce faithful interpolation of speech. With this motivation, we provide a new set of linguistically-informed metrics specifically targeted to the problem of speech videos interpolation. We also release several datasets to test computer vision video generation models of their speech understanding.