MMNov 12, 2025
MCAD: Multimodal Context-Aware Audio Description Generation For SoccerLipisha Chaudhary, Trisha Mittal, Subhadra Gopalakrishnan et al.
Audio Descriptions (AD) are essential for making visual content accessible to individuals with visual impairments. Recent works have shown a promising step towards automating AD, but they have been limited to describing high-quality movie content using human-annotated ground truth AD in the process. In this work, we present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports, with a focus on soccer games, without relying on ground truth AD. To address the absence of domain-specific AD datasets, we fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD. During inference, MCAD incorporates multimodal contextual cues such as player identities, soccer events and actions, and commentary from the game. These cues, combined with input prompts to the fine-tuned VideoLLM, allow the system to produce complete AD text for each video segment. We further introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD. ARGE-AD evaluates the generated AD for the presence of five characteristics: (i) usage of people's names, (ii) mention of actions and events, (iii) appropriate length of AD, (iv) absence of pronouns, and (v) overlap from commentary or subtitles. We present an in-depth analysis of our approach on both movie and soccer datasets. We also validate the use of this metric to quantitatively comment on the quality of generated AD using our metric across domains. Additionally, we contribute audio descriptions for 100 soccer game clips annotated by two AD experts.
CVSep 23, 2024
Analysis of Human Perception in Distinguishing Real and AI-Generated Faces: An Eye-Tracking Based StudyJin Huang, Subhadra Gopalakrishnan, Trisha Mittal et al.
Recent advancements in Artificial Intelligence have led to remarkable improvements in generating realistic human faces. While these advancements demonstrate significant progress in generative models, they also raise concerns about the potential misuse of these generated images. In this study, we investigate how humans perceive and distinguish between real and fake images. We designed a perceptual experiment using eye-tracking technology to analyze how individuals differentiate real faces from those generated by AI. Our analysis of StyleGAN-3 generated images reveals that participants can distinguish real from fake faces with an average accuracy of 76.80%. Additionally, we found that participants scrutinize images more closely when they suspect an image to be fake. We believe this study offers valuable insights into human perception of AI-generated media.
LGFeb 23, 2025
Coreset Selection via LLM-based Concept BottlenecksAkshay Mehra, Trisha Mittal, Subhadra Gopalakrishnan et al.
Coreset Selection (CS) aims to identify a subset of the training dataset that achieves model performance comparable to using the entire dataset. Many state-of-the-art CS methods select coresets using scores whose computation requires training the downstream model on the entire dataset first and recording changes in the model's behavior on samples as it trains (training dynamics). These scores are inefficient to compute and hard to interpret, as they do not indicate whether a sample is difficult to learn in general or only for a specific downstream model. Our work addresses these challenges by proposing a score that computes a sample's difficulty using human-understandable textual attributes (concepts) independent of any downstream model. Specifically, we measure the alignment between a sample's visual features and concept bottlenecks, derived via large language models, by training a linear concept bottleneck layer and computing the sample's difficulty score using it.We then use stratified sampling based on this score to generate a coreset of the dataset.Crucially, our score is efficiently computable without training the downstream model on the full dataset even once, leads to high-performing coresets for various downstream models, and is computable even for an unlabeled dataset. Through experiments on CIFAR-10/100, and ImageNet-1K, we show that our coresets outperform random subsets, even at high pruning rates, and achieve model performance comparable to or better than coresets found by training dynamics-based methods.
CVJan 14, 2025
V-Trans4Style: Visual Transition Recommendation for Video Production Style AdaptationPooja Guhan, Tsung-Wei Huang, Guan-Ming Su et al.
We introduce V-Trans4Style, an innovative algorithm tailored for dynamic video content editing needs. It is designed to adapt videos to different production styles like documentaries, dramas, feature films, or a specific YouTube channel's video-making technique. Our algorithm recommends optimal visual transitions to help achieve this flexibility using a more bottom-up approach. We first employ a transformer-based encoder-decoder network to learn recommending temporally consistent and visually seamless sequences of visual transitions using only the input videos. We then introduce a style conditioning module that leverages this model to iteratively adjust the visual transitions obtained from the decoder through activation maximization. We demonstrate the efficacy of our method through experiments conducted on our newly introduced AutoTransition++ dataset. It is a 6k video version of AutoTransition Dataset that additionally categorizes its videos into different production style categories. Our encoder-decoder model outperforms the state-of-the-art transition recommendation method, achieving improvements of 10% to 80% in Recall@K and mean rank values over baseline. Our style conditioning module results in visual transitions that improve the capture of the desired video production style characteristics by an average of around 12% in comparison to other methods when measured with similarity metrics. We hope that our work serves as a foundation for exploring and understanding video production styles further.