CVNov 8, 2023
CLearViD: Curriculum Learning for Video DescriptionCheng-Yu Chuang, Pooyan Fazli
Video description entails automatically generating coherent natural language sentences that narrate the content of a given video. We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task. In particular, we investigate two curriculum strategies: (1) progressively exposing the model to more challenging samples by gradually applying a Gaussian noise to the video data, and (2) gradually reducing the capacity of the network through dropout during the training process. These methods enable the model to learn more robust and generalizable features. Moreover, CLearViD leverages the Mish activation function, which provides non-linearity and non-monotonicity and helps alleviate the issue of vanishing gradients. Our extensive experiments and ablation studies demonstrate the effectiveness of the proposed model. The results on two datasets, namely ActivityNet Captions and YouCook2, show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics.
CVNov 17, 2024Code
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?Yolo Yunlong Tang, Junjia Guo, Hang Hua et al.
The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. The leaderboard and evaluation code are available at https://yunlong10.github.io/VidComposition/.
CVFeb 27, 2024Code
OSCaR: Object State Captioning and State Change RepresentationNguyen Nguyen, Jing Bi, Ali Vosoughi et al.
The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of the language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.
HCMar 15
ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online VideosMaryam Cheema, Sina Elahimanesh, Pooyan Fazli et al.
Advances in multimodal large language models enable automatic video narration and question answering (VQA), offering scalable alternatives to labor-intensive, human-authored audio descriptions (ADs) for blind and low vision (BLV) viewers. However, prior AI-driven AD systems rarely adapt to the diverse needs and preferences of BLV individuals across videos and are typically evaluated in controlled, single-session settings. We present ViDscribe, a web-based platform that integrates AI-generated ADs with six types of user customizations and a conversational VQA interface for YouTube videos. Through a longitudinal, in-the-wild study with eight BLV participants, we examine how users engage with customization and VQA features over time. Our results show sustained engagement with both features and that customized ADs improve effectiveness, enjoyment, and immersion compared to default ADs, highlighting the value of personalized, interactive video access for BLV users.
CVDec 4, 2024
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video UnderstandingChaoyu Li, Eun Woo Im, Pooyan Fazli
Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VidHalluc benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.
CVFeb 27, 2025
VideoA11y: Method and Dataset for Accessible Video DescriptionChaoyu Li, Sid Padmanabhuni, Maryam Cheema et al.
Video descriptions are crucial for blind and low vision (BLV) users to access visual content. However, current artificial intelligence models for generating descriptions often fall short due to limitations in the quality of human annotations within training datasets, resulting in descriptions that do not fully meet BLV users' needs. To address this gap, we introduce VideoA11y, an approach that leverages multimodal large language models (MLLMs) and video accessibility guidelines to generate descriptions tailored for BLV individuals. Using this method, we have curated VideoA11y-40K, the largest and most comprehensive dataset of 40,000 videos described for BLV users. Rigorous experiments across 15 video categories, involving 347 sighted participants, 40 BLV participants, and seven professional describers, showed that VideoA11y descriptions outperform novice human annotations and are comparable to trained human annotations in clarity, accuracy, objectivity, descriptiveness, and user satisfaction. We evaluated models on VideoA11y-40K using both standard and custom metrics, demonstrating that MLLMs fine-tuned on this dataset produce high-quality accessible descriptions. Code and dataset are available at https://people-robots.github.io/VideoA11y.
CVDec 1, 2024
VideoSAVi: Self-Aligned Video Language Models without Human SupervisionYogesh Kulkarni, Pooyan Fazli
Recent advances in video-large language models (Video-LLMs) have led to significant progress in video understanding. Current preference optimization methods often rely on proprietary APIs or human-annotated captions to generate preference data (i.e., pairs of model outputs ranked by quality or alignment with human judgment), which is then used to train models for video-language alignment. This approach is both costly and labor-intensive. To address this limitation, we introduce VideoSAVi (Self-Aligned Video Language Model), a self-training pipeline that enables Video-LLMs to learn from video content without external supervision. Our approach includes a self-critiquing mechanism that identifies reasoning errors in the model's initial responses and generates improved alternatives, creating preference pairs directly from video content. VideoSAVi then applies Direct Preference Optimization (DPO) to iteratively train the model using the preference data, thus enhancing its temporal and spatial reasoning for video understanding. Experiments show that VideoSAVi delivers significant improvements across multiple benchmarks, including a +4.2 percentage point gain on MVBench, +3.9 on PerceptionTest, and +6.8 on the challenging EgoSchema dataset compared to baseline models. Our model-agnostic approach is computationally efficient, requiring only 32 frames, offering a promising direction for self-aligned video understanding without reliance on external models or annotations.
CVAug 5, 2025
AVATAR: Reinforcement Learning to See, Hear, and Reason Over VideoYogesh Kulkarni, Pooyan Fazli
Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce AVATAR (Audio-Video Agent for Alignment and Reasoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning. AVATAR achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by +5.4on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes, while demonstrating over 35% higher sample efficiency.
CVApr 18, 2025
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM AlignmentYogesh Kulkarni, Pooyan Fazli
Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity. To address these limitations, we introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that enhances Video-LLMs through targeted preference optimization. VideoPASTA trains models to distinguish accurate video representations from carefully crafted adversarial examples that deliberately violate spatial, temporal, or cross-frame relationships. With only 7,020 preference pairs and Direct Preference Optimization, VideoPASTA enables models to learn robust representations that capture fine-grained spatial details and long-range temporal dynamics. Experiments demonstrate that VideoPASTA is model agnostic and significantly improves performance, for example, achieving gains of up to +3.8 percentage points on LongVideoBench, +4.1 on VideoMME, and +4.0 on MVBench, when applied to various state-of-the-art Video-LLMs. These results demonstrate that targeted alignment, rather than massive pretraining or architectural modifications, effectively addresses core video-language challenges. Notably, VideoPASTA achieves these improvements without any human annotation or captioning, relying solely on 32-frame sampling. This efficiency makes our approach a scalable plug-and-play solution that seamlessly integrates with existing models while preserving their original capabilities.
CVApr 17, 2025
ChartQA-X: Generating Explanations for Visual Chart ReasoningShamanthak Hegde, Pooyan Fazli, Hasti Seifi
The ability to explain complex information from chart images is vital for effective data-driven decision-making. In this work, we address the challenge of generating detailed explanations alongside answering questions about charts. We present ChartQA-X, a comprehensive dataset comprising 30,299 chart samples across four chart types, each paired with contextually relevant questions, answers, and explanations. Explanations are generated and selected based on metrics such as faithfulness, informativeness, coherence, and perplexity. Our human evaluation with 245 participants shows that model-generated explanations in ChartQA-X surpass human-written explanations in accuracy and logic and are comparable in terms of clarity and overall quality. Moreover, models fine-tuned on ChartQA-X show substantial improvements across various metrics, including absolute gains of up to 24.57 points in explanation quality, 18.96 percentage points in question-answering accuracy, and 14.75 percentage points on unseen benchmarks for the same task. By integrating explanatory narratives with answers, our approach enables agents to convey complex visual information more effectively, improving comprehension and greater trust in the generated responses.
CVJan 12
CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory AggregationChaoyu Li, Deeparghya Dutta Barua, Fei Tao et al.
Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.
CVNov 23, 2025
EgoVITA: Learning to Plan and Verify for Egocentric Video ReasoningYogesh Kulkarni, Pooyan Fazli
Reasoning about intentions and actions from a first-person (egocentric) perspective remains a fundamental challenge for multimodal large language models (MLLMs). Unlike third-person (exocentric) videos that capture scenes from an outside observer, egocentric videos reflect the actor's continuously changing viewpoint, introducing partial observability, limited field of view, and self-referenced motion. We introduce $\textbf{EgoVITA}$, a reinforcement learning framework that enables MLLMs to reason through structured planning and verification. Built on Group Relative Policy Optimization (GRPO), EgoVITA alternates between two stages: (1) an $\textbf{egocentric planning phase}$, where the model reasons from a first-person viewpoint to predict a step-by-step plan of future actions, and (2) an $\textbf{exocentric verification phase}$, where it switches to a third-person perspective to check the visual and logical consistency of that plan. Through GRPO, the model learns to make plans that are causally predictive of upcoming visual observations, leading to more coherent and visually grounded reasoning. EgoVITA achieves significant gains on egocentric reasoning tasks, outperforming the baseline Qwen2.5-VL-7B by $\mathbf{+7.7}$ on EgoBlind and $\mathbf{+4.4}$ on EgoOrient, while maintaining strong generalization on exocentric video tasks.
CVOct 4, 2025
FrameOracle: Learning What to See and How Much to See in VideosChaoyu Li, Tianzhi Li, Fei Tao et al.
Vision-language models (VLMs) have advanced video understanding, but their performance is limited by the number of input frames they can process. Existing frame sampling strategies, such as uniform or fixed-budget selection, often fail to adapt to variations in information density or task complexity, resulting in inefficiency and information loss. To address this, we present FrameOracle, a lightweight and plug-and-play module that predicts both (1) which frames are most relevant to a given query and (2) how many frames are needed. FrameOracle is trained using a four-stage curriculum, with the first three stages relying on weak proxy signals such as cross-modal similarity. In the final stage, it leverages stronger supervision from a new dataset we introduce, FrameOracle-41K, the first large-scale VideoQA collection to provide keyframe annotations specifying the minimal set of frames required to answer each question. Extensive experiments across five VLMs and six benchmarks demonstrate that FrameOracle reduces 16-frame inputs to an average of 10.4 frames without any loss in accuracy. When starting from 64-frame candidates, it reduces the input to an average of 13.9 frames while improving accuracy by 1.4%, achieving state-of-the-art efficiency-accuracy trade-offs for scalable video understanding.
CVJul 29, 2025
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMsChaoyu Li, Yogesh Kulkarni, Pooyan Fazli
The computational cost of training multimodal large language models (MLLMs) rapidly increases with the number of tokens involved. Existing efficiency methods primarily target inference and rely on token reduction or merging, offering limited benefit during training. In this paper, we propose ReGATE (Reference$-$Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. Specifically, ReGATE adopts a teacher-student framework in which the MLLM being trained serves as the student, and a frozen reference large language model (LLM) acts as the teacher. The teacher computes per-token reference losses, which are combined with an exponential moving average (EMA) of the student's own difficulty scores. This adaptive difficulty-based scoring enables the selective processing of crucial tokens while bypassing less informative ones in the forward pass, significantly reducing computational overhead. Experiments demonstrate that ReGATE, when applied to VideoLLaMA2, matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 35% of the tokens. With additional training, it even surpasses the baseline on several multimodal benchmarks, all while reducing the total token count by over 41%. Code and models will be released soon.
HCNov 7, 2021
NarrationBot and InfoBot: A Hybrid System for Automated Video DescriptionShasta Ihorn, Yue-Ting Siu, Aditya Bodi et al.
Video accessibility is crucial for blind and low vision users for equitable engagements in education, employment, and entertainment. Despite the availability of professional and amateur services and tools, most human-generated descriptions are expensive and time consuming. Moreover, the rate of human-generated descriptions cannot match the speed of video production. To overcome the increasing gaps in video accessibility, we developed a hybrid system of two tools to 1) automatically generate descriptions for videos and 2) provide answers or additional descriptions in response to user queries on a video. Results from a mixed-methods study with 26 blind and low vision individuals show that our system significantly improved user comprehension and enjoyment of selected videos when both tools were used in tandem. In addition, participants reported no significant difference in their ability to understand videos when presented with autogenerated descriptions versus human-revised autogenerated descriptions. Our results demonstrate user enthusiasm about the developed system and its promise for providing customized access to videos. We discuss the limitations of the current work and provide recommendations for the future development of automated video description tools.
LGDec 29, 2019
Real-time Policy Distillation in Deep Reinforcement LearningYuxiang Sun, Pooyan Fazli
Policy distillation in deep reinforcement learning provides an effective way to transfer control policies from a larger network to a smaller untrained network without a significant degradation in performance. However, policy distillation is underexplored in deep reinforcement learning, and existing approaches are computationally inefficient, resulting in a long distillation time. In addition, the effectiveness of the distillation process is still limited to the model capacity. We propose a new distillation mechanism, called real-time policy distillation, in which training the teacher model and distilling the policy to the student model occur simultaneously. Accordingly, the teacher's latest policy is transferred to the student model in real time. This reduces the distillation time to half the original time or even less and also makes it possible for extremely small student models to learn skills at the expert level. We evaluated the proposed algorithm in the Atari 2600 domain. The results show that our approach can achieve full distillation in most games, even with compression ratios up to 1.7%.
ROMar 9, 2018
DeepMoTIon: Learning to Navigate Like HumansMahmoud Hamandi, Mike D'Arcy, Pooyan Fazli
We present a novel human-aware navigation approach, where the robot learns to mimic humans to navigate safely in crowds. The presented model, referred to as DeepMoTIon, is trained with pedestrian surveillance data to predict human velocity in the environment. The robot processes LiDAR scans via the trained network to navigate to the target location. We conduct extensive experiments to assess the components of our network and prove their necessity to imitate humans. Our experiments show that DeepMoTIion outperforms all the benchmarks in terms of human imitation, achieving a 24% reduction in time series-based path deviation over the next best approach. In addition, while many other approaches often failed to reach the target, our method reached the target in 100% of the test cases while complying with social norms and ensuring human safety.
ROOct 18, 2017
Setting Up the Beam for Human-Centered Service TasksUtkarsh Patel, Emre Hatay, Mike D'Arcy et al.
We introduce the Beam, a collaborative autonomous mobile service robot, based on SuitableTech's Beam telepresence system. We present a set of enhancements to the telepresence system, including autonomy, human awareness, increased computation and sensing capabilities, and integration with the popular Robot Operating System (ROS) framework. Together, our improvements transform the Beam into a low-cost platform for research on service robots. We examine the Beam on target search and object delivery tasks and demonstrate that the robot achieves a 100% success rate.