CVMay 18Code
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video ReasoningChengwen Liu, Xiaomin Yu, Zhuoyue Chang et al.
In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.
CVMay 30, 2025Code
Reinforcing Video Reasoning with Focused ThinkingJisheng Dang, Jingze Wu, Teng Wang et al.
Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group information entropy), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from single-choice to multi-choice QA tasks, where soft rewards enable finer-grained gradient estimation by distinguishing partial correctness. Additionally, we propose question-answer inversion, a data augmentation strategy to generate diverse multi-choice samples from existing benchmarks. Experiments demonstrate state-of-the-art performance on several video reasoning and general understanding benchmarks. Notably, TW-GRPO achieves 50.4\% accuracy on CLEVRER (18.8\% improvement over Video-R1) and 65.8\% on MMVU. Our codes are available at \href{https://github.com/longmalongma/TW-GRPO}.
CVJun 22, 2025Code
MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question AnsweringJisheng Dang, Huilin Song, Junbin Xiao et al.
Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA establishes new state-of-the-art results, with Acc@GQA of 30.3% and 47.4% on NExT-GQA and DeVE-QA respectively, demonstrating MUPA' effectiveness towards trustworthy video-language understanding. Our code is available in https://github.com/longmalongma/MUPA.
LGSep 26, 2025Code
FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft LearningYizhou Zhang, Ning Lv, Teng Wang et al.
Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models (LLMs) via reinforcement learning. However, its practical deployment is impeded by an excessively slow training process, primarily attributed to the computationally intensive autoregressive generation of multiple responses per query, which makes the generation phase the primary performance bottleneck. Although speculative decoding presents a promising direction for acceleration, its direct application in GRPO achieves limited speedup under high-concurrency training conditions. To overcome this limitation, we propose a concurrency-aware speculative decoding framework that dynamically adjusts the drafting and verification strategy according to real-time concurrency levels, thereby maximizing the acceleration of the generation process. Furthermore, to address performance degradation arising from distributional drift between the evolving target model and the fixed draft model during training, we introduce an online draft learning mechanism that enables the draft model to continuously adapt using feedback signals from the target model. Experimental results across multiple mathematical reasoning datasets and models demonstrate that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency. The code is available at https://github.com/yedaotian9/GRPO_speculative.
AIJun 1, 2025Code
SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed CaptioningJisheng Dang, Yizhou Zhang, Hao Ye et al.
Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. However, existing methods struggle to capture subtle video dynamics and rich detailed information. In this paper, we leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning, while mitigating several limitations inherent to direct preference optimization (DPO). First, we propose a pipeline for constructing preference pairs that leverages the intrinsic properties of VLMs along with partial assistance from large language models, achieving an optimal balance between cost and data quality. Second, we propose Synergistic Preference Optimization (SynPO), a novel optimization method offering significant advantages over DPO and its variants. SynPO prevents negative preferences from dominating the optimization, explicitly preserves the model's language capability to avoid deviation of the optimization objective, and improves training efficiency by eliminating the need for the reference model. We extensively evaluate SynPO not only on video captioning benchmarks (e.g., VDC, VDD, VATEX) but also across well-established NLP tasks, including general language understanding and preference evaluation, using diverse pretrained models. Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20\% improvement in training efficiency. Code is available at https://github.com/longmalongma/SynPO
LGMay 7
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVRHao Ye, Jisheng Dang, Junfeng Fang et al.
Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. Furthermore, we characterize three distinct properties of RL training: (1) The effective rank-1 component in RLVR don't maintain other model knowledge except mathematical reasoning capability. (2) RLVR fundamentally functions by optimizing a specific singular spectrum. The distribution of singular values of almost all linear layers in RLVR-trained model behaves like heavy-tailed distribution. (3) the left singular vectors associated with rank-1 components demonstrate a stronger alignment tendency during training, which echoes the discovery that RLVR is optimizing sampling efficiency in essence. Taken together, our findings and analysis further reveal how RLVR shapes model parameters and offer potential insights for improving existing RL paradigms or other training paradigms to implement continual learning.
CVJul 7, 2025Code
Boosting Temporal Sentence Grounding via Causal InferenceKefan Tang, Lihuo He, Jisheng Dang et al.
Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model's tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model's robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at https://github.com/Tangkfan/CICR.
CRDec 31, 2025
Noise-Aware and Dynamically Adaptive Federated Defense Framework for SAR Image Target RecognitionYuchao Hou, Zixuan Zhang, Jie Wang et al.
As a critical application of computational intelligence in remote sensing, deep learning-based synthetic aperture radar (SAR) image target recognition facilitates intelligent perception but typically relies on centralized training, where multi-source SAR data are uploaded to a single server, raising privacy and security concerns. Federated learning (FL) provides an emerging computational intelligence paradigm for SAR image target recognition, enabling cross-site collaboration while preserving local data privacy. However, FL confronts critical security risks, where malicious clients can exploit SAR's multiplicative speckle noise to conceal backdoor triggers, severely challenging the robustness of the computational intelligence model. To address this challenge, we propose NADAFD, a noise-aware and dynamically adaptive federated defense framework that integrates frequency-domain, spatial-domain, and client-behavior analyses to counter SAR-specific backdoor threats. Specifically, we introduce a frequency-domain collaborative inversion mechanism to expose cross-client spectral inconsistencies indicative of hidden backdoor triggers. We further design a noise-aware adversarial training strategy that embeds $Γ$-distributed speckle characteristics into mask-guided adversarial sample generation to enhance robustness against both backdoor attacks and SAR speckle noise. In addition, we present a dynamic health assessment module that tracks client update behaviors across training rounds and adaptively adjusts aggregation weights to mitigate evolving malicious contributions. Experiments on MSTAR and OpenSARShip datasets demonstrate that NADAFD achieves higher accuracy on clean test samples and a lower backdoor attack success rate on triggered inputs than existing federated backdoor defenses for SAR target recognition.
CVFeb 13, 2025
Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion ModelFei Shen, Cong Wang, Junyao Gao et al.
Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the \textbf{M}otion-priors \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (\textbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism that mitigates error accumulation by dynamically storing and updating motion features. We also release the \textbf{TalkingFace-Wild} dataset, a multilingual collection of over 200 hours of footage across 10 languages. Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term TalkingFace generation. Code, models, and datasets will be publicly available.
CVApr 17, 2024
Fact :Teaching MLLMs with Faithful, Concise and Transferable RationalesMinghe Gao, Shuang Chen, Liang Pang et al.
The remarkable performance of Multimodal Large Language Models (MLLMs) has unequivocally demonstrated their proficient understanding capabilities in handling a wide array of visual tasks. Nevertheless, the opaque nature of their black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate compositional reasoning tasks is also constrained, culminating in a stagnation of learning progression for these models. In this work, we introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. This paradigm utilizes verifiable visual programming to generate executable code guaranteeing faithfulness and precision. Subsequently, through a series of operations including pruning, merging, and bridging, the rationale enhances its conciseness. Furthermore, we filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Empirical evidence from experiments demonstrates the superiority of our method across models of varying parameter sizes, significantly enhancing their compositional reasoning and generalization ability. Our approach also reduces hallucinations owing to its high correlation between images and text.
CVMar 8
Scale-Aware UAV-to-Satellite Cross-View Geo-Localization: A Semantic Geometric ApproachYibin Ye, Shuo Chen, Kun Wang et al.
Cross-View Geo-Localization (CVGL) between UAV imagery and satellite images plays a crucial role in target localization and UAV self-positioning. However, most existing methods rely on the idealized assumption of scale consistency between UAV queries and satellite galleries, overlooking the severe scale ambiguity commonly encountered in real-world scenarios. This discrepancy leads to field-of-view misalignment and feature mismatch, significantly degrading CVGL robustness. To address this issue, we propose a geometric framework that recovers the absolute metric scale from monocular UAV images using semantic anchors. Specifically, small vehicles (SVs), characterized by relatively stable prior size distributions and high detectability, are exploited as metric references. A Decoupled Stereoscopic Projection Model is introduced to estimate the absolute image scale from these semantic targets. By decomposing vehicle dimensions into radial and tangential components, the model compensates for perspective distortions in 2D detections of 3D vehicles, enabling more accurate scale estimation. To further reduce intra-class size variation and detection noise, a dual-dimension fusion strategy with Interquartile Range (IQR)-based robust aggregation is employed. The estimated global scale is then used as a physical constraint for scale-adaptive satellite image cropping, improving UAV-to-satellite feature alignment. Experiments on augmented DenseUAV and UAV-VisLoc datasets demonstrate that the proposed method significantly improves CVGL robustness under unknown UAV image scales. Additionally, the framework shows strong potential for downstream applications such as passive UAV altitude estimation and 3D model scale recovery.
LGFeb 21, 2025
IPAD: Inverse Prompt for AI Detection - A Robust and Interpretable LLM-Generated Text DetectorZheng Chen, Yushi Feng, Jisheng Dang et al. · pku
Large Language Models (LLMs) have attained human-level fluency in text generation, which complicates the distinguishing between human-written and LLM-generated texts. This increases the risk of misuse and highlights the need for reliable detectors. Yet, existing detectors exhibit poor robustness on out-of-distribution (OOD) data and attacked data, which is critical for real-world scenarios. Also, they struggle to provide interpretable evidence to support their decisions, thus undermining the reliability. In light of these challenges, we propose IPAD (Inverse Prompt for AI Detection), a novel framework consisting of a Prompt Inverter that identifies predicted prompts that could have generated the input text, and two Distinguishers that examine the probability that the input texts align with the predicted prompts. Empirical evaluations demonstrate that IPAD outperforms the strongest baselines by 9.05% (Average Recall) on in-distribution data, 12.93% (AUROC) on out-of-distribution data, and 5.48% (AUROC) on attacked data. IPAD also performs robustly on structured datasets. Furthermore, an interpretability assessment is conducted to illustrate that IPAD enhances the AI detection trustworthiness by allowing users to directly examine the decision-making evidence, which provides interpretable support for its state-of-the-art detection results.