89.7CVApr 27Code
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse SamplingXudong Xie, Hao Yan, Liang Yin et al.
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially for academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) that is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler selects the paragraphs or diagrams most pertinent to user queries. To effectively train and evaluate our model, we construct PaperPDF, a dataset consisting of a broad collection of English and Chinese academic papers. Multiple strategies are proposed to build high-quality 1.1 million QA pairs along with their corresponding evidence sources. Experimental results demonstrate the superiority and high efficiency of our approach over other models on the task of long multimodal document understanding, surpassing proprietary products by an average of 8.6% on F1. Our code and dataset will be released at https://github.com/yh-hust/PDF-Wukong.
100.0CVMar 12Code
Video Streaming Thinking: VideoLLMs Can Watch and Think SimultaneouslyYiran Guan, Liang Yin, Dingkang Liang et al.
Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.
CVJun 3, 2025Code
VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual ReasoningHao Yan, Xingchen Liu, Hao Wang et al.
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at https://github.com/yh-hust/VisuRiddles
LGFeb 5, 2024Code
Contrastive Diffuser: Planning Towards High Return States via Contrastive LearningYixiang Shan, Zhengbang Zhu, Ting Long et al.
The performance of offline reinforcement learning (RL) is sensitive to the proportion of high-return trajectories in the offline dataset. However, in many simulation environments and real-world scenarios, there are large ratios of low-return trajectories rather than high-return trajectories, which makes learning an efficient policy challenging. In this paper, we propose a method called Contrastive Diffuser (CDiffuser) to make full use of low-return trajectories and improve the performance of offline RL algorithms. Specifically, CDiffuser groups the states of trajectories in the offline dataset into high-return states and low-return states and treats them as positive and negative samples correspondingly. Then, it designs a contrastive mechanism to pull the trajectory of an agent toward high-return states and push them away from low-return states. Through the contrast mechanism, trajectories with low returns can serve as negative examples for policy learning, guiding the agent to avoid areas associated with low returns and achieve better performance. Experiments on 14 commonly used D4RL benchmarks demonstrate the effectiveness of our proposed method. Our code is publicly available at \url{https://anonymous.4open.science/r/CDiffuser}.
CVJul 8, 2025Code
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region AssistanceZhang Li, Biao Yang, Qiang Liu et al.
While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.
CVJun 12, 2025Code
MSTAR: Box-free Multi-query Scene Text Retrieval with Attention RecyclingLiang Yin, Xudong Xie, Zhang Li et al.
Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.
78.7ITMay 9
Sensing-Aided Secure Multicast in Two-Level Rotatable Antenna-Enabled ISAC Systems: Modeling and OptimizationZequan Wang, Liang Yin, Hao Xu et al.
In physical layer security, the channel state information (CSI) of passive eavesdroppers is usually difficult to obtain, which has motivated sensing-aided secure communication (SASC). However, in secure multicast scenarios, conventional fixed-position antennas (FPAs) provide limited spatial flexibility for simultaneously serving multiple legitimate users and suppressing leakage toward possible eavesdropper directions. Motivated by this, a novel two-level rotatable antenna (RA)-enabled sensing-aided secure multicast scheme is proposed in this paper. In the proposed architecture, array-level and element-wise rotations are jointly exploited with analog beamforming for user enhancement and leakage suppression. To characterize imperfect eavesdropper sensing, the maximum likelihood estimator and the corresponding Cramér-Rao bound (CRB) are derived to quantify the angular estimation accuracy. Based on the derived CRB, a probabilistic angular uncertainty region is constructed. A CRB-aware max-min secrecy-rate problem is then formulated by evaluating the eavesdropper leakage over sampled high-probability directions within this region. The non-convex problem is handled through a tractable lower-bound reformulation based on Jensen's inequality and smooth approximation, followed by an alternating optimization algorithm combining manifold optimization and projected-gradient updates. Simulation results show the effectiveness and robustness of the proposed scheme compared with various benchmarks. Beam patterns further reveal that array-level and element-wise rotations play complementary roles in maintaining strong gains toward legitimate users and forming a low-gain region over the eavesdropper angular uncertainty interval.
IRMar 7, 2017
Heterogeneous information network model for equipment-standard systemLiang Yin, Li-Chen Shi, Jun-Yan Zhao et al.
Entity information network is used to describe structural relationships between entities. Taking advantage of its extension and heterogeneity, entity information network is more and more widely applied to relationship modeling. Recent years, lots of researches about entity information network modeling have been proposed, while seldom of them concentrate on equipment-standard system with properties of multi-layer, multi-dimension and multi-scale. In order to efficiently deal with some complex issues in equipment-standard system such as standard revising, standard controlling, and production designing, a heterogeneous information network model for equipment-standard system is proposed in this paper. Three types of entities and six types of relationships are considered in the proposed model. Correspondingly, several different similarity-measuring methods are used in the modeling process. The experiments show that the heterogeneous information network model established in this paper can reflect relationships between entities accurately. Meanwhile, the modeling process has a good performance on time consumption.