CVFeb 19Code
EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language ModelsYahong Wang, Juncheng Wu, Zhangkai Ni et al.
Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.
CVDec 8, 2025Code
All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLMsYahong Wang, Juncheng Wu, Zhangkai Ni et al.
Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at https://github.com/YahongWang1/Information-Horizon.
CVMay 28, 2022
Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image CaptioningLongzhen Yang, Yihang Liu, Yitao Peng et al.
Accuracy and Diversity are two essential metrizable manifestations in generating natural and semantically correct captions. Many efforts have been made to enhance one of them with another decayed due to the trade-off gap. In this work, we will show that the inferior standard of accuracy draws from human annotations (leave-one-out) are not appropriate for machine-generated captions. To improve diversity with a solid accuracy performance, we exploited a novel Variational Transformer framework. By introducing the "Invisible Information Prior" and the "Auto-selectable GMM", we instruct the encoder to learn the precise language information and object relation in different scenes for accuracy assurance. By introducing the "Range-Median Reward" baseline, we retain more diverse candidates with higher rewards during the RL-based training process for diversity assurance. Experiments show that our method achieves the simultaneous promotion of accuracy (CIDEr) and diversity (self-CIDEr), up to 1.1 and 4.8 percent. Also, our method got the most similar performance of the semantic retrieval compared to human annotations, with 50.3 (50.6 of human) for R@1(i2t).
CVJan 12, 2023
Hierarchical Dynamic Masks for Visual Explanation of Neural NetworksYitao Peng, Longzhen Yang, Yihang Liu et al.
Saliency methods generating visual explanatory maps representing the importance of image pixels for model classification is a popular technique for explaining neural network decisions. Hierarchical dynamic masks (HDM), a novel explanatory maps generation method, is proposed in this paper to enhance the granularity and comprehensiveness of saliency maps. First, we suggest the dynamic masks (DM), which enables multiple small-sized benchmark mask vectors to roughly learn the critical information in the image through an optimization method. Then the benchmark mask vectors guide the learning of large-sized auxiliary mask vectors so that their superimposed mask can accurately learn fine-grained pixel importance information and reduce the sensitivity to adversarial perturbations. In addition, we construct the HDM by concatenating DM modules. These DM modules are used to find and fuse the regions of interest in the remaining neural network classification decisions in the mask image in a learning-based way. Since HDM forces DM to perform importance analysis in different areas, it makes the fused saliency map more comprehensive. The proposed method outperformed previous approaches significantly in terms of recognition and localization capabilities when tested on natural and medical datasets.
CVAug 13, 2024
Visual Neural Decoding via Improved Visual-EEG Semantic ConsistencyHongzhou Chen, Lianghua He, Yihang Liu et al.
Visual neural decoding refers to the process of extracting and interpreting original visual experiences from human brain activity. Recent advances in metric learning-based EEG visual decoding methods have delivered promising results and demonstrated the feasibility of decoding novel visual categories from brain activity. However, methods that directly map EEG features to the CLIP embedding space may introduce mapping bias and cause semantic inconsistency among features, thereby degrading alignment and impairing decoding performance. To further explore the semantic consistency between visual and neural signals. In this work, we construct a joint semantic space and propose a Visual-EEG Semantic Decouple Framework that explicitly extracts the semantic-related features of these two modalities to facilitate optimal alignment. Specifically, a cross-modal information decoupling module is introduced to guide the extraction of semantic-related information from modalities. Then, by quantifying the mutual information between visual image and EEG features, we observe a strong positive correlation between the decoding performance and the magnitude of mutual information. Furthermore, inspired by the mechanisms of visual object understanding from neuroscience, we propose an intra-class geometric consistency approach during the alignment process. This strategy maps visual samples within the same class to consistent neural patterns, which further enhances the robustness and the performance of EEG visual decoding. Experiments on a large Image-EEG dataset show that our method achieves state-of-the-art results in zero-shot neural decoding tasks.
CVOct 15, 2022
Decoupling Deep Learning for Interpretable Image RecognitionYitao Peng, Yihang Liu, Longzhen Yang et al.
The interpretability of neural networks has recently received extensive attention. Previous prototype-based explainable networks involved prototype activation in both reasoning and interpretation processes, requiring specific explainable structures for the prototype, thus making the network less accurate as it gains interpretability. Therefore, the decoupling prototypical network (DProtoNet) was proposed to avoid this problem. This new model contains encoder, inference, and interpretation modules. As regards the encoder module, unrestricted feature masks were presented to generate expressive features and prototypes. Regarding the inference module, a multi-image prototype learning method was introduced to update prototypes so that the network can learn generalized prototypes. Finally, concerning the interpretation module, a multiple dynamic masks (MDM) decoder was suggested to explain the neural network, which generates heatmaps using the consistent activation of the original image and mask image at the detection nodes of the network. It decouples the inference and interpretation modules of a prototype-based network by avoiding the use of prototype activation to explain the network's decisions in order to simultaneously improve the accuracy and interpretability of the neural network. The multiple public general and medical datasets were tested, and the results confirmed that our method could achieve a 5% improvement in accuracy and state-of-the-art interpretability compared with previous methods.
CVJul 17, 2022
MDM: Multiple Dynamic Masks for Visual Explanation of Neural NetworksYitao Peng, Longzhen Yang, Yihang Liu et al.
The Class Activation Map (CAM) lookup of a neural network tells us to which regions the neural network focuses when it makes a decision. In the past, the CAM search method was dependent upon a specific internal module of the network. It has specific constraints on the structure of the neural network. To make the search of CAM have generality and high performance. We propose a learning-based algorithm, namely Multiple Dynamic Masks (MDM). It is based on a public cognition that only active features of a picture related to classification will affect the classification results of the neural network, and other features will hardly affect the classification results of the network. The mask generated by MDM conforms to the above cognition. It trains mask vectors of different sizes by constraining mask values and activating consistency, then it uses stacking masks of different scale to generate CAM that can balance spatial information and semantic information. Comparing the results of MDM with those of the recent advanced CAM search method, the performance of MDM has reached the state of the art results. We applied the MDM method to the interpretable neural networks ProtoPNet and XProtoNet, which improved the performance of model in the explainable prototype search. Finally, we visualized the CAM generation effect of MDM on neural networks of different architectures, verifying the generality of the MDM method.
CVApr 15, 2025
AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic ImagesYihang Liu, Lianghua He, Ying Wen et al.
Current self-supervised methods, such as contrastive learning, predominantly focus on global discrimination, neglecting the critical fine-grained anatomical details required for accurate radiographic analysis. To address this challenge, we propose an Anatomy-driven self-supervised framework for enhancing Fine-grained Representation in radiographic image analysis (AFiRe). The core idea of AFiRe is to align the anatomical consistency with the unique token-processing characteristics of Vision Transformer. Specifically, AFiRe synergistically performs two self-supervised schemes: (i) Token-wise anatomy-guided contrastive learning, which aligns image tokens based on structural and categorical consistency, thereby enhancing fine-grained spatial-anatomical discrimination; (ii) Pixel-level anomaly-removal restoration, which particularly focuses on local anomalies, thereby refining the learned discrimination with detailed geometrical information. Additionally, we propose Synthetic Lesion Mask to enhance anatomical diversity while preserving intra-consistency, which is typically corrupted by traditional data augmentations, such as Cropping and Affine transformations. Experimental results show that AFiRe: (i) provides robust anatomical discrimination, achieving more cohesive feature clusters compared to state-of-the-art contrastive learning methods; (ii) demonstrates superior generalization, surpassing 7 radiography-specific self-supervised methods in multi-label classification tasks with limited labeling; and (iii) integrates fine-grained information, enabling precise anomaly detection using only image-level annotations.
49.3CVApr 10
M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation ModelYihang Liu, Ying Wen, Jiaxiong Yang et al.
Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blend multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose M-IDoL, a self-supervised \underline{\textit{M}}FM that introduces Information Decomposition for multimodal representation Learning via two objectives: i) maximize inter-modality entropy by dispersing multimodal representation into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimize intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature cluster across modalities and finer-grained feature discrimination within each modality.
CVSep 30, 2025
Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report GenerationLongzhen Yang, Zhangkai Ni, Ying Wen et al.
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images, anchored in explicit visual evidence to improve interpretability and facilitate integration into clinical workflows. However, existing methods often rely on separately trained detection modules that require extensive expert annotations, introducing high labeling costs and limiting generalizability due to pathology distribution bias across datasets. To address these challenges, we propose Self-Supervised Anatomical Consistency Learning (SS-ACL) -- a novel and annotation-free framework that aligns generated reports with corresponding anatomical regions using simple textual prompts. SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy, organizing entities by spatial location. It recursively reconstructs fine-grained anatomical regions to enforce intra-sample spatial alignment, inherently guiding attention maps toward visually relevant areas prompted by text. To further enhance inter-sample semantic alignment for abnormality recognition, SS-ACL introduces a region-level contrastive learning based on anatomical consistency. These aligned embeddings serve as priors for report generation, enabling attention maps to provide interpretable visual evidence. Extensive experiments demonstrate that SS-ACL, without relying on expert annotations, (i) generates accurate and visually grounded reports -- outperforming state-of-the-art methods by 10\% in lexical accuracy and 25\% in clinical efficacy, and (ii) achieves competitive performance on various downstream visual tasks, surpassing current leading visual foundation models by 8\% in zero-shot visual grounding.
CVDec 4, 2023
FeaInfNet: Diagnosis in Medical Image with Feature-Driven Inference and Visual ExplanationsYitao Peng, Lianghua He, Die Hu et al.
Interpretable deep learning models have received widespread attention in the field of image recognition. Due to the unique multi-instance learning of medical images and the difficulty in identifying decision-making regions, many interpretability models that have been proposed still have problems of insufficient accuracy and interpretability in medical image disease diagnosis. To solve these problems, we propose feature-driven inference network (FeaInfNet). Our first key innovation involves proposing a feature-based network reasoning structure, which is applied to FeaInfNet. The network of this structure compares the similarity of each sub-region image patch with the disease templates and normal templates that may appear in the region, and finally combines the comparison of each sub-region to make the final diagnosis. It simulates the diagnosis process of doctors to make the model interpretable in the reasoning process, while avoiding the misleading caused by the participation of normal areas in reasoning. Secondly, we propose local feature masks (LFM) to extract feature vectors in order to provide global information for these vectors, thus enhancing the expressive ability of the FeaInfNet. Finally, we propose adaptive dynamic masks (Adaptive-DM) to interpret feature vectors and prototypes into human-understandable image patches to provide accurate visual interpretation. We conducted qualitative and quantitative experiments on multiple publicly available medical datasets, including RSNA, iChallenge-PM, Covid-19, ChinaCXRSet, and MontgomerySet. The results of our experiments validate that our method achieves state-of-the-art performance in terms of classification accuracy and interpretability compared to baseline methods in medical image diagnosis. Additional ablation studies verify the effectiveness of each of our proposed components.