CVNov 10, 2025Code
Adaptive Morph-Patch Transformer for Aortic Vessel SegmentationZhenxi Zhang, Fuchen Zheng, Adnan Iltaf et al.
Accurate segmentation of aortic vascular structures is critical for diagnosing and treating cardiovascular diseases.Traditional Transformer-based models have shown promise in this domain by capturing long-range dependencies between vascular features. However, their reliance on fixed-size rectangular patches often influences the integrity of complex vascular structures, leading to suboptimal segmentation accuracy. To address this challenge, we propose the adaptive Morph Patch Transformer (MPT), a novel architecture specifically designed for aortic vascular segmentation. Specifically, MPT introduces an adaptive patch partitioning strategy that dynamically generates morphology-aware patches aligned with complex vascular structures. This strategy can preserve semantic integrity of complex vascular structures within individual patches. Moreover, a Semantic Clustering Attention (SCA) method is proposed to dynamically aggregate features from various patches with similar semantic characteristics. This method enhances the model's capability to segment vessels of varying sizes, preserving the integrity of vascular structures. Extensive experiments on three open-source dataset(AVT, AortaSeg24 and TBAD) demonstrate that MPT achieves state-of-the-art performance, with improvements in segmenting intricate vascular structures.
CLSep 23, 2025Code
TsqLoRA: Towards Sensitivity and Quality Low-Rank Adaptation for Efficient Fine-TuningYu Chen, Yifei Han, Long Zhang et al.
Fine-tuning large pre-trained models for downstream tasks has become a fundamental approach in natural language processing. Fully fine-tuning all model parameters is computationally expensive and memory-intensive, especially in resource-constrained environments. Existing parameter-efficient fine-tuning methods reduce the number of trainable parameters but typically overlook the varying sensitivity of different model layers and the importance of training data. In this work, we propose TsqLoRA, a novel method that integrates data-quality-driven selection with sensitivity-aware low-rank adaptation, consisted of two main components: a quality-aware sampling mechanism for selecting the most informative training data, and a dynamic rank allocation module that adjusts the rank of each layer based on its sensitivity to parameter updates. The experimental results demonstrate that TsqLoRA improves fine-tuning efficiency while maintaining or even improving performance on a variety of NLP tasks. Our code will be available at https://github.com/Benjamin-Ricky/TsqLoRA.
CVApr 7, 2025Code
MSA-UNet3+: Multi-Scale Attention UNet3+ with New Supervised Prototypical Contrastive Loss for Coronary DSA Image SegmentationRayan Merghani Ahmed, Adnan Iltaf, Mohamed Elmanna et al.
Accurate segmentation of coronary Digital Subtraction Angiography images is essential to diagnose and treat coronary artery diseases. Despite advances in deep learning, challenges such as high intra-class variance and class imbalance limit precise vessel delineation. Most existing approaches for coronary DSA segmentation cannot address these issues. Also, existing segmentation network's encoders do not directly generate semantic embeddings, which could enable the decoder to reconstruct segmentation masks effectively from these well-defined features. We propose a Supervised Prototypical Contrastive Loss that fuses supervised and prototypical contrastive learning to enhance coronary DSA image segmentation. The supervised contrastive loss enforces semantic embeddings in the encoder, improving feature differentiation. The prototypical contrastive loss allows the model to focus on the foreground class while alleviating the high intra-class variance and class imbalance problems by concentrating only on the hard-to-classify background samples. We implement the proposed SPCL loss within an MSA-UNet3+: a Multi-Scale Attention-Enhanced UNet3+ architecture. The architecture integrates key components: a Multi-Scale Attention Encoder and a Multi-Scale Dilated Bottleneck designed to enhance multi-scale feature extraction and a Contextual Attention Fusion Module built to keep fine-grained details while improving contextual understanding. Experiments on a private coronary DSA dataset show that MSA-UNet3+ outperforms state-of-the-art methods, achieving the highest Dice coefficient and F1-score and significantly reducing ASD and ACD. The developed framework provides clinicians with precise vessel segmentation, enabling accurate identification of coronary stenosis and supporting informed diagnostic and therapeutic decisions. The code will be released at https://github.com/rayanmerghani/MSA-UNet3plus.
CVMay 11, 2025
Overview of the NLPCC 2025 Shared Task 4: Multi-modal, Multilingual, and Multi-hop Medical Instructional Video Question Answering ChallengeBin Li, Shenxi Liu, Yixuan Weng et al.
Following the successful hosts of the 1-st (NLPCC 2023 Foshan) CMIVQA and the 2-rd (NLPCC 2024 Hangzhou) MMIVQA challenges, this year, a new task has been introduced to further advance research in multi-modal, multilingual, and multi-hop medical instructional question answering (M4IVQA) systems, with a specific focus on medical instructional videos. The M4IVQA challenge focuses on evaluating models that integrate information from medical instructional videos, understand multiple languages, and answer multi-hop questions requiring reasoning over various modalities. This task consists of three tracks: multi-modal, multilingual, and multi-hop Temporal Answer Grounding in Single Video (M4TAGSV), multi-modal, multilingual, and multi-hop Video Corpus Retrieval (M4VCR) and multi-modal, multilingual, and multi-hop Temporal Answer Grounding in Video Corpus (M4TAGVC). Participants in M4IVQA are expected to develop algorithms capable of processing both video and text data, understanding multilingual queries, and providing relevant answers to multi-hop medical questions. We believe the newly introduced M4IVQA challenge will drive innovations in multimodal reasoning systems for healthcare scenarios, ultimately contributing to smarter emergency response systems and more effective medical education platforms in multilingual communities. Our official website is https://cmivqa.github.io/
CVOct 20, 2025
Context-Aware Pseudo-Label Scoring for Zero-Shot Video SummarizationYuanli Wu, Long Zhang, Yue Du et al.
We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video summarization framework that bridges large language models with structured semantic reasoning. A small subset of human annotations is converted into high-confidence pseudo labels and organized into dataset-adaptive rubrics defining clear evaluation dimensions such as thematic relevance, action detail, and narrative progression. During inference, boundary scenes, including the opening and closing segments, are scored independently based on their own descriptions, while intermediate scenes incorporate concise summaries of adjacent segments to assess narrative continuity and redundancy. This design enables the language model to balance local salience with global coherence without any parameter tuning. Across three benchmarks, the proposed method achieves stable and competitive results, with F1 scores of 57.58 on SumMe, 63.05 on TVSum, and 53.79 on QFVS, surpassing zero-shot baselines by +0.85, +0.84, and +0.37, respectively. These outcomes demonstrate that rubric-guided pseudo labeling combined with contextual prompting effectively stabilizes LLM-based scoring and establishes a general, interpretable, and training-free paradigm for both generic and query-focused video summarization.
CLSep 23, 2025
Consistency-Aware Parameter-Preserving Knowledge Editing Framework for Multi-Hop Question AnsweringLingwen Deng, Yifei Han, Long Zhang et al.
Parameter-Preserving Knowledge Editing (PPKE) enables updating models with new or corrected information without retraining or parameter adjustment. Recent PPKE approaches based on knowledge graphs (KG) to extend knowledge editing (KE) capabilities to multi-hop question answering (MHQA). However, these methods often lack consistency, leading to knowledge contamination, unstable updates, and retrieval behaviors that fail to reflect the intended edits. Such inconsistencies undermine the reliability of PPKE in multi-hop reasoning. We present CAPE-KG, Consistency-Aware Parameter-Preserving Editing with Knowledge Graphs, a novel consistency-aware framework for PPKE on MHQA. CAPE-KG ensures KG construction, update, and retrieval are always aligned with the requirements of the MHQA task, maintaining coherent reasoning over both unedited and edited knowledge. Extensive experiments on the MQuAKE benchmark show accuracy improvements in PPKE performance for MHQA, demonstrating the effectiveness of addressing consistency in PPKE.
CVApr 4, 2025
Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention FusionJunkai Zhang, Bin Li, Shoujun Zhou et al.
Medical Visual Question Answering (Med-VQA) answers clinical questions using medical images, aiding diagnosis. Designing the MedVQA system holds profound importance in assisting clinical diagnosis and enhancing diagnostic accuracy. Building upon this foundation, Hierarchical Medical VQA extends Medical VQA by organizing medical questions into a hierarchical structure and making level-specific predictions to handle fine-grained distinctions. Recently, many studies have proposed hierarchical MedVQA tasks and established datasets, However, several issues still remain: (1) imperfect hierarchical modeling leads to poor differentiation between question levels causing semantic fragmentation across hierarchies. (2) Excessive reliance on implicit learning in Transformer-based cross-modal self-attention fusion methods, which obscures crucial local semantic correlations in medical scenarios. To address these issues, this study proposes a HiCA-VQA method, including two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders. The hierarchical prompting module pre-aligns hierarchical text prompts with image features to guide the model in focusing on specific image regions according to question types, while the hierarchical decoder performs separate predictions for questions at different levels to improve accuracy across granularities. The framework also incorporates a cross-attention fusion module where images serve as queries and text as key-value pairs. Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework better outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions. This study provides an effective pathway for hierarchical visual question answering systems, advancing medical image understanding.