CVAug 12, 2024
Mipmap-GS: Let Gaussians Deform with Scale-specific Mipmap for Anti-aliasing RenderingJiameng Li, Yue Shi, Jiezhang Cao et al.
3D Gaussian Splatting (3DGS) has attracted great attention in novel view synthesis because of its superior rendering efficiency and high fidelity. However, the trained Gaussians suffer from severe zooming degradation due to non-adjustable representation derived from single-scale training. Though some methods attempt to tackle this problem via post-processing techniques such as selective rendering or filtering techniques towards primitives, the scale-specific information is not involved in Gaussians. In this paper, we propose a unified optimization method to make Gaussians adaptive for arbitrary scales by self-adjusting the primitive properties (e.g., color, shape and size) and distribution (e.g., position). Inspired by the mipmap technique, we design pseudo ground-truth for the target scale and propose a scale-consistency guidance loss to inject scale information into 3D Gaussians. Our method is a plug-in module, applicable for any 3DGS models to solve the zoom-in and zoom-out aliasing. Extensive experiments demonstrate the effectiveness of our method. Notably, our method outperforms 3DGS in PSNR by an average of 9.25 dB for zoom-in and 10.40 dB for zoom-out on the NeRF Synthetic dataset.
CVMay 11
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMsJiameng Li, Minye Wu, Jiezhang Cao et al.
Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense frame sampling introduces massive visual tokens while sparse sampling risks missing critical temporal evidence and leading to LLM hallucination. Existing training-free token reduction methods either treat videos equally as static images or rely on segment-level merging heuristics, which weaken fine-grained spatiotemporal modeling and introduce additional overhead. In this paper, we propose EchoPrune, a lightweight and training-free token pruning method that improves temporal resolution under a fixed LLM-side visual token budget. Our core idea is to interpret redundant video tokens as temporal echoes: if a token is well reconstructed from the previous frame, it is merely a temporally redundant echo; otherwise, it may capture new events, motion, or query-relevant visual evidence. Based on this insight, EchoPrune scores visual tokens by (i) query-guided crossmodal relevance and (ii) temporal reconstruction error, measured by correspondence matching and echo matching across consecutive frames. The selected tokens preserve task-relevant cues and temporal novelty while suppressing predictable redundancy, allowing VideoLLMs to observe more frames without increasing the decoding budget. Extensive experiments on LLaVA-OV, Qwen2.5VL, and Qwen3VL across six video understanding benchmarks show that EchoPrune enables VideoLLMs to process up to 20x frames under the same token budget, yielding improved performance (+8.6%) and inference speedup (5.6x for prefilling) on Qwen2.5VL-7B.
CVApr 3
MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMsJiameng Li, Aleksei Tiulpin, Matthew B. Blaschko
For multimodal large language models (MLLMs), visual information is relatively sparse compared with text. As a result, research on visual pruning emerges for efficient inference. Current approaches typically measure token importance based on the attention scores in the visual encoder or in the LLM decoder, then select visual tokens with high attention scores while pruning others. In this paper, we pursue a different and more surgical approach. Instead of relying on mechanism-specific signals, we directly compute Mutual Information (MI) between visual and textual features themselves, prior to their interaction. This allows us to explicitly measure crossmodal dependency at the feature levels. Our MI-Pruner is simple, efficient and non-intrusive, requiring no access to internal attention maps or architectural modifications. Experimental results demonstrate that our approach outperforms previous attention-based pruning methods with minimal latency.
CVNov 24, 2025
CLASH: A Benchmark for Cross-Modal Contradiction DetectionTeodora Popordanoska, Jiameng Li, Matthew B. Blaschko
Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.
CLOct 7, 2025
YpathRAG:A Retrieval-Augmented Generation Framework and Benchmark for PathologyDeshui Yu, Yizhi Wang, Saihui Jin et al.
Large language models (LLMs) excel on general tasks yet still hallucinate in high-barrier domains such as pathology. Prior work often relies on domain fine-tuning, which neither expands the knowledge boundary nor enforces evidence-grounded constraints. We therefore build a pathology vector database covering 28 subfields and 1.53 million paragraphs, and present YpathRAG, a pathology-oriented RAG framework with dual-channel hybrid retrieval (BGE-M3 dense retrieval coupled with vocabulary-guided sparse retrieval) and an LLM-based supportive-evidence judgment module that closes the retrieval-judgment-generation loop. We also release two evaluation benchmarks, YpathR and YpathQA-M. On YpathR, YpathRAG attains Recall@5 of 98.64%, a gain of 23 percentage points over the baseline; on YpathQA-M, a set of the 300 most challenging questions, it increases the accuracies of both general and medical LLMs by 9.0% on average and up to 15.6%. These results demonstrate improved retrieval quality and factual reliability, providing a scalable construction paradigm and interpretable evaluation for pathology-oriented RAG.
CVMay 26, 2025
CARE: Confidence-aware Ratio Estimation for Medical BiomarkersJiameng Li, Teodora Popordanoska, Aleksei Tiulpin et al.
Ratio-based biomarkers -- such as the proportion of necrotic tissue within a tumor -- are widely used in clinical practice to support diagnosis, prognosis, and treatment planning. These biomarkers are typically estimated from soft segmentation outputs by computing region-wise ratios. Despite the high-stakes nature of clinical decision making, existing methods provide only point estimates, offering no measure of uncertainty. In this work, we propose a unified confidence-aware framework for estimating ratio-based biomarkers. Our uncertainty analysis stems from two observations: i) the probability ratio estimator inherently admits a statistical confidence interval regarding local randomness (bias and variance), ii) the segmentation network is not perfectly calibrated. We conduct a systematic analysis of error propagation in the segmentation-to-biomarker pipeline and identify model miscalibration as the dominant source of uncertainty. We leverage tunable parameters to control the confidence level of the derived bounds, allowing adaptation towards clinical practice. Extensive experiments show that our method produces statistically sound confidence intervals, with tunable confidence levels, enabling more trustworthy application of predictive biomarkers in clinical workflows.
IVJun 12, 2024
Unveiling Incomplete Modality Brain Tumor Segmentation: Leveraging Masked Predicted Auto-Encoder and Divergence LearningZhongao Sun, Jiameng Li, Yuhan Wang et al.
Brain tumor segmentation remains a significant challenge, particularly in the context of multi-modal magnetic resonance imaging (MRI) where missing modality images are common in clinical settings, leading to reduced segmentation accuracy. To address this issue, we propose a novel strategy, which is called masked predicted pre-training, enabling robust feature learning from incomplete modality data. Additionally, in the fine-tuning phase, we utilize a knowledge distillation technique to align features between complete and missing modality data, simultaneously enhancing model robustness. Notably, we leverage the Holder pseudo-divergence instead of the KLD for distillation loss, offering improve mathematical interpretability and properties. Extensive experiments on the BRATS2018 and BRATS2020 datasets demonstrate significant performance enhancements compared to existing state-of-the-art methods.