CVNov 3, 2025
Positive Semi-definite Latent Factor Grouping-Boosted Cluster-reasoning Instance Disentangled Learning for WSI RepresentationChentao Li, Behzad Bozorgtabar, Yifang Ping et al.
Multiple instance learning (MIL) has been widely used for representing whole-slide pathology images. However, spatial, semantic, and decision entanglements among instances limit its representation and interpretability. To address these challenges, we propose a latent factor grouping-boosted cluster-reasoning instance disentangled learning framework for whole-slide image (WSI) interpretable representation in three phases. First, we introduce a novel positive semi-definite latent factor grouping that maps instances into a latent subspace, effectively mitigating spatial entanglement in MIL. To alleviate semantic entanglement, we employs instance probability counterfactual inference and optimization via cluster-reasoning instance disentangling. Finally, we employ a generalized linear weighted decision via instance effect re-weighting to address decision entanglement. Extensive experiments on multicentre datasets demonstrate that our model outperforms all state-of-the-art models. Moreover, it attains pathologist-aligned interpretability through disentangled representations and a transparent decision-making process.
27.8CVApr 23Code
Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric VisionChentao Li, Zirui Gao, Mingze Gao et al.
Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term "Referential Hallucination." To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: https://guyyyug.github.io/EgoPoint-Bench/
CVFeb 16
MacNet: An End-to-End Manifold-Constrained Adaptive Clustering Network for Interpretable Whole Slide Image ClassificationMingrui Ma, Chentao Li, Pan Huang et al.
Whole slide images (WSIs) are the gold standard for pathological diagnosis and sub-typing. Current main-stream two-step frameworks employ offline feature encoders trained without domain-specific knowledge. Among them, attention-based multiple instance learning (MIL) methods are outcome-oriented and offer limited interpretability. Clustering-based approaches can provide explainable decision-making process but suffer from high dimension features and semantically ambiguous centroids. To this end, we propose an end-to-end MIL framework that integrates Grassmann re-embedding and manifold adaptive clustering, where the manifold geometric structure facilitates robust clustering results. Furthermore, we design a prior knowledge guiding proxy instance labeling and aggregation strategy to approximate patch labels and focus on pathologically relevant tumor regions. Experiments on multicentre WSI datasets demonstrate that: 1) our cluster-incorporated model achieves superior performance in both grading accuracy and interpretability; 2) end-to-end learning refines better feature representations and it requires acceptable computation resources.
CVFeb 16
Prototype Instance-semantic Disentanglement with Low-rank Regularized Subspace Clustering for WSIs Explainable RecognitionChentao Li, Pan Huang
The tumor region plays a key role in pathological diagnosis. Tumor tissues are highly similar to precancerous lesions and non tumor instances often greatly exceed tumor instances in whole slide images (WSIs). These issues cause instance-semantic entanglement in multi-instance learning frameworks, degrading both model representation capability and interpretability. To address this, we propose an end-to-end prototype instance semantic disentanglement framework with low-rank regularized subspace clustering, PID-LRSC, in two aspects. First, we use secondary instance subspace learning to construct low-rank regularized subspace clustering (LRSC), addressing instance entanglement caused by an excessive proportion of non tumor instances. Second, we employ enhanced contrastive learning to design prototype instance semantic disentanglement (PID), resolving semantic entanglement caused by the high similarity between tumor and precancerous tissues. We conduct extensive experiments on multicentre pathology datasets, implying that PID-LRSC outperforms other SOTA methods. Overall, PID-LRSC provides clearer instance semantics during decision-making and significantly enhances the reliability of auxiliary diagnostic outcomes.