CVNov 5, 2025
Diffusion-Guided Mask-Consistent Paired Mixing for Endoscopic Image SegmentationPengyu Jie, Wanquan Liu, Rui He et al.
Augmentation for dense prediction typically relies on either sample mixing or generative synthesis. Mixing improves robustness but misaligned masks yield soft label ambiguity. Diffusion synthesis increases apparent diversity but, when trained as common samples, overlooks the structural benefit of mask conditioning and introduces synthetic-real domain shift. We propose a paired, diffusion-guided paradigm that fuses the strengths of both. For each real image, a synthetic counterpart is generated under the same mask and the pair is used as a controllable input for Mask-Consistent Paired Mixing (MCPMix), which mixes only image appearance while supervision always uses the original hard mask. This produces a continuous family of intermediate samples that smoothly bridges synthetic and real appearances under shared geometry, enlarging diversity without compromising pixel-level semantics. To keep learning aligned with real data, Real-Anchored Learnable Annealing (RLA) adaptively adjusts the mixing strength and the loss weight of mixed samples over training, gradually re-anchoring optimization to real data and mitigating distributional bias. Across Kvasir-SEG, PICCOLO, CVC-ClinicDB, a private NPC-LES cohort, and ISIC 2017, the approach achieves state-of-the-art segmentation performance and consistent gains over baselines. The results show that combining label-preserving mixing with diffusion-driven diversity, together with adaptive re-anchoring, yields robust and generalizable endoscopic segmentation.
CVDec 29, 2025
SOFTooth: Semantics-Enhanced Order-Aware Fusion for Tooth Instance SegmentationXiaolan Li, Wanquan Liu, Pengcheng Li et al.
Three-dimensional (3D) tooth instance segmentation remains challenging due to crowded arches, ambiguous tooth-gingiva boundaries, missing teeth, and rare yet clinically important third molars. Native 3D methods relying on geometric cues often suffer from boundary leakage, center drift, and inconsistent tooth identities, especially for minority classes and complex anatomies. Meanwhile, 2D foundation models such as the Segment Anything Model (SAM) provide strong boundary-aware semantics, but directly applying them in 3D is impractical in clinical workflows. To address these issues, we propose SOFTooth, a semantics-enhanced, order-aware 2D-3D fusion framework that leverages frozen 2D semantics without explicit 2D mask supervision. First, a point-wise residual gating module injects occlusal-view SAM embeddings into 3D point features to refine tooth-gingiva and inter-tooth boundaries. Second, a center-guided mask refinement regularizes consistency between instance masks and geometric centroids, reducing center drift. Furthermore, an order-aware Hungarian matching strategy integrates anatomical tooth order and center distance into similarity-based assignment, ensuring coherent labeling even under missing or crowded dentitions. On 3DTeethSeg'22, SOFTooth achieves state-of-the-art overall accuracy and mean IoU, with clear gains on cases involving third molars, demonstrating that rich 2D semantics can be effectively transferred to 3D tooth instance segmentation without 2D fine-tuning.
CVMar 17, 2025
AFR-CLIP: Enhancing Zero-Shot Industrial Anomaly Detection with Stateless-to-Stateful Anomaly Feature RectificationJingyi Yuan, Chenqiang Gao, Pengyu Jie et al.
Recently, zero-shot anomaly detection (ZSAD) has emerged as a pivotal paradigm for industrial inspection and medical diagnostics, detecting defects in novel objects without requiring any target-dataset samples during training. Existing CLIP-based ZSAD methods generate anomaly maps by measuring the cosine similarity between visual and textual features. However, CLIP's alignment with object categories instead of their anomalous states limits its effectiveness for anomaly detection. To address this limitation, we propose AFR-CLIP, a CLIP-based anomaly feature rectification framework. AFR-CLIP first performs image-guided textual rectification, embedding the implicit defect information from the image into a stateless prompt that describes the object category without indicating any anomalous state. The enriched textual embeddings are then compared with two pre-defined stateful (normal or abnormal) embeddings, and their text-on-text similarity yields the anomaly map that highlights defective regions. To further enhance perception to multi-scale features and complex anomalies, we introduce self prompting (SP) and multi-patch feature aggregation (MPFA) modules. Extensive experiments are conducted on eleven anomaly detection benchmarks across industrial and medical domains, demonstrating AFR-CLIP's superiority in ZSAD.