CVMar 3, 2025

OFF-CLIP: Improving Normal Detection Confidence in Radiology CLIP with Simple Off-Diagonal Term Auto-Adjustment

Junhyun Park, Chanyu Moon, Donghwan Lee, Kyungsu Kim, Minho Hwang

arXiv:2503.01794v13.6h-index: 3MICCAI

Originality Incremental advance

AI Analysis

This addresses a specific bottleneck in medical imaging for radiologists by enhancing zero-shot classification accuracy, though it is an incremental improvement over existing CLIP methods.

The paper tackled the problem of high false positives and false negatives in normal case detection using CLIP in radiology by proposing OFF-CLIP, which improved normal classification with a 0.61 AUC increase over the state-of-the-art baseline on VinDr-CXR while maintaining abnormal classification performance.

Contrastive Language-Image Pre-Training (CLIP) has enabled zero-shot classification in radiology, reducing reliance on manual annotations. However, conventional contrastive learning struggles with normal case detection due to its strict intra-sample alignment, which disrupts normal sample clustering and leads to high false positives (FPs) and false negatives (FNs). To address these issues, we propose OFF-CLIP, a contrastive learning refinement that improves normal detection by introducing an off-diagonal term loss to enhance normal sample clustering and applying sentence-level text filtering to mitigate FNs by removing misaligned normal statements from abnormal reports. OFF-CLIP can be applied to radiology CLIP models without requiring any architectural modifications. Experimental results show that OFF-CLIP significantly improves normal classification, achieving a 0.61 Area under the curve (AUC) increase on VinDr-CXR over CARZero, the state-of-the-art zero-shot classification baseline, while maintaining or improving abnormal classification performance. Additionally, OFF-CLIP enhances zero-shot grounding by improving pointing game accuracy, confirming better anomaly localization. These results demonstrate OFF-CLIP's effectiveness as a robust and efficient enhancement for medical vision-language models.

View on arXiv PDF

Similar