Learning Discriminative Visual-Text Representation for Polyp Re-Identification
This work addresses a domain-specific problem in medical imaging for colorectal cancer diagnosis, offering an incremental improvement by incorporating textual data into existing visual methods.
The paper tackles the problem of colonoscopic polyp re-identification by proposing VT-ReID, a method that integrates visual and textual features to improve generalization, achieving state-of-the-art results with a clear margin.
Colonoscopic Polyp Re-Identification aims to match a specific polyp in a large gallery with different cameras and views, which plays a key role for the prevention and treatment of colorectal cancer in the computer-aided diagnosis. However, traditional methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training, which may easily leads to poor generalization capability when adapted the pretrained model into the new scenarios. To relieve this dilemma, we propose a simple but effective training method named VT-ReID, which can remarkably enrich the representation of polyp videos with the interchange of high-level semantic information. Moreover, we elaborately design a novel clustering mechanism to introduce prior knowledge from textual data, which leverages contrastive learning to promote better separation from abundant unlabeled text data. To the best of our knowledge, this is the first attempt to employ the visual-text feature with clustering mechanism for the colonoscopic polyp re-identification. Empirical results show that our method significantly outperforms current state-of-the art methods with a clear margin.