Video Polyp Segmentation: A Deep Learning Perspective
This work addresses the problem of automated polyp detection in colonoscopy videos for medical diagnosis, though it is incremental as it builds on existing segmentation methods with new data and a baseline model.
The authors tackled the lack of large-scale annotated data for video polyp segmentation by introducing the SUN-SEG dataset with 158,690 frames and proposing the PNS+ baseline model, which achieved state-of-the-art performance with real-time inference at 170fps.
We present the first comprehensive video polyp segmentation (VPS) study in the deep learning era. Over the years, developments in VPS are not moving forward with ease due to the lack of large-scale fine-grained segmentation annotations. To address this issue, we first introduce a high-quality frame-by-frame annotated VPS dataset, named SUN-SEG, which contains 158,690 colonoscopy frames from the well-known SUN-database. We provide additional annotations with diverse types, i.e., attribute, object mask, boundary, scribble, and polygon. Second, we design a simple but efficient baseline, dubbed PNS+, consisting of a global encoder, a local encoder, and normalized self-attention (NS) blocks. The global and local encoders receive an anchor frame and multiple successive frames to extract long-term and short-term spatial-temporal representations, which are then progressively updated by two NS blocks. Extensive experiments show that PNS+ achieves the best performance and real-time inference speed (170fps), making it a promising solution for the VPS task. Third, we extensively evaluate 13 representative polyp/object segmentation models on our SUN-SEG dataset and provide attribute-based comparisons. Finally, we discuss several open issues and suggest possible research directions for the VPS community.