EndoGen: Conditional Autoregressive Endoscopic Video Generation
This addresses a need for dynamic, conditionally guided endoscopic video generation to enhance medical imaging and diagnostics, representing a novel but incremental advancement in the field.
The paper tackles the problem of generating endoscopic videos conditionally, which prior methods lacked, by proposing EndoGen, an autoregressive model with a Spatiotemporal Grid-Frame Patterning strategy and Semantic-Aware Token Masking, resulting in high-quality content that improves polyp segmentation performance.
Endoscopic video generation is crucial for advancing medical imaging and enhancing diagnostic capabilities. However, prior efforts in this field have either focused on static images, lacking the dynamic context required for practical applications, or have relied on unconditional generation that fails to provide meaningful references for clinicians. Therefore, in this paper, we propose the first conditional endoscopic video generation framework, namely EndoGen. Specifically, we build an autoregressive model with a tailored Spatiotemporal Grid-Frame Patterning (SGP) strategy. It reformulates the learning of generating multiple frames as a grid-based image generation pattern, which effectively capitalizes the inherent global dependency modeling capabilities of autoregressive architectures. Furthermore, we propose a Semantic-Aware Token Masking (SAT) mechanism, which enhances the model's ability to produce rich and diverse content by selectively focusing on semantically meaningful regions during the generation process. Through extensive experiments, we demonstrate the effectiveness of our framework in generating high-quality, conditionally guided endoscopic content, and improves the performance of downstream task of polyp segmentation. Code released at https://www.github.com/CUHK-AIM-Group/EndoGen.