Automatic Report Generation for Histopathology images using pre-trained Vision Transformers
This work addresses a domain-specific problem for medical imaging by enabling report generation from whole slide images, though it is incremental as it builds on existing pre-trained models.
The paper tackled the challenge of automatic report generation for high-resolution histopathology images by using a pre-trained Vision Transformer in a two-step process to encode patches and generate reports with an LSTM decoder, achieving a fairly performant and portable mechanism.
Deep learning for histopathology has been successfully used for disease classification, image segmentation and more. However, combining image and text modalities using current state-of-the-art methods has been a challenge due to the high resolution of histopathology images. Automatic report generation for histopathology images is one such challenge. In this work, we show that using an existing pre-trained Vision Transformer in a two-step process of first using it to encode 4096x4096 sized patches of the Whole Slide Image (WSI) and then using it as the encoder and an LSTM decoder for report generation, we can build a fairly performant and portable report generation mechanism that takes into account the whole of the high resolution image, instead of just the patches. We are also able to use representations from an existing powerful pre-trained hierarchical vision transformer and show its usefulness in not just zero shot classification but also for report generation.