Juan I. Pisula

h-index4
2papers

2 Papers

CVNov 14, 2022
Language models are good pathologists: using attention-based sequence reduction and text-pretrained transformers for efficient WSI classification

Juan I. Pisula, Katarzyna Bozek

In digital pathology, Whole Slide Image (WSI) analysis is usually formulated as a Multiple Instance Learning (MIL) problem. Although transformer-based architectures have been used for WSI classification, these methods require modifications to adapt them to specific challenges of this type of image data. Among these challenges is the amount of memory and compute required by deep transformer models to process long inputs, such as the thousands of image patches that can compose a WSI at $\times 10$ or $\times 20$ magnification. We introduce \textit{SeqShort}, a multi-head attention-based sequence shortening layer to summarize each WSI in a fixed- and short-sized sequence of instances, that allows us to reduce the computational costs of self-attention on long sequences, and to include positional information that is unavailable in other MIL approaches. Furthermore, we show that WSI classification performance can be improved when the downstream transformer architecture has been pre-trained on a large corpus of text data, and only fine-tuning less than 0.1\% of its parameters. We demonstrate the effectiveness of our method in lymph node metastases classification and cancer subtype classification tasks, without the need of designing a WSI-specific transformer nor doing in-domain pre-training, keeping a reduced compute budget and low number of trainable parameters.

CVMar 8, 2024
Fine-tuning a Multiple Instance Learning Feature Extractor with Masked Context Modelling and Knowledge Distillation

Juan I. Pisula, Katarzyna Bozek

The first step in Multiple Instance Learning (MIL) algorithms for Whole Slide Image (WSI) classification consists of tiling the input image into smaller patches and computing their feature vectors produced by a pre-trained feature extractor model. Feature extractor models that were pre-trained with supervision on ImageNet have proven to transfer well to this domain, however, this pre-training task does not take into account that visual information in neighboring patches is highly correlated. Based on this observation, we propose to increase downstream MIL classification by fine-tuning the feature extractor model using \textit{Masked Context Modelling with Knowledge Distillation}. In this task, the feature extractor model is fine-tuned by predicting masked patches in a bigger context window. Since reconstructing the input image would require a powerful image generation model, and our goal is not to generate realistically looking image patches, we predict instead the feature vectors produced by a larger teacher network. A single epoch of the proposed task suffices to increase the downstream performance of the feature-extractor model when used in a MIL scenario, even capable of outperforming the downstream performance of the teacher model, while being considerably smaller and requiring a fraction of its compute.