Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations
For NLP practitioners working with low-resource medical text in Finnish, this provides observations on domain fine-tuning dynamics, but the results are preliminary and lack quantitative validation.
The authors fine-tuned Finnish BERT on Finnish histopathological reports and attempted to predict downstream task performance from embedding geometry changes during fine-tuning, but no concrete performance numbers are reported.
In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims. (1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data. (2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning. Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels.