CLAILGDec 20, 2022

Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

arXiv:2212.10422v333 citationsh-index: 58
Originality Synthesis-oriented
AI Analysis

This work addresses the gap in biomedical NLP for less-resourced languages, enabling local medical institutions to leverage language models for improved patient care, though it is incremental in adapting existing methods to new linguistic contexts.

The paper tackles the problem of adapting biomedical language models to less-resourced languages like Italian, where large-scale in-domain data is often unavailable, by comparing approaches using machine-translated English data versus native Italian corpora. The result shows that data quantity is more critical than quality for adaptation, but high-quality data can still improve performance with limited corpora.

In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes