CLAIOct 11, 2022

Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation

arXiv:2210.05598v3269 citationsh-index: 27
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited biomedical resources for low-resource language communities, though it is incremental as it applies existing translation methods to new data.

The paper tackled the lack of biomedical data in low-resource languages like Vietnamese by using a state-of-the-art translation model to translate and produce pretrained and supervised data, resulting in ViPubmedT5, which achieved state-of-the-art results on two biomedical benchmarks in summarization and acronym disambiguation.

Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes