CL AIOct 11, 2022

Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation

Long Phan, Tai Dang, Hieu Tran, Trieu H. Trinh, Vy Phan, Lam D. Chau, Minh-Thang Luong

arXiv:2210.05598v323.2269 citationsh-index: 27Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited biomedical resources for low-resource language communities, though it is incremental as it applies existing translation methods to new data.

The paper tackled the lack of biomedical data in low-resource languages like Vietnamese by using a state-of-the-art translation model to translate and produce pretrained and supervised data, resulting in ViPubmedT5, which achieved state-of-the-art results on two biomedical benchmarks in summarization and acronym disambiguation.

Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.

View on arXiv PDF Code

Similar