PEACH: A sentence-aligned Parallel English-Arabic Corpus for Healthcare
It addresses the need for high-quality bilingual data in healthcare for researchers in contrastive linguistics, translation studies, and NLP, but is incremental as it focuses on a specific domain.
The paper introduces PEACH, a manually aligned parallel English-Arabic corpus of healthcare texts with 51,671 sentences and about 590,517 English and 567,707 Arabic word tokens, designed to support research in linguistics, translation, and NLP.
This paper introduces PEACH, a sentence-aligned parallel English-Arabic corpus of healthcare texts encompassing patient information leaflets and educational materials. The corpus contains 51,671 parallel sentences, totaling approximately 590,517 English and 567,707 Arabic word tokens. Sentence lengths vary between 9.52 and 11.83 words on average. As a manually aligned corpus, PEACH is a gold-standard corpus, aiding researchers in contrastive linguistics, translation studies, and natural language processing. It can be used to derive bilingual lexicons, adapt large language models for domain-specific machine translation, evaluate user perceptions of machine translation in healthcare, assess patient information leaflets and educational materials' readability and lay-friendliness, and as an educational resource in translation studies. PEACH is publicly accessible.