CLJan 2, 2023

Statistical Machine Translation for Indic Languages

arXiv:2301.00539v120 citationsh-index: 15Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses machine translation for low-resource Indic languages, but it is incremental as it applies existing SMT methods to new language pairs.

The paper developed bilingual statistical machine translation models for translating between English and fifteen low-resource Indian languages, using preprocessing to handle dataset noise and achieving translation quality evaluated with BLEU, METEOR, and RIBES metrics.

Machine Translation (MT) system generally aims at automatic representation of source language into target language retaining the originality of context using various Natural Language Processing (NLP) techniques. Among various NLP methods, Statistical Machine Translation(SMT). SMT uses probabilistic and statistical techniques to analyze information and conversion. This paper canvasses about the development of bilingual SMT models for translating English to fifteen low-resource Indian Languages (ILs) and vice versa. At the outset, all 15 languages are briefed with a short description related to our experimental need. Further, a detailed analysis of Samanantar and OPUS dataset for model building, along with standard benchmark dataset (Flores-200) for fine-tuning and testing, is done as a part of our experiment. Different preprocessing approaches are proposed in this paper to handle the noise of the dataset. To create the system, MOSES open-source SMT toolkit is explored. Distance reordering is utilized with the aim to understand the rules of grammar and context-dependent adjustments through a phrase reordering categorization framework. In our experiment, the quality of the translation is evaluated using standard metrics such as BLEU, METEOR, and RIBES

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes