A new hybrid metric for verifying parallel corpora of Arabic-English
This work addresses the specific issue of noise and mis-translations in Arabic-English parallel corpora, which is incremental as it builds on existing techniques.
The paper tackles the problem of verifying translation quality in Arabic-English parallel corpora by proposing a new hybrid metric combining sentence length and compression code length techniques, resulting in improved accuracy for identifying satisfactory and unsatisfactory sentence pairs compared to using each technique alone.
This paper discusses a new metric that has been applied to verify the quality in translation between sentence pairs in parallel corpora of Arabic-English. This metric combines two techniques, one based on sentence length and the other based on compression code length. Experiments on sample test parallel Arabic-English corpora indicate the combination of these two techniques improves accuracy of the identification of satisfactory and unsatisfactory sentence pairs compared to sentence length and compression code length alone. The new method proposed in this research is effective at filtering noise and reducing mis-translations resulting in greatly improved quality.