Learning synchronous context-free grammars with multiple specialised non-terminals for hierarchical phrase-based translation
This work improves translation models for specific language pairs, but it is incremental as it refines an existing approach.
The paper tackles the problem of hierarchical phrase-based statistical machine translation by addressing the overloaded single non-terminal in standard synchronous context-free grammars, resulting in a statistically significant improvement in BLEU score.
Translation models based on hierarchical phrase-based statistical machine translation (HSMT) have shown better performances than the non-hierarchical phrase-based counterparts for some language pairs. The standard approach to HSMT learns and apply a synchronous context-free grammar with a single non-terminal. The hypothesis behind the grammar refinement algorithm presented in this work is that this single non-terminal is overloaded, and insufficiently discriminative, and therefore, an adequate split of it into more specialised symbols could lead to improved models. This paper presents a method to learn synchronous context-free grammars with a huge number of initial non-terminals, which are then grouped via a clustering algorithm. Our experiments show that the resulting smaller set of non-terminals correctly capture the contextual information that makes it possible to statistically significantly improve the BLEU score of the standard HSMT approach.