GENder-IT: An Annotated English-Italian Parallel Challenge Set for Cross-Linguistic Natural Gender Phenomena
This provides a domain-specific resource for researchers and developers in machine translation, focusing on natural gender phenomena, but it is incremental as it creates a new dataset rather than advancing methods.
The paper tackles the lack of resources for resolving natural gender ambiguities in machine translation by introducing gENder-IT, an English-Italian parallel challenge set with word-level gender tags and alternative translations, resulting in a specific dataset to address cross-linguistic differences.
Languages differ in terms of the absence or presence of gender features, the number of gender classes and whether and where gender features are explicitly marked. These cross-linguistic differences can lead to ambiguities that are difficult to resolve, especially for sentence-level MT systems. The identification of ambiguity and its subsequent resolution is a challenging task for which currently there aren't any specific resources or challenge sets available. In this paper, we introduce gENder-IT, an English--Italian challenge set focusing on the resolution of natural gender phenomena by providing word-level gender tags on the English source side and multiple gender alternative translations, where needed, on the Italian target side.