Statistical and Neural Methods for Cross-lingual Entity Label Mapping in Knowledge Graphs
This addresses a data consistency problem for machine translation systems using multilingual knowledge bases, though it appears incremental as it applies existing alignment techniques to this specific task.
The paper tackled the problem of inconsistent cross-lingual entity label matching in knowledge graphs like Wikidata, which hampers machine translation. It found that applying word/sentence alignment techniques with a matching algorithm improved label mapping by up to 20 F1-score points, with sentence embedding methods performing best across scripts.
Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for fields such as machine translation. In this work, we investigate the application of word and sentence alignment techniques coupled with a matching algorithm to align cross-lingual entity labels extracted from Wikidata in 10 languages. Our results indicate that mapping between Wikidata's main labels stands to be considerably improved (up to $20$ points in F1-score) by any of the employed methods. We show how methods relying on sentence embeddings outperform all others, even across different scripts. We believe the application of such techniques to measure the similarity of label pairs, coupled with a knowledge base rich in high-quality entity labels, to be an excellent asset to machine translation.