Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents
This work addresses a need for tools to identify semantic differences in documents, which could benefit applications like text comparison and analysis, but it is incremental as it builds on existing unsupervised methods.
The paper tackles the problem of automatically highlighting token-level semantic differences between related documents, formulating it as an unsupervised regression task using masked language models, and finds that a word alignment and contrastive learning approach shows robust correlation to gold labels but still has a large margin for improvement.
Automatically highlighting words that cause semantic differences between two documents could be useful for a wide range of applications. We formulate recognizing semantic differences (RSD) as a token-level regression task and study three unsupervised approaches that rely on a masked language model. To assess the approaches, we begin with basic English sentences and gradually move to more complex, cross-lingual document pairs. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels. However, all unsupervised approaches still leave a large margin of improvement. Code to reproduce our experiments is available at https://github.com/ZurichNLP/recognizing-semantic-differences