CLJun 9, 2019

Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification

arXiv:1906.03656v131.01090 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses a basic linguistic property issue in paragraph embeddings for NLP researchers, but it is incremental as it modifies an existing method.

The paper tackled the problem that paragraph embedding models fail to reliably identify whether a sentence occurs in the input paragraph, by replacing a reconstruction-based objective with a sentence content probe objective. This resulted in improved downstream classification accuracies, faster training, and better generalization ability on benchmark datasets.

While paragraph embedding models are remarkably effective for downstream classification tasks, what they learn and encode into a single vector remains opaque. In this paper, we investigate a state-of-the-art paragraph embedding method proposed by Zhang et al. (2017) and discover that it cannot reliably tell whether a given sentence occurs in the input paragraph or not. We formulate a sentence content task to probe for this basic linguistic property and find that even a much simpler bag-of-words method has no trouble solving it. This result motivates us to replace the reconstruction-based objective of Zhang et al. (2017) with our sentence content probe objective in a semi-supervised setting. Despite its simplicity, our objective improves over paragraph reconstruction in terms of (1) downstream classification accuracies on benchmark datasets, (2) faster training, and (3) better generalization ability.

View on arXiv PDF Code

Similar