Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations
This work addresses the need for more granular semantic representations in natural language processing, offering a novel approach for tasks requiring fine-grained text attribution and similarity, though it is incremental in building upon existing sentence embedding methods.
The paper tackles the problem of representing fine-grained semantic units within text by introducing a sub-sentence encoder that learns distinct embeddings for atomic propositions, achieving effectiveness in applications like retrieving supporting facts and recognizing conditional semantic similarity while maintaining inference cost and space complexity comparable to sentence encoders.
We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propositions, i.e. atomic units of meaning expressed within a text sequence. The sub-sentence embeddings are contrastively learned to recognize (inferred) semantic equivalence between propositions across different text sequences. Our experiments show the effectiveness of sub-sentence encoders in applications, such as retrieving supporting facts for fine-grained text attribution or recognizing the conditional semantic similarity between texts. In practice, we demonstrate that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.