IRJan 11, 2022

Structure and Semantics Preserving Document Representations

Natraj Raman, Sameena Shah, Manuela Veloso

arXiv:2201.03720v33.73 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of improving document retrieval accuracy by balancing semantics and structure, though it appears incremental as it builds on existing pre-train/fine-tune paradigms.

The paper tackles the problem of document retrieval by integrating semantic content and structural relationships between documents, proposing a deep metric learning approach with a novel quintuplet loss that outperforms competing methods on multiple datasets.

Retrieving relevant documents from a corpus is typically based on the semantic similarity between the document content and query text. The inclusion of structural relationship between documents can benefit the retrieval mechanism by addressing semantic gaps. However, incorporating these relationships requires tractable mechanisms that balance structure with semantics and take advantage of the prevalent pre-train/fine-tune paradigm. We propose here a holistic approach to learning document representations by integrating intra-document content with inter-document relations. Our deep metric learning solution analyzes the complex neighborhood structure in the relationship network to efficiently sample similar/dissimilar document pairs and defines a novel quintuplet loss function that simultaneously encourages document pairs that are semantically relevant to be closer and structurally unrelated to be far apart in the representation space. Furthermore, the separation margins between the documents are varied flexibly to encode the heterogeneity in relationship strengths. The model is fully fine-tunable and natively supports query projection during inference. We demonstrate that it outperforms competing methods on multiple datasets for document retrieval tasks.

View on arXiv PDF

Similar