IRAug 10, 2017

Utilizing Embeddings for Ad-hoc Retrieval by Document-to-document Similarity

arXiv:1708.03181v14.01 citations

Originality Incremental advance

AI Analysis

This addresses a specific issue in information retrieval for users needing more accurate semantic relevance, though it appears incremental as it builds on existing embedding methods.

The paper tackles the problem of 'multiple degrees of similarity' in ad-hoc retrieval by proposing a document-to-document similarity approach using embeddings, which outperforms strong baselines on standard TREC test collections.

Latent semantic representations of words or paragraphs, namely the embeddings, have been widely applied to information retrieval (IR). One of the common approaches of utilizing embeddings for IR is to estimate the document-to-query (D2Q) similarity in their embeddings. As words with similar syntactic usage are usually very close to each other in the embeddings space, although they are not semantically similar, the D2Q similarity approach may suffer from the problem of "multiple degrees of similarity". To this end, this paper proposes a novel approach that estimates a semantic relevance score (SEM) based on document-to-document (D2D) similarity of embeddings. As Word or Para2Vec generates embeddings by the context of words/paragraphs, the D2D similarity approach turns the task of document ranking into the estimation of similarity between content within different documents. Experimental results on standard TREC test collections show that our proposed approach outperforms strong baselines.

View on arXiv PDF

Similar