CLFeb 6, 2024

Sparse Graph Representations for Procedural Instructional Documents

arXiv:2402.03957v11.0h-index: 4

Originality Incremental advance

AI Analysis

This work addresses the need for more efficient and interpretable document similarity methods in NLP, particularly for procedural instructional documents, though it is incremental as it builds upon existing graph-based approaches.

The paper tackled the problem of document similarity computation by proposing a sparse directed graph representation that incorporates sequential information, achieving comparable results on general datasets and a ten-point improvement on instructional documents with sequential flow.

Computation of document similarity is a critical task in various NLP domains that has applications in deduplication, matching, and recommendation. Traditional approaches for document similarity computation include learning representations of documents and employing a similarity or a distance function over the embeddings. However, pairwise similarities and differences are not efficiently captured by individual representations. Graph representations such as Joint Concept Interaction Graph (JCIG) represent a pair of documents as a joint undirected weighted graph. JCIGs facilitate an interpretable representation of document pairs as a graph. However, JCIGs are undirected, and don't consider the sequential flow of sentences in documents. We propose two approaches to model document similarity by representing document pairs as a directed and sparse JCIG that incorporates sequential information. We propose two algorithms inspired by Supergenome Sorting and Hamiltonian Path that replace the undirected edges with directed edges. Our approach also sparsifies the graph to $O(n)$ edges from JCIG's worst case of $O(n^2)$. We show that our sparse directed graph model architecture consisting of a Siamese encoder and GCN achieves comparable results to the baseline on datasets not containing sequential information and beats the baseline by ten points on an instructional documents dataset containing sequential information.

View on arXiv PDF

Similar