CLFeb 7, 2023
Transformer-based Models for Long-Form Document Matching: Challenges and Empirical AnalysisAkshita Jha, Adithya Samavedhi, Vineeth Rakesh et al.
Recent advances in the area of long document matching have primarily focused on using transformer-based models for long document encoding and matching. There are two primary challenges associated with these models. Firstly, the performance gain provided by transformer-based models comes at a steep cost - both in terms of the required training time and the resource (memory and energy) consumption. The second major limitation is their inability to handle more than a pre-defined input token length at a time. In this work, we empirically demonstrate the effectiveness of simple neural models (such as feed-forward networks, and CNNs) and simple embeddings (like GloVe, and Paragraph Vector) over transformer-based models on the task of document matching. We show that simple models outperform the more complex BERT-based models while taking significantly less training time, energy, and memory. The simple models are also more robust to variations in document length and text perturbations.
IRAug 20, 2021
Supervised Contrastive Learning for Interpretable Long-Form Document MatchingAkshita Jha, Vineeth Rakesh, Jaideep Chandrashekar et al.
Recent advancements in deep learning techniques have transformed the area of semantic text matching. However, most state-of-the-art models are designed to operate with short documents such as tweets, user reviews, comments, etc. These models have fundamental limitations when applied to long-form documents such as scientific papers, legal documents, and patents. When handling such long documents, there are three primary challenges: (i) the presence of different contexts for the same word throughout the document, (ii) small sections of contextually similar text between two documents, but dissimilar text in the remaining parts (this defies the basic understanding of "similarity"), and (iii) the coarse nature of a single global similarity measure which fails to capture the heterogeneity of the document content. In this paper, we describe CoLDE: Contrastive Long Document Encoder - a transformer-based framework that addresses these challenges and allows for interpretable comparisons of long documents. CoLDE uses unique positional embeddings and a multi-headed chunkwise attention layer in conjunction with a supervised contrastive learning framework to capture similarity at three different levels: (i) high-level similarity scores between a pair of documents, (ii) similarity scores between different sections within and across documents, and (iii) similarity scores between different chunks in the same document and across other documents. These fine-grained similarity scores aid in better interpretability. We evaluate CoLDE on three long document datasets namely, ACL Anthology publications, Wikipedia articles, and USPTO patents. Besides outperforming the state-of-the-art methods on the document matching task, CoLDE is also robust to changes in document length and text perturbations and provides interpretable results.
NISep 1, 2017
Inferring Networked Device Categories from Low-Level Activity IndicatorsKyumars Sheykh Esmaili, Jaideep Chandrashekar, Pascal Le Guyadec
We study the problem of inferring the type of a networked device in a home network by leveraging low level traffic activity indicators seen at commodity home gateways. We analyze a dataset of detailed device network activity obtained from 240 subscriber homes of a large European ISP and extract a number of traffic and spatial fingerprints for individual devices. We develop a two level taxonomy to describe devices onto which we map individual devices using a number of heuristics. We leverage the heuristically derived labels to train classifiers that distinguish device classes based on the traffic and spatial fingerprints of a device. Our results show an accuracy level up to 91% for the coarse level category and up to 84% for the fine grained category. By incorporating information from other sources (e.g., MAC OUI), we are able to further improve accuracy to above 97% and 92%, respectively. Finally, we also extract a set of simple and human-readable rules that concisely capture the behaviour of these distinct device categories.