IRJun 14, 2022
Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product SearchChandan K. Reddy, Lluís Màrquez, Fran Valero et al.
Improving the quality of search results can significantly enhance users experience and engagement with search engines. In spite of several recent advancements in the fields of machine learning and data mining, correctly classifying items for a particular user search query has been a long-standing challenge, which still has a large room for improvement. This paper introduces the "Shopping Queries Dataset", a large dataset of difficult Amazon search queries and results, publicly released with the aim of fostering research in improving the quality of search results. The dataset contains around 130 thousand unique queries and 2.6 million manually labeled (query,product) relevance judgements. The dataset is multilingual with queries in English, Japanese, and Spanish. The Shopping Queries Dataset is being used in one of the KDDCup'22 challenges. In this paper, we describe the dataset and present three evaluation tasks along with baseline results: (i) ranking the results list, (ii) classifying product results into relevance categories, and (iii) identifying substitute products for a given query. We anticipate that this data will become the gold standard for future research in the topic of product search.
CLJul 26, 2025Code
Infogen: Generating Complex Statistical Infographics from DocumentsAkash Ghosh, Aparna Garimella, Pritika Ramu et al.
Statistical infographics are powerful tools that simplify complex data into visually engaging and easy-to-understand formats. Despite advancements in AI, particularly with LLMs, existing efforts have been limited to generating simple charts, with no prior work addressing the creation of complex infographics from text-heavy documents that demand a deep understanding of the content. We address this gap by introducing the task of generating statistical infographics composed of multiple sub-charts (e.g., line, bar, pie) that are contextually accurate, insightful, and visually aligned. To achieve this, we define infographic metadata that includes its title and textual insights, along with sub-chart-specific details such as their corresponding data and alignment. We also present Infodat, the first benchmark dataset for text-to-infographic metadata generation, where each sample links a document to its metadata. We propose Infogen, a two-stage framework where fine-tuned LLMs first generate metadata, which is then converted into infographic code. Extensive evaluations on Infodat demonstrate that Infogen achieves state-of-the-art performance, outperforming both closed and open-source LLMs in text-to-statistical infographic generation.
CLAug 19, 2025Code
Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research PaperKrishna Garg, Firoz Shaik, Sambaran Bandyopadhyay et al.
As researchers increasingly adopt LLMs as writing assistants, generating high-quality research paper introductions remains both challenging and essential. We introduce Scientific Introduction Generation (SciIG), a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works. Curating new datasets from NAACL 2025 and ICLR 2025 papers, we assess five state-of-the-art models, including both open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems, across multiple dimensions: lexical overlap, semantic similarity, content coverage, faithfulness, consistency, citation correctness, and narrative quality. Our comprehensive framework combines automated metrics with LLM-as-a-judge evaluations. Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness. Moreover, three-shot prompting consistently outperforms fewer-shot approaches. These findings provide practical insights into developing effective research writing assistants and set realistic expectations for LLM-assisted academic writing. To foster reproducibility and future research, we will publicly release all code and datasets.
SINov 19, 2018Code
Outlier Aware Network Embedding for Attributed NetworksSambaran Bandyopadhyay, Lokesh N, M. N. Murty
Attributed network embedding has received much interest from the research community as most of the networks come with some content in each node, which is also known as node attributes. Existing attributed network approaches work well when the network is consistent in structure and attributes, and nodes behave as expected. But real world networks often have anomalous nodes. Typically these outliers, being relatively unexplainable, affect the embeddings of other nodes in the network. Thus all the downstream network mining tasks fail miserably in the presence of such outliers. Hence an integrated approach to detect anomalies and reduce their overall effect on the network embedding is required. Towards this end, we propose an unsupervised outlier aware network embedding algorithm (ONE) for attributed networks, which minimizes the effect of the outlier nodes, and hence generates robust network embeddings. We align and jointly optimize the loss functions coming from structure and attributes of the network. To the best of our knowledge, this is the first generic network embedding approach which incorporates the effect of outliers for an attributed network without any supervision. We experimented on publicly available real networks and manually planted different types of outliers to check the performance of the proposed algorithm. Results demonstrate the superiority of our approach to detect the network outliers compared to the state-of-the-art approaches. We also consider different downstream machine learning applications on networks to show the efficiency of ONE as a generic network embedding technique. The source code is made available at https://github.com/sambaranban/ONE.
CLMay 21, 2024
Presentations are not always linear! GNN meets LLM for Document-to-Presentation Transformation with AttributionHimanshu Maheshwari, Sambaran Bandyopadhyay, Aparna Garimella et al.
Automatically generating a presentation from the text of a long document is a challenging and useful problem. In contrast to a flat summary, a presentation needs to have a better and non-linear narrative, i.e., the content of a slide can come from different and non-contiguous parts of the given document. However, it is difficult to incorporate such non-linear mapping of content to slides and ensure that the content is faithful to the document. LLMs are prone to hallucination and their performance degrades with the length of the input document. Towards this, we propose a novel graph based solution where we learn a graph from the input document and use a combination of graph neural network and LLM to generate a presentation with attribution of content for each slide. We conduct thorough experiments to show the merit of our approach compared to directly using LLMs for this task.
CLJun 5, 2025
Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMsAnanth Muppidi, Abhilash Nandy, Sambaran Bandyopadhyay
The performance of large language models in domain-specific tasks necessitates fine-tuning, which is computationally expensive and technically challenging. This paper focuses on parameter-efficient fine-tuning using soft prompting, a promising approach that adapts pre-trained models to downstream tasks by learning a small set of parameters. We propose a novel Input Dependent Soft Prompting technique with a self-Attention Mechanism (ID-SPAM) that generates soft prompts based on the input tokens and attends different tokens with varying importance. Our method is simple and efficient, keeping the number of trainable parameters small. We show the merits of the proposed approach compared to state-of-the-art techniques on various tasks and show the improved zero shot domain transfer capability.
CLMay 23, 2025
Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable FeedbackAnanth Muppidi, Tarak Das, Sambaran Bandyopadhyay et al.
The generation of presentation slides automatically is an important problem in the era of generative AI. This paper focuses on evaluating multimodal content in presentation slides that can effectively summarize a document and convey concepts to a broad audience. We introduce a benchmark dataset, RefSlides, consisting of human-made high-quality presentations that span various topics. Next, we propose a set of metrics to characterize different intrinsic properties of the content of a presentation and present REFLEX, an evaluation approach that generates scores and actionable feedback for these metrics. We achieve this by generating negative presentation samples with different degrees of metric-specific perturbations and use them to fine-tune LLMs. This reference-free evaluation technique does not require ground truth presentations during inference. Our extensive automated and human experiments demonstrate that our evaluation approach outperforms classical heuristic-based and state-of-the-art large language model-based evaluations in generating scores and explanations.
CLJun 21, 2024
Is This a Bad Table? A Closer Look at the Evaluation of Table Generation from TextPritika Ramu, Aparna Garimella, Sambaran Bandyopadhyay
Understanding whether a generated table is of good quality is important to be able to use it in creating or editing documents using automatic methods. In this work, we underline that existing measures for table quality evaluation fail to capture the overall semantics of the tables, and sometimes unfairly penalize good tables and reward bad ones. We propose TabEval, a novel table evaluation strategy that captures table semantics by first breaking down a table into a list of natural language atomic statements and then compares them with ground truth statements using entailment-based measures. To validate our approach, we curate a dataset comprising of text descriptions for 1,250 diverse Wikipedia tables, covering a range of topics and structures, in contrast to the limited scope of existing datasets. We compare TabEval with existing metrics using unsupervised and supervised text-to-table generation methods, demonstrating its stronger correlation with human judgments of table quality across four datasets.
CLJun 1, 2024
Enhancing Presentation Slide Generation by LLMs with a Multi-Staged End-to-End ApproachSambaran Bandyopadhyay, Himanshu Maheshwari, Anandhavelu Natarajan et al.
Generating presentation slides from a long document with multimodal elements such as text and images is an important task. This is time consuming and needs domain expertise if done manually. Existing approaches for generating a rich presentation from a document are often semi-automatic or only put a flat summary into the slides ignoring the importance of a good narrative. In this paper, we address this research gap by proposing a multi-staged end-to-end model which uses a combination of LLM and VLM. We have experimentally shown that compared to applying LLMs directly with state-of-the-art prompting, our proposed multi-staged solution is better in terms of automated metrics and human evaluation.
SEDec 1, 2021
Monolith to Microservices: Representing Application Software through Heterogeneous Graph Neural NetworkAlex Mathai, Sambaran Bandyopadhyay, Utkarsh Desai et al.
Monolithic software encapsulates all functional capabilities into a single deployable unit. But managing it becomes harder as the demand for new functionalities grow. Microservice architecture is seen as an alternate as it advocates building an application through a set of loosely coupled small services wherein each service owns a single functional responsibility. But the challenges associated with the separation of functional modules, slows down the migration of a monolithic code into microservices. In this work, we propose a representation learning based solution to tackle this problem. We use a heterogeneous graph to jointly represent software artifacts (like programs and resources) and the different relationships they share (function calls, inheritance, etc.), and perform a constraint-based clustering through a novel heterogeneous graph neural network. Experimental studies show that our approach is effective on monoliths of different types.
SEFeb 7, 2021
Graph Neural Network to Dilute Outliers for Refactoring Monolith ApplicationUtkarsh Desai, Sambaran Bandyopadhyay, Srikanth Tamilselvam
Microservices are becoming the defacto design choice for software architecture. It involves partitioning the software components into finer modules such that the development can happen independently. It also provides natural benefits when deployed on the cloud since resources can be allocated dynamically to necessary components based on demand. Therefore, enterprises as part of their journey to cloud, are increasingly looking to refactor their monolith application into one or more candidate microservices; wherein each service contains a group of software entities (e.g., classes) that are responsible for a common functionality. Graphs are a natural choice to represent a software system. Each software entity can be represented as nodes and its dependencies with other entities as links. Therefore, this problem of refactoring can be viewed as a graph based clustering task. In this work, we propose a novel method to adapt the recent advancements in graph neural networks in the context of code to better understand the software and apply them in the clustering task. In that process, we also identify the outliers in the graph which can be directly mapped to top refactor candidates in the software. Our solution is able to improve state-of-the-art performance compared to works from both software engineering and existing graph representation based techniques.
LGDec 7, 2020
Dynamic Structure Learning through Graph Neural Network for Forecasting Soil Moisture in Precision AgricultureAnoushka Vyas, Sambaran Bandyopadhyay
Soil moisture is an important component of precision agriculture as it directly impacts the growth and quality of vegetation. Forecasting soil moisture is essential to schedule the irrigation and optimize the use of water. Physics based soil moisture models need rich features and heavy computation which is not scalable. In recent literature, conventional machine learning models have been applied for this problem. These models are fast and simple, but they often fail to capture the spatio-temporal correlation that soil moisture exhibits over a region. In this work, we propose a novel graph neural network based solution that learns temporal graph structures and forecast soil moisture in an end-to-end framework. Our solution is able to handle the problem of missing ground truth soil moisture which is common in practice. We show the merit of our algorithm on real-world soil moisture data.
SINov 28, 2020
Unsupervised Constrained Community Detection via Self-Expressive Graph Neural NetworkSambaran Bandyopadhyay, Vishal Peter
Graph neural networks (GNNs) are able to achieve promising performance on multiple graph downstream tasks such as node classification and link prediction. Comparatively lesser work has been done to design GNNs which can operate directly for community detection on graphs. Traditionally, GNNs are trained on a semi-supervised or self-supervised loss function and then clustering algorithms are applied to detect communities. However, such decoupled approaches are inherently sub-optimal. Designing an unsupervised loss function to train a GNN and extract communities in an integrated manner is a fundamental challenge. To tackle this problem, we combine the principle of self-expressiveness with the framework of self-supervised graph neural network for unsupervised community detection for the first time in literature. Our solution is trained in an end-to-end fashion and achieves state-of-the-art community detection performance on multiple publicly available datasets.
SIJul 20, 2020
Integrating Network Embedding and Community Outlier Detection via Multiclass Graph DescriptionSambaran Bandyopadhyay, Saley Vishal Vivek, M. N. Murty
Network (or graph) embedding is the task to map the nodes of a graph to a lower dimensional vector space, such that it preserves the graph properties and facilitates the downstream network mining tasks. Real world networks often come with (community) outlier nodes, which behave differently from the regular nodes of the community. These outlier nodes can affect the embedding of the regular nodes, if not handled carefully. In this paper, we propose a novel unsupervised graph embedding approach (called DMGD) which integrates outlier and community detection with node embedding. We extend the idea of deep support vector data description to the framework of graph embedding when there are multiple communities present in the given network, and an outlier is characterized relative to its community. We also show the theoretical bounds on the number of outliers detected by DMGD. Our formulation boils down to an interesting minimax game between the outliers, community assignments and the node embedding function. We also propose an efficient algorithm to solve this optimization framework. Experimental results on both synthetic and real world networks show the merit of our approach compared to state-of-the-arts.
LGJul 19, 2020
Robust Hierarchical Graph Classification with Subgraph AttentionSambaran Bandyopadhyay, Manasvi Aggarwal, M. Narasimha Murty
Graph neural networks get significant attention for graph representation and classification in machine learning community. Attention mechanism applied on the neighborhood of a node improves the performance of graph neural networks. Typically, it helps to identify a neighbor node which plays more important role to determine the label of the node under consideration. But in real world scenarios, a particular subset of nodes together, but not the individual pairs in the subset, may be important to determine the label of the graph. To address this problem, we introduce the concept of subgraph attention for graphs. On the other hand, hierarchical graph pooling has been shown to be promising in recent literature. But due to noisy hierarchical structure of real world graphs, not all the hierarchies of a graph play equal role for graph classification. Towards this end, we propose a graph classification algorithm called SubGattPool which jointly learns the subgraph attention and employs two different types of hierarchical attention mechanisms to find the important nodes in a hierarchy and the importance of individual hierarchies in a graph. Experimental evaluation with different types of graph classification algorithms shows that SubGattPool is able to improve the state-of-the-art or remains competitive on multiple publicly available graph classification datasets. We conduct further experiments on both synthetic and real world graph datasets to justify the usefulness of different components of SubGattPool and to show its consistent performance on other downstream tasks.
LGJun 8, 2020
Unsupervised Graph Representation by Periphery and Hierarchical Information MaximizationSambaran Bandyopadhyay, Manasvi Aggarwal, M. Narasimha Murty
Deep representation learning on non-Euclidean data types, such as graphs, has gained significant attention in recent years. Invent of graph neural networks has improved the state-of-the-art for both node and the entire graph representation in a vector space. However, for the entire graph representation, most of the existing graph neural networks are trained on a graph classification loss in a supervised way. But obtaining labels of a large number of graphs is expensive for real world applications. Thus, we aim to propose an unsupervised graph neural network to generate a vector representation of an entire graph in this paper. For this purpose, we combine the idea of hierarchical graph neural networks and mutual information maximization into a single framework. We also propose and use the concept of periphery representation of a graph and show its usefulness in the proposed algorithm which is referred as GraPHmax. We conduct thorough experiments on several real-world graph datasets and compare the performance of GraPHmax with a diverse set of both supervised and unsupervised baseline algorithms. Experimental results show that we are able to improve the state-of-the-art for multiple graph level tasks on several real-world datasets, while remain competitive on the others.
SIFeb 9, 2020
Line Hypergraph Convolution Network: Applying Graph Convolution for HypergraphsSambaran Bandyopadhyay, Kishalay Das, M. Narasimha Murty
Network representation learning and node classification in graphs got significant attention due to the invent of different types graph neural networks. Graph convolution network (GCN) is a popular semi-supervised technique which aggregates attributes within the neighborhood of each node. Conventional GCNs can be applied to simple graphs where each edge connects only two nodes. But many modern days applications need to model high order relationships in a graph. Hypergraphs are effective data types to handle such complex relationships. In this paper, we propose a novel technique to apply graph convolution on hypergraphs with variable hyperedge sizes. We use the classical concept of line graph of a hypergraph for the first time in the hypergraph learning literature. Then we propose to use graph convolution on the line graph of a hypergraph. Experimental analysis on multiple real world network datasets shows the merit of our approach compared to state-of-the-arts.
SIDec 11, 2019
Beyond Node Embedding: A Direct Unsupervised Edge Representation Framework for Homogeneous NetworksSambaran Bandyopadhyay, Anirban Biswas, M. N. Murty et al.
Network representation learning has traditionally been used to find lower dimensional vector representations of the nodes in a network. However, there are very important edge driven mining tasks of interest to the classical network analysis community, which have mostly been unexplored in the network embedding space. For applications such as link prediction in homogeneous networks, vector representation (i.e., embedding) of an edge is derived heuristically just by using simple aggregations of the embeddings of the end vertices of the edge. Clearly, this method of deriving edge embedding is suboptimal and there is a need for a dedicated unsupervised approach for embedding edges by leveraging edge properties of the network. Towards this end, we propose a novel concept of converting a network to its weighted line graph which is ideally suited to find the embedding of edges of the original network. We further derive a novel algorithm to embed the line graph, by introducing the concept of collective homophily. To the best of our knowledge, this is the first direct unsupervised approach for edge embedding in homogeneous information networks, without relying on the node embeddings. We validate the edge embeddings on three downstream edge mining tasks. Our proposed optimization framework for edge embedding also generates a set of node embeddings, which are not just the aggregation of edges. Further experimental analysis shows the connection of our framework to the concept of node centrality.