SEMar 16, 2022Code
Exploring Variational Graph Auto-Encoders for Extract Class Refactoring RecommendationPritom Saha Akash, Kevin Chen-Chuan Chang
The code smell is a sign of design and development flaws in a software system that reduces the reusability and maintainability of the system. Refactoring is done as an ongoing practice to remove the code smell from the program code. Among different code smells, the God class or Blob is one of the most common code smells. A god class contains too many responsibilities, violating object-oriented programming design's low coupling and high cohesiveness principles. This paper proposes an automatic approach to extracting a God class into multiple smaller classes with more specific responsibilities. To do this, we first construct a graph of methods (as nodes) for the concerning god class. The edge between any two methods is determined by their structural similarity, and the feature for each method is initialized using different semantic representation methods. Then, the variational graph auto-encoder is used to learn a vector representation for each method. Finally, the learned vectors are used to cluster methods into different groups to be recommended as refactored classes. We assessed the proposed framework using three different class cohesion metrics on sixteen actual God Classes collected from two well-known open-source systems. We also conducted a comparative study of our approach with a similar existing approach and found that the proposed approach generated better results for almost all the God Classes used in the experiment.
CLMar 19, 2022
Domain Representative Keywords Selection: A Probabilistic ApproachPritom Saha Akash, Jie Huang, Kevin Chen-Chuan Chang et al.
We propose a probabilistic approach to select a subset of a \textit{target domain representative keywords} from a candidate set, contrasting with a context domain. Such a task is crucial for many downstream tasks in natural language processing. To contrast the target domain and the context domain, we adapt the \textit{two-component mixture model} concept to generate a distribution of candidate keywords. It provides more importance to the \textit{distinctive} keywords of the target domain than common keywords contrasting with the context domain. To support the \textit{representativeness} of the selected keywords towards the target domain, we introduce an \textit{optimization algorithm} for selecting the subset from the generated candidate distribution. We have shown that the optimization algorithm can be efficiently implemented with a near-optimal approximation guarantee. Finally, extensive experiments on multiple domains demonstrate the superiority of our approach over other baselines for the tasks of keyword summary generation and trending keywords selection.
CLJun 19, 2023
Unsupervised Open-domain Keyphrase GenerationLam Thanh Do, Pritom Saha Akash, Kevin Chen-Chuan Chang
In this work, we study the problem of unsupervised open-domain keyphrase generation, where the objective is a keyphrase generation model that can be built without using human-labeled data and can perform consistently across domains. To solve this problem, we propose a seq2seq model that consists of two modules, namely \textit{phraseness} and \textit{informativeness} module, both of which can be built in an unsupervised and open-domain fashion. The phraseness module generates phrases, while the informativeness module guides the generation towards those that represent the core concepts of the text. We thoroughly evaluate our proposed method using eight benchmark datasets from different domains. Results on in-domain datasets show that our approach achieves state-of-the-art results compared with existing unsupervised models, and overall narrows the gap between supervised and unsupervised methods down to about 16\%. Furthermore, we demonstrate that our model performs consistently across domains, as it overall surpasses the baselines on out-of-domain datasets.
CLOct 16, 2022
Coordinated Topic ModelingPritom Saha Akash, Jie Huang, Kevin Chen-Chuan Chang
We propose a new problem called coordinated topic modeling that imitates human behavior while describing a text corpus. It considers a set of well-defined topics like the axes of a semantic space with a reference representation. It then uses the axes to model a corpus for easily understandable representation. This new task helps represent a corpus more interpretably by reusing existing knowledge and benefits the corpora comparison task. We design ECTM, an embedding-based coordinated topic model that effectively uses the reference representation to capture the target corpus-specific aspects while maintaining each topic's global semantics. In ECTM, we introduce the topic- and document-level supervision with a self-training mechanism to solve the problem. Finally, extensive experiments on multiple domains show the superiority of our model over other baselines.
CLOct 24, 2023
Let the Pretrained Language Models "Imagine" for Short Texts Topic ModelingPritom Saha Akash, Jie Huang, Kevin Chen-Chuan Chang
Topic models are one of the compelling methods for discovering latent semantics in a document collection. However, it assumes that a document has sufficient co-occurrence information to be effective. However, in short texts, co-occurrence information is minimal, which results in feature sparsity in document representation. Therefore, existing topic models (probabilistic or neural) mostly fail to mine patterns from them to generate coherent topics. In this paper, we take a new approach to short-text topic modeling to address the data-sparsity issue by extending short text into longer sequences using existing pre-trained language models (PLMs). Besides, we provide a simple solution extending a neural topic model to reduce the effect of noisy out-of-topics text generation from PLMs. We observe that our model can substantially improve the performance of short-text topic modeling. Extensive experiments on multiple real-world datasets under extreme data sparsity scenarios show that our models can generate high-quality topics outperforming state-of-the-art models.
CLNov 15, 2023
Long-form Question Answering: An Iterative Planning-Retrieval-Generation ApproachPritom Saha Akash, Kashob Kumar Roy, Lucian Popa et al.
Long-form question answering (LFQA) poses a challenge as it involves generating detailed answers in the form of paragraphs, which go beyond simple yes/no responses or short factual answers. While existing QA models excel in questions with concise answers, LFQA requires handling multiple topics and their intricate relationships, demanding comprehensive explanations. Previous attempts at LFQA focused on generating long-form answers by utilizing relevant contexts from a corpus, relying solely on the question itself. However, they overlooked the possibility that the question alone might not provide sufficient information to identify the relevant contexts. Additionally, generating detailed long-form answers often entails aggregating knowledge from diverse sources. To address these limitations, we propose an LFQA model with iterative Planning, Retrieval, and Generation. This iterative process continues until a complete answer is generated for the given question. From an extensive experiment on both an open domain and a technical domain QA dataset, we find that our model outperforms the state-of-the-art models on various textual and factual metrics for the LFQA task.
CLOct 8, 2023
TopicAdapt- An Inter-Corpora Topics Adaptation ApproachPritom Saha Akash, Trisha Das, Kevin Chen-Chuan Chang
Topic models are popular statistical tools for detecting latent semantic topics in a text corpus. They have been utilized in various applications across different fields. However, traditional topic models have some limitations, including insensitivity to user guidance, sensitivity to the amount and quality of data, and the inability to adapt learned topics from one corpus to another. To address these challenges, this paper proposes a neural topic model, TopicAdapt, that can adapt relevant topics from a related source corpus and also discover new topics in a target corpus that are absent in the source corpus. The proposed model offers a promising approach to improve topic modeling performance in practical scenarios. Experiments over multiple datasets from diverse domains show the superiority of the proposed model against the state-of-the-art topic models.
CLNov 12, 2024
Query Optimization for Parametric Knowledge Refinement in Retrieval-Augmented Large Language ModelsYouan Cong, Pritom Saha Akash, Cheng Wang et al.
We introduce the \textit{Extract-Refine-Retrieve-Read} (ERRR) framework, a novel approach designed to bridge the pre-retrieval information gap in Retrieval-Augmented Generation (RAG) systems through query optimization tailored to meet the specific knowledge requirements of Large Language Models (LLMs). Unlike conventional query optimization techniques used in RAG, the ERRR framework begins by extracting parametric knowledge from LLMs, followed by using a specialized query optimizer for refining these queries. This process ensures the retrieval of only the most pertinent information essential for generating accurate responses. Moreover, to enhance flexibility and reduce computational costs, we propose a trainable scheme for our pipeline that utilizes a smaller, tunable model as the query optimizer, which is refined through knowledge distillation from a larger teacher model. Our evaluations on various question-answering (QA) datasets and with different retrieval systems show that ERRR consistently outperforms existing baselines, proving to be a versatile and cost-effective module for improving the utility and accuracy of RAG systems.
CLMay 30, 2025
ERU-KG: Efficient Reference-aligned Unsupervised Keyphrase GenerationLam Thanh Do, Aaditya Bodke, Pritom Saha Akash et al.
Unsupervised keyphrase prediction has gained growing interest in recent years. However, existing methods typically rely on heuristically defined importance scores, which may lead to inaccurate informativeness estimation. In addition, they lack consideration for time efficiency. To solve these problems, we propose ERU-KG, an unsupervised keyphrase generation (UKG) model that consists of an informativeness and a phraseness module. The former estimates the relevance of keyphrase candidates, while the latter generate those candidates. The informativeness module innovates by learning to model informativeness through references (e.g., queries, citation contexts, and titles) and at the term-level, thereby 1) capturing how the key concepts of documents are perceived in different contexts and 2) estimating informativeness of phrases more efficiently by aggregating term informativeness, removing the need for explicit modeling of the candidates. ERU-KG demonstrates its effectiveness on keyphrase generation benchmarks by outperforming unsupervised baselines and achieving on average 89\% of the performance of a supervised model for top 10 predictions. Additionally, to highlight its practical utility, we evaluate the model on text retrieval tasks and show that keyphrases generated by ERU-KG are effective when employed as query and document expansions. Furthermore, inference speed tests reveal that ERU-KG is the fastest among baselines of similar model sizes. Finally, our proposed model can switch between keyphrase generation and extraction by adjusting hyperparameters, catering to diverse application requirements.
CLJun 9, 2025
Understanding Cross-Domain Adaptation in Low-Resource Topic ModelingPritom Saha Akash, Kevin Chen-Chuan Chang
Topic modeling plays a vital role in uncovering hidden semantic structures within text corpora, but existing models struggle in low-resource settings where limited target-domain data leads to unstable and incoherent topic inference. We address this challenge by formally introducing domain adaptation for low-resource topic modeling, where a high-resource source domain informs a low-resource target domain without overwhelming it with irrelevant content. We establish a finite-sample generalization bound showing that effective knowledge transfer depends on robust performance in both domains, minimizing latent-space discrepancy, and preventing overfitting to the data. Guided by these insights, we propose DALTA (Domain-Aligned Latent Topic Adaptation), a new framework that employs a shared encoder for domain-invariant features, specialized decoders for domain-specific nuances, and adversarial alignment to selectively transfer relevant information. Experiments on diverse low-resource datasets demonstrate that DALTA consistently outperforms state-of-the-art methods in terms of topic coherence, stability, and transferability.