Balaji Veeramani

CL
h-index18
9papers
243citations
Novelty49%
AI Score39

9 Papers

CLSep 14, 2023
Generative AI Text Classification using Ensemble LLM Approaches

Harika Abburi, Michael Suesserman, Nirmala Pudota et al.

Large Language Models (LLMs) have shown impressive performance across a variety of Artificial Intelligence (AI) and natural language processing tasks, such as content creation, report generation, etc. However, unregulated malign application of these models can create undesirable consequences such as generation of fake news, plagiarism, etc. As a result, accurate detection of AI-generated language can be crucial in responsible usage of LLMs. In this work, we explore 1) whether a certain body of text is AI generated or written by human, and 2) attribution of a specific language model in generating a body of text. Texts in both English and Spanish are considered. The datasets used in this study are provided as part of the Automated Text Identification (AuTexTification) shared task. For each of the research objectives stated above, we propose an ensemble neural model that generates probabilities from different pre-trained LLMs which are used as features to a Traditional Machine Learning (TML) classifier following it. For the first task of distinguishing between AI and human generated text, our model ranked in fifth and thirteenth place (with macro $F1$ scores of 0.733 and 0.649) for English and Spanish texts, respectively. For the second task on model attribution, our model ranked in first place with macro $F1$ scores of 0.625 and 0.653 for English and Spanish texts, respectively.

CLNov 6, 2023
A Simple yet Efficient Ensemble Approach for AI-generated Text Detection

Harika Abburi, Kalyani Roy, Michael Suesserman et al.

Recent Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing across wide range of styles and genres. However, such capabilities are prone to potential abuse, such as fake news generation, spam email creation, and misuse in academic assignments. Hence, it is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text. In this paper, we propose a simple yet efficient solution to this problem by ensembling predictions from multiple constituent LLMs. Compared to previous state-of-the-art approaches, which are perplexity-based or uses ensembles with a number of LLMs, our condensed ensembling approach uses only two constituent LLMs to achieve comparable performance. Experiments conducted on four benchmark datasets for generative text classification show performance improvements in the range of 0.5 to 100\% compared to previous state-of-the-art approaches. We also study the influence that the training data from individual LLMs have on model performance. We found that substituting commercially-restrictive Generative Pre-trained Transformer (GPT) data with data generated from other open language models such as Falcon, Large Language Model Meta AI (LLaMA2), and Mosaic Pretrained Transformers (MPT) is a feasible alternative when developing generative text detectors. Furthermore, to demonstrate zero-shot generalization, we experimented with an English essays dataset, and results suggest that our ensembling approach can handle new data effectively.

LGOct 12, 2022
Anomaly Detection via Federated Learning

Marc Vucovich, Amogh Tarcar, Penjo Rebelo et al.

Machine learning has helped advance the field of anomaly detection by incorporating classifiers and autoencoders to decipher between normal and anomalous behavior. Additionally, federated learning has provided a way for a global model to be trained with multiple clients' data without requiring the client to directly share their data. This paper proposes a novel anomaly detector via federated learning to detect malicious network activity on a client's server. In our experiments, we use an autoencoder with a classifier in a federated learning framework to determine if the network activity is benign or malicious. By using our novel min-max scalar and sampling technique, called FedSam, we determined federated learning allows the global model to learn from each client's data and, in turn, provide a means for each client to improve their intrusion detection system's defense against cyber-attacks.

LGSep 29, 2023
A Closer Look at Bearing Fault Classification Approaches

Harika Abburi, Tanya Chaudhary, Haider Ilyas et al.

Rolling bearing fault diagnosis has garnered increased attention in recent years owing to its presence in rotating machinery across various industries, and an ever increasing demand for efficient operations. Prompt detection and accurate prediction of bearing failures can help reduce the likelihood of unexpected machine downtime and enhance maintenance schedules, averting lost productivity. Recent technological advances have enabled monitoring the health of these assets at scale using a variety of sensors, and predicting the failures using modern Machine Learning (ML) approaches including deep learning architectures. Vibration data has been collected using accelerated run-to-failure of overloaded bearings, or by introducing known failure in bearings, under a variety of operating conditions such as rotating speed, load on the bearing, type of bearing fault, and data acquisition frequency. However, in the development of bearing failure classification models using vibration data there is a lack of consensus in the metrics used to evaluate the models, data partitions used to evaluate models, and methods used to generate failure labels in run-to-failure experiments. An understanding of the impact of these choices is important to reliably develop models, and deploy them in practical settings. In this work, we demonstrate the significance of these choices on the performance of the models using publicly-available vibration datasets, and suggest model development considerations for real world scenarios. Our experimental findings demonstrate that assigning vibration data from a given bearing across training and evaluation splits leads to over-optimistic performance estimates, PCA-based approach is able to robustly generate labels for failure classification in run-to-failure experiments, and $F$ scores are more insightful to evaluate the models with unbalanced real-world failure data.

CLFeb 9, 2024Code
EntGPT: Entity Linking with Generative Large Language Models

Yifan Ding, Amrit Poudel, Qingkai Zeng et al.

Entity Linking in natural language processing seeks to match text entities to their corresponding entries in a dictionary or knowledge base. Traditional approaches rely on contextual models, which can be complex, hard to train, and have limited transferability across different domains. Generative large language models like GPT offer a promising alternative but often underperform with naive prompts. In this study, we introduce EntGPT, employing advanced prompt engineering to enhance EL tasks. Our three-step hard-prompting method (EntGPT-P) significantly boosts the micro-F_1 score by up to 36% over vanilla prompts, achieving competitive performance across 10 datasets without supervised fine-tuning. Additionally, our instruction tuning method (EntGPT-I) improves micro-F_1 scores by 2.1% on average in supervised EL tasks and outperforms several baseline models in six Question Answering tasks. Our methods are compatible with both open-source and proprietary LLMs. All data and code are available on GitHub at https://github.com/yifding/In_Context_EL.

CLJan 2, 2025
Citations and Trust in LLM Generated Responses

Yifan Ding, Matthew Facciani, Amrit Poudel et al.

Question answering systems are rapidly advancing, but their opaque nature may impact user trust. We explored trust through an anti-monitoring framework, where trust is predicted to be correlated with presence of citations and inversely related to checking citations. We tested this hypothesis with a live question-answering experiment that presented text responses generated using a commercial Chatbot along with varying citations (zero, one, or five), both relevant and random, and recorded if participants checked the citations and their self-reported trust in the generated responses. We found a significant increase in trust when citations were present, a result that held true even when the citations were random; we also found a significant decrease in trust when participants checked the citations. These results highlight the importance of citations in enhancing trust in AI-generated content.

DCJan 16, 2025
The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution

Frank Sifei Luan, Ron Yifeng Wang, Yile Gu et al.

While ML model training and inference are both GPU-intensive, CPU-based data processing is often the bottleneck. Distributed data processing systems based on the batch or stream processing models assume homogeneous resource requirements. They excel at CPU-based computation but either under-utilize heterogeneous resources or impose high overheads on failure and reconfiguration. We introduce the streaming batch model, a hybrid of batch and streaming that enables efficient and fault-tolerant heterogeneous execution. The key idea is to use partitions as the unit of execution to achieve elasticity, but to allow partitions to be dynamically created and streamed between heterogeneous operators for memory-efficient pipelining. We present Ray Data, a streaming batch system that improves throughput on heterogeneous batch inference pipelines by 2.5-12$\times$ compared to traditional batch and stream processing systems. By leveraging heterogeneous clusters, Ray Data improves training throughput for multimodal models such as Stable Diffusion by 31% compared to single-node ML data loaders.

LGNov 17, 2025
Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

Haohui Wang, Jingyuan Qi, Jianpeng Chen et al.

The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.

LGMay 1, 2023
EvoluNet: Advancing Dynamic Non-IID Transfer Learning on Graphs

Haohui Wang, Yuzhen Mao, Yujun Yan et al.

Non-IID transfer learning on graphs is crucial in many high-stakes domains. The majority of existing works assume stationary distribution for both source and target domains. However, real-world graphs are intrinsically dynamic, presenting challenges in terms of domain evolution and dynamic discrepancy between source and target domains. To bridge the gap, we shift the problem to the dynamic setting and pose the question: given the label-rich source graphs and the label-scarce target graphs both observed in previous T timestamps, how can we effectively characterize the evolving domain discrepancy and optimize the generalization performance of the target domain at the incoming T+1 timestamp? To answer it, we propose a generalization bound for dynamic non-IID transfer learning on graphs, which implies the generalization performance is dominated by domain evolution and domain discrepancy between source and target graphs. Inspired by the theoretical results, we introduce a novel generic framework named EvoluNet. It leverages a transformer-based temporal encoding module to model temporal information of the evolving domains and then uses a dynamic domain unification module to efficiently learn domain-invariant representations across the source and target domains. Finally, EvoluNet outperforms the state-of-the-art models by up to 12.1%, demonstrating its effectiveness in transferring knowledge from dynamic source graphs to dynamic target graphs.