82.1CLMay 26Code
Uncertainty-Aware Budget Allocation for Adaptive Test-Time ReasoningManh Nguyen, Sunil Gupta, Hung Le
Sampling multiple responses improves language model reasoning, but uniform compute allocation is inefficient: easy questions are over-sampled while hard questions remain under-explored. We propose Uncertainty-Aware Budget Allocation (UAB), a concave integer optimization framework that reallocates a fixed sampling budget based on per-question uncertainty estimated at no additional inference cost. In Phase 1, every question receives one generation; its average negative log-likelihood (ANLL), extracted directly from output log-probabilities, serves as a difficulty signal while the generation contributes to the final vote. In Phase 2, the remaining budget is allocated by a marginal-greedy algorithm that solves a concave coverage-maximization surrogate exactly: uncertain questions receive more sampling budget while confident questions receive fewer additional samples. Evaluated on six open-weight and black-box models spanning 1.5B to 27B parameters and five reasoning benchmarks covering math, logic, and preference tasks, UAB outperforms baselines by up to +3% in average accuracy and up to +5% on individual benchmarks, with the largest gains in low-resource settings, requiring no auxiliary model or additional LLM call. Code is publicly available at https://github.com/manhitv/UAB.
LGNov 13, 2025
Uncertainty-Guided Checkpoint Selection for Reinforcement Finetuning of Large Language ModelsManh Nguyen, Dung Nguyen, Dai Do et al.
Reinforcement learning (RL) finetuning is crucial to aligning large language models (LLMs), but the process is notoriously unstable and exhibits high variance across model checkpoints. In practice, selecting the best checkpoint is challenging: evaluating checkpoints on the validation set during training is computationally expensive and requires a good validation set, while relying on the final checkpoint provides no guarantee of good performance. We introduce an uncertainty-guided approach for checkpoint selection (UGCS) that avoids these pitfalls. Our method identifies hard question-answer pairs using per-sample uncertainty and ranks checkpoints by how well they handle these challenging cases. By averaging the rewards of the top-uncertain samples over a short training window, our method produces a stable and discriminative signal without additional forward passes or significant computation overhead. Experiments across three datasets and three LLMs demonstrate that it consistently identifies checkpoints with stronger generalization, outperforming traditional strategies such as relying on training or validation performance. These results highlight that models solving their hardest tasks with low uncertainty are the most reliable overall.
53.0CLMar 21
Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message RetentionManh Nguyen, Anh Nguyen, Dung Nguyen et al.
Multi-Agent Debate has emerged as a promising framework for improving the reasoning quality of large language models through iterative inter-agent communication. However, broadcasting all agent messages at every round introduces noise and redundancy that can degrade debate quality and waste computational resources. Current approaches rely on uncertainty estimation to filter low-confidence responses before broadcasting, but this approach is unreliable due to miscalibrated confidence scores and sensitivity to threshold selection. To address this, we propose Diversity-Aware Retention (DAR), a lightweight debate framework that, at each debate round, selects the subset of agent responses that maximally disagree with each other and with the majority vote before broadcasting. Through an explicit index-based retention mechanism, DAR preserves the original messages without modification, ensuring that retained disagreements remain authentic. Experiments on diverse reasoning and question answering benchmarks demonstrate that our selective message propagation consistently improves debate performance, particularly as the number of agents scales, where noise accumulation is most severe. Our results highlight that what agents hear is as important as what agents say in multi-agent reasoning systems.
23.4CLApr 14
Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus ScoreManh Nguyen, Sunil Gupta, Hung Le
Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fréchet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.
AIJun 22, 2025Code
Towards Robust Fact-Checking: A Multi-Agent System with Advanced Evidence RetrievalTam Trinh, Manh Nguyen, Truong-Son Hy
The rapid spread of misinformation in the digital era poses significant challenges to public discourse, necessitating robust and scalable fact-checking solutions. Traditional human-led fact-checking methods, while credible, struggle with the volume and velocity of online content, prompting the integration of automated systems powered by Large Language Models (LLMs). However, existing automated approaches often face limitations, such as handling complex claims, ensuring source credibility, and maintaining transparency. This paper proposes a novel multi-agent system for automated fact-checking that enhances accuracy, efficiency, and explainability. The system comprises four specialized agents: an Input Ingestion Agent for claim decomposition, a Query Generation Agent for formulating targeted subqueries, an Evidence Retrieval Agent for sourcing credible evidence, and a Verdict Prediction Agent for synthesizing veracity judgments with human-interpretable explanations. Evaluated on benchmark datasets (FEVEROUS, HOVER, SciFact), the proposed system achieves a 12.3% improvement in Macro F1-score over baseline methods. The system effectively decomposes complex claims, retrieves reliable evidence from trusted sources, and generates transparent explanations for verification decisions. Our approach contributes to the growing field of automated fact-checking by providing a more accurate, efficient, and transparent verification methodology that aligns with human fact-checking practices while maintaining scalability for real-world applications. Our source code is available at https://github.com/HySonLab/FactAgent
LGDec 4, 2025
Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language ModelsManh Nguyen, Sunil Gupta, Hung Le
Detecting uncertainty in large language models (LLMs) is essential for building reliable systems, yet many existing approaches are overly complex and depend on brittle semantic clustering or access to model internals. We introduce \textbf{Radial Dispersion Score (RDS)}, a simple, training-free, fully model-agnostic uncertainty metric that measures the radial dispersion of sampled generations in embedding space. Specifically, given $N$ sampled generations embedded on the unit hypersphere, RDS computes the total $\ell_1$ distance from the empirical centroid, i.e., the mean embedding, providing a direct geometric signal of semantic variability. A lightweight probability-weighted variant further incorporates the model's own token probabilities when available, outperforming nine recent state-of-the-art baselines. Moreover, RDS naturally extends to effective per-sample uncertainty estimates that complement probability- and consistency-based methods while remaining lightweight for practical use. Across four challenging free-form question-answering datasets and four LLMs, our metrics achieve state-of-the-art hallucination detection and best-of-$N$ performance, while remaining robust and scalable with respect to sample size and embedding choice. These results highlight the practical value of RDS and its contribution toward improving the trustworthiness of LLMs.
LGNov 10, 2025
Probabilities Are All You Need: A Probability-Only Approach to Uncertainty Estimation in Large Language ModelsManh Nguyen, Sunil Gupta, Hung Le
Large Language Models (LLMs) exhibit strong performance across various natural language processing (NLP) tasks but remain vulnerable to hallucinations, generating factually incorrect or misleading outputs. Uncertainty estimation, often using predictive entropy estimation, is key to addressing this issue. However, existing methods often require multiple samples or extra computation to assess semantic entropy. This paper proposes an efficient, training-free uncertainty estimation method that approximates predictive entropy using the responses' top-$K$ probabilities. Moreover, we employ an adaptive mechanism to determine $K$ to enhance flexibility and filter out low-confidence probabilities. Experimental results on three free-form question-answering datasets across several LLMs demonstrate that our method outperforms expensive state-of-the-art baselines, contributing to the broader goal of enhancing LLM trustworthiness.
CLNov 5, 2025
GRAD: Graph-Retrieved Adaptive Decoding for Hallucination MitigationManh Nguyen, Sunil Gupta, Dai Do et al.
Hallucination mitigation remains a persistent challenge for large language models (LLMs), even as model scales grow. Existing approaches often rely on external knowledge sources, such as structured databases or knowledge graphs, accessed through prompting or retrieval. However, prompt-based grounding is fragile and domain-sensitive, while symbolic knowledge integration incurs heavy retrieval and formatting costs. Motivated by knowledge graphs, we introduce Graph-Retrieved Adaptive Decoding (GRAD), a decoding-time method that grounds generation in corpus-derived evidence without retraining. GRAD constructs a sparse token transition graph by accumulating next-token logits across a small retrieved corpus in a single forward pass. During decoding, graph-retrieved logits are max-normalized and adaptively fused with model logits to favor high-evidence continuations while preserving fluency. Across three models and a range of question-answering benchmarks spanning intrinsic, extrinsic hallucination, and factuality tasks, GRAD consistently surpasses baselines, achieving up to 9.7$\%$ higher intrinsic accuracy, 8.6$\%$ lower hallucination rates, and 6.9$\%$ greater correctness compared to greedy decoding, while attaining the highest truth--informativeness product score among all methods. GRAD offers a lightweight, plug-and-play alternative to contrastive decoding and knowledge graph augmentation, demonstrating that statistical evidence from corpus-level token transitions can effectively steer generation toward more truthful and verifiable outputs.
LGAug 7, 2025
SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language ModelsDai Do, Manh Nguyen, Svetha Venkatesh et al.
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL). However, such methods require extensive data and compute, making them impractical for smaller models. Current approaches to curriculum learning or data selection are largely heuristic-driven or demand extensive computational resources, limiting their scalability and generalizability. We propose \textbf{SPaRFT}, a self-paced learning framework that enables efficient learning based on the capability of the model being trained through optimizing which data to use and when. First, we apply \emph{cluster-based data reduction} to partition training data by semantics and difficulty, extracting a compact yet diverse subset that reduces redundancy. Then, a \emph{multi-armed bandit} treats data clusters as arms, optimized to allocate training samples based on model current performance. Experiments across multiple reasoning benchmarks show that SPaRFT achieves comparable or better accuracy than state-of-the-art baselines while using up to \(100\times\) fewer samples. Ablation studies and analyses further highlight the importance of both data clustering and adaptive selection. Our results demonstrate that carefully curated, performance-driven training curricula can unlock strong reasoning abilities in LLMs with minimal resources.
LGAug 4, 2025
CAAD: Context-Aware Adaptive Decoding for Truthful Text GenerationManh Nguyen, Sunil Gupta, Hung Le
Ensuring truthfulness in large language models remains a critical challenge for reliable text generation. While supervised fine-tuning and reinforcement learning with human feedback have shown promise, they require substantial amount of annotated data and computational resources, limiting scalability. In contrast, decoding-time interventions offer lightweight alternatives without model retraining. However, existing decoding strategies often face issues like prompt sensitivity, limited generalization, or dependence on internal model states. We propose a context-aware adaptive decoding method that leverages a compact reference grounding space, built from as few as 10 annotated examples and comprising pairs of context embeddings and next token logits from truthful responses, to enable retrieval-based logit shaping during inference. At each decoding step, our method retrieves top-N semantically similar contexts and aggregates their associated next token logits to modify the LLM's logits. Across three open-ended question-answering benchmarks, our approach achieves a 2.8 percent average improvement on TruthfulQA and further outperforms existing baselines on both Biographies and WikiQA. Experimental results also demonstrate cross-task generalization, with TruthfulQA-derived grounding enhancing biography generation. Our model-agnostic, scalable, and efficient method requires only a single generation pass, highlighting the potential of context-aware decoding for factual reliability in LLMs.
LGMar 27, 2024
Preventing Collapse in Contrastive Learning with Orthonormal Prototypes (CLOP)Huanran Li, Manh Nguyen, Daniel Pimentel-Alarcón
Contrastive learning has emerged as a powerful method in deep learning, excelling at learning effective representations through contrasting samples from different distributions. However, neural collapse, where embeddings converge into a lower-dimensional space, poses a significant challenge, especially in semi-supervised and self-supervised setups. In this paper, we first theoretically analyze the effect of large learning rates on contrastive losses that solely rely on the cosine similarity metric, and derive a theoretical bound to mitigate this collapse. {Building on these insights, we propose CLOP, a novel semi-supervised loss function designed to prevent neural collapse by promoting the formation of orthogonal linear subspaces among class embeddings.} Unlike prior approaches that enforce a simplex ETF structure, CLOP focuses on subspace separation, leading to more distinguishable embeddings. Through extensive experiments on real and synthetic datasets, we demonstrate that CLOP enhances performance, providing greater stability across different learning rates and batch sizes.
LGNov 27, 2025
Semi-Supervised Contrastive Learning with Orthonormal PrototypesHuanran Li, Manh Nguyen, Daniel Pimentel-Alarcón
Contrastive learning has emerged as a powerful method in deep learning, excelling at learning effective representations through contrasting samples from different distributions. However, dimensional collapse, where embeddings converge into a lower-dimensional space, poses a significant challenge, especially in semi-supervised and self-supervised setups. In this paper, we first identify a critical learning-rate threshold, beyond which standard contrastive losses converge to collapsed solutions. Building on these insights, we propose CLOP, a novel semi-supervised loss function designed to prevent dimensional collapse by promoting the formation of orthogonal linear subspaces among class embeddings. Through extensive experiments on real and synthetic datasets, we demonstrate that CLOP improves performance in image classification and object detection tasks while also exhibiting greater stability across different learning rates and batch sizes.
LGNov 27, 2025
Nonnegative Matrix Factorization through Cone CollapseManh Nguyen, Daniel Pimentel-Alarcón
Nonnegative matrix factorization (NMF) is a widely used tool for learning parts-based, low-dimensional representations of nonnegative data, with applications in vision, text, and bioinformatics. In clustering applications, orthogonal NMF (ONMF) variants further impose (approximate) orthogonality on the representation matrix so that its rows behave like soft cluster indicators. Existing algorithms, however, are typically derived from optimization viewpoints and do not explicitly exploit the conic geometry induced by NMF: data points lie in a convex cone whose extreme rays encode fundamental directions or "topics". In this work we revisit NMF from this geometric perspective and propose Cone Collapse, an algorithm that starts from the full nonnegative orthant and iteratively shrinks it toward the minimal cone generated by the data. We prove that, under mild assumptions on the data, Cone Collapse terminates in finitely many steps and recovers the minimal generating cone of $\mathbf{X}^\top$ . Building on this basis, we then derive a cone-aware orthogonal NMF model (CC-NMF) by applying uni-orthogonal NMF to the recovered extreme rays. Across 16 benchmark gene-expression, text, and image datasets, CC-NMF consistently matches or outperforms strong NMF baselines-including multiplicative updates, ANLS, projective NMF, ONMF, and sparse NMF-in terms of clustering purity. These results demonstrate that explicitly recovering the data cone can yield both theoretically grounded and empirically strong NMF-based clustering methods.
LGNov 27, 2025
Graph Contrastive Learning via Spectral Graph AlignmentManh Nguyen
Given augmented views of each input graph, contrastive learning methods (e.g., InfoNCE) optimize pairwise alignment of graph embeddings across views while providing no mechanism to control the global structure of the view specific graph-of-graphs built from these embeddings. We introduce SpecMatch-CL, a novel loss function that aligns the view specific graph-of-graphs by minimizing the difference between their normalized Laplacians. Theoretically, we show that under certain assumptions, the difference between normalized Laplacians provides an upper bound not only for the difference between the ideal Perfect Alignment contrastive loss and the current loss, but also for the Uniformly loss. Empirically, SpecMatch-CL establishes new state of the art on eight TU benchmarks under unsupervised learning and semi-supervised learning at low label rates, and yields consistent gains in transfer learning on PPI-306K and ZINC 2M datasets.
IVDec 3, 2020
Flow-based Deformation Guidance for Unpaired Multi-Contrast MRI Image-to-Image TranslationToan Duc Bui, Manh Nguyen, Ngan Le et al.
Image synthesis from corrupted contrasts increases the diversity of diagnostic information available for many neurological diseases. Recently the image-to-image translation has experienced significant levels of interest within medical research, beginning with the successful use of the Generative Adversarial Network (GAN) to the introduction of cyclic constraint extended to multiple domains. However, in current approaches, there is no guarantee that the mapping between the two image domains would be unique or one-to-one. In this paper, we introduce a novel approach to unpaired image-to-image translation based on the invertible architecture. The invertible property of the flow-based architecture assures a cycle-consistency of image-to-image translation without additional loss functions. We utilize the temporal information between consecutive slices to provide more constraints to the optimization for transforming one domain to another in unpaired volumetric medical images. To capture temporal structures in the medical images, we explore the displacement between the consecutive slices using a deformation field. In our approach, the deformation field is used as a guidance to keep the translated slides realistic and consistent across the translation. The experimental results have shown that the synthesized images using our proposed approach are able to archive a competitive performance in terms of mean squared error, peak signal-to-noise ratio, and structural similarity index when compared with the existing deep learning-based methods on three standard datasets, i.e. HCP, MRBrainS13, and Brats2019.