CLAug 6, 2024Code
Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer AggregationNam Le Hai, Nghi D. Q. Bui
Code comments provide important information for understanding the source code. They can help developers understand the overall purpose of a function or class, as well as identify bugs and technical debt. However, an overabundance of comments is meaningless and counterproductive. As a result, it is critical to automatically filter out these comments for specific purposes. In this paper, we present Dopamin, a Transformer-based tool for dealing with this issue. Our model excels not only in presenting knowledge sharing of common categories across multiple languages, but also in achieving robust performance in comment classification by improving comment representation. As a result, it outperforms the STACC baseline by 3% on the NLBSE'24 Tool Competition dataset in terms of average F1-score, while maintaining a comparable inference time for practical use. The source code is publicity available at https://github.com/FSoft-AI4Code/Dopamin.
33.8CLMay 2
MemORAI: Memory Organization and Retrieval via Adaptive Graph Intelligence for LLM Conversational AgentsHung Pham Van, Nguyen Manh Hieu, Khang Pham Tran Tuan et al.
Large Language Models (LLMs) lack persistent memory for long-term personalized conversations. Existing graph-based memory systems suffer from information dilution, absent provenance tracking, and uniform retrieval that ignores query context. We introduce MemORAI (Memory Organization and Retrieval via Adaptive Graph Intelligence), a framework that integrates three innovations: selective memory filtering with dual-layer compression to retain user-persona-relevant content, a provenance-enriched multi-relational graph tracking factual origins at the turn level, and query-adaptive subgraph retrieval with Dynamic Weighted PageRank that applies query-conditioned edge weighting. Evaluated on LOCOMO and LongMemEval benchmarks, MemORAI achieves state-of-the-art performance in memory retrieval and personalized response generation, demonstrating that selective storage, enriched representation, and adaptive retrieval are essential for coherent, personalized LLM agents.
SEJun 17, 2024Code
On the Impacts of Contexts on Repository-Level Code GenerationNam Le Hai, Dung Manh Nguyen, Nghi D. Q. Bui
CodeLLMs have gained widespread adoption for code generation tasks, yet their capacity to handle repository-level code generation with complex contextual dependencies remains underexplored. Our work underscores the critical importance of leveraging repository-level contexts to generate executable and functionally correct code. We present RepoExec, a novel benchmark designed to evaluate repository-level code generation, with a focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts. Our study examines a controlled scenario where developers specify essential code dependencies (contexts), challenging models to integrate them effectively. Additionally, we introduce an instruction-tuned dataset that enhances CodeLLMs' ability to leverage dependencies, along with a new metric, Dependency Invocation Rate (DIR), to quantify context utilization. Experimental results reveal that while pretrained LLMs demonstrate superior performance in terms of correctness, instruction-tuned models excel in context utilization and debugging capabilities. RepoExec offers a comprehensive evaluation framework for assessing code functionality and alignment with developer intent, thereby advancing the development of more reliable CodeLLMs for real-world applications. The dataset and source code are available at https://github.com/FSoft-AI4Code/RepoExec.
CLFeb 24, 2025
CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different TokenizersAnh Duc Le, Tu Vu, Nam Le Hai et al.
Large Language Models (LLMs) achieve state-of-the-art performance across various NLP tasks but face deployment challenges due to high computational costs and memory constraints. Knowledge distillation (KD) is a promising solution, transferring knowledge from large teacher models to smaller student models. However, existing KD methods often assume shared vocabularies and tokenizers, limiting their flexibility. While approaches like Universal Logit Distillation (ULD) and Dual-Space Knowledge Distillation (DSKD) address vocabulary mismatches, they overlook the critical \textbf{reasoning-aware distillation} aspect. To bridge this gap, we propose CoT2Align a universal KD framework that integrates Chain-of-Thought (CoT) augmentation and introduces Cross-CoT Alignment to enhance reasoning transfer. Additionally, we extend Optimal Transport beyond token-wise alignment to a sequence-level and layer-wise alignment approach that adapts to varying sequence lengths while preserving contextual integrity. Comprehensive experiments demonstrate that CoT2Align outperforms existing KD methods across different vocabulary settings, improving reasoning capabilities and robustness in domain-specific tasks.
CLMay 9, 2023
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and GenerationDung Nguyen Manh, Nam Le Hai, Anh T. V. Dau et al.
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.