CLAug 20, 2024
Automating Intervention Discovery from Scientific Literature: A Progressive Ontology Prompting and Dual-LLM FrameworkYuting Hu, Dancheng Liu, Qingyun Wang et al.
Identifying effective interventions from the scientific literature is challenging due to the high volume of publications, specialized terminology, and inconsistent reporting formats, making manual curation laborious and prone to oversight. To address this challenge, this paper proposes a novel framework leveraging large language models (LLMs), which integrates a progressive ontology prompting (POP) algorithm with a dual-agent system, named LLM-Duo. On the one hand, the POP algorithm conducts a prioritized breadth-first search (BFS) across a predefined ontology, generating structured prompt templates and action sequences to guide the automatic annotation process. On the other hand, the LLM-Duo system features two specialized LLM agents, an explorer and an evaluator, working collaboratively and adversarially to continuously refine annotation quality. We showcase the real-world applicability of our framework through a case study focused on speech-language intervention discovery. Experimental results show that our approach surpasses advanced baselines, achieving more accurate and comprehensive annotations through a fully automated process. Our approach successfully identified 2,421 interventions from a corpus of 64,177 research articles in the speech-language pathology domain, culminating in the creation of a publicly accessible intervention knowledge base with great potential to benefit the speech-language pathology community.
LGFeb 1, 2025Code
Sub-Sequential Physics-Informed Learning with State Space ModelChenhui Xu, Dancheng Liu, Yuting Hu et al.
Physics-Informed Neural Networks (PINNs) are a kind of deep-learning-based numerical solvers for partial differential equations (PDEs). Existing PINNs often suffer from failure modes of being unable to propagate patterns of initial conditions. We discover that these failure modes are caused by the simplicity bias of neural networks and the mismatch between PDE's continuity and PINN's discrete sampling. We reveal that the State Space Model (SSM) can be a continuous-discrete articulation allowing initial condition propagation, and that simplicity bias can be eliminated by aligning a sequence of moderate granularity. Accordingly, we propose PINNMamba, a novel framework that introduces sub-sequence modeling with SSM. Experimental results show that PINNMamba can reduce errors by up to 86.3\% compared with state-of-the-art architecture. Our code is available at https://github.com/miniHuiHui/PINNMamba.
CLSep 13, 2024
Towards Precision Characterization of Communication Disorders using Models of Perceived Pragmatic SimilarityNigel G. Ward, Andres Segura, Georgina Bugarini et al.
The diagnosis and treatment of individuals with communication disorders offers many opportunities for the application of speech technology, but research so far has not adequately considered: the diversity of conditions, the role of pragmatic deficits, and the challenges of limited data. This paper explores how a general-purpose model of perceived pragmatic similarity may overcome these limitations. It explains how it might support several use cases for clinicians and clients, and presents evidence that a simple model can provide value, and in particular can capture utterance aspects that are relevant to diagnoses of autism and specific language impairment.
LGMay 7, 2024
Robust Implementation of Retrieval-Augmented Generation on Edge-based Computing-in-Memory ArchitecturesRuiyang Qin, Zheyu Yan, Dewen Zeng et al.
Large Language Models (LLMs) deployed on edge devices learn through fine-tuning and updating a certain portion of their parameters. Although such learning methods can be optimized to reduce resource utilization, the overall required resources remain a heavy burden on edge devices. Instead, Retrieval-Augmented Generation (RAG), a resource-efficient LLM learning method, can improve the quality of the LLM-generated content without updating model parameters. However, the RAG-based LLM may involve repetitive searches on the profile data in every user-LLM interaction. This search can lead to significant latency along with the accumulation of user data. Conventional efforts to decrease latency result in restricting the size of saved user data, thus reducing the scalability of RAG as user data continuously grows. It remains an open question: how to free RAG from the constraints of latency and scalability on edge devices? In this paper, we propose a novel framework to accelerate RAG via Computing-in-Memory (CiM) architectures. It accelerates matrix multiplications by performing in-situ computation inside the memory while avoiding the expensive data transfer between the computing unit and memory. Our framework, Robust CiM-backed RAG (RoCR), utilizing a novel contrastive learning-based training method and noise-aware training, can enable RAG to efficiently search profile data with CiM. To the best of our knowledge, this is the first work utilizing CiM to accelerate RAG.
CVJan 25, 2025
Recognize Any Surgical Object: Unleashing the Power of Weakly-Supervised DataJiajie Li, Brian R Quaranto, Chenhui Xu et al.
We present RASO, a foundation model designed to Recognize Any Surgical Object, offering robust open-set recognition capabilities across a broad range of surgical procedures and object classes, in both surgical images and videos. RASO leverages a novel weakly-supervised learning framework that generates tag-image-text pairs automatically from large-scale unannotated surgical lecture videos, significantly reducing the need for manual annotations. Our scalable data generation pipeline gathers 2,200 surgical procedures and produces 3.6 million tag annotations across 2,066 unique surgical tags. Our experiments show that RASO achieves improvements of 2.9 mAP, 4.5 mAP, 10.6 mAP, and 7.2 mAP on four standard surgical benchmarks, respectively, in zero-shot settings, and surpasses state-of-the-art models in supervised surgical action recognition tasks. Code, model, and demo are available at https://ntlm1686.github.io/raso.
SDNov 21, 2024
Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the EdgeRuiyang Qin, Dancheng Liu, Gelei Xu et al.
The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users' personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment), ultimately enabling a more personalized and efficient adaptation on edge devices. However, due to the complex training requirements and substantial computational demands of existing approaches, cross-modal alignment between ASR audio and LLM can be challenging on edge devices. In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. Our framework enables efficient ASR-LLM alignment on resource-constrained devices like NVIDIA Jetson Orin (8GB RAM), achieving 50x training time speedup while improving the alignment quality by more than 50\%. To the best of our knowledge, this is the first work to study efficient ASR-LLM alignment on resource-constrained edge devices.
ROMar 10, 2025
Combating Partial Perception Deficit in Autonomous Driving with Multimodal LLM CommonsenseYuting Hu, Chenhui Xu, Ruiyang Qin et al.
Partial perception deficits can compromise autonomous vehicle safety by disrupting environmental understanding. Current protocols typically respond with immediate stops or minimal-risk maneuvers, worsening traffic flow and lacking flexibility for rare driving scenarios. In this paper, we propose LLM-RCO, a framework leveraging large language models to integrate human-like driving commonsense into autonomous systems facing perception deficits. LLM-RCO features four key modules: hazard inference, short-term motion planner, action condition verifier, and safety constraint generator. These modules interact with the dynamic driving environment, enabling proactive and context-aware control actions to override the original control policy of autonomous agents. To improve safety in such challenging conditions, we construct DriveLM-Deficit, a dataset of 53,895 video clips featuring deficits of safety-critical objects, complete with annotations for LLM-based hazard inference and motion planning fine-tuning. Extensive experiments in adverse driving conditions with the CARLA simulator demonstrate that systems equipped with LLM-RCO significantly improve driving performance, highlighting its potential for enhancing autonomous driving resilience against adverse perception deficits. Our results also show that LLMs fine-tuned with DriveLM-Deficit can enable more proactive movements instead of conservative stops in the context of perception deficits.
CLMay 27, 2025
Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data AugmentationDancheng Liu, Amir Nassereldine, Chenhui Xu et al.
Whisper's robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set, an impractical scale for most researchers. In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models, reducing word-error rates by up to 19.24 percent on unseen datasets when training on the 960-hour Librispeech dataset. These findings highlight strategic acoustically focused data augmentation as a promising alternative to massive datasets for building robust ASR models, offering a potential solution to future foundation ASR models when massive human speech data is lacking.
AIMar 5, 2025
Towards Understanding Multi-Round Large Language Model Reasoning: Approximability, Learnability and GeneralizabilityChenhui Xu, Dancheng Liu, Jiajie Li et al.
Recent advancements in cognitive science and multi-round reasoning techniques for Large Language Models (LLMs) suggest that iterative thinking processes improve problem-solving performance in complex tasks. Inspired by this, approaches like Chain-of-Thought, debating, and self-refinement have been applied to auto-regressive LLMs, achieving significant successes in tasks such as mathematical reasoning, commonsense reasoning, and multi-hop question answering. Despite these successes, the theoretical basis for how multi-round reasoning enhances problem-solving abilities remains underexplored. In this work, we investigate the approximation, learnability, and generalization properties of multi-round auto-regressive models. We show that Transformers with finite context windows are universal approximators for steps of Turing-computable functions and can approximate any Turing-computable sequence-to-sequence function through multi-round reasoning. We extend PAC learning to sequence generation and demonstrate that multi-round generation is learnable even when the sequence length exceeds the model's context window. Finally, we examine how generalization error propagates across rounds, and show how the aforementioned approaches can help constrain this error, ensuring outputs stay within an expectation boundary. This work sheds light on the systemic theoretical foundations of multi-round sequence learning and reasoning, emphasizing its role in inference complexity.
LGNov 12, 2024
NVCiM-PT: An NVCiM-assisted Prompt Tuning Framework for Edge LLMsRuiyang Qin, Pengyu Ren, Zheyu Yan et al.
Large Language Models (LLMs) deployed on edge devices, known as edge LLMs, need to continuously fine-tune their model parameters from user-generated data under limited resource constraints. However, most existing learning methods are not applicable for edge LLMs because of their reliance on high resources and low learning capacity. Prompt tuning (PT) has recently emerged as an effective fine-tuning method for edge LLMs by only modifying a small portion of LLM parameters, but it suffers from user domain shifts, resulting in repetitive training and losing resource efficiency. Conventional techniques to address domain shift issues often involve complex neural networks and sophisticated training, which are incompatible for PT for edge LLMs. Therefore, an open research question is how to address domain shift issues for edge LLMs with limited resources. In this paper, we propose a prompt tuning framework for edge LLMs, exploiting the benefits offered by non-volatile computing-in-memory (NVCiM) architectures. We introduce a novel NVCiM-assisted PT framework, where we narrow down the core operations to matrix-matrix multiplication, which can then be accelerated by performing in-situ computation on NVCiM. To the best of our knowledge, this is the first work employing NVCiM to improve the edge LLM PT performance.
CLJun 25, 2024
FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech DataDancheng Liu, Jinjun Xiong
Automatic Speech Recognition (ASR) for adults' speeches has made significant progress by employing deep neural network (DNN) models recently, but improvement in children's speech is still unsatisfactory due to children's speech's distinct characteristics. DNN models pre-trained on adult data often struggle in generalizing children's speeches with fine tuning because of the lack of high-quality aligned children's speeches. When generating datasets, human annotations are not scalable, and existing forced-alignment tools are not usable as they make impractical assumptions about the quality of the input transcriptions. To address these challenges, we propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children's speech data from many of the existing noisy children's speech data. We demonstrate its usage on the CHILDES dataset and show that FASA can improve data quality by 13.6$\times$ over human annotations.
CLJun 21, 2024
Large Language Models have Intrinsic Self-Correction AbilityDancheng Liu, Amir Nassereldine, Ziming Yang et al.
Large language models (LLMs) have attracted significant attention for their exceptional abilities in various natural language processing tasks, but they suffer from hallucinations that will cause performance degradation. One promising solution to improve the LLMs' performance is to ask LLMs to revise their answer after generation, a technique known as self-correction. Among the two types of self-correction, intrinsic self-correction is considered a promising direction because it does not utilize external knowledge. However, recent works doubt the validity of LLM's ability to conduct intrinsic self-correction. In this paper, we present a novel perspective on the intrinsic self-correction capabilities of LLMs through theoretical analyses and empirical experiments. In addition, we identify two critical factors for successful self-correction: zero temperature and fair prompts. Leveraging these factors, we demonstrate that intrinsic self-correction ability is exhibited across multiple existing LLMs. Our findings offer insights into the fundamental theories underlying the self-correction behavior of LLMs and remark on the importance of unbiased prompts and zero temperature settings in harnessing their full potential.
CLJun 21, 2024
PI-Whisper: Designing an Adaptive and Incremental Automatic Speech Recognition System for Edge DevicesAmir Nassereldine, Dancheng Liu, Chenhui Xu et al.
Edge-based automatic speech recognition (ASR) technologies are increasingly prevalent in the development of intelligent and personalized assistants. However, resource-constrained ASR models face significant challenges in adaptivity, incrementality, and inclusivity when faced with a diverse population. To tackle those challenges, we propose PI-Whisper, a novel ASR system that adaptively enhances recognition capabilities by identifying speakers' characteristics in real-time. In this work, we show how the design of PI-Whisper allows for incremental adaptation of new characteristics without the need for repetitive retraining, enhances recognition capabilities, and improves equity and fairness across diverse speaker groups. PI-Whisper demonstrates these advantages by achieving state-of-the-art accuracy, reducing the word error rate (WER) by up to 13.7% relative to baselines while scaling linearly to computing resources.
LGJun 6, 2024
Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge DevicesRuiyang Qin, Dancheng Liu, Chenhui Xu et al.
The scaling laws have become the de facto guidelines for designing large language models (LLMs), but they were studied under the assumption of unlimited computing resources for both training and inference. As LLMs are increasingly used as personalized intelligent assistants, their customization (i.e., learning through fine-tuning) and deployment onto resource-constrained edge devices will become more and more prevalent. An urging but open question is how a resource-constrained computing environment would affect the design choices for a personalized LLM. We study this problem empirically in this work. In particular, we consider the tradeoffs among a number of key design factors and their intertwined impacts on learning efficiency and accuracy. The factors include the learning methods for LLM customization, the amount of personalized data used for learning customization, the types and sizes of LLMs, the compression methods of LLMs, the amount of time afforded to learn, and the difficulty levels of the target use cases. Through extensive experimentation and benchmarking, we draw a number of surprisingly insightful guidelines for deploying LLMs onto resource-constrained devices. For example, an optimal choice between parameter learning and RAG may vary depending on the difficulty of the downstream task, the longer fine-tuning time does not necessarily help the model, and a compressed LLM may be a better choice than an uncompressed LLM to learn from limited personalized data.
CRJan 19, 2024
Ensembler: Protect Collaborative Inference Privacy from Model Inversion Attack via Selective EnsembleDancheng Liu, Chenhui Xu, Jiajie Li et al.
For collaborative inference through a cloud computing platform, it is sometimes essential for the client to shield its sensitive information from the cloud provider. In this paper, we introduce Ensembler, an extensible framework designed to substantially increase the difficulty of conducting model inversion attacks by adversarial parties. Ensembler leverages selective model ensemble on the adversarial server to obfuscate the reconstruction of the client's private information. Our experiments demonstrate that Ensembler can effectively shield input images from reconstruction attacks, even when the client only retains one layer of the network locally. Ensembler significantly outperforms baseline methods by up to 43.5% in structural similarity while only incurring 4.8% time overhead during inference.