Lianming Huang

CV
h-index12
6papers
140citations
Novelty46%
AI Score49

6 Papers

CLJul 18, 2024
Retrieval-Augmented Generation for Natural Language Processing: A Survey

Shangyu Wu, Ying Xiong, Yufei Cui et al.

Large language models (LLMs) have demonstrated great success in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge database to augment LLMs, makes up those drawbacks of LLMs. This paper reviews all significant techniques of RAG, especially in the retriever and the retrieval fusions. Besides, tutorial codes are provided for implementing the representative techniques in RAG. This paper further discusses the RAG update, including RAG with/without knowledge update. Then, we introduce RAG evaluation and benchmarking, as well as the application of RAG in representative NLP tasks and industrial scenarios. Finally, this paper discusses RAG's future directions and challenges for promoting this field's development.

CVNov 25, 2025
DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

Haibo HU, Lianming Huang, Nan Guan et al.

Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.

LGNov 25, 2025
On-Demand Multi-Task Sparsity for Efficient Large-Model Deployment on Edge Devices

Lianming Huang, Haibo Hu, Qiao Li et al.

Sparsity is essential for deploying large models on resource constrained edge platforms. However, optimizing sparsity patterns for individual tasks in isolation ignores the significant I/O overhead incurred during frequent task switching. We introduce an on-demand multi-task sparsity framework specifically designed to minimize switching costs by maximizing parameter reuse. Unlike monolithic approaches, we decompose weights into reusable block-granular units and align sparse structures across tasks to maximize overlap. By dynamically loading only the small differential set of blocks required for the next task, our method effectively mitigates the cold-start latency inherent in traditional monolithic approaches.Experiments on a real-world autonomous driving platform demonstrate that our framework achieves superior switching efficiency, accelerating task switching by over 6.6X on average compared to existing sparsity methods.

ROOct 2, 2025
Nav-EE: Navigation-Guided Early Exiting for Efficient Vision-Language Models in Autonomous Driving

Haibo Hu, Lianming Huang, Xinyu Wang et al.

Vision-Language Models (VLMs) are increasingly applied in autonomous driving for unified perception and reasoning, but high inference latency hinders real-time deployment. Early-exit reduces latency by terminating inference at intermediate layers, yet its task-dependent nature limits generalization across diverse scenarios. We observe that this limitation aligns with autonomous driving: navigation systems can anticipate upcoming contexts (e.g., intersections, traffic lights), indicating which tasks will be required. We propose Nav-EE, a navigation-guided early-exit framework that precomputes task-specific exit layers offline and dynamically applies them online based on navigation priors. Experiments on CODA, Waymo, and BOSCH show that Nav-EE achieves accuracy comparable to full inference while reducing latency by up to 63.9%. Real-vehicle integration with Autoware Universe further demonstrates reduced inference latency (600ms to 300ms), supporting faster decision-making in complex scenarios. These results suggest that coupling navigation foresight with early-exit offers a viable path toward efficient deployment of large models in autonomous systems. Code and data are available at our anonymous repository: https://anonymous.4open.science/r/Nav-EE-BBC4

CVAug 20, 2025
GM-Skip: Metric-Guided Transformer Block Skipping for Efficient Vision-Language Models

Lianming Huang, Haibo Hu, Qiao Li et al.

Transformer-based Vision-Language Models (VLMs) have achieved impressive performance on tasks such as image captioning, object recognition, and visual reasoning, but their high computational cost hinders deployment in latency-sensitive applications like autonomous driving. We introduce GM-Skip, a flexible and metric-adaptive framework for Transformer block skipping that accelerates VLM inference while preserving output quality. GM-Skip features a greedy, metric-guided block selection strategy that uses metric feedback (e.g., accuracy, CIDEr) to identify redundant layers, along with a reverse-order deletion mechanism that preserves early foundational blocks to avoid performance collapse. To support diverse deployment needs, it incorporates a tunable trade-off between sparsity and performance via a score-sparsity balance objective. Experiments across multiple tasks and datasets, including COCO and CODA, show that GM-Skip consistently improves inference speed while maintaining task performance. On the COCO dataset, GM-Skip improves single-object classification accuracy on the Person category from 19.1 percent to 87.3 percent while skipping more than 40 percent of Transformer blocks. In real-world deployment, it achieves up to 45.4 percent latency reduction on single-object detection when integrated into an autonomous vehicle running Autoware.Universe, validating the effectiveness of its skip configurations and confirming its practical value in accelerating real-world inference.

CVJun 4, 2025
AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving

Lianming Huang, Haibo Hu, Yufei Cui et al.

With the rapid advancement of autonomous driving, deploying Vision-Language Models (VLMs) to enhance perception and decision-making has become increasingly common. However, the real-time application of VLMs is hindered by high latency and computational overhead, limiting their effectiveness in time-critical driving scenarios. This challenge is particularly evident when VLMs exhibit over-inference, continuing to process unnecessary layers even after confident predictions have been reached. To address this inefficiency, we propose AD-EE, an Early Exit framework that incorporates domain characteristics of autonomous driving and leverages causal inference to identify optimal exit layers. We evaluate our method on large-scale real-world autonomous driving datasets, including Waymo and the corner-case-focused CODA, as well as on a real vehicle running the Autoware Universe platform. Extensive experiments across multiple VLMs show that our method significantly reduces latency, with maximum improvements reaching up to 57.58%, and enhances object detection accuracy, with maximum gains of up to 44%.