CLJul 30, 2022
Dynamically Retrieving Knowledge via Query Generation for Informative Dialogue GenerationZhongtian Hu, Lifang Wang, Yangqi Chen et al.
Knowledge-driven dialog system has recently made remarkable breakthroughs. Compared with general dialog systems, superior knowledge-driven dialog systems can generate more informative and knowledgeable responses with pre-provided knowledge. However, in practical applications, the dialog system cannot be provided with corresponding knowledge in advance because it cannot know in advance the development of the conversation. Therefore, in order to make the knowledge dialogue system more practical, it is vital to find a way to retrieve relevant knowledge based on the dialogue history. To solve this problem, we design a knowledge-driven dialog system named DRKQG (Dynamically Retrieving Knowledge via Query Generation for informative dialog response). Specifically, the system can be divided into two modules: the query generation module and the dialog generation module. First, a time-aware mechanism is utilized to capture context information, and a query can be generated for retrieving knowledge through search engine. Then, we integrate the copy mechanism and transformers, which allows the response generation module to produce responses derived from the context and retrieved knowledge. Experimental results at LIC2022, Language and Intelligence Technology Competition, show that our module outperforms the baseline model by a large margin on automatic evaluation metrics, while human evaluation by the Baidu Linguistics team shows that our system achieves impressive results in Factually Correct and Knowledgeable.
IRJan 11, 2024
UniRQR: A Unified Model for Retrieval Decision, Query, and Response Generation in Internet-Based Knowledge Dialogue SystemsZhongtian Hu, Yangqi Chen, Meng Zhao et al.
Knowledge-based dialogue systems with internet retrieval have recently attracted considerable attention from researchers. The dialogue systems overcome a major limitation of traditional knowledge dialogue systems, where the timeliness of knowledge cannot be assured, hence providing greater practical application value. Knowledge-based dialogue systems with internet retrieval can be typically segmented into three tasks: Retrieval Decision, Query Generation, and Response Generation. However, many of studies assumed that all conversations require external knowledge to continue, neglecting the critical step of determining when retrieval is necessary. This assumption often leads to an over-dependence on external knowledge, even when it may not be required. Our work addresses this oversight by employing a single unified model facilitated by prompt and multi-task learning approaches. This model not only decides whether retrieval is necessary but also generates retrieval queries and responses. By integrating these functions, our system leverages the full potential of pre-trained models and reduces the complexity and costs associated with deploying multiple models. We conducted extensive experiments to investigate the mutual enhancement among the three tasks in our system. What is more, the experiment results on the Wizint and Dusinc datasets not only demonstrate that our unified model surpasses the baseline performance for individual tasks, but also reveal that it achieves comparable results when contrasted with SOTA systems that deploy separate, specialized models for each task.
CVJul 12, 2025
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language ModelsXiao Liang, Di Wang, Zhicheng Jiao et al.
The rapid advancements in Vision Language Models (VLMs) have prompted the development of multi-modal medical assistant systems. Despite this progress, current models still have inherent probabilistic uncertainties, often producing erroneous or unverified responses-an issue with serious implications in medical applications. Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning. However, these training-dependent strategies are costly and still lack sufficient alignment with clinical expertise. To address these issues, we propose an expert-in-the-loop framework named Expert-Controlled Classifier-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training. This framework introduces an uncertainty estimation strategy to identify unreliable outputs. It then retrieves relevant references to assist experts in highlighting key terms and applies classifier-free guidance to refine the token embeddings of MedVLM, ensuring that the adjusted outputs are correct and align with expert highlights. Evaluations across three medical visual question answering benchmarks demonstrate that the proposed Expert-CFG, with 4.2B parameters and limited expert annotations, outperforms state-of-the-art models with 13B parameters. The results demonstrate the feasibility of deploying such a system in resource-limited settings for clinical use.
CVJul 9, 2025
CheXPO: Preference Optimization for Chest X-ray VLMs with Counterfactual RationaleXiao Liang, Jiawei Hu, Di Wang et al.
Vision-language models (VLMs) are prone to hallucinations that critically compromise reliability in medical applications. While preference optimization can mitigate these hallucinations through clinical feedback, its implementation faces challenges such as clinically irrelevant training samples, imbalanced data distributions, and prohibitive expert annotation costs. To address these challenges, we introduce CheXPO, a Chest X-ray Preference Optimization strategy that combines confidence-similarity joint mining with counterfactual rationale. Our approach begins by synthesizing a unified, fine-grained multi-task chest X-ray visual instruction dataset across different question types for supervised fine-tuning (SFT). We then identify hard examples through token-level confidence analysis of SFT failures and use similarity-based retrieval to expand hard examples for balancing preference sample distributions, while synthetic counterfactual rationales provide fine-grained clinical preferences, eliminating the need for additional expert input. Experiments show that CheXPO achieves 8.93% relative performance gain using only 5% of SFT samples, reaching state-of-the-art performance across diverse clinical tasks and providing a scalable, interpretable solution for real-world radiology applications.
CLJan 20, 2025
Advancing Multi-Party Dialogue Framework with Speaker-ware Contrastive LearningZhongtian Hu, Qi He, Ronghan Li et al.
Multi-party dialogues, common in collaborative scenarios like brainstorming sessions and negotiations, pose significant challenges due to their complexity and diverse speaker roles. Current methods often use graph neural networks to model dialogue context, capturing structural dynamics but heavily relying on annotated graph structures and overlooking individual speaking styles. To address these challenges, we propose CMR, a Contrastive learning-based Multi-party dialogue Response generation framework. CMR employs a two-stage self-supervised contrastive learning framework. First, it captures global differences in speaking styles across individuals. Then, it focuses on intra-conversation comparisons to identify thematic transitions and contextually relevant facts. To the best of our knowledge, this is the first approach that applies contrastive learning in multi-party dialogue generation. Experimental results demonstrate that CMR not only significantly outperforms state-of-the-art models, but also generalizes well to large pre-trained language models, effectively enhancing their capability in handling multi-party conversations.
CLJan 20, 2025
Can MLLMs Generalize to Multi-Party dialog? Exploring Multilingual Response Generation in Complex ScenariosZhongtian Hu, Yiwen Cui, Ronghan Li et al.
Current multilingual large language models(MLLMs) still focus on simple question-answering formats, often overlooking more complex dialogue scenarios. In other words, their capabilities of multilingual large models have yet to be validated in dialogue tasks with intricate structures. We therefore ask, Q1: How well do LLMs generalize to more complex dialog scenarios? Q2: Can supervised fine-tuning on a high-quality parallel benchmark restore this ability? Q3: Does the "multilingual complementarity" effect survive in the setting? To answer these questions, we introduce XMP, a high-quality parallel Multilingual dataset sourced from Multi-party Podcast dialogues, which is the first parallel dataset focusing on multi-party dialogue scenarios. Most samples in the dataset feature three or more participants, discussing a wide range of topics. Through extensive experiments, we find that, R1: MLLMs fail to generalize to multi-party setting, R2 Fine-tuning on XMP improves only marginally, with the 70B model achieving at most a 1% absolute gain over its 8B counterpart; R3: Mixing languages during SFT is usually detrimental, with any benefits being marginal and limited to isolated cases in the 70B model.