MMJul 3, 2024Code
Contrast then Memorize: Semantic Neighbor Retrieval-Enhanced Inductive Multimodal Knowledge Graph CompletionYu Zhao, Ying Zhang, Baohang Zhou et al.
A large number of studies have emerged for Multimodal Knowledge Graph Completion (MKGC) to predict the missing links in MKGs. However, fewer studies have been proposed to study the inductive MKGC (IMKGC) involving emerging entities unseen during training. Existing inductive approaches focus on learning textual entity representations, which neglect rich semantic information in visual modality. Moreover, they focus on aggregating structural neighbors from existing KGs, which of emerging entities are usually limited. However, the semantic neighbors are decoupled from the topology linkage and usually imply the true target entity. In this paper, we propose the IMKGC task and a semantic neighbor retrieval-enhanced IMKGC framework CMR, where the contrast brings the helpful semantic neighbors close, and then the memorize supports semantic neighbor retrieval to enhance inference. Specifically, we first propose a unified cross-modal contrastive learning to simultaneously capture the textual-visual and textual-textual correlations of query-entity pairs in a unified representation space. The contrastive learning increases the similarity of positive query-entity pairs, therefore making the representations of helpful semantic neighbors close. Then, we explicitly memorize the knowledge representations to support the semantic neighbor retrieval. At test time, we retrieve the nearest semantic neighbors and interpolate them to the query-entity similarity distribution to augment the final prediction. Extensive experiments validate the effectiveness of CMR on three inductive MKGC datasets. Codes are available at https://github.com/OreOZhao/CMR.
CVNov 14, 2025
Hindsight Distillation Reasoning with Knowledge Encouragement Preference for Knowledge-based Visual Question AnsweringYu Zhao, Ying Zhang, Xuhui Sui et al.
Knowledge-based Visual Question Answering (KBVQA) necessitates external knowledge incorporation beyond cross-modal understanding. Existing KBVQA methods either utilize implicit knowledge in multimodal large language models (MLLMs) via in-context learning or explicit knowledge via retrieval augmented generation. However, their reasoning processes remain implicit, without explicit multi-step trajectories from MLLMs. To address this gap, we provide a Hindsight Distilled Reasoning (HinD) framework with Knowledge Encouragement Preference Optimization (KEPO), designed to elicit and harness internal knowledge reasoning ability in MLLMs. First, to tackle the reasoning supervision problem, we propose to emphasize the hindsight wisdom of MLLM by prompting a frozen 7B-size MLLM to complete the reasoning process between the question and its ground truth answer, constructing Hindsight-Zero training data. Then we self-distill Hindsight-Zero into Chain-of-Thought (CoT) Generator and Knowledge Generator, enabling the generation of sequential steps and discrete facts. Secondly, to tackle the misalignment between knowledge correctness and confidence, we optimize the Knowledge Generator with KEPO, preferring under-confident but helpful knowledge over the over-confident but unhelpful one. The generated CoT and sampled knowledge are then exploited for answer prediction. Experiments on OK-VQA and A-OKVQA validate the effectiveness of HinD, showing that HinD with elicited reasoning from 7B-size MLLM achieves superior performance without commercial model APIs or outside knowledge.
CLSep 9, 2022
Multi-grained Label Refinement Network with Dependency Structures for Joint Intent Detection and Slot FillingBaohang Zhou, Ying Zhang, Xuhui Sui et al.
Slot filling and intent detection are two fundamental tasks in the field of natural language understanding. Due to the strong correlation between these two tasks, previous studies make efforts on modeling them with multi-task learning or designing feature interaction modules to improve the performance of each task. However, none of the existing approaches consider the relevance between the structural information of sentences and the label semantics of two tasks. The intent and semantic components of a utterance are dependent on the syntactic elements of a sentence. In this paper, we investigate a multi-grained label refinement network, which utilizes dependency structures and label semantic embeddings. Considering to enhance syntactic representations, we introduce the dependency structures of sentences into our model by graph attention layer. To capture the semantic dependency between the syntactic information and task labels, we combine the task specific features with corresponding label embeddings by attention mechanism. The experimental results demonstrate that our model achieves the competitive performance on two public datasets.
MMMar 17
Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information ExtractionBaohang Zhou, Kehui Song, Rize Jin et al.
Multimodal information extraction (MIE) constitutes a set of essential tasks aimed at extracting structural information from Web texts with integrating images, to facilitate the structural construction of Web-based semantic knowledge. To address the expanding category set including newly emerging entity types or relations on websites, prior research proposed the zero-shot MIE (ZS-MIE) task which aims to extract unseen structural knowledge with textual and visual modalities. However, the ZS-MIE models are limited to recognizing the samples that fall within the unseen category set, and they struggle to deal with real-world scenarios that encompass both seen and unseen categories. The shortcomings of existing methods can be ascribed to two main aspects. On one hand, these methods construct representations of samples and categories within Euclidean space, failing to capture the hierarchical semantic relationships between the two modalities within a sample and their corresponding category prototypes. On the other hand, there is a notable gap in the distribution of semantic similarity between seen and unseen category sets, which impacts the generative capability of the ZS-MIE models. To overcome the disadvantages, we delve into the generalized zero-shot MIE (GZS-MIE) task and propose the hyperbolic multimodal generative representation learning framework (HMGRL). The variational information bottleneck and autoencoder networks are reconstructed with hyperbolic space for modeling the multi-level hierarchical semantic correlations among samples and prototypes. Furthermore, the proposed model is trained with the unseen samples generated by the decoder, and we introduce the semantic similarity distribution alignment loss to enhance the model's generalization performance. Experimental evaluations on two benchmark datasets underscore the superiority of HMGRL compared to existing baseline methods.
CLNov 6, 2025
Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question AnsweringXinying Qian, Ying Zhang, Yu Zhao et al.
Temporal Knowledge Graph Question Answering (TKGQA) aims to answer time-sensitive questions by leveraging factual information from Temporal Knowledge Graphs (TKGs). While previous studies have employed pre-trained TKG embeddings or graph neural networks to inject temporal knowledge, they fail to fully understand the complex semantic information of time constraints. Recently, Large Language Models (LLMs) have shown remarkable progress, benefiting from their strong semantic understanding and reasoning generalization capabilities. However, their temporal reasoning ability remains limited. LLMs frequently suffer from hallucination and a lack of knowledge. To address these limitations, we propose the Plan of Knowledge framework with a contrastive temporal retriever, which is named PoK. Specifically, the proposed Plan of Knowledge module decomposes a complex temporal question into a sequence of sub-objectives from the pre-defined tools, serving as intermediate guidance for reasoning exploration. In parallel, we construct a Temporal Knowledge Store (TKS) with a contrastive retrieval framework, enabling the model to selectively retrieve semantically and temporally aligned facts from TKGs. By combining structured planning with temporal knowledge retrieval, PoK effectively enhances the interpretability and factual consistency of temporal reasoning. Extensive experiments on four benchmark TKGQA datasets demonstrate that PoK significantly improves the retrieval precision and reasoning accuracy of LLMs, surpassing the performance of the state-of-the-art TKGQA methods by 56.0% at most.
CLFeb 21, 2024
Self-DC: When to Reason and When to Act? Self Divide-and-Conquer for Compositional Unknown QuestionsHongru Wang, Boyang Xue, Baohang Zhou et al.
Previous research has typically concentrated on leveraging the internal knowledge of Large Language Models (LLMs) to answer known questions (i.e., \textit{internal reasoning such as generate-then-read}). In contrast, for questions that fall outside their known scope, these models rely on external knowledge retrieval to provide accurate responses (i.e., \textit{external acting such as retrieve-then-read}). However, few previous works consider the \textit{compositional questions}, which consist of several known and unknown sub-questions, necessitating the dynamic combination of previous two methods (i.e., \textit{internal reasoning and external acting}) to achieve a better trade-off between effectiveness and efficiency. To this end, we introduce a \textbf{Self} \textbf{D}ivide-and-\textbf{C}onquer (\textit{\texttt{Self-DC}}) framework, accompanying with the first \textbf{C}ompositional \textbf{u}nknown \textbf{Q}uestion-\textbf{A}nswering dataset (CuQA). This framework enables LLMs to adaptively choose between using internal knowledge and retrieving external knowledge as needed, resulting in a better trade-off between effectiveness and efficiency. Experimental results on two datasets demonstrate that \textit{\texttt{Self-DC}} can achieve comparable or even better performance with much fewer external calls compared with several strong baselines.
CLFeb 26, 2024
UniRetriever: Multi-task Candidates Selection for Various Context-Adaptive Conversational RetrievalHongru Wang, Boyang Xue, Baohang Zhou et al.
Conversational retrieval refers to an information retrieval system that operates in an iterative and interactive manner, requiring the retrieval of various external resources, such as persona, knowledge, and even response, to effectively engage with the user and successfully complete the dialogue. However, most previous work trained independent retrievers for each specific resource, resulting in sub-optimal performance and low efficiency. Thus, we propose a multi-task framework function as a universal retriever for three dominant retrieval tasks during the conversation: persona selection, knowledge selection, and response selection. To this end, we design a dual-encoder architecture consisting of a context-adaptive dialogue encoder and a candidate encoder, aiming to attention to the relevant context from the long dialogue and retrieve suitable candidates by simply a dot product. Furthermore, we introduce two loss constraints to capture the subtle relationship between dialogue context and different candidates by regarding historically selected candidates as hard negatives. Extensive experiments and analysis establish state-of-the-art retrieval quality both within and outside its training domain, revealing the promising potential and generalization capability of our model to serve as a universal retriever for different candidate selection tasks simultaneously.
LGJun 27, 2025
Hyper-modal Imputation Diffusion Embedding with Dual-Distillation for Federated Multimodal Knowledge Graph CompletionYing Zhang, Yu Zhao, Xuhui Sui et al.
With the increasing multimodal knowledge privatization requirements, multimodal knowledge graphs in different institutes are usually decentralized, lacking of effective collaboration system with both stronger reasoning ability and transmission safety guarantees. In this paper, we propose the Federated Multimodal Knowledge Graph Completion (FedMKGC) task, aiming at training over federated MKGs for better predicting the missing links in clients without sharing sensitive knowledge. We propose a framework named MMFeD3-HidE for addressing multimodal uncertain unavailability and multimodal client heterogeneity challenges of FedMKGC. (1) Inside the clients, our proposed Hyper-modal Imputation Diffusion Embedding model (HidE) recovers the complete multimodal distributions from incomplete entity embeddings constrained by available modalities. (2) Among clients, our proposed Multimodal FeDerated Dual Distillation (MMFeD3) transfers knowledge mutually between clients and the server with logit and feature distillation to improve both global convergence and semantic consistency. We propose a FedMKGC benchmark for a comprehensive evaluation, consisting of a general FedMKGC backbone named MMFedE, datasets with heterogeneous multimodal information, and three groups of constructed baselines. Experiments conducted on our benchmark validate the effectiveness, semantic consistency, and convergence robustness of MMFeD3-HidE.