Yujian Feng

CLJun 25, 2023

Chain-of-Thought Prompt Distillation for Multimodal Named Entity Recognition and Multimodal Relation Extraction

Feng Chen, Yujian Feng

Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE) necessitate the fundamental reasoning capacity for intricate linguistic and multimodal comprehension. In this study, we explore distilling the reasoning ability of large language models (LLMs) into a more compact student model by generating a \textit{chain of thought} (CoT) -- a sequence of intermediate reasoning steps. Specifically, we commence by exemplifying the elicitation of such reasoning ability from LLMs through CoT prompts covering multi-grain (noun, sentence, multimodality) and data-augmentation (style, entity, image) dimensions. Subsequently, we present a novel conditional prompt distillation method to assimilate the commonsense reasoning ability from LLMs, thereby enhancing the utility of the student model in addressing text-only inputs without the requisite addition of image and CoT knowledge. Extensive experiments reveal that our approach attains state-of-the-art accuracy and manifests a plethora of advantages concerning interpretability, data efficiency, and cross-domain generalization on MNER and MRE datasets.

CVSep 18, 2021Code

Homogeneous and Heterogeneous Relational Graph for Visible-infrared Person Re-identification

Yujian Feng, Feng Chen, Jian Yu et al.

Visible-infrared person re-identification (VI Re-ID) aims to match person images between the visible and infrared modalities. Existing VI Re-ID methods mainly focus on extracting homogeneous structural relationships in an image, i.e. the relations between local features, while ignoring the heterogeneous correlation of local features in different modalities. The heterogeneous structured relationship is crucial to learn effective identity representations and perform cross-modality matching. In this paper, we model the homogenous structural relationship by a modality-specific graph within individual modality and then mine the heterogeneous structural correlation with the modality-specific graph of visible and infrared modality. First, the homogeneous structured graph (HOSG) mines one-vs.-rest relation between an arbitrary node (local feature) and all the rest nodes within a visible or infrared image to learn effective identity representation. Second, to find cross-modality identity-consistent correspondence, the heterogeneous graph alignment module (HGAM) further measures the relational edge strength between local node features of two modalities with routing search way. Third, we propose the cross-modality cross-correlation (CMCC) loss to extract the modality invariance of feature representations of visible and infrared graphs. CMCC computes the mutual information between modalities and expels semantic redundancy. Extensive experiments on SYSU-MM01 and RegDB datasets demonstrate that our method outperforms state-of-the-arts with a gain of 13.73\% and 9.45\% Rank1/mAP. The code is available at https://github.com/fegnyujian/Homogeneous-and-Heterogeneous-Relational-Graph.

Yujian Feng

2 Papers