CVMMAug 1, 2023

Relation-Aware Distribution Representation Network for Person Clustering with Multiple Modalities

arXiv:2308.00588v13 citationsh-index: 28
Originality Incremental advance
AI Analysis

This work addresses person clustering for tasks like movie parsing and identity-based editing, offering a novel method to handle weakly correlated multi-modal features, though it is incremental in improving existing benchmarks.

The paper tackles the problem of person clustering with multi-modal clues by proposing a Relation-Aware Distribution representation Network (RAD-Net) to generate modality-agnostic distribution representations, achieving improvements of +6% and +8.2% in F-score on the VPCD and VoxCeleb2 datasets.

Person clustering with multi-modal clues, including faces, bodies, and voices, is critical for various tasks, such as movie parsing and identity-based movie editing. Related methods such as multi-view clustering mainly project multi-modal features into a joint feature space. However, multi-modal clue features are usually rather weakly correlated due to the semantic gap from the modality-specific uniqueness. As a result, these methods are not suitable for person clustering. In this paper, we propose a Relation-Aware Distribution representation Network (RAD-Net) to generate a distribution representation for multi-modal clues. The distribution representation of a clue is a vector consisting of the relation between this clue and all other clues from all modalities, thus being modality agnostic and good for person clustering. Accordingly, we introduce a graph-based method to construct distribution representation and employ a cyclic update policy to refine distribution representation progressively. Our method achieves substantial improvements of +6% and +8.2% in F-score on the Video Person-Clustering Dataset (VPCD) and VoxCeleb2 multi-view clustering dataset, respectively. Codes will be released publicly upon acceptance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes