LGAICVSDASFeb 24, 2022

Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

arXiv:2202.12307v113 citations
Originality Highly original
AI Analysis

This addresses the problem of disentangling content and style for generative tasks in speech and image domains, offering a novel framework with potential for various applications.

The paper tackles unsupervised learning of content-style decomposed representations by modeling them as a token-level bipartite graph, achieving state-of-the-art zero-shot voice conversion and top performance in image part discovery tasks.

This paper addresses the unsupervised learning of content-style decomposed representation. We first give a definition of style and then model the content-style representation as a token-level bipartite graph. An unsupervised framework, named Retriever, is proposed to learn such representations. First, a cross-attention module is employed to retrieve permutation invariant (P.I.) information, defined as style, from the input data. Second, a vector quantization (VQ) module is used, together with man-induced constraints, to produce interpretable content tokens. Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys. Being modal-agnostic, the proposed Retriever is evaluated in both speech and image domains. The state-of-the-art zero-shot voice conversion performance confirms the disentangling ability of our framework. Top performance is also achieved in the part discovery task for images, verifying the interpretability of our representation. In addition, the vivid part-based style transfer quality demonstrates the potential of Retriever to support various fascinating generative tasks. Project page at https://ydcustc.github.io/retriever-demo/.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes