28.0ASMay 20
Discriminative-Generative Target Speaker Extraction with Decoder-Only Language ModelsBang Zeng, Beilong Tang, Wang Xiang et al.
Target speaker extraction (TSE) aims to recover the speech of a desired speaker from a mixture given a short enrollment utterance, while speech enhancement (SE) focuses on improving speech quality under noisy conditions. Most existing TSE and SE systems are based on discriminative modeling and have shown strong interference suppression ability, but they often remain limited in perceptual quality and naturalness. To address this issue, we first introduce LauraTSE, a generative TSE model built on an autoregressive decoder-only language model. Although generative modeling is promising for quality enhancement, purely generative TSE may suffer from hallucination, content drift, and limited controllability in complex acoustic conditions. We therefore propose a discriminative-generative two-stage framework, where a discriminative front-end first produces target-related representations with strong interference suppression, and a generative back-end then reconstructs high-quality speech in the neural audio codec representation space. This design combines the controllability of discriminative extraction with the reconstruction capability of generative modeling. We further investigate several collaboration strategies for the two-stage framework, including front-end freezing, joint fine-tuning, SI-SDR regularization, and autoregressive/non-autoregressive inference. Experimental results on both TSE and SE benchmarks show that the proposed framework achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines.
IRNov 3, 2021
GRCN: Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit FeedbackWei Yinwei, Wang Xiang, Nie Liqiang et al.
Reorganizing implicit feedback of users as a user-item interaction graph facilitates the applications of graph convolutional networks (GCNs) in recommendation tasks. In the interaction graph, edges between user and item nodes function as the main element of GCNs to perform information propagation and generate informative representations. Nevertheless, an underlying challenge lies in the quality of interaction graph, since observed interactions with less-interested items occur in implicit feedback (say, a user views micro-videos accidentally). This means that the neighborhoods involved with such false-positive edges will be influenced negatively and the signal on user preference can be severely contaminated. However, existing GCN-based recommender models leave such challenge under-explored, resulting in suboptimal representations and performance. In this work, we focus on adaptively refining the structure of interaction graph to discover and prune potential false-positive edges. Towards this end, we devise a new GCN-based recommender model, \emph{Graph-Refined Convolutional Network} (GRCN), which adjusts the structure of interaction graph adaptively based on status of model training, instead of remaining the fixed structure. In particular, a graph refining layer is designed to identify the noisy edges with the high confidence of being false-positive interactions, and consequently prune them in a soft manner. We then apply a graph convolutional layer on the refined graph to distill informative signals on user preference. Through extensive experiments on three datasets for micro-video recommendation, we validate the rationality and effectiveness of our GRCN. Further in-depth analysis presents how the refined graph benefits the GCN-based recommender model.
IROct 28, 2021
Hierarchical User Intent Graph Network forMultimedia RecommendationWei Yinwei, Wang Xiang, He Xiangnan et al.
In this work, we aim to learn multi-level user intents from the co-interacted patterns of items, so as to obtain high-quality representations of users and items and further enhance the recommendation performance. Towards this end, we develop a novel framework, Hierarchical User Intent Graph Network, which exhibits user intents in a hierarchical graph structure, from the fine-grained to coarse-grained intents. In particular, we get the multi-level user intents by recursively performing two operations: 1) intra-level aggregation, which distills the signal pertinent to user intents from co-interacted item graphs; and 2) inter-level aggregation, which constitutes the supernode in higher levels to model coarser-grained user intents via gathering the nodes' representations in the lower ones. Then, we refine the user and item representations as a distribution over the discovered intents, instead of simple pre-existing features. To demonstrate the effectiveness of our model, we conducted extensive experiments on three public datasets. Our model achieves significant improvements over the state-of-the-art methods, including MMGCN and DisenGCN. Furthermore, by visualizing the item representations, we provide the semantics of user intents.