Juan F. Montesinos

SD
6papers
106citations
Novelty47%
AI Score25

6 Papers

CVApr 5, 2022
VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro

In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audio-visual correspondence score. We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models in the audio-visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2. While the existing methods focus mainly on lip synchronisation in speech videos, we also consider the special case of the singing voice. The singing voice is a more challenging use case for synchronisation due to sustained vowel sounds. We also investigate the relevance of lip synchronisation models trained on speech datasets in the context of singing voice. Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end. The demos, source code, and the pre-trained models are available on https://ipcv.github.io/VocaLiST/

SDMar 8, 2022
VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro

This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/

SDJun 1, 2023
Speech inpainting: Context-based speech synthesis guided by video

Juan F. Montesinos, Daniel Michelsanti, Gloria Haro et al.

Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment in a way that it is consistent with the corresponding visual content and the uncorrupted audio context. We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio. It outperforms the previous state-of-the-art audio-visual model and audio-only baselines. We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.

SDApr 20, 2021
A cappella: Audio-visual Singing Voice Separation

Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro

The task of isolating a target singing voice in music videos has useful applications. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We also propose an audio-visual convolutional network based on graphs which achieves state-of-the-art singing voice separation results on our dataset and compare it against its audio-only counterpart, U-Net, and a state-of-the-art audio-visual speech separation model. We evaluate the models in the following challenging setups: i) presence of overlapping voices in the audio mixtures, ii) the target voice set to lower volume levels in the mix, and iii) combination of i) and ii). The third one being the most challenging evaluation setup. We demonstrate that our model outperforms the baseline models in the singing voice separation task in the most challenging evaluation setup. The code, the pre-trained models, and the dataset are publicly available at https://ipcv.github.io/Acappella/able at https://ipcv.github.io/Acappella/

ASJun 14, 2020
Solos: A Dataset for Audio-Visual Music Analysis

Juan F. Montesinos, Olga Slizovskaia, Gloria Haro

In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual self-supervised task. These videos, gathered from YouTube, consist of solo musical performances of 13 different instruments. Compared to previously proposed audio-visual datasets, Solos is cleaner since a big amount of its recordings are auditions and manually checked recordings, ensuring there is no background noise nor effects added in the video post-processing. Besides, it is, up to the best of our knowledge, the only dataset that contains the whole set of instruments present in the URMP\cite{URPM} dataset, a high-quality dataset of 44 audio-visual recordings of multi-instrument classical music pieces with individual audio tracks. URMP was intented to be used for source separation, thus, we evaluate the performance on the URMP dataset of two different source-separation models trained on Solos. The dataset is publicly available at https://juanfmontesinos.github.io/Solos/

SDMar 23, 2020
Multi-channel U-Net for Music Source Separation

Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro et al.

A fairly straightforward approach for music source separation is to train independent models, wherein each model is dedicated for estimating only a specific source. Training a single model to estimate multiple sources generally does not perform as well as the independent dedicated models. However, Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separation and attempts to achieve a performance comparable to that of the dedicated models. We propose a multi-channel U-Net (M-U-Net) trained using a weighted multi-task loss as an alternative to the C-U-Net. We investigate two weighting strategies for our multi-task loss: 1) Dynamic Weighted Average (DWA), and 2) Energy Based Weighting (EBW). DWA determines the weights by tracking the rate of change of loss of each task during training. EBW aims to neutralize the effect of the training bias arising from the difference in energy levels of each of the sources in a mixture. Our methods provide three-fold advantages compared to C-UNet: 1) Fewer effective training iterations per epoch, 2) Fewer trainable network parameters (no control parameters), and 3) Faster processing at inference. Our methods achieve performance comparable to that of C-U-Net and the dedicated U-Nets at a much lower training cost.