Alexey Ozerov

6papers

188citations

Novelty46%

AI Score24

Ranked #177,660 of 205,806 authors (top 86%)#53,410 in CV (top 91%)

6 Papers

LGOct 6, 2021

ParaDiS: Parallelly Distributable Slimmable Neural Networks

Alexey Ozerov, Anne Lambert, Suresh Kirthi Kumaraswamy

When several limited power devices are available, one of the most efficient ways to make profit of these resources, while reducing the processing latency and communication load, is to run in parallel several neural sub-networks and to fuse the result at the end of processing. However, such a combination of sub-networks must be trained specifically for each particular configuration of devices (characterized by number of devices and their capacities) which may vary over different model deployments and even within the same deployment. In this work we introduce parallelly distributable slimmable (ParaDiS) neural networks that are splittable in parallel among various device configurations without retraining. While inspired by slimmable networks allowing instant adaptation to resources on just one device, ParaDiS networks consist of several multi-device distributable configurations or switches that strongly share the parameters between them. We evaluate ParaDiS framework on MobileNet v1 and ResNet-50 architectures on ImageNet classification task and WDSR architecture for image super-resolution task. We show that ParaDiS switches achieve similar or better accuracy than the individual models, i.e., distributed models of the same structure trained individually. Moreover, we show that, as compared to universally slimmable networks that are not distributable, the accuracy of distributable ParaDiS switches either does not drop at all or drops by a maximum of 1 % only in the worst cases. Finally, once distributed over several devices, ParaDiS outperforms greatly slimmable models.

SDFeb 10, 2021

Self-Supervised VQ-VAE for One-Shot Music Style Transfer

Ondřej Cífka, Alexey Ozerov, Umut Şimşekli et al.

Neural style transfer, allowing to apply the artistic style of one image to another, has become one of the most widely showcased computer vision applications shortly after its introduction. In contrast, related tasks in the music audio domain remained, until recently, largely untackled. While several style conversion methods tailored to musical signals have been proposed, most lack the 'one-shot' capability of classical image style transfer algorithms. On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling. In this work, we are specifically interested in the problem of one-shot timbre transfer. We present a novel method for this task, based on an extension of the vector-quantized variational autoencoder (VQ-VAE), along with a simple self-supervised learning strategy designed to obtain disentangled representations of timbre and pitch. We evaluate the method using a set of objective metrics and show that it is able to outperform selected baselines.

ASJul 15, 2020

A survey and an extensive evaluation of popular audio declipping methods

Pavel Záviška, Pavel Rajmic, Alexey Ozerov et al.

Dynamic range limitations in signal processing often lead to clipping, or saturation, in signals. The task of audio declipping is estimating the original audio signal, given its clipped measurements, and has attracted much interest in recent years. Audio declipping algorithms often make assumptions about the underlying signal, such as sparsity or low-rankness, and about the measurement system. In this paper, we provide an extensive review of audio declipping algorithms proposed in the literature. For each algorithm, we present assumptions that are made about the audio signal, the modeling domain, and the optimization algorithm. Furthermore, we provide an extensive numerical evaluation of popular declipping algorithms, on real audio data. We evaluate each algorithm in terms of the Signal-to-Distortion Ratio, and also using perceptual metrics of sound quality. The article is accompanied by a repository containing the evaluated methods.

CVNov 9, 2018

Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

Sanjeel Parekh, Alexey Ozerov, Slim Essid et al.

We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results.

CVApr 19, 2018

Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

Sanjeel Parekh, Slim Essid, Alexey Ozerov et al.

Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements. The system is trained using only video-level event labels without any timing information. An important feature of our method is its capacity to learn from unsynchronized audio-visual events. We achieve state-of-the-art results on a large-scale dataset of weakly-labeled audio event videos. Visualizations of localized visual regions and audio segments substantiate our system's efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously.

SDOct 31, 2017

Audio style transfer

Eric Grinstein, Ngoc Duong, Alexey Ozerov et al.

'Style transfer' among images has recently emerged as a very active research topic, fuelled by the power of convolution neural networks (CNNs), and has become fast a very popular technology in social media. This paper investigates the analogous problem in the audio domain: How to transfer the style of a reference audio signal to a target audio content? We propose a flexible framework for the task, which uses a sound texture model to extract statistics characterizing the reference audio style, followed by an optimization-based audio texture synthesis to modify the target content. In contrast to mainstream optimization-based visual transfer method, the proposed process is initialized by the target content instead of random noise and the optimized loss is only about texture, not structure. These differences proved key for audio style transfer in our experiments. In order to extract features of interest, we investigate different architectures, whether pre-trained on other tasks, as done in image style transfer, or engineered based on the human auditory system. Experimental results on different types of audio signal confirm the potential of the proposed approach.