SDMar 6, 2022Code
HEAR: Holistic Evaluation of Audio RepresentationsJoseph Turian, Jordie Shier, Humair Raj Khan et al. · cmu
What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. HEAR was launched as a NeurIPS 2021 shared challenge. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear.
CVAug 3, 2022Code
Estimating Visual Information From Audio Through Manifold LearningFabrizio Pedersoli, Dryden Wiebe, Amin Banitalebi et al.
We propose a new framework for extracting visual information about a scene only using audio signals. Audio-based methods can overcome some of the limitations of vision-based methods i.e., they do not require "line-of-sight", are robust to occlusions and changes in illumination, and can function as a backup in case vision/lidar sensors fail. Therefore, audio-based methods can be useful even for applications in which only visual information is of interest Our framework is based on Manifold Learning and consists of two steps. First, we train a Vector-Quantized Variational Auto-Encoder to learn the data manifold of the particular visual modality we are interested in. Second, we train an Audio Transformation network to map multi-channel audio signals to the latent representation of the corresponding visual sample. We show that our method is able to produce meaningful images from audio using a publicly available audio/visual dataset. In particular, we consider the prediction of the following visual modalities from audio: depth and semantic segmentation. We hope the findings of our work can facilitate further research in visual information extraction from audio. Code is available at: https://github.com/ubc-vision/audio_manifold.
SDApr 27, 2021Code
One Billion Audio Sounds from GPU-enabled Modular SynthesisJoseph Turian, Jordie Shier, George Tzanetakis et al.
We release synth1B1, a multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, paired with the synthesis parameters used to generate them. The dataset is 100x larger than any audio dataset in the literature. We also introduce torchsynth, an open source modular synthesizer that generates the synth1B1 samples on-the-fly at 16200x faster than real-time (714MHz) on a single GPU. Finally, we release two new audio datasets: FM synth timbre and subtractive synth pitch. Using these datasets, we demonstrate new rank-based evaluation criteria for existing audio representations. Finally, we propose a novel approach to synthesizer hyperparameter optimization.
DCMay 19, 2017Code
Espresso: Efficient Forward Propagation for BCNNsFabrizio Pedersoli, George Tzanetakis, Andrea Tagliasacchi
There are many applications scenarios for which the computational performance and memory footprint of the prediction phase of Deep Neural Networks (DNNs) needs to be optimized. Binary Neural Networks (BDNNs) have been shown to be an effective way of achieving this objective. In this paper, we show how Convolutional Neural Networks (CNNs) can be implemented using binary representations. Espresso is a compact, yet powerful library written in C/CUDA that features all the functionalities required for the forward propagation of CNNs, in a binary file less than 400KB, without any external dependencies. Although it is mainly designed to take advantage of massive GPU parallelism, Espresso also provides an equivalent CPU implementation for CNNs. Espresso provides special convolutional and dense layers for BCNNs, leveraging bit-packing and bit-wise computations for efficient execution. These techniques provide a speed-up of matrix-multiplication routines, and at the same time, reduce memory usage when storing parameters and activations. We experimentally show that Espresso is significantly faster than existing implementations of optimized binary neural networks ($\approx$ 2 orders of magnitude). Espresso is released under the Apache 2.0 license and is available at http://github.com/fpeder/espresso.
HCApr 28, 2025
A Real-Time Gesture-Based Control FrameworkMahya Khazaei, Ali Bahrani, George Tzanetakis
We introduce a real-time, human-in-the-loop gesture control framework that can dynamically adapt audio and music based on human movement by analyzing live video input. By creating a responsive connection between visual and auditory stimuli, this system enables dancers and performers to not only respond to music but also influence it through their movements. Designed for live performances, interactive installations, and personal use, it offers an immersive experience where users can shape the music in real time. The framework integrates computer vision and machine learning techniques to track and interpret motion, allowing users to manipulate audio elements such as tempo, pitch, effects, and playback sequence. With ongoing training, it achieves user-independent functionality, requiring as few as 50 to 80 samples to label simple gestures. This framework combines gesture training, cue mapping, and audio manipulation to create a dynamic, interactive experience. Gestures are interpreted as input signals, mapped to sound control commands, and used to naturally adjust music elements, showcasing the seamless interplay between human interaction and machine response.
SDFeb 12, 2020
Deep Autotuner: a Pitch Correcting Network for Singing PerformancesSanna Wager, George Tzanetakis, Cheng-i Wang et al.
We introduce a data-driven approach to automatic pitch correction of solo singing performances. The proposed approach predicts note-wise pitch shifts from the relationship between the respective spectrograms of the singing and accompaniment. This approach differs from commercial systems, where vocal track notes are usually shifted to be centered around pitches in a user-defined score, or mapped to the closest pitch among the twelve equal-tempered scale degrees. The proposed system treats pitch as a continuous value rather than relying on a set of discretized notes found in musical scores, thus allowing for improvisation and harmonization in the singing performance. We train our neural network model using a dataset of 4,702 amateur karaoke performances selected for good intonation. Our model is trained on both incorrect intonation, for which it learns a correction, and intentional pitch variation, which it learns to preserve. The proposed deep neural network with gated recurrent units on top of convolutional layers shows promising performance on the real-world score-free singing pitch correction task of autotuning.
SDFeb 3, 2019
Deep Autotuner: A Data-Driven Approach to Natural-Sounding Pitch Correction for Singing Voice in Karaoke PerformancesSanna Wager, George Tzanetakis, Cheng-i Wang et al.
We describe a machine-learning approach to pitch correcting a solo singing performance in a karaoke setting, where the solo voice and accompaniment are on separate tracks. The proposed approach addresses the situation where no musical score of the vocals nor the accompaniment exists: It predicts the amount of correction from the relationship between the spectral contents of the vocal and accompaniment tracks. Hence, the pitch shift in cents suggested by the model can be used to make the voice sound in tune with the accompaniment. This approach differs from commercially used automatic pitch correction systems, where notes in the vocal tracks are shifted to be centered around notes in a user-defined score or mapped to the closest pitch among the twelve equal-tempered scale degrees. We train the model using a dataset of 4,702 amateur karaoke performances selected for good intonation. We present a Convolutional Gated Recurrent Unit (CGRU) model to accomplish this task. This method can be extended into unsupervised pitch correction of a vocal performance, popularly referred to as autotuning.
LGJul 2, 2013
The Orchive : Data mining a massive bioacoustic archiveSteven Ness, Helena Symonds, Paul Spong et al.
The Orchive is a large collection of over 20,000 hours of audio recordings from the OrcaLab research facility located off the northern tip of Vancouver Island. It contains recorded orca vocalizations from the 1980 to the present time and is one of the largest resources of bioacoustic data in the world. We have developed a web-based interface that allows researchers to listen to these recordings, view waveform and spectral representations of the audio, label clips with annotations, and view the results of machine learning classifiers based on automatic audio features extraction. In this paper we describe such classifiers that discriminate between background noise, orca calls, and the voice notes that are present in most of the tapes. Furthermore we show classification results for individual calls based on a previously existing orca call catalog. We have also experimentally investigated the scalability of classifiers over the entire Orchive.