Jonathan Berger

SD
5papers
92citations
Novelty38%
AI Score22

5 Papers

SDOct 27, 2022
One-Shot Acoustic Matching Of Audio Signals -- Learning to Hear Music In Any Room/ Concert Hall

Prateek Verma, Chris Chafe, Jonathan Berger

The acoustic space in which a sound is created and heard plays an essential role in how that sound is perceived by affording a unique sense of \textit{presence}. Every sound we hear results from successive convolution operations intrinsic to the sound source and external factors such as microphone characteristics and room impulse responses. Typically, researchers use an excitation such as a pistol shot or balloon pop as an impulse signal with which an auralization can be created. The room "impulse" responses convolved with the signal of interest can transform the input sound into the sound played in the acoustic space of interest. Here we propose a novel architecture that can transform a sound of interest into any other acoustic space(room or hall) of interest by using arbitrary audio recorded as a proxy for a balloon pop. The architecture is grounded in simple signal processing ideas to learn residual signals from a learned acoustic signature and the input signal. Our framework allows a neural network to adjust gains of every point in the time-frequency representation, giving sound qualitative and quantitative results.

SDAug 16, 2022
Enhancing Audio Perception of Music By AI Picked Room Acoustics

Prateek Verma, Jonathan Berger

Every sound that we hear is the result of successive convolutional operations (e.g. room acoustics, microphone characteristics, resonant properties of the instrument itself, not to mention characteristics and limitations of the sound reproduction system). In this work we seek to determine the best room in which to perform a particular piece using AI. Additionally, we use room acoustics as a way to enhance the perceptual qualities of a given sound. Historically, rooms (particularly Churches and concert halls) were designed to host and serve specific musical functions. In some cases the architectural acoustical qualities enhanced the music performed there. We try to mimic this, as a first step, by designating room impulse responses that would correlate to producing enhanced sound quality for particular music. A convolutional architecture is first trained to take in an audio sample and mimic the ratings of experts with about 78 % accuracy for various instrument families and notes for perceptual qualities. This gives us a scoring function for any audio sample which can rate the perceptual pleasantness of a note automatically. Now, via a library of about 60,000 synthetic impulse responses mimicking all kinds of room, materials, etc, we use a simple convolution operation, to transform the sound as if it was played in a particular room. The perceptual evaluator is used to rank the musical sounds, and yield the "best room or the concert hall" to play a sound. As a byproduct it can also use room acoustics to turn a poor quality sound into a "good" sound.

SDMay 1, 2021
Audio Transformers

Prateek Verma, Jonathan Berger

Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.

SDDec 11, 2019
Learning to Model Aspects of Hearing Perception Using Neural Loss Functions

Prateek Verma, Jonathan Berger

We present a framework to model the perceived quality of audio signals by combining convolutional architectures, with ideas from classical signal processing, and describe an approach to enhancing perceived acoustical quality. We demonstrate the approach by transforming the sound of an inexpensive musical with degraded sound quality to that of a high-quality musical instrument without the need for parallel data which is often hard to collect. We adapt the classical approach of a simple adaptive EQ filtering to the objective criterion learned by a neural architecture and optimize it to get the signal of our interest. Since we learn adaptive masks depending on the signal of interest as opposed to a fixed transformation for all the inputs, we show that shallow neural architectures can achieve the desired result. A simple constraint on the objective and the initialization helps us in avoiding adversarial examples, which otherwise would have produced noisy, unintelligible audio. We believe that the current framework proposed has enormous applications, in a variety of problems where one can learn a loss function depending on the problem, using a neural architecture and optimize it after it has been learned.

SDApr 10, 2019
Neuralogram: A Deep Neural Network Based Representation for Audio Signals

Prateek Verma, Chris Chafe, Jonathan Berger

We propose the Neuralogram -- a deep neural network based representation for understanding audio signals which, as the name suggests, transforms an audio signal to a dense, compact representation based upon embeddings learned via a neural architecture. Through a series of probing signals, we show how our representation can encapsulate pitch, timbre and rhythm-based information, and other attributes. This representation suggests a method for revealing meaningful relationships in arbitrarily long audio signals that are not readily represented by existing algorithms. This has the potential for numerous applications in audio understanding, music recommendation, meta-data extraction to name a few.