NAApr 11, 2018
Designing Gabor windows using convex optimizationNathanaël Perraudin, Nicki Holighaus, Peter L. Søndergaard et al.
Redundant Gabor frames admit an infinite number of dual frames, yet only the canonical dual Gabor system, constructed from the minimal l2-norm dual window, is widely used. This window function however, might lack desirable properties, e.g. good time-frequency concentration, small support or smoothness. We employ convex optimization methods to design dual windows satisfying the Wexler-Raz equations and optimizing various constraints. Numerical experiments suggest that alternate dual windows with considerably improved features can be found.
LGJul 18, 2023
Convex Geometry of ReLU-layers, Injectivity on the Ball and Local ReconstructionDaniel Haider, Martin Ehler, Peter Balazs
The paper uses a frame-theoretic setting to study the injectivity of a ReLU-layer on the closed ball of $\mathbb{R}^n$ and its non-negative part. In particular, the interplay between the radius of the ball and the bias vector is emphasized. Together with a perspective from convex geometry, this leads to a computationally feasible method of verifying the injectivity of a ReLU-layer under reasonable restrictions in terms of an upper bound of the bias vector. Explicit reconstruction formulas are provided, inspired by the duality concept from frame theory. All this gives rise to the possibility of quantifying the invertibility of a ReLU-layer and a concrete reconstruction algorithm for any input vector on the ball.
SDJul 25, 2023
Fitting Auditory Filterbanks with Multiresolution Neural NetworksVincent Lostanlen, Daniel Haider, Han Han et al.
Waveform-based deep learning faces a dilemma between nonparametric and parametric approaches. On one hand, convolutional neural networks (convnets) may approximate any linear time-invariant system; yet, in practice, their frequency responses become more irregular as their receptive fields grow. On the other hand, a parametric model such as LEAF is guaranteed to yield Gabor filters, hence an optimal time-frequency localization; yet, this strong inductive bias comes at the detriment of representational capacity. In this paper, we aim to overcome this dilemma by introducing a neural audio model, named multiresolution neural network (MuReNN). The key idea behind MuReNN is to train separate convolutional operators over the octave subbands of a discrete wavelet transform (DWT). Since the scale of DWT atoms grows exponentially between octaves, the receptive fields of the subsequent learnable convolutions in MuReNN are dilated accordingly. For a given real-world dataset, we fit the magnitude response of MuReNN to that of a well-established auditory filterbank: Gammatone for speech, CQT for music, and third-octave for urban sounds, respectively. This is a form of knowledge distillation (KD), in which the filterbank ''teacher'' is engineered by domain knowledge while the neural network ''student'' is optimized from data. We compare MuReNN to the state of the art in terms of goodness of fit after KD on a hold-out set and in terms of Heisenberg time-frequency localization. Compared to convnets and Gabor convolutions, we find that MuReNN reaches state-of-the-art performance on all three optimization problems.
LGSep 11, 2023
Instabilities in Convnets for Raw AudioDaniel Haider, Vincent Lostanlen, Martin Ehler et al.
What makes waveform-based deep learning so hard? Despite numerous attempts at training convolutional neural networks (convnets) for filterbank design, they often fail to outperform hand-crafted baselines. These baselines are linear time-invariant systems: as such, they can be approximated by convnets with wide receptive fields. Yet, in practice, gradient-based optimization leads to suboptimal approximations. In our article, we approach this phenomenon from the perspective of initialization. We present a theory of large deviations for the energy response of FIR filterbanks with random Gaussian weights. We find that deviations worsen for large filters and locally periodic input signals, which are both typical for audio signal processing applications. Numerical simulations align with our theory and suggest that the condition number of a convolutional layer follows a logarithmic scaling law between the number and length of the filters, which is reminiscent of discrete wavelet bases.
LGSep 30, 2024
(Almost) Smooth Sailing: Towards Numerical Stability of Neural Networks Through Differentiable Regularization of the Condition NumberRossen Nenov, Daniel Haider, Peter Balazs
Maintaining numerical stability in machine learning models is crucial for their reliability and performance. One approach to maintain stability of a network layer is to integrate the condition number of the weight matrix as a regularizing term into the optimization algorithm. However, due to its discontinuous nature and lack of differentiability the condition number is not suitable for a gradient descent approach. This paper introduces a novel regularizer that is provably differentiable almost everywhere and promotes matrices with low condition numbers. In particular, we derive a formula for the gradient of this regularizer which can be easily implemented and integrated into existing optimization algorithms. We show the advantages of this approach for noisy classification and denoising of MNIST images.
SDAug 30, 2024
Hold Me Tight: Stable Encoder-Decoder Design for Speech EnhancementDaniel Haider, Felix Perfler, Vincent Lostanlen et al.
Convolutional layers with 1-D filters are often used as frontend to encode audio signals. Unlike fixed time-frequency representations, they can adapt to the local characteristics of input data. However, 1-D filters on raw audio are hard to train and often suffer from instabilities. In this paper, we address these problems with hybrid solutions, i.e., combining theory-driven and data-driven approaches. First, we preprocess the audio signals via a auditory filterbank, guaranteeing good frequency localization for the learned encoder. Second, we use results from frame theory to define an unsupervised learning objective that encourages energy conservation and perfect reconstruction. Third, we adapt mixed compressed spectral norms as learning objectives to the encoder coefficients. Using these solutions in a low-complexity encoder-mask-decoder model significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.
FADec 1, 2024
Construction of generalized samplets in Banach spacesPeter Balazs, Michael Multerer
Recently, samplets have been introduced as localized discrete signed measures which are tailored to an underlying data set. Samplets exhibit vanishing moments, i.e., their measure integrals vanish for all polynomials up to a certain degree, which allows for feature detection and data compression. In the present article, we extend the different construction steps of samplets to functionals in Banach spaces more general than point evaluations. To obtain stable representations, we assume that these functionals form frames with square-summable coefficients or even Riesz bases with square-summable coefficients. In either case, the corresponding analysis operator is injective and we obtain samplet bases with the desired properties by means of constructing an isometry of the analysis operator's image. Making the assumption that the dual of the Banach space under consideration is imbedded into the space of compactly supported distributions, the multilevel hierarchy for the generalized samplet construction is obtained by spectral clustering of a similarity graph for the functionals' supports. Based on this multilevel hierarchy, generalized samplets exhibit vanishing moments with respect to a given set of primitives within the Banach space. We derive an abstract localization result for the generalized samplet coefficients with respect to the samplets' support sizes and the approximability of the Banach space elements by the chosen primitives. Finally, we present three examples showcasing the generalized samplet framework.
LGJul 8, 2025
Aliasing in Convnets: A Frame-Theoretic PerspectiveDaniel Haider, Vincent Lostanlen, Martin Ehler et al.
Using a stride in a convolutional layer inherently introduces aliasing, which has implications for numerical stability and statistical generalization. While techniques such as the parametrizations via paraunitary systems have been used to promote orthogonal convolution and thus ensure Parseval stability, a general analysis of aliasing and its effects on the stability has not been done in this context. In this article, we adapt a frame-theoretic approach to describe aliasing in convolutional layers with 1D kernels, leading to practical estimates for stability bounds and characterizations of Parseval stability, that are tailored to take short kernel sizes into account. From this, we derive two computationally very efficient optimization objectives that promote Parseval stability via systematically suppressing aliasing. Finally, for layers with random kernels, we derive closed-form expressions for the expected value and variance of the terms that describe the aliasing effects, revealing fundamental insights into the aliasing behavior at initialization.
SDMay 12, 2025
ISAC: An Invertible and Stable Auditory Filter Bank with Customizable Kernels for ML IntegrationDaniel Haider, Felix Perfler, Peter Balazs et al.
This paper introduces ISAC, an invertible and stable, perceptually-motivated filter bank that is specifically designed to be integrated into machine learning paradigms. More precisely, the center frequencies and bandwidths of the filters are chosen to follow a non-linear, auditory frequency scale, the filter kernels have user-defined maximum temporal support and may serve as learnable convolutional kernels, and there exists a corresponding filter bank such that both form a perfect reconstruction pair. ISAC provides a powerful and user-friendly audio front-end suitable for any application, including analysis-synthesis schemes.
LGJun 22, 2024
Injectivity of ReLU-layers: Tools from Frame TheoryDaniel Haider, Martin Ehler, Peter Balazs
Injectivity is the defining property of a mapping that ensures no information is lost and any input can be perfectly reconstructed from its output. By performing hard thresholding, the ReLU function naturally interferes with this property, making the injectivity analysis of ReLU layers in neural networks a challenging yet intriguing task that has not yet been fully solved. This article establishes a frame theoretic perspective to approach this problem. The main objective is to develop a comprehensive characterization of the injectivity behavior of ReLU layers in terms of all three involved ingredients: (i) the weights, (ii) the bias, and (iii) the domain where the data is drawn from. Maintaining a focus on practical applications, we limit our attention to bounded domains and present two methods for numerically approximating a maximal bias for given weights and data domains. These methods provide sufficient conditions for the injectivity of a ReLU layer on those domains and yield a novel practical methodology for studying the information loss in ReLU layers. Finally, we derive explicit reconstruction formulas based on the duality concept from frame theory.
SDFeb 15, 2022
Phase-Based Signal Representations for ScatteringDaniel Haider, Peter Balazs, Nicki Holighaus
The scattering transform is a non-linear signal representation method based on cascaded wavelet transform magnitudes. In this paper we introduce phase scattering, a novel approach where we use phase derivatives in a scattering procedure. We first revisit phase-related concepts for representing time-frequency information of audio signals, in particular, the partial derivatives of the phase in the time-frequency domain. By putting analytical and numerical results in a new light, we set the basis to extend the phase-based representations to higher orders by means of a scattering transform, which leads to well localized signal representations of large-scale structures. All the ideas are introduced in a general way and then applied using the STFT.
SDFeb 15, 2022
Audio Inpainting via $\ell_1$-Minimization and Dictionary LearningShristi Rajbamshi, Georg Tauböck, Peter Balazs et al.
Audio inpainting refers to signal processing techniques that aim at restoring missing or corrupted consecutive samples in audio signals. Prior works have shown that $\ell_1$- minimization with appropriate weighting is capable of solving audio inpainting problems, both for the analysis and the synthesis models. These models assume that audio signals are sparse with respect to some redundant dictionary and exploit that sparsity for inpainting purposes. Remaining within the sparsity framework, we utilize dictionary learning to further increase the sparsity and combine it with weighted $\ell_1$-minimization adapted for audio inpainting to compensate for the loss of energy within the gap after restoration. Our experiments demonstrate that our approach is superior in terms of signal-to-distortion ratio (SDR) and objective difference grade (ODG) compared with its original counterpart.
SDApr 23, 2019
Harmonic-aligned Frame Mask Based on Non-stationary Gabor Transform with Application to Content-dependent Speaker ComparisonFeng Huang, Peter Balazs
We propose harmonic-aligned frame mask for speech signals using non-stationary Gabor transform (NSGT). A frame mask operates on the transfer coefficients of a signal and consequently converts the signal into a counterpart signal. It depicts the difference between the two signals. In preceding studies, frame masks based on regular Gabor transform were applied to single-note instrumental sound analysis. This study extends the frame mask approach to speech signals. For voiced speech, the fundamental frequency is usually changing consecutively over time. We employ NSGT with pitch-dependent and therefore time-varying frequency resolution to attain harmonic alignment in the transform domain and hence yield harmonic-aligned frame masks for speech signals. We propose to apply the harmonic-aligned frame mask to content-dependent speaker comparison. Frame masks, computed from voiced signals of a same vowel but from different speakers, were utilized as similarity measures to compare and distinguish the speaker identities (SID). Results obtained with deep neural networks demonstrate that the proposed frame mask is valid in representing speaker characteristics and shows a potential for SID applications in limited data scenarios.
SDNov 3, 2016
Frame Theory for Signal Processing in PsychoacousticsPeter Balazs, Nicki Holighaus, Thibaud Necciari et al.
This review chapter aims to strengthen the link between frame theory and signal processing tasks in psychoacoustics. On the one side, the basic concepts of frame theory are presented and some proofs are provided to explain those concepts in some detail. The goal is to reveal to hearing scientists how this mathematical theory could be relevant for their research. In particular, we focus on frame theory in a filter bank approach, which is probably the most relevant view-point for audio signal processing. On the other side, basic psychoacoustic concepts are presented to stimulate mathematicians to apply their knowledge in this field.
SDSep 1, 2016
A Non-iterative Method for (Re)Construction of Phase from STFT MagnitudeZdeněk Průša, Peter Balazs, Peter L. Søndergaard
A non-iterative method for the construction of the Short-Time Fourier Transform (STFT) phase from the magnitude is presented. The method is based on the direct relationship between the partial derivatives of the phase and the logarithm of the magnitude of the un-sampled STFT with respect to the Gaussian window. Although the theory holds in the continuous setting only, the experiments show that the algorithm performs well even in the discretized setting (Discrete Gabor transform) with low redundancy using the sampled Gaussian window, the truncated Gaussian window and even other compactly supported windows like the Hann window. Due to the non-iterative nature, the algorithm is very fast and it is suitable for long audio signals. Moreover, solutions of iterative phase reconstruction algorithms can be improved considerably by initializing them with the phase estimate provided by the present algorithm. We present an extensive comparison with the state-of-the-art algorithms in a reproducible manner.
SDJul 22, 2016
Inpainting of long audio segments with similarity graphsNathanael Perraudin, Nicki Holighaus, Piotr Majdak et al.
We present a novel method for the compensation of long duration data loss in audio signals, in particular music. The concealment of such signal defects is based on a graph that encodes signal structure in terms of time-persistent spectral similarity. A suitable candidate segment for the substitution of the lost content is proposed by an intuitive optimization scheme and smoothly inserted into the gap, i.e. the lost or distorted signal region. Extensive listening tests show that the proposed algorithm provides highly promising results when applied to a variety of real-world music signals.
SDJan 25, 2016
A Perceptually Motivated Filter Bank with Perfect Reconstruction for Audio Signal ProcessingThibaud Necciari, Nicki Holighaus, Peter Balazs et al.
Many audio applications rely on filter banks (FBs) to analyze, process, and re-synthesize sounds. To approximate the auditory frequency resolution in the signal chain, some applications rely on perceptually motivated FBs, the gammatone FB being a popular example. However, most perceptually motivated FBs only allow partial signal reconstruction at high redundancies and/or do not have good resistance to sub-channel processing. This paper introduces an oversampled perceptually motivated FB enabling perfect reconstruction, efficient FB design, and adaptable redundancy. The filters are directly constructed in the frequency domain and linearly distributed on a perceptual frequency scale (e.g. ERB, Bark, or Mel scale). The proposed design allows for various filter shapes, uniform or non-uniform FB setting, and large down-sampling factors. For redundancies $\geq$ 3 perfect reconstruction is achieved by computing the canonical dual FB analytically. For lower redundancies perfect reconstruction is achieved using an iterative method. Experiments show performance improvements of the proposed approach when compared to the gammatone FB in terms of reconstruction error and resistance to sub-channel processing, especially at low redundancies.
FANov 21, 2006
Hilbert-Schmidt Operators and Frames - Classification, Approximation by Multipliers and AlgorithmsPeter Balazs
In this paper we deal with the connection of frames with the class of Hilbert Schmidt operators. First we give an easy criteria for operators being in this class using frames. It is the equivalent to the criteria using orthonormal bases. Then we construct Bessel sequences frames and Riesz bases for the class of Hilbert Schmidt operators using the tensor product of such sequences in the original Hilbert space. We investigate how the Hilbert Schmidt inner product of an arbitrary operator and such a rank one operator can be calculated in an efficient way. Finally we give an algorithm to find the best approximation, in the Hilbert Schmidt sense, of arbitrary matrices by operators called frame multipliers, which multiply the analysis coefficents by a fixed symbol followed by a synthesis.
FAOct 25, 2006
Frames and Finite Dimensionality: Frame Transformation, Classification and AlgorithmsPeter Balazs
In this paper we will look at the connection of frames and finite dimensionality. A main focus is to present simple algorithms and make them available online. The main result is a way to 'switch' between different frames, giving an algorithm to calculate the coefficients of one frame given the analysis of another frame using their cross-Gram matrix. This is a canonical extension of the basic transformation matrix used for orthonormal bases (ONB) and is therefore called frame transformation. Furthermore we will summarize basic properties of frames in finite dimensional spaces. We will give basic algorithms to use with frames in finite dimensional spaces, useful for basic numerical experiments. Finally we will give a criteria for finite dimensional spaces using frames. Keywords: frames, discrete expansion, matrices, frame transformation, finite dimension, algorithm; MSC: 41A58, 65F30; 15A03