SDNov 1, 2025
More Than A Shortcut: A Hyperbolic Approach To Early-Exit NetworksSwapnil Bhosale, Cosmin Frateanu, Camilla Clark et al.
Deploying accurate event detection on resource-constrained devices is challenged by the trade-off between performance and computational cost. While Early-Exit (EE) networks offer a solution through adaptive computation, they often fail to enforce a coherent hierarchical structure, limiting the reliability of their early predictions. To address this, we propose Hyperbolic Early-Exit networks (HypEE), a novel framework that learns EE representations in the hyperbolic space. Our core contribution is a hierarchical training objective with a novel entailment loss, which enforces a partial-ordering constraint to ensure that deeper network layers geometrically refine the representations of shallower ones. Experiments on multiple audio event detection tasks and backbone architectures show that HypEE significantly outperforms standard Euclidean EE baselines, especially at the earliest, most computationally-critical exits. The learned geometry also provides a principled measure of uncertainty, enabling a novel triggering mechanism that makes the overall system both more efficient and more accurate than a conventional EE and standard backbone models without early-exits.
CVMay 17, 2025
Learning to Highlight Audio by Watching MoviesChao Huang, Ruohan Gao, J. M. F. Tsang et al.
Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process -- separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: https://wikichao.github.io/VisAH/.
LGMay 30, 2025
Efficient Neural and Numerical Methods for High-Quality Online Speech Spectrogram Inversion via Gradient TheoremAndres Fernandez, Juan Azcarreta, Cagdas Bilen et al.
Recent work in online speech spectrogram inversion effectively combines Deep Learning with the Gradient Theorem to predict phase derivatives directly from magnitudes. Then, phases are estimated from their derivatives via least squares, resulting in a high quality reconstruction. In this work, we introduce three innovations that drastically reduce computational cost, while maintaining high quality: Firstly, we introduce a novel neural network architecture with just 8k parameters, 30 times smaller than previous state of the art. Secondly, increasing latency by 1 hop size allows us to further halve the cost of the neural inference step. Thirdly, we we observe that the least squares problem features a tridiagonal matrix and propose a linear-complexity solver for the least squares step that leverages tridiagonality and positive-semidefiniteness, achieving a speedup of several orders of magnitude. We release samples online.
ASOct 18, 2019
A Framework for the Robust Evaluation of Sound Event DetectionCagdas Bilen, Giacomo Ferroni, Francesco Tuveri et al.
This work defines a new framework for performance evaluation of polyphonic sound event detection (SED) systems, which overcomes the limitations of the conventional collar-based event decisions, event F-scores and event error rates. The proposed framework introduces a definition of event detection that is more robust against labelling subjectivity. It also resorts to polyphonic receiver operating characteristic (ROC) curves to deliver more global insight into system performance than F1-scores, and proposes a reduction of these curves into a single polyphonic sound detection score (PSDS), which allows system comparison independently from operating points (OPs). The presented method also delivers better insight into data biases and classification stability across sound classes. Furthermore, it can be tuned to varying applications in order to match a variety of user experience requirements. The benefits of the proposed approach are demonstrated by re-evaluating the baseline and two of the top-performing systems from DCASE 2019 Task 4.
NAMar 17, 2014
Balancing Sparsity and Rank Constraints in Quadratic Basis PursuitCagdas Bilen, Gilles Puy, Rémi Gribonval et al.
We investigate the methods that simultaneously enforce sparsity and low-rank structure in a matrix as often employed for sparse phase retrieval problems or phase calibration problems in compressive sensing. We propose a new approach for analyzing the trade off between the sparsity and low rank constraints in these approaches which not only helps to provide guidelines to adjust the weights between the aforementioned constraints, but also enables new simulation strategies for evaluating performance. We then provide simulation results for phase retrieval and phase calibration cases both to demonstrate the consistency of the proposed method with other approaches and to evaluate the change of performance with different weights for the sparsity and low rank structure constraints.
MMMar 26, 2012
Compressed Sensing for Moving Imagery in Medical ImagingCagdas Bilen, Yao Wang, Ivan Selesnick
Numerous applications in signal processing have benefited from the theory of compressed sensing which shows that it is possible to reconstruct signals sampled below the Nyquist rate when certain conditions are satisfied. One of these conditions is that there exists a known transform that represents the signal with a sufficiently small number of non-zero coefficients. However when the signal to be reconstructed is composed of moving images or volumes, it is challenging to form such regularization constraints with traditional transforms such as wavelets. In this paper, we present a motion compensating prior for such signals that is derived directly from the optical flow constraint and can utilize the motion information during compressed sensing reconstruction. Proposed regularization method can be used in a wide variety of applications involving compressed sensing and images or volumes of moving and deforming objects. It is also shown that it is possible to estimate the signal and the motion jointly or separately. Practical examples from magnetic resonance imaging has been presented to demonstrate the benefit of the proposed method.