ASAISDJan 29

Representation-Regularized Convolutional Audio Transformer for Audio Understanding

arXiv:2601.21612v1h-index: 15Has Code
Originality Incremental advance
AI Analysis

This work addresses computational bottlenecks in audio understanding for researchers and practitioners, though it is incremental as it builds on existing self-supervised learning frameworks.

The authors tackled the limitations of single-granularity and computational inefficiency in self-supervised learning for audio understanding by proposing the Convolutional Audio Transformer (CAT), which achieved competitive performance on AudioSet 20k with 5 times faster convergence than existing methods.

Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high-quality semantic representations from frozen, pre-trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at https://github.com/realzhouchushu/CAT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes