As Good as It KAN Get: High-Fidelity Audio Representation
This addresses audio representation for multimedia and signal processing researchers, offering incremental improvements through novel architectures.
The study tackled the problem of limited implicit neural representation (INR) applications in audio by introducing the Kolmogorov-Arnold Network (KAN), which achieved a Log-Spectral Distance of 1.29 and Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio, and extended it with FewSound to improve parameter updates, outperforming HyperSound with 33.3% MSE and 60.87% SI-SNR gains.
Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at https://github.com/gmum/fewsound.git.