TISDiSS: A Training-Time and Inference-Time Scalable Framework for Discriminative Source Separation
This addresses scalability issues for practitioners in speech, music, and audio processing, offering a practical solution for adaptive source separation, though it is incremental in integrating existing scaling concepts.
The paper tackles the problem of high training and deployment costs in source separation by proposing TISDiSS, a framework that enables flexible speed-performance trade-offs through dynamic inference repetitions, achieving state-of-the-art performance on standard benchmarks with reduced parameters.
Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and deployment costs. Motivated by recent advances in inference-time scaling for generative modeling, we propose Training-Time and Inference-Time Scalable Discriminative Source Separation (TISDiSS), a unified framework that integrates early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions. TISDiSS enables flexible speed-performance trade-offs by adjusting inference depth without retraining additional models. We further provide systematic analyses of architectural and training choices and show that training with more inference repetitions improves shallow-inference performance, benefiting low-latency applications. Experiments on standard speech separation benchmarks demonstrate state-of-the-art performance with a reduced parameter count, establishing TISDiSS as a scalable and practical framework for adaptive source separation. Code is available at https://github.com/WingSingFung/TISDiSS.