Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond
This addresses the challenge of handling weakly aligned and real-valued sequences in MIR, offering a more elegant and extensible solution, though it is incremental as it builds on existing DTW and CTC methods.
The paper tackles the problem of learning from weakly aligned data in music information retrieval by proposing soft dynamic time warping (SoftDTW) as an alternative to connectionist temporal classification (CTC), showing that SoftDTW yields results on par with a state-of-the-art multi-label extension of CTC for multi-pitch estimation.
Many tasks in music information retrieval (MIR) involve weakly aligned data, where exact temporal correspondences are unknown. The connectionist temporal classification (CTC) loss is a standard technique to learn feature representations based on weakly aligned training data. However, CTC is limited to discrete-valued target sequences and can be difficult to extend to multi-label problems. In this article, we show how soft dynamic time warping (SoftDTW), a differentiable variant of classical DTW, can be used as an alternative to CTC. Using multi-pitch estimation as an example scenario, we show that SoftDTW yields results on par with a state-of-the-art multi-label extension of CTC. In addition to being more elegant in terms of its algorithmic formulation, SoftDTW naturally extends to real-valued target sequences.