SDASMay 31

A Lightweight Slot-Attention Framework for Multi-Instrument Multi-Pitch Estimation

arXiv:2606.014607.1
AI Analysis

For researchers in music information retrieval, this work offers a promising but incremental step toward source-aware multi-pitch estimation, with mixed results on source assignment.

This paper proposes a lightweight slot-attention framework for multi-instrument multi-pitch estimation, using permutation-invariant Hungarian matching to decompose a mixture into source-like pitch maps. Experiments show improved instrument family decomposition on URMP, but stem-level prediction remains challenging.

Multi-pitch estimation (MPE) typically predicts which pitches are active in a mixture, but not which instrument or source produced them. This paper investigates a lightweight slot-attention framework for multi-instrument MPE (MI-MPE), where a mixture CQT is mapped to an unordered set of source-like pitch maps. The model uses permutation-invariant Hungarian matching to avoid fixed output semantics and treats the number of slots as an upper bound on the number of active sources. We further study two modular extensions: a self-supervised timbre encoder that provides training-time targets for slot-level timbre embeddings, and a polyphony branch that regularizes the pitch density of mixture- and slot-level predictions. Experiments show that Hungarian matching substantially improves instrument family decomposition on URMP. Stem-level prediction remains more challenging: timbre and polyphony supervision improve selected configurations, but do not consistently resolve source assignment. The results suggest that slot-based architectures are a promising direction for source-aware MPE, while highlighting the need to couple auxiliary musical cues to slot identity more carefully.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes