Stochastic Clock Attention for Aligning Continuous and Ordered Sequences
This addresses the need for better alignment models in sequence-to-sequence tasks like text-to-speech, offering a drop-in replacement that enhances performance for frame-synchronous targets.
The paper tackled the problem of aligning continuous and ordered sequences in attention mechanisms, which standard methods fail to enforce continuity or monotonicity for; the result was a novel attention framework using learned clocks that improved alignment stability and robustness to time-scaling, matching or improving accuracy in a Transformer text-to-speech testbed.
We formulate an attention mechanism for continuous and ordered sequences that explicitly functions as an alignment model, which serves as the core of many sequence-to-sequence tasks. Standard scaled dot-product attention relies on positional encodings and masks but does not enforce continuity or monotonicity, which are crucial for frame-synchronous targets. We propose learned nonnegative \emph{clocks} to source and target and model attention as the meeting probability of these clocks; a path-integral derivation yields a closed-form, Gaussian-like scoring rule with an intrinsic bias toward causal, smooth, near-diagonal alignments, without external positional regularizers. The framework supports two complementary regimes: normalized clocks for parallel decoding when a global length is available, and unnormalized clocks for autoregressive decoding -- both nearly-parameter-free, drop-in replacements. In a Transformer text-to-speech testbed, this construction produces more stable alignments and improved robustness to global time-scaling while matching or improving accuracy over scaled dot-product baselines. We hypothesize applicability to other continuous targets, including video and temporal signal modeling.