AS AISep 12, 2024

Super Monotonic Alignment Search

arXiv:2409.07704v21.21 citationsh-index: 8Has Code

Originality Synthesis-oriented

AI Analysis

This work provides a significant speedup for text-to-speech systems using MAS, though it is incremental as it optimizes an existing algorithm without changing its core paradigm.

The paper tackled the high time complexity and CPU execution inefficiencies of Monotonic Alignment Search (MAS) in text-to-speech by implementing GPU acceleration with Triton kernel and PyTorch JIT script, resulting in up to 72 times faster performance in extreme-length cases.

Monotonic alignment search (MAS), introduced by Glow-TTS, is one of the most popular algorithm in text-to-speech to estimate unknown alignments between text and speech. Since this algorithm needs to search for the most probable alignment with dynamic programming by caching all possible paths, the time complexity of the algorithm is $O(T \times S)$, where $T$ is the length of text and $S$ is the length of speech representation. The authors of Glow-TTS run this algorithm on CPU, and while they mentioned it is difficult to parallelize, we found that MAS can be parallelized in text length dimension and CPU execution consumes an inordinate amount of time for inter-device copy. Therefore, we implemented a Triton kernel and PyTorch JIT script to accelerate MAS on GPU without inter-device copy. As a result, Super-MAS Triton kernel is up to 72 times faster in the extreme-length case. The code is available at https://github.com/supertone-inc/super-monotonic-align.

View on arXiv PDF Code

Similar