Effective Context in Neural Speech Models
This work addresses the need for better understanding context usage in speech models, which is incremental as it builds on existing methods to provide new insights without introducing major innovations.
The paper tackled the problem of measuring how much context neural speech models actually use, termed effective context, by proposing two measurement approaches and analyzing various speech Transformers, finding that effective context correlates with task complexity and remains relatively short in self-supervised models.
Modern neural speech models benefit from having longer context, and many approaches have been proposed to increase the maximum context a model can use. However, few have attempted to measure how much context these models actually use, i.e., the effective context. Here, we propose two approaches to measuring the effective context, and use them to analyze different speech Transformers. For supervised models, we find that the effective context correlates well with the nature of the task, with fundamental frequency tracking, phone classification, and word classification requiring increasing amounts of effective context. For self-supervised models, we find that effective context increases mainly in the early layers, and remains relatively short -- similar to the supervised phone model. Given that these models do not use a long context during prediction, we show that HuBERT can be run in streaming mode without modification to the architecture and without further fine-tuning.