SDASOct 25, 2021

Lhotse: a speech data representation library for the modern deep learning ecosystem

arXiv:2110.12561v148 citations
Originality Synthesis-oriented
AI Analysis

This library addresses data management challenges for researchers and practitioners in speech processing, offering a practical tool for combining datasets and handling complex audio features, though it is incremental as it builds on concepts from existing toolkits like Kaldi.

The authors tackled the difficulty of working with speech data due to diverse codecs, lengths, and formats by developing Lhotse, a library that provides a common JSON description format, Python classes, and data preparation recipes for over 30 speech corpora, simplifying data wrangling and integration with modern deep learning tools like PyTorch.

Speech data is notoriously difficult to work with due to a variety of codecs, lengths of recordings, and meta-data formats. We present Lhotse, a speech data representation library that draws upon lessons learned from Kaldi speech recognition toolkit and brings its concepts into the modern deep learning ecosystem. Lhotse provides a common JSON description format with corresponding Python classes and data preparation recipes for over 30 popular speech corpora. Various datasets can be easily combined together and re-purposed for different tasks. The library handles multi-channel recordings, long recordings, local and cloud storage, lazy and on-the-fly operations amongst other features. We introduce Cut and CutSet concepts, which simplify common data wrangling tasks for audio and help incorporate acoustic context of speech utterances. Finally, we show how Lhotse leverages PyTorch data API abstractions and adopts them to handle speech data for deep learning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes