SDJun 12, 2025
Discrete Audio Tokens: More Than a Survey!Pooneh Mousavi, Gallil Maimon, Adel Moumen et al.
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
ASJul 28, 2021
Don't Separate, Learn to Remix: End-to-End Neural Remixing with Joint OptimizationHaici Yang, Shivani Firodiya, Nicholas J. Bryan et al.
The task of manipulating the level and/or effects of individual instruments to recompose a mixture of recordings, or remixing, is common across a variety of applications such as music production, audio-visual post-production, podcasts, and more. This process, however, traditionally requires access to individual source recordings, restricting the creative process. To work around this, source separation algorithms can separate a mixture into its respective components. Then, a user can adjust their levels and mix them back together. This two-step approach, however, still suffers from audible artifacts and motivates further work. In this work, we learn to remix music directly by re-purposing Conv-TasNet, a well-known source separation model, into two neural remixing architectures. To do this, we use an explicit loss term that directly measures remix quality and jointly optimize it with a separation loss. We evaluate our methods using the Slakh and MUSDB18 datasets and report remixing performance as well as the impact on source separation as a byproduct. Our results suggest that learning-to-remix significantly outperforms a strong separation baseline and is particularly useful for small volume changes.
SDNov 7, 2020
Non-local convolutional neural networks (nlcnn) for speaker recognitionHaici Yang, Hongda Mao, Ruirui Li et al.
Speaker recognition is the process of identifying a speaker based on the voice. The technology has attracted more attention with the recent increase in popularity of smart voice assistants, such as Amazon Alexa. In the past few years, various convolutional neural network (CNN) based speaker recognition algorithms have been proposed and achieved satisfactory performance. However, convolutional operations are building blocks that typically perform on a local neighborhood at a time and thus miss to capture global, long-range interactions at the feature level which are critical for understanding the pattern in a speaker's voice. In this work, we propose to apply Non-local Convolutional Neural Networks (NLCNN) to improve the capability of capturing long-range dependencies at the feature level, therefore improving speaker recognition performance. Specifically, we introduce non-local blocks where the output response of a position is computed as a weighted sum of the input features at all positions. Combining non-local blocks with pre-defined CNN networks, we investigate the effectiveness of NLCNN models. Without extensive tuning, the proposed NLCNN models outperform state-of-the-art speaker recognition algorithms on the public Voxceleb dataset. What's more, we investigate different types of non-local operations applied to the frequency-time domain, time domain, frequency domain and frame-level respectively. Among them, time domain is the most effective one for speaker recognition applications.
DLJun 3, 2020
Mapping the co-evolution of artificial intelligence, robotics, and the internet of things over 20 years (1998-2017)Katy Börner, Olga Scrivner, Leonard E. Cross et al.
Understanding the emergence, co-evolution, and convergence of science and technology (S&T) areas offers competitive intelligence for researchers, managers, policy makers, and others. The resulting data-driven decision support helps set proper research and development (R&D) priorities; develop future S&T investment strategies; monitor key authors, organizations, or countries; perform effective research program assessment; and implement cutting-edge education/training efforts. This paper presents new funding, publication, and scholarly network metrics and visualizations that were validated via expert surveys. The metrics and visualizations exemplify the emergence and convergence of three areas of strategic interest: artificial intelligence (AI), robotics, and internet of things (IoT) over the last 20 years (1998-2017). For 32,716 publications and 4,497 NSF awards, we identify their conceptual space (using the UCSD map of science), geospatial network, and co-evolution landscape. The findings demonstrate how the transition of knowledge (through cross-discipline publications and citations) and the emergence of new concepts (through term bursting) create a tangible potential for interdisciplinary research and new disciplines.
ASFeb 14, 2020
Boosted Locality Sensitive Hashing: Discriminative Binary Codes for Source SeparationSunwoo Kim, Haici Yang, Minje Kim
Speech enhancement tasks have seen significant improvements with the advance of deep learning technology, but with the cost of increased computational complexity. In this study, we propose an adaptive boosting approach to learning locality sensitive hash codes, which represent audio spectra efficiently. We use the learned hash codes for single-channel speech denoising tasks as an alternative to a complex machine learning model, particularly to address the resource-constrained environments. Our adaptive boosting algorithm learns simple logistic regressors as the weak learners. Once trained, their binary classification results transform each spectrum of test noisy speech into a bit string. Simple bitwise operations calculate Hamming distance to find the K-nearest matching frames in the dictionary of training noisy speech spectra, whose associated ideal binary masks are averaged to estimate the denoising mask for that test mixture. Our proposed learning algorithm differs from AdaBoost in the sense that the projections are trained to minimize the distances between the self-similarity matrix of the hash codes and that of the original spectra, rather than the misclassification rate. We evaluate our discriminative hash codes on the TIMIT corpus with various noise types, and show comparative performance to deep learning methods in terms of denoising performance and complexity.