Takuya Hasumi

h-index4

4papers

29citations

Novelty53%

AI Score37

Ranked #94,849 of 194,257 authors (top 49%)#506 in AS (top 35%)

4 Papers

11.3ASJun 12, 2024Code

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata et al.

We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing English prompt datasets, our corpus provides more diverse prompt annotations for all speakers of LibriTTS-R. Experimental results for prompt-based controllable TTS demonstrate that the TTS model trained with LibriTTS-P achieves higher naturalness than the model using the conventional dataset. Furthermore, the results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset. Our corpus, LibriTTS-P, is available at https://github.com/line/LibriTTS-P.

9.7SDJun 23

Aligning MusicLLM with Emotion using Instruction Tuning and Feedback-Driven Alignment

Takuya Hasumi, Welly Naptali

This paper investigates whether music large language models (MusicLLMs) can be aligned for emotion regression. While MusicLLMs have shown strong performance in music information retrieval tasks, their ability to predict arousal and valence scores remains limited, since emotion regression has not been an explicit training objective. To examine whether MusicLLMs can be aligned with emotion, we train MusicLLMs on emotion regression and compare two strategies: instruction tuning and feedback-driven alignment. Our experiments show that task-aware instruction tuning enables MusicLLMs to predict emotion levels to some extent, although the accuracy remains limited. Applying feedback-driven alignment with a verifiable numerical reward substantially improves performance on both arousal and valence over instruction tuning alone. We further show that our approach improves emotion regression performance while maintaining MusicQA capability.

2.3ASJun 4, 2025

BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing

Masaya Kawamura, Takuya Hasumi, Yuma Shirahata et al.

This paper proposes a highly compact, lightweight text-to-speech (TTS) model for on-device applications. To reduce the model size, the proposed model introduces two techniques. First, we introduce quantization-aware training (QAT), which quantizes model parameters during training to as low as 1.58-bit. In this case, most of 32-bit model parameters are quantized to ternary values {-1, 0, 1}. Second, we propose a method named weight indexing. In this method, we save a group of 1.58-bit weights as a single int8 index. This allows for efficient storage of model parameters, even on hardware that treats values in units of 8-bit. Experimental results demonstrate that the proposed method achieved 83 % reduction in model size, while outperforming the baseline of similar model size without quantization in synthesis quality.

2.3SDSep 2, 2021

Multichannel Audio Source Separation with Independent Deeply Learned Matrix Analysis Using Product of Source Models

Takuya Hasumi, Tomohiko Nakamura, Norihiro Takamune et al.

Independent deeply learned matrix analysis (IDLMA) is one of the state-of-the-art multichannel audio source separation methods using the source power estimation based on deep neural networks (DNNs). The DNN-based power estimation works well for sounds having timbres similar to the DNN training data. However, the sounds to which IDLMA is applied do not always have such timbres, and the timbral mismatch causes the performance degradation of IDLMA. To tackle this problem, we focus on a blind source separation counterpart of IDLMA, independent low-rank matrix analysis. It uses nonnegative matrix factorization (NMF) as the source model, which can capture source spectral components that only appear in the target mixture, using the low-rank structure of the source spectrogram as a clue. We thus extend the DNN-based source model to encompass the NMF-based source model on the basis of the product-of-expert concept, which we call the product of source models (PoSM). For the proposed PoSM-based IDLMA, we derive a computationally efficient parameter estimation algorithm based on an optimization principle called the majorization-minimization algorithm. Experimental evaluations show the effectiveness of the proposed method.