Shusuke Takahashi

SD
h-index27
32papers
747citations
Novelty49%
AI Score56

32 Papers

SDJun 15, 2023
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam et al.

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.

LGMay 16, 2022
SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

Yuhta Takida, Takashi Shibuya, WeiHsiang Liao et al.

One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standard VAE via novel stochastic dequantization and quantization, called stochastically quantized variational autoencoder (SQ-VAE). In SQ-VAE, we observe a trend that the quantization is stochastic at the initial stage of the training but gradually converges toward a deterministic quantization, which we call self-annealing. Our experiments show that SQ-VAE improves codebook utilization without using common heuristics. Furthermore, we empirically show that SQ-VAE is superior to VAE and VQ-VAE in vision- and speech-related tasks.

ASOct 27, 2022
Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

Ryosuke Sawata, Naoki Murata, Yuhta Takida et al.

Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to improve perceptual speech quality pre-processed by an SE method. We train a diffusion-based generative model by utilizing a dataset consisting of clean speech only. Then, our refiner effectively mixes clean parts newly generated via denoising diffusion restoration into the degraded and distorted parts caused by a preceding SE method, resulting in refined speech. Once our refiner is trained on a set of clean speech, it can be applied to various SE methods without additional training specialized for each SE module. Therefore, our refiner can be a versatile post-processing module w.r.t. SE methods and has high potential in terms of modularity. Experimental results show that our method improved perceptual speech quality regardless of the preceding SE methods used.

SDOct 11, 2022
DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

Kin Wai Cheuk, Ryosuke Sawata, Toshimitsu Uesaka et al.

In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT). Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative task where we train our model to generate realistic looking piano rolls from pure Gaussian noise conditioned on spectrograms. This new AMT formulation enables DiffRoll to transcribe, generate and even inpaint music. Due to the classifier-free nature, DiffRoll is also able to be trained on unpaired datasets where only piano rolls are available. Our experiments show that DiffRoll outperforms its discriminative counterpart by 19 percentage points (ppt.) and our ablation studies also indicate that it outperforms similar existing methods by 4.8 ppt. Source code and demonstration are available https://sony.github.io/DiffRoll/.

SDDec 19, 2025
Do Foundational Audio Encoders Understand Music Structure?

Keisuke Toyama, Zhi Zhong, Akira Takahashi et al.

In music information retrieval (MIR) research, the use of pretrained foundational audio encoders (FAEs) has recently become a trend. FAEs pretrained on large amounts of music and audio data have been shown to improve performance on MIR tasks such as music tagging and automatic music transcription. However, their use for music structure analysis (MSA) remains underexplored: only a small subset of FAEs has been examined for MSA, and the impact of factors such as learning methods, training data, and model context length on MSA performance remains unclear. In this study, we conduct comprehensive experiments on 11 types of FAEs to investigate how these factors affect MSA performance. Our results demonstrate that FAEs using self-supervised learning with masked language modeling on music data are particularly effective for MSA. These findings pave the way for future research in FAE and MSA.

CVFeb 24
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Christian Simon, Masato Ishii, Wei-Yao Wang et al.

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

SDOct 21, 2024Code
OpenMU: Your Swiss Army Knife for Music Understanding

Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao et al.

We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our music understanding model, OpenMU, with extensive ablations, demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music understanding and to enhance creative music production efficiency.

SDMay 14
Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

Shuyang Cui, Zhi Zhong, Qiyu Wu et al.

Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic-to-audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break-the-Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine-tuning a pre-trained text-to-audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target-reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: https://ik4sumii.github.io/break-the-beat/

SDOct 10, 2025Code
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji

We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.

ASOct 8, 2020Code
All for One and One for All: Improving Music Separation by Bridging Networks

Ryosuke Sawata, Stefan Uhlich, Shusuke Takahashi et al.

This paper proposes several improvements for music separation with deep neural networks (DNNs), namely a multi-domain loss (MDL) and two combination schemes. First, by using MDL we take advantage of the frequency and time domain representation of audio signals. Next, we utilize the relationship among instruments by jointly considering them. We do this on the one hand by modifying the network architecture and introducing a CrossNet structure. On the other hand, we consider combinations of instrument estimates by using a new combination loss (CL). MDL and CL can easily be applied to many existing DNN-based separation methods as they are merely loss functions which are only used during training and which do not affect the inference step. Experimental results show that the performance of Open-Unmix (UMX), a well-known and state-of-the-art open source library for music separation, can be improved by utilizing our above schemes. Our modifications of UMX are open-sourced together with this paper.

SDMay 1
MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation

Akira Takahashi, Ryosuke Sawata, Shusuke Takahashi et al.

Although recent video-to-audio (V2A) models excelled at synthesizing semantically plausible sounds from visual inputs, they do not explicitly model room-acoustic effects such as reverberation or room impulse responses (RIRs), and thus offer limited controllability over these effects. However, we hypothesize that such V2A models implicitly have semantic knowledge of the relationship between spatial audio and the corresponding vision cues. In this paper, we revisit a V2A model for the sake of the above, and propose the way to utilize the pretrained model as prior for physically grounded room-acoustic processing. Based on one of the state-of-the-art V2A models, MMAudio, we propose MMAudioReverbs that is a unified framework dealing with i) dereverberation and ii) room impulse response (RIR) estimation without network architectural modification, and fine-tuned on a small dataset. Experimental results showed that audio and visual cues respectively have advantage depending on the type of physical room acoustics. It implies that foundation V2A models can be used for physically grounded room-acoustic analysis.

SDMay 1
MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video

Kazuya Tateishi, Akira Takahashi, Atsuo Hiroe et al.

Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection accuracy from 46.7% to 75.0% and material-classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video-to-audio synthesis.

SDNov 2, 2024
Music Foundation Model as Generic Booster for Music Downstream Tasks

WeiHsiang Liao, Yuhta Takida, Yukara Ikemiya et al.

We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.

CVMay 23, 2024
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Shiqi Yang, Zhi Zhong, Mengjie Zhao et al.

In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/

SDMay 22, 2025
SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

Zhi Zhong, Akira Takahashi, Shuyang Cui et al.

Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction. ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions. In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders. We have observed a performance gap between ControlNet-based and from-scratch foley models. To narrow this gap, we propose SpecMaskFoley, a method that steers the pretrained SpecMaskGIT model toward video-synchronized foley synthesis via ControlNet. To unlock the potential of a single ControlNet branch, we resolve the discrepancy between the temporal video features and the time-frequency nature of the pretrained SpecMaskGIT via a frequency-aware temporal feature aligner, eliminating the need for complicated conditioning mechanisms widely used in prior arts. Evaluations on a common foley synthesis benchmark demonstrate that SpecMaskFoley could even outperform strong from-scratch baselines, substantially advancing the development of ControlNet-based foley synthesis models. Demo page: https://zzaudio.github.io/SpecMaskFoley_Demo/

SDJul 16, 2025
Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification

Kazuki Shimada, Archontis Politis, Iran R. Roman et al.

This paper presents the objective, dataset, baseline, and metrics of Task 3 of the DCASE2025 Challenge on sound event localization and detection (SELD). In previous editions, the challenge used four-channel audio formats of first-order Ambisonics (FOA) and microphone array. In contrast, this year's challenge investigates SELD with stereo audio data (termed stereo SELD). This change shifts the focus from more specialized 360° audio and audiovisual scene analysis to more commonplace audio and media scenarios with limited field-of-view (FOV). Due to inherent angular ambiguities in stereo audio data, the task focuses on direction-of-arrival (DOA) estimation in the azimuth plane (left-right axis) along with distance estimation. The challenge remains divided into two tracks: audio-only and audiovisual, with the audiovisual track introducing a new sub-task of onscreen/offscreen event classification necessitated by the limited FOV. This challenge introduces the DCASE2025 Task3 Stereo SELD Dataset, whose stereo audio and perspective video clips are sampled and converted from the STARSS23 recordings. The baseline system is designed to process stereo audio and corresponding video frames as inputs. In addition to the typical SELD event classification and localization, it integrates onscreen/offscreen classification for the audiovisual track. The evaluation metrics have been modified to introduce an onscreen/offscreen accuracy metric, which assesses the models' ability to identify which sound sources are onscreen. In the experimental evaluation, the baseline system performs reasonably well with the stereo audio data.

SDOct 29, 2025
'Studies for': A Human-AI Co-Creative Sound Artwork Using a Real-time Multi-channel Sound Generation Model

Chihiro Nagashima, Akira Takahashi, Zhi Zhong et al.

This paper explores the integration of AI technologies into the artistic workflow through the creation of Studies for, a generative sound installation developed in collaboration with sound artist Evala (https://www.ntticc.or.jp/en/archive/works/studies-for/). The installation employs SpecMaskGIT, a lightweight yet high-quality sound generation AI model, to generate and playback eight-channel sound in real-time, creating an immersive auditory experience over the course of a three-month exhibition. The work is grounded in the concept of a "new form of archive," which aims to preserve the artistic style of an artist while expanding beyond artists' past artworks by continued generation of new sound elements. This speculative approach to archival preservation is facilitated by training the AI model on a dataset consisting of over 200 hours of Evala's past sound artworks. By addressing key requirements in the co-creation of art using AI, this study highlights the value of the following aspects: (1) the necessity of integrating artist feedback, (2) datasets derived from an artist's past works, and (3) ensuring the inclusion of unexpected, novel outputs. In Studies for, the model was designed to reflect the artist's artistic identity while generating new, previously unheard sounds, making it a fitting realization of the concept of "a new form of archive." We propose a Human-AI co-creation framework for effectively incorporating sound generation AI models into the sound art creation process and suggest new possibilities for creating and archiving sound art that extend an artist's work beyond their physical existence. Demo page: https://sony.github.io/studies-for/

AIOct 1, 2025
VIRTUE: Visual-Interactive Text-Image Universal Embedder

Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu et al.

Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.

SDSep 26, 2025
Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription

Michael Yeung, Keisuke Toyama, Toya Teramoto et al.

Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.

CVAug 1, 2025
TITAN-Guide: Taming Inference-Time AligNment for Guided Text-to-Video Diffusion Models

Christian Simon, Masato Ishii, Akio Hayakawa et al.

In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on the base model. However, the existing training-free guidance frameworks either have heavy memory requirements or offer sub-optimal control due to rough estimation. These shortcomings limit the applicability to control diffusion models that require intense computation, such as Text-to-Video (T2V) diffusion models. In this work, we propose Taming Inference Time Alignment for Guided Text-to-Video Diffusion Model, so-called TITAN-Guide, which overcomes memory space issues, and provides more optimal control in the guidance process compared to the counterparts. In particular, we develop an efficient method for optimizing diffusion latents without backpropagation from a discriminative guiding model. In particular, we study forward gradient descents for guided diffusion tasks with various options on directional directives. In our experiments, we demonstrate the effectiveness of our approach in efficiently managing memory during latent optimization, while previous methods fall short. Our proposed approach not only minimizes memory requirements but also significantly enhances T2V performance across a range of diffusion guidance benchmarks. Code, models, and demo are available at https://titanguide.github.io.

CVJun 4, 2024
MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training

Kengo Uchida, Takashi Shibuya, Yuhta Takida et al.

In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions according to control signals, such as the start-end positions and the pelvis trajectory. In this paper, we propose MoLA, which provides fast, high-quality, variable-length motion generation and can also deal with multiple editing tasks in a single framework. Our approach revisits the motion representation used as inputs and outputs in the model, incorporating an activation variable to enable variable-length motion generation. Additionally, we integrate a variational autoencoder and a latent diffusion model, further enhanced through adversarial training, to achieve high-quality and fast generation. Moreover, we apply a training-free guided generation framework to achieve various editing tasks with motion control inputs. We quantitatively show the effectiveness of adversarial learning in text-to-motion generation, and demonstrate the applicability of our editing framework to multiple editing tasks in the motion domain.

SDMay 18, 2023
Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

Hao Shi, Kazuki Shimada, Masato Hirano et al.

Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us to use the further complementarity between predictive and diffusion-based generative SE. In this paper, we propose a unified system that use jointly generative and predictive decoders across two levels. The encoder encodes both generative and predictive information at the shared encoding level. At the decoded feature level, we fuse the two decoded features by generative and predictive decoders. Specifically, the two SE modules are fused in the initial and final diffusion steps: the initial fusion initializes the diffusion process with the predictive SE to improve convergence, and the final fusion combines the two complementary SE outputs to enhance SE performance. Experiments conducted on the Voice-Bank dataset demonstrate that incorporating predictive information leads to faster decoding and higher PESQ scores compared with other score-based diffusion SE (StoRM and SGMSE+).

ASOct 14, 2021
Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi et al.

Sound event localization and detection (SELD) involves identifying the direction-of-arrival (DOA) and the event class. The SELD methods with a class-wise output format make the model predict activities of all sound event classes and corresponding locations. The class-wise methods can output activity-coupled Cartesian DOA (ACCDOA) vectors, which enable us to solve a SELD task with a single target using a single network. However, there is still a challenge in detecting the same event class from multiple locations. To overcome this problem while maintaining the advantages of the class-wise format, we extended ACCDOA to a multi one and proposed auxiliary duplicating permutation invariant training (ADPIT). The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class. The class-wise ADPIT scheme enables each track of the multi-ACCDOA format to learn with the same target as the single-ACCDOA format. In evaluations with the DCASE 2021 Task 3 dataset, the model trained with the multi-ACCDOA format and with the class-wise ADPIT detects overlapping events from the same class while maintaining its performance in the other cases. Also, the proposed method performed comparably to state-of-the-art SELD methods with fewer parameters.

SDOct 13, 2021
Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection

Yuichiro Koyama, Kazuhide Shigemi, Masafumi Takahashi et al.

Recording and annotating real sound events for a sound event localization and detection (SELD) task is time consuming, and data augmentation techniques are often favored when the amount of data is limited. However, how to augment the spatial information in a dataset, including unlabeled directional interference events, remains an open research question. Furthermore, directional interference events make it difficult to accurately extract spatial characteristics from target sound events. To address this problem, we propose an impulse response simulation framework (IRS) that augments spatial characteristics using simulated room impulse responses (RIR). RIRs corresponding to a microphone array assumed to be placed in various rooms are accurately simulated, and the source signals of the target sound events are extracted from a mixture. The simulated RIRs are then convolved with the extracted source signals to obtain an augmented multi-channel training dataset. Evaluation results obtained using the TAU-NIGENS Spatial Sound Events 2021 dataset show that the IRS contributes to improving the overall SELD performance. Additionally, we conducted an ablation study to discuss the contribution and need for each component within the IRS.

SDOct 13, 2021
Music Source Separation with Deep Equilibrium Models

Yuichiro Koyama, Naoki Murata, Stefan Uhlich et al.

While deep neural network-based music source separation (MSS) is very effective and achieves high performance, its model size is often a problem for practical deployment. Deep implicit architectures such as deep equilibrium models (DEQ) were recently proposed, which can achieve higher performance than their explicit counterparts with limited depth while keeping the number of parameters small. This makes DEQ also attractive for MSS, especially as it was originally applied to sequential modeling tasks in natural language processing and thus should in principle be also suited for MSS. However, an investigation of a good architecture and training scheme for MSS with DEQ is needed as the characteristics of acoustic signals are different from those of natural language data. Hence, in this paper we propose an architecture and training scheme for MSS with DEQ. Starting with the architecture of Open-Unmix (UMX), we replace its sequence model with DEQ. We refer to our proposed method as DEQ-based UMX (DEQ-UMX). Experimental results show that DEQ-UMX performs better than the original UMX while reducing its number of parameters by 30%.

ASOct 12, 2021
Spatial mixup: Directional loudness modification as data augmentation for sound event localization and detection

Ricardo Falcon-Perez, Kazuki Shimada, Yuichiro Koyama et al.

Data augmentation methods have shown great importance in diverse supervised learning problems where labeled data is scarce or costly to obtain. For sound event localization and detection (SELD) tasks several augmentation methods have been proposed, with most borrowing ideas from other domains such as images, speech, or monophonic audio. However, only a few exploit the spatial properties of a full 3D audio scene. We propose Spatial Mixup, as an application of parametric spatial audio effects for data augmentation, which modifies the directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain. Similarly to beamforming, these modifications enhance or suppress signals arriving from certain directions, although the effect is less pronounced. Therefore enabling deep learning models to achieve invariance to small spatial perturbations. The method is evaluated with experiments in the DCASE 2021 Task 3 dataset, where spatial mixup increases performance over a non-augmented baseline, and compares to other well known augmentation methods. Furthermore, combining spatial mixup with other methods greatly improves performance.

ASOct 12, 2021
Improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models

Ryosuke Sawata, Yosuke Kashiwagi, Shusuke Takahashi

A deep neural network (DNN)-based speech enhancement (SE) aiming to maximize the performance of an automatic speech recognition (ASR) system is proposed in this paper. In order to optimize the DNN-based SE model in terms of the character error rate (CER), which is one of the metric to evaluate the ASR system and generally non-differentiable, our method uses two DNNs: one for speech processing and one for mimicking the output CERs derived through an acoustic model (AM). Then both of DNNs are alternately optimized in the training phase. Even if the AM is a black-box, e.g., like one provided by a third-party, the proposed method enables the DNN-based SE model to be optimized in terms of the CER since the DNN mimicking the AM is differentiable. Consequently, it becomes feasible to build CER-centric SE model that has no negative effect, e.g., additional calculation cost and changing network architecture, on the inference phase since our method is merely a training scheme for the existing DNN-based methods. Experimental results show that our method improved CER by 8.8% relative derived through a black-box AM although certain noise levels are kept.

ASJun 21, 2021
Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection

Kazuki Shimada, Naoya Takahashi, Yuichiro Koyama et al.

This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) representation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augmentation techniques outperformed state-of-the-art SELD systems in terms of localization and location-dependent detection. Using the ACCDOA-based system as a base, we perform model ensembles by averaging outputs of several systems trained with different conditions such as input features, training folds, and model architectures. We also use the event independent network v2 (EINV2)-based system to increase the diversity of the model ensembles. To generalize the models, we further propose impulse response simulation (IRS), which generates simulated multi-channel signals by convolving simulated room impulse responses (RIRs) with source signals extracted from the original dataset. Our systems significantly improved over the baseline system on the development dataset.

ASJun 4, 2021
Manifold-Aware Deep Clustering: Maximizing Angles between Embedding Vectors Based on Regular Simplex

Keitaro Tanaka, Ryosuke Sawata, Shusuke Takahashi

This paper presents a new deep clustering (DC) method called manifold-aware DC (M-DC) that can enhance hyperspace utilization more effectively than the original DC. The original DC has a limitation in that a pair of two speakers has to be embedded having an orthogonal relationship due to its use of the one-hot vector-based loss function, while our method derives a unique loss function aimed at maximizing the target angle in the hyperspace based on the nature of a regular simplex. Our proposed loss imposes a higher penalty than the original DC when the speaker is assigned incorrectly. The change from DC to M-DC can be easily achieved by rewriting just one term in the loss function of DC, without any other modifications to the network architecture or model parameters. As such, our method has high practicability because it does not affect the original inference part. The experimental results show that the proposed method improves the performances of the original DC and its expansion method.

LGFeb 17, 2021
Preventing Oversmoothing in VAE via Generalized Variance Parameterization

Yuhta Takida, Wei-Hsiang Liao, Chieh-Hsin Lai et al.

Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon in which the learned latent space becomes uninformative. This is often related to the hyperparameter resembling the data variance. It can be shown that an inappropriate choice of this hyperparameter causes the oversmoothness in the linearly approximated case and can be empirically verified for the general cases. Moreover, determining such appropriate choice becomes infeasible if the data variance is non-uniform or conditional. Therefore, we propose VAE extensions with generalized parameterizations of the data variance and incorporate maximum likelihood estimation into the objective function to adaptively regularize the decoder smoothness. The images generated from proposed VAE extensions show improved Fréchet inception distance (FID) on MNIST and CelebA datasets.

ASOct 29, 2020
ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection

Kazuki Shimada, Yuichiro Koyama, Naoya Takahashi et al.

Neural-network (NN)-based methods show high performance in sound event localization and detection (SELD). Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target. The two-branch representation with a single network has to decide how to balance the two objectives during optimization. Using two networks dedicated to each task increases system complexity and network size. To address these problems, we propose an activity-coupled Cartesian DOA (ACCDOA) representation, which assigns a sound event activity to the length of a corresponding Cartesian DOA vector. The ACCDOA representation enables us to solve a SELD task with a single target and has two advantages: avoiding the necessity of balancing the objectives and model size increase. In experimental evaluations with the DCASE 2020 Task 3 dataset, the ACCDOA representation outperformed the two-branch representation in SELD metrics with a smaller network size. The ACCDOA-based SELD system also performed better than state-of-the-art SELD systems in terms of localization and location-dependent detection.

ASJun 22, 2020
Sound Event Localization and Detection Using Activity-Coupled Cartesian DOA Vector and RD3net

Kazuki Shimada, Naoya Takahashi, Shusuke Takahashi et al.

Our systems submitted to the DCASE2020 task~3: Sound Event Localization and Detection (SELD) are described in this report. We consider two systems: a single-stage system that solve sound event localization~(SEL) and sound event detection~(SED) simultaneously, and a two-stage system that first handles the SED and SEL tasks individually and later combines those results. As the single-stage system, we propose a unified training framework that uses an activity-coupled Cartesian DOA vector~(ACCDOA) representation as a single target for both the SED and SEL tasks. To efficiently estimate sound event locations and activities, we further propose RD3Net, which incorporates recurrent and convolution layers with dense skip connections and dilation. To generalize the models, we apply three data augmentation techniques: equalized mixture data augmentation~(EMDA), rotation of first-order Ambisonic~(FOA) singals, and multichannel extension of SpecAugment. Our systems demonstrate a significant improvement over the baseline system.