72.6ASMar 25
Rethinking Masking Strategies for Masked Prediction-based Audio Self-supervised LearningDaisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda et al.
Since the introduction of Masked Autoencoders, various improvements to masking techniques have been explored. In this paper, we rethink masking strategies for audio representation learning using masked prediction-based self-supervised learning (SSL) on general audio spectrograms. While recent informed masking techniques have attracted attention, we observe that they incur substantial computational overhead. Motivated by this observation, we propose dispersion-weighted masking (DWM), a lightweight masking strategy that leverages the spectral sparsity inherent in the frequency structure of audio content. Our experiments show that inverse block masking, commonly used in recent SSL frameworks, improves audio event understanding performance while introducing a trade-off in generalization. The proposed DWM alleviates these limitations and computational complexity, leading to consistent performance improvements. This work provides practical guidance on masking strategy design for masked prediction-based audio representation learning.
MMApr 12, 2024
Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event AnalysisMasahiro Yasuda, Noboru Harada, Yasunori Ohishi et al.
Observations with distributed sensors are essential in analyzing a series of human and machine activities (referred to as 'events' in this paper) in complex and extensive real-world environments. This is because the information obtained from a single sensor is often missing or fragmented in such an environment; observations from multiple locations and modalities should be integrated to analyze events comprehensively. However, a learning method has yet to be established to extract joint representations that effectively combine such distributed observations. Therefore, we propose Guided Masked sELf-Distillation modeling (Guided-MELD) for inter-sensor relationship modeling. The basic idea of Guided-MELD is to learn to supplement the information from the masked sensor with information from other sensors needed to detect the event. Guided-MELD is expected to enable the system to effectively distill the fragmented or redundant target event information obtained by the sensors without being overly dependent on any specific sensors. To validate the effectiveness of the proposed method in novel tasks of distributed multimedia sensor event analysis, we recorded two new datasets that fit the problem setting: MM-Store and MM-Office. These datasets consist of human activities in a convenience store and an office, recorded using distributed cameras and microphones. Experimental results on these datasets show that the proposed Guided-MELD improves event tagging and detection performance and outperforms conventional inter-sensor relationship modeling methods. Furthermore, the proposed method performed robustly even when sensors were reduced.
ASJun 1, 2025
CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation TokenizerDaiki Takeuchi, Binh Thien Nguyen, Masahiro Yasuda et al.
Automated Audio Captioning (AAC) aims to describe the semantic contexts of general sounds, including acoustic events and scenes, by leveraging effective acoustic features. To enhance performance, an AAC method, EnCLAP, employed discrete tokens from EnCodec as an effective input for fine-tuning a language model BART. However, EnCodec is designed to reconstruct waveforms rather than capture the semantic contexts of general sounds, which AAC should describe. To address this issue, we propose CLAP-ART, an AAC method that utilizes ``semantic-rich and discrete'' tokens as input. CLAP-ART computes semantic-rich discrete tokens from pre-trained audio representations through vector quantization. We experimentally confirmed that CLAP-ART outperforms baseline EnCLAP on two AAC benchmarks, indicating that semantic-rich discrete tokens derived from semantically rich AR are beneficial for AAC.
ASFeb 18, 2022
Multi-view and Multi-modal Event Detection Utilizing Transformer-based Multi-sensor fusionMasahiro Yasuda, Yasunori Ohishi, Shoichiro Saito et al.
We tackle a challenging task: multi-view and multi-modal event detection that detects events in a wide-range real environment by utilizing data from distributed cameras and microphones and their weak labels. In this task, distributed sensors are utilized complementarily to capture events that are difficult to capture with a single sensor, such as a series of actions of people moving in an intricate room, or communication between people located far apart in a room. For sensors to cooperate effectively in such a situation, the system should be able to exchange information among sensors and combines information that is useful for identifying events in a complementary manner. For such a mechanism, we propose a Transformer-based multi-sensor fusion (MultiTrans) which combines multi-sensor data on the basis of the relationships between features of different viewpoints and modalities. In the experiments using a dataset newly collected for this task, our proposed method using MultiTrans improved the event detection performance and outperformed comparatives.
ASFeb 18, 2022
Echo-aware Adaptation of Sound Event Localization and Detection in Unknown EnvironmentsMasahiro Yasuda, Yasunori Ohishi, Shoichiro Saito
Our goal is to develop a sound event localization and detection (SELD) system that works robustly in unknown environments. A SELD system trained on known environment data is degraded in an unknown environment due to environmental effects such as reverberation and noise not contained in the training data. Previous studies on related tasks have shown that domain adaptation methods are effective when data on the environment in which the system will be used is available even without labels. However adaptation to unknown environments remains a difficult task. In this study, we propose echo-aware feature refinement (EAR) for SELD, which suppresses environmental effects at the feature level by using additional spatial cues of the unknown environment obtained through measuring acoustic echoes. FOA-MEIR, an impulse response dataset containing over 100 environments, was recorded to validate the proposed method. Experiments on FOA-MEIR show that the EAR effectively improves SELD performance in unknown environments.
ASFeb 17, 2022
Wearable SELD dataset: Dataset for sound event localization and detection using wearable devices around headKento Nagatomo, Masahiro Yasuda, Kohei Yatabe et al.
Sound event localization and detection (SELD) is a combined task of identifying the sound event and its direction. Deep neural networks (DNNs) are utilized to associate them with the sound signals observed by a microphone array. Although ambisonic microphones are popular in the literature of SELD, they might limit the range of applications due to their predetermined geometry. Some applications (including those for pedestrians that perform SELD while walking) require a wearable microphone array whose geometry can be designed to suit the task. In this paper, for the development of such a wearable SELD, we propose a dataset named Wearable SELD dataset. It consists of data recorded by 24 microphones placed on a head and torso simulators (HATS) with some accessories mimicking wearable devices (glasses, earphones, and headphones). We also provide experimental results of SELD using the proposed dataset and SELDNet to investigate the effect of microphone configuration.
ASFeb 16, 2022
APPLADE: Adjustable Plug-and-play Audio Declipper Combining DNN with Sparse OptimizationTomoro Tanaka, Kohei Yatabe, Masahiro Yasuda et al.
In this paper, we propose an audio declipping method that takes advantages of both sparse optimization and deep learning. Since sparsity-based audio declipping methods have been developed upon constrained optimization, they are adjustable and well-studied in theory. However, they always uniformly promote sparsity and ignore the individual properties of a signal. Deep neural network (DNN)-based methods can learn the properties of target signals and use them for audio declipping. Still, they cannot perform well if the training data have mismatches and/or constraints in the time domain are not imposed. In the proposed method, we use a DNN in an optimization algorithm. It is inspired by an idea called plug-and-play (PnP) and enables us to promote sparsity based on the learned information of data, considering constraints in the time domain. Our experiments confirmed that the proposed method is stable and robust to mismatches between training and test data.
ASDec 14, 2020
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption RetrievalYuma Koizumi, Yasunori Ohishi, Daisuke Niizumi et al.
The goal of audio captioning is to translate input audio into its description using natural language. One of the problems in audio captioning is the lack of training data due to the difficulty in collecting audio-caption pairs by crawling the web. In this study, to overcome this problem, we propose to use a pre-trained large-scale language model. Since an audio input cannot be directly inputted into such a language model, we utilize guidance captions retrieved from a training dataset based on similarities that may exist in different audio. Then, the caption of the audio input is generated by using a pre-trained language model while referring to the guidance captions. Experimental results show that (i) the proposed method has succeeded to use a pre-trained language model for audio captioning, and (ii) the oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.
ASJul 1, 2020
A Transformer-based Audio Captioning Model with Keyword EstimationYuma Koizumi, Ryo Masumura, Kyosuke Nishida et al.
One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene. Since one acoustic event/scene can be described with several words, it results in a combinatorial explosion of possible captions and difficulty in training. To solve this problem, we propose a Transformer-based audio-captioning model with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic event detection/acoustic scene classification (i.e., keyword estimation). TRACKE estimates keywords, which comprise a word set corresponding to audio events/scenes in the input audio, and generates the caption while referring to the estimated keywords to reduce word-selection indeterminacy. Experimental results on a public AAC dataset indicate that TRACKE achieved state-of-the-art performance and successfully estimated both the caption and its keywords.
ASJun 10, 2020
Description and Discussion on DCASE2020 Challenge Task2: Unsupervised Anomalous Sound Detection for Machine Condition MonitoringYuma Koizumi, Yohei Kawaguchi, Keisuke Imoto et al.
In this paper, we present the task description and discuss the results of the DCASE 2020 Challenge Task 2: Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring. The goal of anomalous sound detection (ASD) is to identify whether the sound emitted from a target machine is normal or anomalous. The main challenge of this task is to detect unknown anomalous sounds under the condition that only normal sound samples have been provided as training data. We have designed this challenge as the first benchmark of ASD research, which includes a large-scale dataset, evaluation metrics, and a simple baseline system. We received 117 submissions from 40 teams, and several novel approaches have been developed as a result of this challenge. On the basis of the analysis of the evaluation results, we discuss two new approaches and their problems.
ASFeb 14, 2020
Sound Event Localization based on Sound Intensity Vector Refined By DNN-Based Denoising and Source SeparationMasahiro Yasuda, Yuma Koizumi, Shoichiro Saito et al.
We propose a direction-of-arrival (DOA) estimation method for Sound Event Localization and Detection (SELD). Direct estimation of DOA using a deep neural network (DNN), i.e. completely-datadriven approach, achieves high accuracy. However, there is a gap in the accuracy between DOA estimation for single and overlapping sources because they cannot incorporate physical knowledge. Meanwhile, although the accuracy of physics-based approaches is inferior to DNN-based approaches, it is robust for overlapping source. In this study, we consider a combination of physics-based and DNN-based approaches; the sound intensity vectors (IVs) for physics-based DOA estimation is refined based on DNN-based denoising and source separation. This method enables the accurate DOA estimation for both single and overlapping sources using a spherical microphone array. Experimental results show that the proposed method achieves state-of-the-art DOA estimation accuracy on an open dataset of the SELD.
SDFeb 14, 2020
Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene LabelsKeisuke Imoto, Noriyuki Tonami, Yuma Koizumi et al.
Sound event detection (SED) and acoustic scene classification (ASC) are major tasks in environmental sound analysis. Considering that sound events and scenes are closely related to each other, some works have addressed joint analyses of sound events and acoustic scenes based on multitask learning (MTL), in which the knowledge of sound events and scenes can help in estimating them mutually. The conventional MTL-based methods utilize one-hot scene labels to train the relationship between sound events and scenes; thus, the conventional methods cannot model the extent to which sound events and scenes are related. However, in the real environment, common sound events may occur in some acoustic scenes; on the other hand, some sound events occur only in a limited acoustic scene. In this paper, we thus propose a new method for SED based on MTL of SED and ASC using the soft labels of acoustic scenes, which enable us to model the extent to which sound events and scenes are related. Experiments conducted using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets show that the proposed method improves the SED performance by 3.80% in F-score compared with conventional MTL-based SED.
ASOct 10, 2019
DOA Estimation by DNN-based Denoising and Dereverberation from Sound Intensity VectorMasahiro Yasuda, Yuma Koizumi, Luca Mazzon et al.
We propose a direction of arrival (DOA) estimation method that combines sound-intensity vector (IV)-based DOA estimation and DNN-based denoising and dereverberation. Since the accuracy of IV-based DOA estimation degrades due to environmental noise and reverberation, two DNNs are used to remove such effects from the observed IVs. DOA is then estimated from the refined IVs based on the physics of wave propagation. Experiments on an open dataset showed that the average DOA error of the proposed method was 0.528 degrees, and it outperformed a conventional IV-based and DNN-based DOA estimation method.
ASOct 10, 2019
First Order Ambisonics Domain Spatial Augmentation for DNN-based Direction of Arrival EstimationLuca Mazzon, Yuma Koizumi, Masahiro Yasuda et al.
In this paper, we propose a novel data augmentation method for training neural networks for Direction of Arrival (DOA) estimation. This method focuses on expanding the representation of the DOA subspace of a dataset. Given some input data, it applies a transformation to it in order to change its DOA information and simulate new potentially unseen one. Such transformation, in general, is a combination of a rotation and a reflection. It is possible to apply such transformation due to a well-known property of First Order Ambisonics (FOA). The same transformation is applied also to the labels, in order to maintain consistency between input data and target labels. Three methods with different level of generality are proposed for applying this augmentation principle. Experiments are conducted on two different DOA networks. Results of both experiments demonstrate the effectiveness of the novel augmentation strategy by improving the DOA error by around 40%.