Keisuke Imoto

h-index15

31papers

911citations

Novelty34%

AI Score44

Ranked #48,137 of 194,257 authors (top 25%)#335 in SD (top 19%)

31 Papers

18.4SDJun 13, 2022

Description and Discussion on DCASE 2022 Challenge Task 2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Applying Domain Generalization Techniques

Kota Dohi, Keisuke Imoto, Noboru Harada et al.

We present the task description and discussion on the results of the DCASE 2022 Challenge Task 2: ``Unsupervised anomalous sound detection (ASD) for machine condition monitoring applying domain generalization techniques''. Domain shifts are a critical problem for the application of ASD systems. Because domain shifts can change the acoustic characteristics of data, a model trained in a source domain performs poorly for a target domain. In DCASE 2021 Challenge Task 2, we organized an ASD task for handling domain shifts. In this task, it was assumed that the occurrences of domain shifts are known. However, in practice, the domain of each sample may not be given, and the domain shifts can occur implicitly. In 2022 Task 2, we focus on domain generalization techniques that detects anomalies regardless of the domain shifts. Specifically, the domain of each sample is not given in the test data and only one threshold is allowed for all domains. Analysis of 81 submissions from 31 teams revealed two remarkable types of domain generalization techniques: 1) domain-mixing-based approach that obtains generalized representations and 2) domain-classification-based approach that explicitly or implicitly classifies different domains to improve detection performance for each domain.

16.7SDJul 15

Genre Bias or Aesthetic Perception? Identifying and Mitigating Shortcut Learning in Music Evaluation

Yizhou Zhang, Wangjin Zhou, Yi Zhao et al.

Music aesthetics scoring plays a critical role in applications such as dataset curation, generative model evaluation, and reward modeling for music generation. Recent approaches rely on deep neural networks trained on human-annotated ratings, but these models may exploit spurious correlations rather than capturing perceptually meaningful aesthetics. In this work, we identify a previously underexplored failure mode in music evaluation models: genre-induced shortcut learning. Through a systematic analysis of SongEval, we show that biases in training data lead to strong correlations between genre-related features and predicted scores, causing the model to use them as a proxy for aesthetics. This results in systematic overestimation of pop music and undervaluation of high-quality samples from other genres, leading to predictions that are inconsistent with human preferences. To address this issue, we propose a training objective that jointly reweights hard samples and regularizes group-level performance, encouraging the model to learn genre-invariant representations of musicality. Experimental results demonstrate that our method reduces genre-dependent bias and improves alignment with human preferences, as reflected by gains in both cross-genre and within-genre preference alignment.

6.1ASJun 1

Description and Discussion on DCASE 2026 Challenge Task 2: Noise-aware Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Tomoya Nishida, Noboru Harada, Daiki Takeuchi et al.

This paper presents an overview of DCASE 2026 Challenge Task 2, titled "Noise-aware unsupervised anomalous sound detection (UASD) for machine condition monitoring." The task aims to advance noise-robust anomalous sound detection for machine condition monitoring under the unsupervised setting, where only normal machine sounds are available for training. Reliable detection under noisy conditions is crucial for practical deployment, but previous DCASE Task 2 settings provided limited information about environmental noise, potentially limiting UASD performance in highly noisy situations. To address this limitation, DCASE 2026 allows participants to exploit two-channel audio samples simultaneously captured at locations near and far from the target machine. Since the distant microphone is expected to contain relatively stronger environmental noise and weaker direct machine sounds, it may help distinguish environmental noise components from the target machine sounds. After the challenge submission deadline, challenge results and an analysis of the submitted systems will be added.

4.9ASMay 13

How Much Does Machine Identity Matter in Anomalous Sound Detection at Test Time?

Kevin Wilkinghoff, Keisuke Imoto, Zheng-Hua Tan

Anomalous sound detection (ASD) benchmarks typically assume that the identity of the monitored machine is known at test time and that recordings are evaluated in a machine-wise manner. However, in realistic monitoring scenarios with multiple known machines operating concurrently, test recordings may not be reliably attributable to a specific machine, and requiring machine identity imposes deployment constraints such as dedicated sensors per machine. To reveal performance degradations and method-specific differences in robustness that are hidden under standard machine-wise evaluation, we consider a minimal modification of the ASD evaluation protocol in which test recordings from multiple machines are merged and evaluated jointly without access to machine identity at inference time. Training data and evaluation metrics remain unchanged, and machine identity labels are used only for post hoc evaluation. Experiments with representative ASD methods show that relaxing this assumption reveals performance degradations and method-specific differences in robustness that are hidden under standard machine-wise evaluation, and that these degradations are strongly related to implicit machine identification accuracy.

20.8ASAug 9, 2019Code

ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection

Yuma Koizumi, Shoichiro Saito, Hisashi Uematsu et al.

This paper introduces a new dataset called "ToyADMOS" designed for anomaly detection in machine operating sounds (ADMOS). To the best our knowledge, no large-scale datasets are available for ADMOS, although large-scale datasets have contributed to recent advancements in acoustic signal processing. This is because anomalous sound data are difficult to collect. To build a large-scale dataset for ADMOS, we collected anomalous operating sounds of miniature machines (toys) by deliberately damaging them. The released dataset consists of three sub-datasets for machine-condition inspection, fault diagnosis of machines with geometrically fixed tasks, and fault diagnosis of machines with moving tasks. Each sub-dataset includes over 180 hours of normal machine-operating sounds and over 4,000 samples of anomalous sounds collected with four microphones at a 48-kHz sampling rate. The dataset is freely available for download at https://github.com/YumaKoizumi/ToyADMOS-dataset

9.7SDOct 23, 2024Code

Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation

Junwon Lee, Modan Tailleur, Laurie M. Heller et al.

Despite significant advancements in neural text-to-audio generation, challenges persist in controllability and evaluation. This paper addresses these issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024. We present an evaluation protocol combining objective metric, namely Fréchet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation. Our analysis reveals varying performance across sound categories and model architectures, with larger models generally excelling but innovative lightweight approaches also showing promise. The strong correlation between objective metrics and human ratings validates our evaluation approach. We discuss outcomes in terms of audio quality, controllability, and architectural considerations for text-to-audio synthesizers, providing direction for future research.

4.9SDOct 30, 2024

DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection

Yoto Fujita, Yoshiaki Bando, Keisuke Imoto et al.

This paper describes sound event localization and detection (SELD) for spatial audio recordings captured by firstorder ambisonics (FOA) microphones. In this task, one may train a deep neural network (DNN) using FOA data annotated with the classes and directions of arrival (DOAs) of sound events. However, the performance of this approach is severely bounded by the amount of annotated data. To overcome this limitation, we propose a novel method of pretraining the feature extraction part of the DNN in a self-supervised manner. We use spatial audio-visual recordings abundantly available as virtual reality contents. Assuming that sound objects are concurrently observed by the FOA microphones and the omni-directional camera, we jointly train audio and visual encoders with contrastive learning such that the audio and visual embeddings of the same recording and DOA are made close. A key feature of our method is that the DOA-wise audio embeddings are jointly extracted from the raw audio data, while the DOA-wise visual embeddings are separately extracted from the local visual crops centered on the corresponding DOA. This encourages the latent features of the audio encoder to represent both the classes and DOAs of sound events. The experiment using the DCASE2022 Task 3 dataset of 20 hours shows non-annotated audio-visual recordings of 100 hours reduced the error score of SELD from 36.4 pts to 34.9 pts.

4.0SDApr 6, 2025

Formula-Supervised Sound Event Detection: Pre-Training Without Real Data

Yuto Shibata, Keitaro Tanaka, Yoshiaki Bando et al.

In this paper, we propose a novel formula-driven supervised learning (FDSL) framework for pre-training an environmental sound analysis model by leveraging acoustic signals parametrically synthesized through formula-driven methods. Specifically, we outline detailed procedures and evaluate their effectiveness for sound event detection (SED). The SED task, which involves estimating the types and timings of sound events, is particularly challenged by the difficulty of acquiring a sufficient quantity of accurately labeled training data. Moreover, it is well known that manually annotated labels often contain noises and are significantly influenced by the subjective judgment of annotators. To address these challenges, we propose a novel pre-training method that utilizes a synthetic dataset, Formula-SED, where acoustic data are generated solely based on mathematical formulas. The proposed method enables large-scale pre-training by using the synthesis parameters applied at each time step as ground truth labels, thereby eliminating label noise and bias. We demonstrate that large-scale pre-training with Formula-SED significantly enhances model accuracy and accelerates training, as evidenced by our results in the DESED dataset used for DCASE2023 Challenge Task 4. The project page is at https://yutoshibata07.github.io/Formula-SED/

5.8AIJan 15, 2025

Sound Scene Synthesis at the DCASE 2024 Challenge

Mathieu Lagrange, Junwon Lee, Modan Tailleur et al.

This paper presents Task 7 at the DCASE 2024 Challenge: sound scene synthesis. Recent advances in sound synthesis and generative models have enabled the creation of realistic and diverse audio content. We introduce a standardized evaluation framework for comparing different sound scene synthesis systems, incorporating both objective and subjective metrics. The challenge attracted four submissions, which are evaluated using the Fréchet Audio Distance (FAD) and human perceptual ratings. Our analysis reveals significant insights into the current capabilities and limitations of sound scene synthesis systems, while also highlighting areas for future improvement in this rapidly evolving field.

10.3ASJun 11, 2024

Description and Discussion on DCASE 2024 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Tomoya Nishida, Noboru Harada, Daisuke Niizumi et al.

We present the task description of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge Task 2: First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring. Continuing from last year's DCASE 2023 Challenge Task 2, we organize the task as a first-shot problem under domain generalization required settings. The main goal of the first-shot problem is to enable rapid deployment of ASD systems for new kinds of machines without the need for machine-specific hyperparameter tunings. This problem setting was realized by (1) giving only one section for each machine type and (2) having completely different machine types for the development and evaluation datasets. For the DCASE 2024 Challenge Task 2, data of completely new machine types were newly collected and provided as the evaluation dataset. In addition, attribute information such as the machine operation conditions were concealed for several machine types to mimic situations where such information are unavailable. We will add challenge results and analysis of the submissions after the challenge submission deadline.

18.1SDMay 13, 2023

Description and Discussion on DCASE 2023 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Kota Dohi, Keisuke Imoto, Noboru Harada et al.

We present the task description of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge Task 2: ``First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring''. The main goal is to enable rapid deployment of ASD systems for new kinds of machines without the need for hyperparameter tuning. In the past ASD tasks, developed methods tuned hyperparameters for each machine type, as the development and evaluation datasets had the same machine types. However, collecting normal and anomalous data as the development dataset can be infeasible in practice. In 2023 Task 2, we focus on solving the first-shot problem, which is the challenge of training a model on a completely novel machine type. Specifically, (i) each machine type has only one section (a subset of machine type) and (ii) machine types in the development and evaluation datasets are completely different. Analysis of 86 submissions from 23 teams revealed that the keys to outperform baselines were: 1) sampling techniques for dealing with class imbalances across different domains and attributes, 2) generation of synthetic samples for robust detection, and 3) use of multiple large pre-trained models to extract meaningful embeddings for the anomaly detector.

7.3SDDec 1, 2021

Environmental Sound Extraction Using Onomatopoeic Words

Yuki Okamoto, Shota Horiguchi, Masaaki Yamamoto et al.

An onomatopoeic word, which is a character sequence that phonetically imitates a sound, is effective in expressing characteristics of sound such as duration, pitch, and timbre. We propose an environmental-sound-extraction method using onomatopoeic words to specify the target sound to be extracted. By this method, we estimate a time-frequency mask from an input mixture spectrogram and an onomatopoeic word using a U-Net architecture, then extract the corresponding target sound by masking the spectrogram. Experimental results indicate that the proposed method can extract only the target sound corresponding to the onomatopoeic word and performs better than conventional methods that use sound-event classes to specify the target sound.

7.3SDOct 7, 2021

Sound Event Detection Guided by Semantic Contexts of Scenes

Noriyuki Tonami, Keisuke Imoto, Ryotaro Nagase et al.

Some studies have revealed that contexts of scenes (e.g., "home," "office," and "cooking") are advantageous for sound event detection (SED). Mobile devices and sensing technologies give useful information on scenes for SED without the use of acoustic signals. However, conventional methods can employ pre-defined contexts in inference stages but not undefined contexts. This is because one-hot representations of pre-defined scenes are exploited as prior contexts for such conventional methods. To alleviate this problem, we propose scene-informed SED where pre-defined scene-agnostic contexts are available for more accurate SED. In the proposed method, pre-trained large-scale language models are utilized, which enables SED models to employ unseen semantic contexts of scenes in inference stages. Moreover, we investigated the extent to which the semantic representation of scene contexts is useful for SED. Experimental results performed with TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016/2017 datasets show that the proposed method improves micro and macro F-scores by 4.34 and 3.13 percentage points compared with conventional Conformer- and CNN--BiGRU-based SED, respectively.

14.2ASJun 8, 2021

Description and Discussion on DCASE 2021 Challenge Task 2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions

Yohei Kawaguchi, Keisuke Imoto, Yuma Koizumi et al.

We present the task description and discussion on the results of the DCASE 2021 Challenge Task 2. In 2020, we organized an unsupervised anomalous sound detection (ASD) task, identifying whether a given sound was normal or anomalous without anomalous training data. In 2021, we organized an advanced unsupervised ASD task under domain-shift conditions, which focuses on the inevitable problem of the practical use of ASD systems. The main challenge of this task is to detect unknown anomalous sounds where the acoustic characteristics of the training and testing samples are different, i.e., domain-shifted. This problem frequently occurs due to changes in seasons, manufactured products, and/or environmental noise. We received 75 submissions from 26 teams, and several novel approaches have been developed in this challenge. On the basis of the analysis of the evaluation results, we found that there are two types of remarkable approaches that TOP-5 winning teams adopted: 1) ensemble approaches of ``outlier exposure'' (OE)-based detectors and ``inlier modeling'' (IM)-based detectors and 2) approaches based on IM-based detection for features learned in a machine-identification task.

2.3SDMay 5, 2021

Acoustic Scene Classification Using Multichannel Observation with Partially Missing Channels

Keisuke Imoto

Sounds recorded with smartphones or IoT devices often have partially unreliable observations caused by clipping, wind noise, and completely missing parts due to microphone failure and packet loss in data transmission over the network. In this paper, we investigate the impact of the partially missing channels on the performance of acoustic scene classification using multichannel audio recordings, especially for a distributed microphone array. Missing observations cause not only losses of time-frequency and spatial information on sound sources but also a mismatch between a trained model and evaluation data. We thus investigate how a missing channel affects the performance of acoustic scene classification in detail. We also propose simple data augmentation methods for scene classification using multichannel observations with partially missing channels and evaluate the scene classification performance using the data augmentation methods.

5.9SDFeb 11, 2021

Onoma-to-wave: Environmental sound synthesis from onomatopoeic words

Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi et al.

In this paper, we propose a framework for environmental sound synthesis from onomatopoeic words. As one way of expressing an environmental sound, we can use an onomatopoeic word, which is a character sequence for phonetically imitating a sound. An onomatopoeic word is effective for describing diverse sound features. Therefore, using onomatopoeic words for environmental sound synthesis will enable us to generate diverse environmental sounds. To generate diverse sounds, we propose a method based on a sequence-to-sequence framework for synthesizing environmental sounds from onomatopoeic words. We also propose a method of environmental sound synthesis using onomatopoeic words and sound event labels. The use of sound event labels in addition to onomatopoeic words enables us to capture each sound event's feature depending on the input sound event label. Our subjective experiments show that our proposed methods achieve higher diversity and naturalness than conventional methods using sound event labels.

5.9SDFeb 10, 2021

Sound Event Detection Based on Curriculum Learning Considering Learning Difficulty of Events

Noriyuki Tonami, Keisuke Imoto, Yuki Okamoto et al.

In conventional sound event detection (SED) models, two types of events, namely, those that are present and those that do not occur in an acoustic scene, are regarded as the same type of events. The conventional SED methods cannot effectively exploit the difference between the two types of events. All time frames of sound events that do not occur in an acoustic scene are easily regarded as inactive in the scene, that is, the events are easy-to-train. The time frames of the events that are present in a scene must be classified as active in addition to inactive in the acoustic scene, that is, the events are difficult-to-train. To take advantage of the training difficulty, we apply curriculum learning into SED, where models are trained from easy- to difficult-to-train events. To utilize the curriculum learning, we propose a new objective function for SED, wherein the events are trained from easy- to difficult-to-train events. Experimental results show that the F-score of the proposed method is improved by 10.09 percentage points compared with that of the conventional binary cross entropy-based SED.

15.5SDFeb 3, 2021

Impact of Sound Duration and Inactive Frames on Sound Event Detection Performance

Keisuke Imoto, Sakiko Mishima, Yumi Arai et al.

In many methods of sound event detection (SED), a segmented time frame is regarded as one data sample to model training. The durations of sound events greatly depend on the sound event class, e.g., the sound event "fan" has a long duration, whereas the sound event "mouse clicking" is instantaneous. Thus, the difference in the duration between sound event classes results in a serious data imbalance in SED. Moreover, most sound events tend to occur occasionally; therefore, there are many more inactive time frames of sound events than active frames. This also causes a severe data imbalance between active and inactive frames. In this paper, we investigate the impact of sound duration and inactive frames on SED performance by introducing four loss functions, such as simple reweighting loss, inverse frequency loss, asymmetric focal loss, and focal batch Tversky loss. Then, we provide insights into how we tackle this imbalance problem.

6.2SDOct 16, 2020

Joint Analysis of Sound Events and Acoustic Scenes Using Multitask Learning

Noriyuki Tonami, Keisuke Imoto, Ryosuke Yamanishi et al.

Sound event detection (SED) and acoustic scene classification (ASC) are important research topics in environmental sound analysis. Many research groups have addressed SED and ASC using neural-network-based methods, such as the convolutional neural network (CNN), recurrent neural network (RNN), and convolutional recurrent neural network (CRNN). The conventional methods address SED and ASC separately even though sound events and acoustic scenes are closely related to each other. For example, in the acoustic scene "office," the sound events "mouse clicking" and "keyboard typing" are likely to occur. Therefore, it is expected that information on sound events and acoustic scenes will be of mutual aid for SED and ASC. In this paper, we propose multitask learning for joint analysis of sound events and acoustic scenes, in which the parts of the networks holding information on sound events and acoustic scenes in common are shared. Experimental results obtained using the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of SED and ASC by 1.31 and 1.80 percentage points in terms of the F-score, respectively, compared with the conventional CRNN-based method.

3.5SDJul 9, 2020

RWCP-SSD-Onomatopoeia: Onomatopoeic Word Dataset for Environmental Sound Synthesis

Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi et al.

Environmental sound synthesis is a technique for generating a natural environmental sound. Conventional work on environmental sound synthesis using sound event labels cannot finely control synthesized sounds, for example, the pitch and timbre. We consider that onomatopoeic words can be used for environmental sound synthesis. Onomatopoeic words are effective for explaining the feature of sounds. We believe that using onomatopoeic words will enable us to control the fine time-frequency structure of synthesized sounds. However, there is no dataset available for environmental sound synthesis using onomatopoeic words. In this paper, we thus present RWCP-SSD-Onomatopoeia, a dataset consisting of 155,568 onomatopoeic words paired with audio samples for environmental sound synthesis. We also collected self-reported confidence scores and others-reported acceptance scores of onomatopoeic words, to help us investigate the difficulty in the transcription and selection of a suitable word for environmental sound synthesis.

1.9SDJun 27, 2020

Sound Event Detection Using Duration Robust Loss Function

Daichi Akiyama, Keisuke Imoto, Noriyuki Tonami et al.

Many methods of sound event detection (SED) based on machine learning regard a segmented time frame as one data sample to model training. However, the sound durations of sound events vary greatly depending on the sound event class, e.g., the sound event ``fan'' has a long time duration, while the sound event ``mouse clicking'' is instantaneous. The difference in the time duration between sound event classes thus causes a serious data imbalance problem in SED. In this paper, we propose a method for SED using a duration robust loss function, which can focus model training on sound events of short duration. In the proposed method, we focus on a relationship between the duration of the sound event and the ease/difficulty of model training. In particular, many sound events of long duration (e.g., sound event ``fan'') are stationary sounds, which have less variation in their acoustic features and their model training is easy. Meanwhile, some sound events of short duration (e.g., sound event ``object impact'') have more than one audio pattern, such as attack, decay, and release parts. We thus apply a class-wise reweighting to the binary-cross entropy loss function depending on the ease/difficulty of model training. Evaluation experiments conducted using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets show that the proposed method respectively improves the detection performance of sound events by 3.15 and 4.37 percentage points in macro- and micro-Fscores compared with a conventional method using the binary-cross entropy loss function.

21.0ASJun 10, 2020

Description and Discussion on DCASE2020 Challenge Task2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Yuma Koizumi, Yohei Kawaguchi, Keisuke Imoto et al.

In this paper, we present the task description and discuss the results of the DCASE 2020 Challenge Task 2: Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring. The goal of anomalous sound detection (ASD) is to identify whether the sound emitted from a target machine is normal or anomalous. The main challenge of this task is to detect unknown anomalous sounds under the condition that only normal sound samples have been provided as training data. We have designed this challenge as the first benchmark of ASD research, which includes a large-scale dataset, evaluation metrics, and a simple baseline system. We received 117 submissions from 40 teams, and several novel approaches have been developed as a result of this challenge. On the basis of the analysis of the evaluation results, we discuss two new approaches and their problems.

3.3ASApr 25, 2020

Sound Event Detection Utilizing Graph Laplacian Regularization with Event Co-occurrence

Keisuke Imoto, Seisuke Kyochi

A limited number of types of sound event occur in an acoustic scene and some sound events tend to co-occur in the scene; for example, the sound events "dishes" and "glass jingling" are likely to co-occur in the acoustic scene "cooking". In this paper, we propose a method of sound event detection using graph Laplacian regularization with sound event co-occurrence taken into account. In the proposed method, the occurrences of sound events are expressed as a graph whose nodes indicate the frequencies of event occurrence and whose edges indicate the sound event co-occurrences. This graph representation is then utilized for the model training of sound event detection, which is optimized under an objective function with a regularization term considering the graph structure of sound event occurrence and co-occurrence. Evaluation experiments using the TUT Sound Events 2016 and 2017 detasets, and the TUT Acoustic Scenes 2016 dataset show that the proposed method improves the performance of sound event detection by 7.9 percentage points compared with the conventional CNN-BiGRU-based detection method in terms of the segment-based F1 score. In particular, the experimental results indicate that the proposed method enables the detection of co-occurring sound events more accurately than the conventional method.

11.7ASFeb 14, 2020

Sound Event Localization based on Sound Intensity Vector Refined By DNN-Based Denoising and Source Separation

Masahiro Yasuda, Yuma Koizumi, Shoichiro Saito et al.

We propose a direction-of-arrival (DOA) estimation method for Sound Event Localization and Detection (SELD). Direct estimation of DOA using a deep neural network (DNN), i.e. completely-datadriven approach, achieves high accuracy. However, there is a gap in the accuracy between DOA estimation for single and overlapping sources because they cannot incorporate physical knowledge. Meanwhile, although the accuracy of physics-based approaches is inferior to DNN-based approaches, it is robust for overlapping source. In this study, we consider a combination of physics-based and DNN-based approaches; the sound intensity vectors (IVs) for physics-based DOA estimation is refined based on DNN-based denoising and source separation. This method enables the accurate DOA estimation for both single and overlapping sources using a spherical microphone array. Experimental results show that the proposed method achieves state-of-the-art DOA estimation accuracy on an open dataset of the SELD.

16.4SDFeb 14, 2020

Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels

Keisuke Imoto, Noriyuki Tonami, Yuma Koizumi et al.

Sound event detection (SED) and acoustic scene classification (ASC) are major tasks in environmental sound analysis. Considering that sound events and scenes are closely related to each other, some works have addressed joint analyses of sound events and acoustic scenes based on multitask learning (MTL), in which the knowledge of sound events and scenes can help in estimating them mutually. The conventional MTL-based methods utilize one-hot scene labels to train the relationship between sound events and scenes; thus, the conventional methods cannot model the extent to which sound events and scenes are related. However, in the real environment, common sound events may occur in some acoustic scenes; on the other hand, some sound events occur only in a limited acoustic scene. In this paper, we thus propose a new method for SED based on MTL of SED and ASC using the soft labels of acoustic scenes, which enable us to model the extent to which sound events and scenes are related. Experiments conducted using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets show that the proposed method improves the SED performance by 3.80% in F-score compared with conventional MTL-based SED.

3.3ASJan 30, 2020

Graph Cepstrum: Spatial Feature Extracted from Partially Connected Microphones

Keisuke Imoto

In this paper, we propose an effective and robust method of spatial feature extraction for acoustic scene analysis utilizing partially synchronized and/or closely located distributed microphones. In the proposed method, a new cepstrum feature utilizing a graph-based basis transformation to extract spatial information from distributed microphones, while taking into account whether any pairs of microphones are synchronized and/or closely located, is introduced. Specifically, in the proposed graph-based cepstrum, the log-amplitude of a multichannel observation is converted to a feature vector utilizing the inverse graph Fourier transform, which is a method of basis transformation of a signal on a graph. Results of experiments using real environmental sounds show that the proposed graph-based cepstrum robustly extracts spatial information with consideration of the microphone connections. Moreover, the results indicate that the proposed method more robustly classifies acoustic scenes than conventional spatial features when the observed sounds have a large synchronization mismatch between partially synchronized microphone groups.

7.3SDAug 27, 2019

Overview of Tasks and Investigation of Subjective Evaluation Methods in Environmental Sound Synthesis and Conversion

Yuki Okamoto, Keisuke Imoto, Tatsuya Komatsu et al.

Synthesizing and converting environmental sounds have the potential for many applications such as supporting movie and game production, data augmentation for sound event detection and scene classification. Conventional works on synthesizing and converting environmental sounds are based on a physical modeling or concatenative approach. However, there are a limited number of works that have addressed environmental sound synthesis and conversion with statistical generative models; thus, this research area is not yet well organized. In this paper, we review problem definitions, applications, and evaluation methods of environmental sound synthesis and conversion. We then report on environmental sound synthesis using sound event labels, in which we focus on the current performance of statistical environmental sound synthesis and investigate how we should conduct subjective experiments on environmental sound synthesis.

13.6SDApr 27, 2019

Joint Analysis of Acoustic Events and Scenes Based on Multitask Learning

Noriyuki Tonami, Keisuke Imoto, Masahiro Niitsuma et al.

Acoustic event detection and scene classification are major research tasks in environmental sound analysis, and many methods based on neural networks have been proposed. Conventional methods have addressed these tasks separately; however, acoustic events and scenes are closely related to each other. For example, in the acoustic scene `office', the acoustic events `mouse clicking' and `keyboard typing' are likely to occur. In this paper, we propose multitask learning for joint analysis of acoustic events and scenes, which shares the parts of the networks holding information on acoustic events and scenes in common. By integrating the two networks, we expect that information on acoustic scenes will improve the performance of acoustic event detection. Experimental results obtained using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of acoustic event detection by 10.66 percentage points in terms of the F-score, compared with a conventional method based on a convolutional recurrent neural network.

10.0SDFeb 2, 2019

Sound Event Detection Using Graph Laplacian Regularization Based on Event Co-occurrence

Keisuke Imoto, Seisuke Kyochi

The types of sound events that occur in a situation are limited, and some sound events are likely to co-occur; for instance, ``dishes'' and ``glass jingling.'' In this paper, we propose a technique of sound event detection utilizing graph Laplacian regularization taking the sound event co-occurrence into account. In the proposed method, sound event occurrences are represented as a graph whose nodes indicate the frequency of event occurrence and whose edges indicate the co-occurrence of sound events. This graph representation is then utilized for sound event modeling, which is optimized under an objective function with a regularization term considering the graph structure. Experimental results obtained using TUT Sound Events 2016 development, 2017 development, and TUT Acoustic Scenes 2016 development indicate that the proposed method improves the detection performance of sound events by 7.9 percentage points compared to that of the conventional CNN-BiGRU-based method in terms of the segment-based F1-score. Moreover, the results show that the proposed method can detect co-occurring sound events more accurately than the conventional method.

8.6ASNov 9, 2018

Joint Acoustic and Class Inference for Weakly Supervised Sound Event Detection

Sandeep Kothinti, Keisuke Imoto, Debmalya Chakrabarty et al.

Sound event detection is a challenging task, especially for scenes with multiple simultaneous events. While event classification methods tend to be fairly accurate, event localization presents additional challenges, especially when large amounts of labeled data are not available. Task4 of the 2018 DCASE challenge presents an event detection task that requires accuracy in both segmentation and recognition of events while providing only weakly labeled training data. Supervised methods can produce accurate event labels but are limited in event segmentation when training data lacks event timestamps. On the other hand, unsupervised methods that model the acoustic properties of the audio can produce accurate event boundaries but are not guided by the characteristics of event classes and sound categories. We present a hybrid approach that combines an acoustic-driven event boundary detection and a supervised label inference using a deep neural network. This framework leverages benefits of both unsupervised and supervised methodologies and takes advantage of large amounts of unlabeled data, making it ideal for large-scale weakly labeled event detection. Compared to a baseline system, the proposed approach delivers a 15% absolute improvement in F-score, demonstrating the benefits of the hybrid bottom-up, top-down approach.

6.2SDMay 30, 2018

Acoustic Scene Analysis Using Partially Connected Microphones Based on Graph Cepstrum

Keisuke Imoto

In this paper, we propose an effective and robust method for acoustic scene analysis based on spatial information extracted from partially synchronized and/or closely located distributed microphones. In the proposed method, to extract spatial information from distributed microphones while taking into account whether any pairs of microphones are synchronized and/or closely located, we derive a new cepstrum feature utilizing a graph-based basis transformation. Specifically, in the proposed graph-based cepstrum, the logarithm of the amplitude in a multichannel observation is converted to a feature vector by an inverse graph Fourier transform, which can consider whether any pair of microphones is connected. Our experimental results indicate that the proposed graph-based cepstrum effectively extracts spatial information with consideration of the microphone connections. Moreover, the results show that the proposed method more robustly classifies acoustic scenes than conventional spatial features when the observed sounds have a large synchronization mismatch between partially synchronized microphone groups.