Volkan Kılıç

h-index23

6papers

180citations

Novelty44%

AI Score26

Ranked #161,573 of 194,257 authors (top 83%)#1,130 in AS (top 78%)

6 Papers

13.0ASMar 6, 2022

Leveraging Pre-trained BERT for Audio Captioning

Xubo Liu, Xinhao Mei, Qiushi Huang et al.

Audio captioning aims at using natural language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring knowledge from pre-trained audio models such as Pre-trained Audio Neural Networks (PANNs) have recently emerged as a useful method to mitigate this issue. However, there is less attention on exploiting pre-trained language models for the decoder, compared with the encoder. BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks. Nevertheless, the potential of BERT as the language decoder for audio captioning has not been investigated. In this study, we demonstrate the efficacy of the pre-trained BERT model for audio captioning. Specifically, we apply PANNs as the encoder and initialize the decoder from the public pre-trained BERT models. We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model. Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.

10.8ASOct 28, 2022Code

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Xubo Liu, Qiushi Huang, Xinhao Mei et al.

Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.

4.3ASMar 7, 2022

Deep Neural Decision Forest for Acoustic Scene Classification

Jianyuan Sun, Xubo Liu, Xinhao Mei et al.

Acoustic scene classification (ASC) aims to classify an audio clip based on the characteristic of the recording environment. In this regard, deep learning based approaches have emerged as a useful tool for ASC problems. Conventional approaches to improving the classification accuracy include integrating auxiliary methods such as attention mechanism, pre-trained models and ensemble multiple sub-networks. However, due to the complexity of audio clips captured from different environments, it is difficult to distinguish their categories without using any auxiliary methods for existing deep learning models using only a single classifier. In this paper, we propose a novel approach for ASC using deep neural decision forest (DNDF). DNDF combines a fixed number of convolutional layers and a decision forest as the final classifier. The decision forest consists of a fixed number of decision tree classifiers, which have been shown to offer better classification performance than a single classifier in some datasets. In particular, the decision forest differs substantially from traditional random forests as it is stochastic, differentiable, and capable of using the back-propagation to update and learn feature representations in neural network. Experimental results on the DCASE2019 and ESC-50 datasets demonstrate that our proposed DNDF method improves the ASC performance in terms of classification accuracy and shows competitive performance as compared with state-of-the-art baselines.

5.7SDOct 10, 2022

Automated Audio Captioning via Fusion of Low- and High- Dimensional Features

Jianyuan Sun, Xubo Liu, Xinhao Mei et al.

Automated audio captioning (AAC) aims to describe the content of an audio clip using simple sentences. Existing AAC methods are developed based on an encoder-decoder architecture that success is attributed to the use of a pre-trained CNN10 called PANNs as the encoder to learn rich audio representations. AAC is a highly challenging task due to its high-dimensional talent space involves audio of various scenarios. Existing methods only use the high-dimensional representation of the PANNs as the input of the decoder. However, the low-dimension representation may retain as much audio information as the high-dimensional representation may be neglected. In addition, although the high-dimensional approach may predict the audio captions by learning from existing audio captions, which lacks robustness and efficiency. To deal with these challenges, a fusion model which integrates low- and high-dimensional features AAC framework is proposed. In this paper, a new encoder-decoder framework is proposed called the Low- and High-Dimensional Feature Fusion (LHDFF) model for AAC. Moreover, in LHDFF, a new PANNs encoder is proposed called Residual PANNs (RPANNs) by fusing the low-dimensional feature from the intermediate convolution layer output and the high-dimensional feature from the final layer output of PANNs. To fully explore the information of the low- and high-dimensional fusion feature and high-dimensional feature respectively, we proposed dual transformer decoder structures to generate the captions in parallel. Especially, a probabilistic fusion approach is proposed that can ensure the overall performance of the system is improved by concentrating on the respective advantages of the two transformer decoders. Experimental results show that LHDFF achieves the best performance on the Clotho and AudioCaps datasets compared with other existing models

2.9SDDec 4, 2018

Intensity Particle Flow SMC-PHD Filter For Audio Speaker Tracking

Yang Liu, Wenwu Wang, Volkan Kilic

Non-zero diffusion particle flow Sequential Monte Carlo probability hypothesis density (NPF-SMC-PHD) filtering has been recently introduced for multi-speaker tracking. However, the NPF does not consider the missing detection which plays a key role in estimation of the number of speakers with their states. To address this limitation, we propose to use intensity particle flow (IPF) in NPFSMC-PHD filter. The proposed method, IPF-SMC-PHD, considers the clutter intensity and detection probability while no data association algorithms are used for the calculation of particle flow. Experiments on the LOCATA (acoustic source Localization and Tracking) dataset with the sequences of task 4 show that our proposed IPF-SMC-PHD filter improves the tracking performance in terms of estimation accuracy as compared to its baseline counterparts.

2.4CVMar 17, 2017

Smartphone Based Colorimetric Detection via Machine Learning

Ali Y. Mutlu, Volkan Kılıç, Gizem K. Özdemir et al.

We report the application of machine learning to smartphone based colorimetric detection of pH values. The strip images were used as the training set for Least Squares-Support Vector Machine (LS-SVM) classifier algorithms that were able to successfully classify the distinct pH values. The difference in the obtained image formats was found not to significantly affect the performance of the proposed machine learning approach. Moreover, the influence of the illumination conditions on the perceived color of pH strips was investigated and further experiments were carried out to study effect of color change on the learning model. Test results on JPEG, RAW and RAW-corrected image formats captured in different lighting conditions lead to perfect classification accuracy, sensitivity and specificity, which proves that the colorimetric detection using machine learning based systems is able to adapt to various experimental conditions and is a great candidate for smartphone based sensing in paper-based colorimetric assays.