Volkan Kılıç

AS
4papers
172citations
Novelty41%
AI Score23

4 Papers

ASMar 6, 2022
Leveraging Pre-trained BERT for Audio Captioning

Xubo Liu, Xinhao Mei, Qiushi Huang et al.

Audio captioning aims at using natural language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring knowledge from pre-trained audio models such as Pre-trained Audio Neural Networks (PANNs) have recently emerged as a useful method to mitigate this issue. However, there is less attention on exploiting pre-trained language models for the decoder, compared with the encoder. BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks. Nevertheless, the potential of BERT as the language decoder for audio captioning has not been investigated. In this study, we demonstrate the efficacy of the pre-trained BERT model for audio captioning. Specifically, we apply PANNs as the encoder and initialize the decoder from the public pre-trained BERT models. We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model. Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.

ASOct 28, 2022
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Xubo Liu, Qiushi Huang, Xinhao Mei et al.

Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.

ASMar 7, 2022
Deep Neural Decision Forest for Acoustic Scene Classification

Jianyuan Sun, Xubo Liu, Xinhao Mei et al.

Acoustic scene classification (ASC) aims to classify an audio clip based on the characteristic of the recording environment. In this regard, deep learning based approaches have emerged as a useful tool for ASC problems. Conventional approaches to improving the classification accuracy include integrating auxiliary methods such as attention mechanism, pre-trained models and ensemble multiple sub-networks. However, due to the complexity of audio clips captured from different environments, it is difficult to distinguish their categories without using any auxiliary methods for existing deep learning models using only a single classifier. In this paper, we propose a novel approach for ASC using deep neural decision forest (DNDF). DNDF combines a fixed number of convolutional layers and a decision forest as the final classifier. The decision forest consists of a fixed number of decision tree classifiers, which have been shown to offer better classification performance than a single classifier in some datasets. In particular, the decision forest differs substantially from traditional random forests as it is stochastic, differentiable, and capable of using the back-propagation to update and learn feature representations in neural network. Experimental results on the DCASE2019 and ESC-50 datasets demonstrate that our proposed DNDF method improves the ASC performance in terms of classification accuracy and shows competitive performance as compared with state-of-the-art baselines.

CVMar 17, 2017
Smartphone Based Colorimetric Detection via Machine Learning

Ali Y. Mutlu, Volkan Kılıç, Gizem K. Özdemir et al.

We report the application of machine learning to smartphone based colorimetric detection of pH values. The strip images were used as the training set for Least Squares-Support Vector Machine (LS-SVM) classifier algorithms that were able to successfully classify the distinct pH values. The difference in the obtained image formats was found not to significantly affect the performance of the proposed machine learning approach. Moreover, the influence of the illumination conditions on the perceived color of pH strips was investigated and further experiments were carried out to study effect of color change on the learning model. Test results on JPEG, RAW and RAW-corrected image formats captured in different lighting conditions lead to perfect classification accuracy, sensitivity and specificity, which proves that the colorimetric detection using machine learning based systems is able to adapt to various experimental conditions and is a great candidate for smartphone based sensing in paper-based colorimetric assays.