City-Identification of Flickr Videos Using Semantic Acoustic Features
This addresses the problem of geolocating videos for applications like multimedia retrieval, but it is incremental as it builds on existing tasks with a novel audio-only approach.
The paper tackled city-identification of videos using only audio, without images or tags, by developing a method based on semantic acoustic features derived from a taxonomy of urban sounds. It improved state-of-the-art performance on the MediaEval Placing Task dataset, demonstrating a correlation between acoustic information and city-location.
City-identification of videos aims to determine the likelihood of a video belonging to a set of cities. In this paper, we present an approach using only audio, thus we do not use any additional modality such as images, user-tags or geo-tags. In this manner, we show to what extent the city-location of videos correlates to their acoustic information. Success in this task suggests improvements can be made to complement the other modalities. In particular, we present a method to compute and use semantic acoustic features to perform city-identification and the features show semantic evidence of the identification. The semantic evidence is given by a taxonomy of urban sounds and expresses the potential presence of these sounds in the city- soundtracks. We used the MediaEval Placing Task set, which contains Flickr videos labeled by city. In addition, we used the UrbanSound8K set containing audio clips labeled by sound- type. Our method improved the state-of-the-art performance and provides a novel semantic approach to this task