SDApr 21
Environmental Sound Deepfake Detection Using Deep-Learning FrameworkLam Pham, Khoi Vu, Dat Tran et al.
In this paper, we propose a deep-learning framework for environmental sound deepfake detection (ESDD) -- the task of identifying whether the sound scene and sound event in an input audio recording is fake or not. To this end, we conducted extensive experiments to explore how individual spectrograms, a wide range of network architectures and pre-trained models, ensemble of spectrograms or network architectures affect the ESDD task performance. The experimental results on the benchmark datasets of EnvSDD and ESDD-Challenge-TestSet indicate that detecting deepfake audio of sound scene and detecting deepfake audio of sound event should be considered as individual tasks. We also indicate that the approach of finetuning a pre-trained model is more effective compared with training a model from scratch for the ESDD task. Eventually, our best model, which was finetuned from the pre-trained WavLM model with the proposed three-stage training strategy, achieve the Accuracy of 0.98, F1 Score of 0.95, AuC of 0.99 on EnvSDD Test subset and the Accuracy of 0.88, F1 Score of 0.77, and AuC of 0.92 on ESDD-Challenge-TestSet dataset.
IVMay 30, 2025
Beyond Pretty Pictures: Combined Single- and Multi-Image Super-resolution for Sentinel-2 ImagesAditya Retnanto, Son Le, Sebastian Mueller et al.
Super-resolution aims to increase the resolution of satellite images by reconstructing high-frequency details, which go beyond naïve upsampling. This has particular relevance for Earth observation missions like Sentinel-2, which offer frequent, regular coverage at no cost; but at coarse resolution. Its pixel footprint is too large to capture small features like houses, streets, or hedge rows. To address this, we present SEN4X, a hybrid super-resolution architecture that combines the advantages of single-image and multi-image techniques. It combines temporal oversampling from repeated Sentinel-2 acquisitions with a learned prior from high-resolution Pléiades Neo data. In doing so, SEN4X upgrades Sentinel-2 imagery to 2.5 m ground sampling distance. We test the super-resolved images on urban land-cover classification in Hanoi, Vietnam. We find that they lead to a significant performance improvement over state-of-the-art super-resolution baselines.
CVJan 3, 2025
Quantitative Gait Analysis from Single RGB Videos Using a Dual-Input Transformer-Based NetworkHiep Dinh, Son Le, My Than et al.
Gait and movement analysis have become a well-established clinical tool for diagnosing health conditions, monitoring disease progression for a wide spectrum of diseases, and to implement and assess treatment, surgery and or rehabilitation interventions. However, quantitative motion assessment remains limited to costly motion capture systems and specialized personnel, restricting its accessibility and broader application. Recent advancements in deep neural networks have enabled quantitative movement analysis using single-camera videos, offering an accessible alternative to conventional motion capture systems. In this paper, we present an efficient approach for clinical gait analysis through a dual-pattern input convolutional Transformer network. The proposed system leverages a dual-input Transformer model to estimate essential gait parameters from single RGB videos captured by a single-view camera. The system demonstrates high accuracy in estimating critical metrics such as the gait deviation index (GDI), knee flexion angle, step length, and walking cadence, validated on a dataset of individuals with movement disorders. Notably, our approach surpasses state-of-the-art methods in various scenarios, using fewer resources and proving highly suitable for clinical application, particularly in resource-constrained environments.