EnvTriCascade: An Environment-Aware Tri-Stage Cascaded Framework for ESDD2 2026 Challenge
For researchers in audio spoofing detection, this work addresses the emerging challenge of component-level manipulation in real-world scenarios, though it is an incremental improvement over existing methods.
The paper tackles component-level audio spoofing detection where speech and environmental sounds can be independently manipulated. The proposed EnvTriCascade framework achieves a Macro-F1 score of 0.8266 on the test set, significantly outperforming the official baseline and ranking second in the ESDD2 Challenge.
ADD in real-world scenarios has evolved from speech-only spoofing to more challenging component-level settings, where speech and environmental sounds may be independently manipulated. To tackle this, we propose EnvTriCascade, an Environment-Aware Tri-Stage Cascaded framework for the ESDD2 Challenge. First, a mix-consistency detector provides a binary prior to distinguish original recordings from manipulated mixtures, which calibrates the final decisions. Next, two complementary five-class detectors, leveraging SSLAM+XLS-R and EAT-large+XLS-R representations, extract robust multi-branch features integrated via a cross-branch attention-gated classifier. To enhance robustness against diverse mixing conditions, we incorporate RawBoost augmentation. Trained exclusively on the official CompSpoofV2 dataset, our system achieves a Macro-F1 score of 0.8266 on the test set, significantly outperforming the official baseline and ranking second in the challenge.