RoboKA: KAN Informed Multimodal Learning for RoboCall Surveillance System
This work addresses the lack of public datasets for robocall surveillance and provides a novel method for detecting adversarial robocalls, but the results are on synthetic data only.
The authors created Robo-SAr, a synthetic robocall dataset with ~1400 samples across three adversarial axes, and proposed RoboKA, a KAN-based multimodal fusion framework that outperforms baselines in recall and F1-score for robocall surveillance.
Wide exploration on robocall surveillance research is hindered due to limited access to public datasets, due to privacy concerns. In this work, we first curate Robo-SAr, a synthetic robocall dataset designed for robocall surveillance research. Robo-SAr comprises of ~200 unwanted and ~1200 legitimate synthetic robocall samples across three realistic adversarial axes: psycholinguistics-manipulated transcripts, emotion-eliciting speech, and cloned voices. We further propose RoboKA, a Kolmogorov-Arnold Network (KAN)-based multimodal fusion framework designed to model structured nonlinear interactions between acoustic and linguistic cues that characterize diverse adversarial robocall strategies. RoboKA first leverages cross-modal contrastive learning to align latent modality representations and feeds the resulting embeddings to a KAN-projection head for final classification. We benchmark RoboKA against strong unimodal and multimodal baselines in both in-domain and out-of-domain setups, finding RoboKA to surpass all baselines in terms of recall and F1-score.