I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors
For researchers and practitioners in deepfake detection, this work reveals the limitations of human perception as a detection mechanism, highlighting the need for automated tools.
The study found that humans detect synthetic speech at below-chance levels, with trust cues having no main effect on accuracy, though quality ratings implicitly discriminated utterance types.
Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually encounter synthetic speech remains poorly understood. We investigate voice deepfake detection as a perceptual and contextual process, presenting a localization task in which 47 participants marked suspected synthetic segments across authentic, fully synthetic, and partially synthetic utterances under three manipulated trust cues: instructional framing, affective priming, and provenance labeling. Participants provided quality ratings on mechanicalness, expressiveness, intelligibility, clarity, calmness, and confidence of evaluation. Utterance class was the primary determinant of detection accuracy and perceptual quality; trust cues produced no main effects but motivated detection behavior. Fully synthetic speech was detected at below-chance levels. Quality ratings tracked utterance type, indicating implicit discrimination where overt detection failed.