Can LLMs Help Localize Fake Words in Partially Fake Speech?
This addresses the problem of detecting manipulated speech for security and media integrity, but it is incremental as it focuses on specific datasets and patterns.
The paper investigates whether a text-trained large language model (LLM) can localize fake words in partially fake speech, where specific words are edited, and finds that the model often uses editing-style patterns like word-level polarity substitutions as cues for localization.
Large language models (LLMs), trained on large-scale text, have recently attracted significant attention for their strong performance across many tasks. Motivated by this, we investigate whether a text-trained LLM can help localize fake words in partially fake speech, where only specific words within a speech are edited. We build a speech LLM to perform fake word localization via next token prediction. Experiments and analyses on AV-Deepfake1M and PartialEdit indicates that the model frequently leverages editing-style pattern learned from the training data, particularly word-level polarity substitutions for those two databases we discussed, as cues for localizing fake words. Although such particular patterns provide useful information in an in-domain scenario, how to avoid over-reliance on such particular pattern and improve generalization to unseen editing styles remains an open question.