F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
This work addresses the need for more natural and engaging conversational AI systems by enabling dynamic behavioral control, though it is incremental as it builds on existing full-duplex and instruction-following methods.
The authors tackled the problem of limited customization in spoken conversational systems by developing an open, instruction-following full-duplex model that can control speaker voice, topic, and conversational behaviors, requiring only 2,000 hours of data for efficient training under academic constraints.
Spoken conversational systems require more than accurate speech generation to have human-like conversations: to feel natural and engaging, they must produce conversational behaviour that adapts dynamically to the context. Current spoken conversational systems, however, rarely allow such customization, limiting their naturalness and usability. In this work, we present the first open, instruction-following full-duplex conversational speech model that can be trained efficiently under typical academic resource constraints. By keeping the audio encoder frozen and finetuning only the language model, our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage optimization. The model can follow explicit instructions to control speaker voice, conversation topic, conversational behaviour (e.g., backchanneling and interruptions), and dialogue initiation. We propose a single-stage training protocol and systematically analyze design choices. Both the model and training code will be released to enable reproducible research on controllable full-duplex speech systems.