Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing
This work addresses the challenge of monitoring social interactions for applications in behavioral science or health, but it is incremental as it builds on existing multimodal sensing methods.
The paper tackled the problem of detecting in-person conversations in noisy environments using smartwatch audio and motion data, achieving 82.0% macro F1-score in lab settings and 77.2% in semi-naturalistic settings.
Social interactions play a crucial role in shaping human behavior, relationships, and societies. It encompasses various forms of communication, such as verbal conversation, non-verbal gestures, facial expressions, and body language. In this work, we develop a novel computational approach to detect a foundational aspect of human social interactions, in-person verbal conversations, by leveraging audio and inertial data captured with a commodity smartwatch in acoustically-challenging scenarios. To evaluate our approach, we conducted a lab study with 11 participants and a semi-naturalistic study with 24 participants. We analyzed machine learning and deep learning models with 3 different fusion methods, showing the advantages of fusing audio and inertial data to consider not only verbal cues but also non-verbal gestures in conversations. Furthermore, we perform a comprehensive set of evaluations across activities and sampling rates to demonstrate the benefits of multimodal sensing in specific contexts. Overall, our framework achieved 82.0$\pm$3.0% macro F1-score when detecting conversations in the lab and 77.2$\pm$1.8% in the semi-naturalistic setting.