Automatic Identification of Non-Meaningful Body-Movements and What It Reveals About Humans
This work addresses the challenge of interpreting non-verbal cues in communication for public speakers and audiences, though it is incremental as it applies existing methods to a new dataset.
The study tackled the problem of automatically distinguishing meaningful from non-meaningful body movements in public speaking, achieving an AUC of up to 0.82 using linear classifiers on multimodal features. It revealed that speakers focus more on lexical features for self-evaluation, while audiences prioritize prosody.
We present a framework to identify whether a public speaker's body movements are meaningful or non-meaningful ("Mannerisms") in the context of their speeches. In a dataset of 84 public speaking videos from 28 individuals, we extract 314 unique body movement patterns (e.g. pacing, gesturing, shifting body weights, etc.). Online workers and the speakers themselves annotated the meaningfulness of the patterns. We extracted five types of features from the audio-video recordings: disfluency, prosody, body movements, facial, and lexical. We use linear classifiers to predict the annotations with AUC up to 0.82. Analysis of the classifier weights reveals that it puts larger weights on the lexical features while predicting self-annotations. Contrastingly, it puts a larger weight on prosody features while predicting audience annotations. This analysis might provide subtle hint that public speakers tend to focus more on the verbal features while evaluating self-performances. The audience, on the other hand, tends to focus more on the non-verbal aspects of the speech. The dataset and code associated with this work has been released for peer review and further analysis.