CVSep 16, 2024

Neuromorphic Facial Analysis with Cross-Modal Supervision

Federico Becattini, Luca Cultrera, Lorenzo Berlincioni, Claudio Ferrari, Andrea Leonardo, Alberto Del Bimbo

arXiv:2409.10213v18.76 citationsh-index: 61

Originality Incremental advance

AI Analysis

This work addresses the problem of detecting micro-movements for emotion inference in facial analysis, which is incremental by leveraging existing RGB expertise to bridge the domain gap with event data.

The paper tackles the challenge of analyzing subtle facial movements by introducing FACEMORPHIC, a multimodal dataset with synchronized RGB and event camera data, and demonstrates that cross-modal supervision enables effective neuromorphic face analysis without manual annotation, achieving results comparable to RGB-based methods.

Traditional approaches for analyzing RGB frames are capable of providing a fine-grained understanding of a face from different angles by inferring emotions, poses, shapes, landmarks. However, when it comes to subtle movements standard RGB cameras might fall behind due to their latency, making it hard to detect micro-movements that carry highly informative cues to infer the true emotions of a subject. To address this issue, the usage of event cameras to analyze faces is gaining increasing interest. Nonetheless, all the expertise matured for RGB processing is not directly transferrable to neuromorphic data due to a strong domain shift and intrinsic differences in how data is represented. The lack of labeled data can be considered one of the main causes of this gap, yet gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. In this paper, we first present FACEMORPHIC, a multimodal temporally synchronized face dataset comprising both RGB videos and event streams. The data is labeled at a video level with facial Action Units and also contains streams collected with a variety of applications in mind, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space.

View on arXiv PDF

Similar