CVMay 2, 2018

Learnable PINs: Cross-Modal Embeddings for Person Identity

arXiv:1805.00833v2169 citations
Originality Incremental advance
AI Analysis

This enables applications like character retrieval in TV dramas, but it is incremental as it builds on existing embedding methods with a novel curriculum learning approach.

The paper tackles the problem of cross-modal retrieval between face and voice for person identity without identity labels, achieving this through cross-modal self-supervision from videos and establishing a benchmark for unseen identities.

We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes