IRCLCVSIMay 17, 2019

Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks

arXiv:1905.07075v32 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of analyzing multimodal social media data for applications like content retrieval and user modeling, though it is incremental as it builds on existing embedding techniques.

The paper tackles the problem of understanding social media content and user behavior by proposing a unified multimodal embedding framework that jointly models users, images, and text in a common space, outperforming unimodal methods on cross-modal retrieval tasks and improving user interest prediction on Twitter and Instagram data.

There has been an explosion of multimodal content generated on social media networks in the last few years, which has necessitated a deeper understanding of social media content and user behavior. We present a novel content-independent content-user-reaction model for social multimedia content analysis. Compared to prior works that generally tackle semantic content understanding and user behavior modeling in isolation, we propose a generalized solution to these problems within a unified framework. We embed users, images and text drawn from open social media in a common multimodal geometric space, using a novel loss function designed to cope with distant and disparate modalities, and thereby enable seamless three-way retrieval. Our model not only outperforms unimodal embedding based methods on cross-modal retrieval tasks but also shows improvements stemming from jointly solving the two tasks on Twitter data. We also show that the user embeddings learned within our joint multimodal embedding model are better at predicting user interests compared to those learned with unimodal content on Instagram data. Our framework thus goes beyond the prior practice of using explicit leader-follower link information to establish affiliations by extracting implicit content-centric affiliations from isolated users. We provide qualitative results to show that the user clusters emerging from learned embeddings have consistent semantics and the ability of our model to discover fine-grained semantics from noisy and unstructured data. Our work reveals that social multimodal content is inherently multimodal and possesses a consistent structure because in social networks meaning is created through interactions between users and content.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes