JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features
This addresses the need for better content representation in applications like information retrieval and recommendation systems, but is incremental as it extends existing multi-modal fusion methods.
The paper tackles the problem of learning social media content by jointly fusing textual, acoustic, and visual features, and demonstrates that the proposed model outperforms state-of-the-art approaches by a large margin.
Learning social media content is the basis of many real-world applications, including information retrieval and recommendation systems, among others. In contrast with previous works that focus mainly on single modal or bi-modal learning, we propose to learn social media content by fusing jointly textual, acoustic, and visual information (JTAV). Effective strategies are proposed to extract fine-grained features of each modality, that is, attBiGRU and DCRNN. We also introduce cross-modal fusion and attentive pooling techniques to integrate multi-modal information comprehensively. Extensive experimental evaluation conducted on real-world datasets demonstrates our proposed model outperforms the state-of-the-art approaches by a large margin.