CVOct 3, 2018

Image and Encoded Text Fusion for Multi-Modal Classification

arXiv:1810.02001v149 citations
Originality Synthesis-oriented
AI Analysis

This work addresses multi-modal classification problems for real-world scenarios, but it is incremental as it builds on existing CNNs and fusion methods.

The paper tackles multi-modal classification by fusing encoded text onto images to create information-enriched images, using CNNs for classification, and reports encouraging results on two large-scale datasets compared to individual sources and other fusion strategies.

Multi-modal approaches employ data from multiple input streams such as textual and visual domains. Deep neural networks have been successfully employed for these approaches. In this paper, we present a novel multi-modal approach that fuses images and text descriptions to improve multi-modal classification performance in real-world scenarios. The proposed approach embeds an encoded text onto an image to obtain an information-enriched image. To learn feature representations of resulting images, standard Convolutional Neural Networks (CNNs) are employed for the classification task. We demonstrate how a CNN based pipeline can be used to learn representations of the novel fusion approach. We compare our approach with individual sources on two large-scale multi-modal classification datasets while obtaining encouraging results. Furthermore, we evaluate our approach against two famous multi-modal strategies namely early fusion and late fusion.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes