After All, Only The Last Neuron Matters: Comparing Multi-modal Fusion Functions for Scene Graph Generation
This work addresses a specific bottleneck in Scene Graph Generation models, offering an incremental improvement for researchers and practitioners in computer vision.
The paper tackled the problem of selecting the best fusion function for combining hidden states in Scene Graph Generation, finding that the DIST function achieved the highest Recall @ K scores, establishing it as part of the state-of-the-art.
From object segmentation to word vector representations, Scene Graph Generation (SGG) became a complex task built upon numerous research results. In this paper, we focus on the last module of this model: the fusion function. The role of this latter is to combine three hidden states. We perform an ablation test in order to compare different implementations. First, we reproduce the state-of-the-art results using SUM, and GATE functions. Then we expand the original solution by adding more model-agnostic functions: an adapted version of DIST and a mixture between MFB and GATE. On the basis of the state-of-the-art configuration, DIST performed the best Recall @ K, which makes it now part of the state-of-the-art.