CVAIMMOct 26, 2025

LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction

arXiv:2510.22829v1h-index: 1Has Code
Originality Incremental advance
AI Analysis

This work addresses commercial memorability prediction for media analysis, but it is incremental as it builds on existing multimodal fusion and LLM techniques for a specific competition task.

This paper tackles commercial memorability prediction by proposing a multimodal fusion system with a Gemma-3 LLM backbone, which integrates visual and textual features using LLM-generated rationale prompts. The results show the LLM-based system achieves greater robustness and generalization on the test set compared to a gradient boosted trees baseline.

This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline. The paper's codebase can be found at https://github.com/dsgt-arc/mediaeval-2025-memorability

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes