CVMay 26

Bounded-Compute Multimodal Regression for Product-Rating Prediction

arXiv:2605.2773769.0h-index: 2
Predicted impact top 44% in CV · last 90 daysOriginality Synthesis-oriented
AI Analysis

Provides a strong baseline for efficient multimodal regression in resource-constrained settings, but the gains are incremental.

The authors adapt a small VLM for product-rating regression by replacing the language head with an MLP and fixing inputs, achieving 0.39 PLCC and 0.40 CES under a strict latency budget.

Vision-language models (VLMs) are increasingly attractive for multimodal quality assessment, but their default reliance on autoregressive text generation and dynamic visual processing is poorly matched to scalar regression under strict latency budgets. We present a bounded-compute adaptation of SmolVLM2-256M-Video-Instruct for product-rating prediction in the LoViF 2026 Efficient VLM challenge. Motivated by recent multimodal engagement-prediction results showing that feature-based regression can outperform token-based score generation, we replace the language-modeling head with a lightweight two-layer MLP fed by pooled decoder states, and we enforce deterministic inputs through fixed 384x384 images and truncated metadata. Across controlled ablations, static global image processing slightly outperforms dynamic tiling, and scaling from 100K to 16M training examples substantially improves validation correlation. Under the official held-out evaluation, our 228M-parameter model achieves 0.39 PLCC and 0.40 CES, providing a strong and reproducible baseline for resource-constrained multimodal regression.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes