CV AIDec 31, 2025

Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

arXiv:2512.24826v13.6h-index: 2

Originality Incremental advance

AI Analysis

This addresses the dimensional shift problem for cross-modal AI systems processing 3D scenes, though it appears incremental as it builds on existing vision-language models and control modules.

The paper tackles the problem of adapting 2D-trained cross-modal systems to 3D multi-object scenes by introducing a method that uses derivative-free optimization with regret minimization to improve multivariate mutual information estimates, enabling online adaptation to occlusions and feature differentiation without pretraining or finetuning.

Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.

View on arXiv PDF

Similar