CVAIJan 28

Physically Guided Visual Mass Estimation from a Single RGB Image

arXiv:2601.20303v1h-index: 2
Originality Incremental advance
AI Analysis

This solves the challenge of visual mass estimation for robotics or industrial applications, but it is incremental as it builds on existing methods like monocular depth estimation and vision-language models.

The paper tackles the problem of estimating object mass from a single RGB image by addressing the ambiguity in volume and density, achieving consistent outperformance over state-of-the-art methods on datasets like image2mass and ABO-500.

Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two physically guided latent factors (volume- and density-related) are predicted through separate regression heads under mass-only supervision. Experiments on image2mass and ABO-500 show that the proposed method consistently outperforms state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes