PortionNet: Distilling 3D Geometric Knowledge for Food Nutrition Estimation
This addresses the challenge of accurate food nutrition estimation for smartphone users without depth sensors, though it is incremental as it builds on existing knowledge distillation and 3D reasoning methods.
The paper tackled the problem of food nutrition estimation from single images by proposing PortionNet, a cross-modal knowledge distillation framework that learns geometric features from point clouds during training but uses only RGB images at inference, achieving state-of-the-art performance on MetaFood3D with improvements in volume and energy estimation.
Accurate food nutrition estimation from single images is challenging due to the loss of 3D information. While depth-based methods provide reliable geometry, they remain inaccessible on most smartphones because of depth-sensor requirements. To overcome this challenge, we propose PortionNet, a novel cross-modal knowledge distillation framework that learns geometric features from point clouds during training while requiring only RGB images at inference. Our approach employs a dual-mode training strategy where a lightweight adapter network mimics point cloud representations, enabling pseudo-3D reasoning without any specialized hardware requirements. PortionNet achieves state-of-the-art performance on MetaFood3D, outperforming all previous methods in both volume and energy estimation. Cross-dataset evaluation on SimpleFood45 further demonstrates strong generalization in energy estimation.