Fengqing Zhu

CV
h-index50
79papers
1,364citations
Novelty43%
AI Score56

79 Papers

41.5CVMay 27
Evaluating the Feasibility of Inferring Dietary Behavior Change Receptivity from Egocentric Images of Eating Environment

Long Li, Yuning Huang, Heather A. Eicher-Miller et al.

Accurately assessing dietary behavior change receptivity is essential for designing effective just-in-time adaptive interventions (JITAIs) that promote healthier eating habits. However, self-report-based assessment of behavior change receptivity is sparse and delayed, limiting its practical use in continuous monitoring. To explore whether passive sensing may help address this challenge, this study conducts a pilot investigation of inferring participants' self-reported behavior change receptivity from egocentric eating images collected by a wearable camera. We use pilot data obtained from free-living eating episodes using the Automatic Ingestion Monitor v2 (AIM-2). The data included egocentric image sequences captured during eating and paired with responses to questions assessing specific dimensions of behavior change receptivity (awareness, interaction capability, and motivation). To examine whether visual information contained any relevancy to these responses, we evaluated a transfer-learning-assisted framework that combines a pre-trained Contrastive Language-Image Pre-Training (CLIP) vision encoder with a lightweight transformer classifier. The model processes eating episode image sequences to extract potential semantic and temporal cues related to behavior change receptivity. Preliminary experimental results show promising improvements over simple baseline models for behavior change receptivity indicators. These early findings suggest that egocentric eating episode images may contain cues related to dietary behavior change receptivity, and warrant further investigation with larger and more comprehensive datasets.

CVJul 6, 2022Code
3DG-STFM: 3D Geometric Guided Student-Teacher Feature Matching

Runyu Mao, Chen Bai, Yatong An et al.

We tackle the essential task of finding dense visual correspondences between a pair of images. This is a challenging problem due to various factors such as poor texture, repetitive patterns, illumination variation, and motion blur in practical scenarios. In contrast to methods that use dense correspondence ground-truths as direct supervision for local feature matching training, we train 3DG-STFM: a multi-modal matching model (Teacher) to enforce the depth consistency under 3D dense correspondence supervision and transfer the knowledge to 2D unimodal matching model (Student). Both teacher and student models consist of two transformer-based matching modules that obtain dense correspondences in a coarse-to-fine manner. The teacher model guides the student model to learn RGB-induced depth information for the matching purpose on both coarse and fine branches. We also evaluate 3DG-STFM on a model compression task. To the best of our knowledge, 3DG-STFM is the first student-teacher learning method for the local feature matching task. The experiments show that our method outperforms state-of-the-art methods on indoor and outdoor camera pose estimations, and homography estimation problems. Code is available at: https://github.com/Ryan-prime/3DG-STFM.

IVAug 27, 2022Code
Lossy Image Compression with Quantized Hierarchical VAEs

Zhihao Duan, Ming Lu, Zhan Ma et al.

Recent research has shown a strong theoretical connection between variational autoencoders (VAEs) and the rate-distortion theory. Motivated by this, we consider the problem of lossy image compression from the perspective of generative modeling. Starting with ResNet VAEs, which are originally designed for data (image) distribution modeling, we redesign their latent variable model using a quantization-aware posterior and prior, enabling easy quantization and entropy coding at test time. Along with improved neural network architecture, we present a powerful and efficient model that outperforms previous methods on natural image lossy compression. Our model compresses images in a coarse-to-fine fashion and supports parallel encoding and decoding, leading to fast execution on GPUs. Code is available at https://github.com/duanzhiihao/lossy-vae.

CVJan 19, 2023
Improving Food Detection For Images From a Wearable Egocentric Camera

Yue Han, Sri Kalyan Yarlagadda, Tonmoy Ghosh et al.

Diet is an important aspect of our health. Good dietary habits can contribute to the prevention of many diseases and improve the overall quality of life. To better understand the relationship between diet and health, image-based dietary assessment systems have been developed to collect dietary information. We introduce the Automatic Ingestion Monitor (AIM), a device that can be attached to one's eye glasses. It provides an automated hands-free approach to capture eating scene images. While AIM has several advantages, images captured by the AIM are sometimes blurry. Blurry images can significantly degrade the performance of food image analysis such as food detection. In this paper, we propose an approach to pre-process images collected by the AIM imaging sensor by rejecting extremely blurry images to improve the performance of food detection.

CVOct 26, 2022
Long-tailed Food Classification

Jiangpeng He, Luotao Lin, Heather Eicher-Miller et al.

Food classification serves as the basic step of image-based dietary assessment to predict the types of foods in each input image. However, food image predictions in a real world scenario are usually long-tail distributed among different food classes, which cause heavy class-imbalance problems and a restricted performance. In addition, none of the existing long-tailed classification methods focus on food data, which can be more challenging due to the lower inter-class and higher intra-class similarity among foods. In this work, we first introduce two new benchmark datasets for long-tailed food classification including Food101-LT and VFN-LT where the number of samples in VFN-LT exhibits the real world long-tailed food distribution. Then we propose a novel 2-Phase framework to address the problem of class-imbalance by (1) undersampling the head classes to remove redundant samples along with maintaining the learned information through knowledge distillation, and (2) oversampling the tail classes by performing visual-aware data augmentation. We show the effectiveness of our method by comparing with existing state-of-the-art long-tailed classification methods and show improved performance on both Food101-LT and VFN-LT benchmarks. The results demonstrate the potential to apply our method to related real life applications.

CVAug 3, 2023
An End-to-end Food Portion Estimation Framework Based on Shape Reconstruction from Monocular Image

Zeman Shao, Gautham Vinod, Jiangpeng He et al.

Dietary assessment is a key contributor to monitoring health status. Existing self-report methods are tedious and time-consuming with substantial biases and errors. Image-based food portion estimation aims to estimate food energy values directly from food images, showing great potential for automated dietary assessment solutions. Existing image-based methods either use a single-view image or incorporate multi-view images and depth information to estimate the food energy, which either has limited performance or creates user burdens. In this paper, we propose an end-to-end deep learning framework for food energy estimation from a monocular image through 3D shape reconstruction. We leverage a generative model to reconstruct the voxel representation of the food object from the input image to recover the missing 3D information. Our method is evaluated on a publicly available food image dataset Nutrition5k, resulting a Mean Absolute Error (MAE) of 40.05 kCal and Mean Absolute Percentage Error (MAPE) of 11.47% for food energy estimation. Our method uses RGB image as the only input at the inference stage and achieves competitive results compared to the existing method requiring both RGB and depth information.

CVJul 1, 2023
Long-Tailed Continual Learning For Visual Food Recognition

Jiangpeng He, Xiaoyan Zhang, Luotao Lin et al.

Deep learning-based food recognition has made significant progress in predicting food types from eating occasion images. However, two key challenges hinder real-world deployment: (1) continuously learning new food classes without forgetting previously learned ones, and (2) handling the long-tailed distribution of food images, where a few common classes and many more rare classes. To address these, food recognition methods should focus on long-tailed continual learning. In this work, We introduce a dataset that encompasses 186 American foods along with comprehensive annotations. We also introduce three new benchmark datasets, VFN186-LT, VFN186-INSULIN and VFN186-T2D, which reflect real-world food consumption for healthy populations, insulin takers and individuals with type 2 diabetes without taking insulin. We propose a novel end-to-end framework that improves the generalization ability for instance-rare food classes using a knowledge distillation-based predictor to avoid misalignment of representation during continual learning. Additionally, we introduce an augmentation technique by integrating class-activation-map (CAM) and CutMix to improve generalization on instance-rare food classes. Our method, evaluated on Food101-LT, VFN-LT, VFN186-LT, VFN186-INSULIN, and VFN186-T2DM, shows significant improvements over existing methods. An ablation study highlights further performance enhancements, demonstrating its potential for real-world food recognition applications.

CVJan 12, 2023
Online Class-Incremental Learning For Real-World Food Image Classification

Siddeshwar Raghavan, Jiangpeng He, Fengqing Zhu

Food image classification is essential for monitoring health and tracking dietary in image-based dietary assessment methods. However, conventional systems often rely on static datasets with fixed classes and uniform distribution. In contrast, real-world food consumption patterns, shaped by cultural, economic, and personal influences, involve dynamic and evolving data. Thus, require the classification system to cope with continuously evolving data. Online Class Incremental Learning (OCIL) addresses the challenge of learning continuously from a single-pass data stream while adapting to the new knowledge and reducing catastrophic forgetting. Experience Replay (ER) based OCIL methods store a small portion of previous data and have shown encouraging performance. However, most existing OCIL works assume that the distribution of encountered data is perfectly balanced, which rarely happens in real-world scenarios. In this work, we explore OCIL for real-world food image classification by first introducing a probabilistic framework to simulate realistic food consumption scenarios. Subsequently, we present an attachable Dynamic Model Update (DMU) module designed for existing ER methods, which enables the selection of relevant images for model training, addressing challenges arising from data repetition and imbalanced sample occurrences inherent in realistic food consumption patterns within the OCIL framework. Our performance evaluation demonstrates significant enhancements compared to established ER methods, showing great potential for lifelong learning in real-world food image classification scenarios. The code of our method is publicly accessible at https://gitlab.com/viper-purdue/OCIL-real-world-food-image-classification

CVJul 12, 2024
MetaFood CVPR 2024 Challenge on Physically Informed 3D Food Reconstruction: Methods and Results

Jiangpeng He, Yuhao Chen, Gautham Vinod et al.

The increasing interest in computer vision applications for nutrition and dietary monitoring has led to the development of advanced 3D reconstruction techniques for food items. However, the scarcity of high-quality data and limited collaboration between industry and academia have constrained progress in this field. Building on recent advancements in 3D reconstruction, we host the MetaFood Workshop and its challenge for Physically Informed 3D Food Reconstruction. This challenge focuses on reconstructing volume-accurate 3D models of food items from 2D images, using a visible checkerboard as a size reference. Participants were tasked with reconstructing 3D models for 20 selected food items of varying difficulty levels: easy, medium, and hard. The easy level provides 200 images, the medium level provides 30 images, and the hard level provides only 1 image for reconstruction. In total, 16 teams submitted results in the final testing phase. The solutions developed in this challenge achieved promising results in 3D food reconstruction, with significant potential for improving portion estimation for dietary assessment and nutritional monitoring. More details about this workshop challenge and access to the dataset can be found at https://sites.google.com/view/cvpr-metafood-2024.

CVSep 26, 2024
Efficient Microscopic Image Instance Segmentation for Food Crystal Quality Control

Xiaoyu Ji, Jan P Allebach, Ali Shakouri et al.

This paper is directed towards the food crystal quality control area for manufacturing, focusing on efficiently predicting food crystal counts and size distributions. Previously, manufacturers used the manual counting method on microscopic images of food liquid products, which requires substantial human effort and suffers from inconsistency issues. Food crystal segmentation is a challenging problem due to the diverse shapes of crystals and their surrounding hard mimics. To address this challenge, we propose an efficient instance segmentation method based on object detection. Experimental results show that the predicted crystal counting accuracy of our method is comparable with existing segmentation methods, while being five times faster. Based on our experiments, we also define objective criteria for separating hard mimics and food crystals, which could benefit manual annotation tasks on similar dataset.

CVApr 12, 2022
Out-Of-Distribution Detection In Unsupervised Continual Learning

Jiangpeng He, Fengqing Zhu

Unsupervised continual learning aims to learn new tasks incrementally without requiring human annotations. However, most existing methods, especially those targeted on image classification, only work in a simplified scenario by assuming all new data belong to new tasks, which is not realistic if the class labels are not provided. Therefore, to perform unsupervised continual learning in real life applications, an out-of-distribution detector is required at beginning to identify whether each new data corresponds to a new task or already learned tasks, which still remains under-explored yet. In this work, we formulate the problem for Out-of-distribution Detection in Unsupervised Continual Learning (OOD-UCL) with the corresponding evaluation protocol. In addition, we propose a novel OOD detection method by correcting the output bias at first and then enhancing the output confidence for in-distribution data based on task discriminativeness, which can be applied directly without modifying the learning procedures and objectives of continual learning. Our method is evaluated on CIFAR-100 dataset by following the proposed evaluation protocol and we show improved performance compared with existing OOD detection methods under the unsupervised continual learning scenario.

CVJul 1, 2023
Single-Stage Heavy-Tailed Food Classification

Jiangpeng He, Fengqing Zhu

Deep learning based food image classification has enabled more accurate nutrition content analysis for image-based dietary assessment by predicting the types of food in eating occasion images. However, there are two major obstacles to apply food classification in real life applications. First, real life food images are usually heavy-tailed distributed, resulting in severe class-imbalance issue. Second, it is challenging to train a single-stage (i.e. end-to-end) framework under heavy-tailed data distribution, which cause the over-predictions towards head classes with rich instances and under-predictions towards tail classes with rare instance. In this work, we address both issues by introducing a novel single-stage heavy-tailed food classification framework. Our method is evaluated on two heavy-tailed food benchmark datasets, Food101-LT and VFN-LT, and achieves the best performance compared to existing work with over 5% improvements for top-1 accuracy.

CVAug 13, 2022
Simulating Personal Food Consumption Patterns using a Modified Markov Chain

Xinyue Pan, Jiangpeng He, Andrew Peng et al.

Food image classification serves as the foundation of image-based dietary assessment to predict food categories. Since there are many different food classes in real life, conventional models cannot achieve sufficiently high accuracy. Personalized classifiers aim to largely improve the accuracy of food image classification for each individual. However, a lack of public personal food consumption data proves to be a challenge for training such models. To address this issue, we propose a novel framework to simulate personal food consumption data patterns, leveraging the use of a modified Markov chain model and self-supervised learning. Our method is capable of creating an accurate future data pattern from a limited amount of initial data, and our simulated data patterns can be closely correlated with the initial data pattern. Furthermore, we use Dynamic Time Warping distance and Kullback-Leibler divergence as metrics to evaluate the effectiveness of our method on the public Food-101 dataset. Our experimental results demonstrate promising performance compared with random simulation and the original Markov chain method.

CVJun 5, 2022
Towards the Creation of a Nutrition and Food Group Based Image Database

Zeman Shao, Jiangpeng He, Ya-Yuan Yu et al.

Food classification is critical to the analysis of nutrients comprising foods reported in dietary assessment. Advances in mobile and wearable sensors, combined with new image based methods, particularly deep learning based approaches, have shown great promise to improve the accuracy of food classification to assess dietary intake. However, these approaches are data-hungry and their performances are heavily reliant on the quantity and quality of the available datasets for training the food classification model. Existing food image datasets are not suitable for fine-grained food classification and the following nutrition analysis as they lack fine-grained and transparently derived food group based identification which are often provided by trained dietitians with expert domain knowledge. In this paper, we propose a framework to create a nutrition and food group based image database that contains both visual and hierarchical food categorization information to enhance links to the nutrient profile of each food. We design a protocol for linking food group based food codes in the U.S. Department of Agriculture's (USDA) Food and Nutrient Database for Dietary Studies (FNDDS) to a food image dataset, and implement a web-based annotation tool for efficient deployment of this protocol.Our proposed method is used to build a nutrition and food group based image database including 16,114 food images representing the 74 most frequently consumed What We Eat in America (WWEIA) food sub-categories in the United States with 1,865 USDA food code matched to a nutrient database, the USDA FNDDS nutrient database.

CVMar 16, 2023
Conditional Synthetic Food Image Generation

Wenjin Fu, Yue Han, Jiangpeng He et al.

Generative Adversarial Networks (GAN) have been widely investigated for image synthesis based on their powerful representation learning ability. In this work, we explore the StyleGAN and its application of synthetic food image generation. Despite the impressive performance of GAN for natural image generation, food images suffer from high intra-class diversity and inter-class similarity, resulting in overfitting and visual artifacts for synthetic images. Therefore, we aim to explore the capability and improve the performance of GAN methods for food image generation. Specifically, we first choose StyleGAN3 as the baseline method to generate synthetic food images and analyze the performance. Then, we identify two issues that can cause performance degradation on food images during the training phase: (1) inter-class feature entanglement during multi-food classes training and (2) loss of high-resolution detail during image downsampling. To address both issues, we propose to train one food category at a time to avoid feature entanglement and leverage image patches cropped from high-resolution datasets to retain fine details. We evaluate our method on the Food-101 dataset and show improved quality of generated synthetic food images compared with the baseline. Finally, we demonstrate the great potential of improving the performance of downstream tasks, such as food image classification by including high-quality synthetic training samples in the data augmentation.

CVMar 16, 2023
Self-Supervised Visual Representation Learning on Food Images

Andrew Peng, Jiangpeng He, Fengqing Zhu

Food image analysis is the groundwork for image-based dietary assessment, which is the process of monitoring what kinds of food and how much energy is consumed using captured food or eating scene images. Existing deep learning-based methods learn the visual representation for downstream tasks based on human annotation of each food image. However, most food images in real life are obtained without labels, and data annotation requires plenty of time and human effort, which is not feasible for real-world applications. To make use of the vast amount of unlabeled images, many existing works focus on unsupervised or self-supervised learning of visual representations directly from unlabeled data. However, none of these existing works focus on food images, which is more challenging than general objects due to its high inter-class similarity and intra-class variance. In this paper, we focus on the implementation and analysis of existing representative self-supervised learning methods on food images. Specifically, we first compare the performance of six selected self-supervised learning models on the Food-101 dataset. Then we analyze the pros and cons of each selected model when training on food data to identify the key factors that can help improve the performance. Finally, we propose several ideas for future work on self-supervised visual representation learning for food images.

CVSep 15, 2023
Personalized Food Image Classification: Benchmark Datasets and New Baseline

Xinyue Pan, Jiangpeng He, Fengqing Zhu

Food image classification is a fundamental step of image-based dietary assessment, enabling automated nutrient analysis from food images. Many current methods employ deep neural networks to train on generic food image datasets that do not reflect the dynamism of real-life food consumption patterns, in which food images appear sequentially over time, reflecting the progression of what an individual consumes. Personalized food classification aims to address this problem by training a deep neural network using food images that reflect the consumption pattern of each individual. However, this problem is under-explored and there is a lack of benchmark datasets with individualized food consumption patterns due to the difficulty in data collection. In this work, we first introduce two benchmark personalized datasets including the Food101-Personal, which is created based on surveys of daily dietary patterns from participants in the real world, and the VFNPersonal, which is developed based on a dietary study. In addition, we propose a new framework for personalized food image classification by leveraging self-supervised learning and temporal image feature information. Our method is evaluated on both benchmark datasets and shows improved performance compared to existing works. The dataset has been made available at: https://skynet.ecn.purdue.edu/~pan161/dataset_personal.html

CVAug 7, 2024
FMiFood: Multi-modal Contrastive Learning for Food Image Classification

Xinyue Pan, Jiangpeng He, Fengqing Zhu

Food image classification is the fundamental step in image-based dietary assessment, which aims to estimate participants' nutrient intake from eating occasion images. A common challenge of food images is the intra-class diversity and inter-class similarity, which can significantly hinder classification performance. To address this issue, we introduce a novel multi-modal contrastive learning framework called FMiFood, which learns more discriminative features by integrating additional contextual information, such as food category text descriptions, to enhance classification accuracy. Specifically, we propose a flexible matching technique that improves the similarity matching between text and image embeddings to focus on multiple key information. Furthermore, we incorporate the classification objectives into the framework and explore the use of GPT-4 to enrich the text descriptions and provide more detailed context. Our method demonstrates improved performance on both the UPMC-101 and VFN datasets compared to existing methods.

CVAug 25, 2022
Image Based Food Energy Estimation With Depth Domain Adaptation

Gautham Vinod, Zeman Shao, Fengqing Zhu

Assessment of dietary intake has primarily relied on self-report instruments, which are prone to measurement errors. Dietary assessment methods have increasingly incorporated technological advances particularly mobile, image based approaches to address some of these limitations and further automation. Mobile, image-based methods can reduce user burden and bias by automatically estimating dietary intake from eating occasion images that are captured by mobile devices. In this paper, we propose an "Energy Density Map" which is a pixel-to-pixel mapping from the RGB image to the energy density of the food. We then incorporate the "Energy Density Map" with an associated depth map that is captured by a depth sensor to estimate the food energy. The proposed method is evaluated on the Nutrition5k dataset. Experimental results show improved results compared to baseline methods with an average error of 13.29 kCal and an average percentage error of 13.57% between the ground-truth and the estimated energy of the food.

CVSep 3, 2024
MetaFood3D: 3D Food Dataset with Nutrition Values

Yuhao Chen, Jiangpeng He, Gautham Vinod et al.

Food computing is both important and challenging in computer vision (CV). It significantly contributes to the development of CV algorithms due to its frequent presence in datasets across various applications, ranging from classification and instance segmentation to 3D reconstruction. The polymorphic shapes and textures of food, coupled with high variation in forms and vast multimodal information, including language descriptions and nutritional data, make food computing a complex and demanding task for modern CV algorithms. 3D food modeling is a new frontier for addressing food related problems, due to its inherent capability to deal with random camera views and its straightforward representation for calculating food portion size. However, the primary hurdle in the development of algorithms for food object analysis is the lack of nutrition values in existing 3D datasets. Moreover, in the broader field of 3D research, there is a critical need for domain-specific test datasets. To bridge the gap between general 3D vision and food computing research, we introduce MetaFood3D. This dataset consists of 743 meticulously scanned and labeled 3D food objects across 131 categories, featuring detailed nutrition information, weight, and food codes linked to a comprehensive nutrition database. Our MetaFood3D dataset emphasizes intra-class diversity and includes rich modalities such as textured mesh files, RGB-D videos, and segmentation masks. Experimental results demonstrate our dataset's strong capabilities in enhancing food portion estimation algorithms, highlight the gap between video captures and 3D scanned data, and showcase the strengths of MetaFood3D in generating synthetic eating occasion data and 3D food objects.

CVDec 9, 2025
Food Image Generation on Multi-Noun Categories

Xinyue Pan, Yuhao Chen, Jiangpeng He et al.

Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt "egg noodle" may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.

CVFeb 13
Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images

Yuhao Chen, Gautham Vinod, Siddeshwar Raghavan et al.

We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision--language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.

CVNov 12, 2025
PANDA - Patch And Distribution-Aware Augmentation for Long-Tailed Exemplar-Free Continual Learning

Siddeshwar Raghavan, Jiangpeng He, Fengqing Zhu

Exemplar-Free Continual Learning (EFCL) restricts the storage of previous task data and is highly susceptible to catastrophic forgetting. While pre-trained models (PTMs) are increasingly leveraged for EFCL, existing methods often overlook the inherent imbalance of real-world data distributions. We discovered that real-world data streams commonly exhibit dual-level imbalances, dataset-level distributions combined with extreme or reversed skews within individual tasks, creating both intra-task and inter-task disparities that hinder effective learning and generalization. To address these challenges, we propose PANDA, a Patch-and-Distribution-Aware Augmentation framework that integrates seamlessly with existing PTM-based EFCL methods. PANDA amplifies low-frequency classes by using a CLIP encoder to identify representative regions and transplanting those into frequent-class samples within each task. Furthermore, PANDA incorporates an adaptive balancing strategy that leverages prior task distributions to smooth inter-task imbalances, reducing the overall gap between average samples across tasks and enabling fairer learning with frozen PTMs. Extensive experiments and ablation studies demonstrate PANDA's capability to work with existing PTM-based CL methods, improving accuracy and reducing catastrophic forgetting.

IVMar 27, 2024Code
Theoretical Bound-Guided Hierarchical VAE for Neural Image Codecs

Yichi Zhang, Zhihao Duan, Yuning Huang et al.

Recent studies reveal a significant theoretical link between variational autoencoders (VAEs) and rate-distortion theory, notably in utilizing VAEs to estimate the theoretical upper bound of the information rate-distortion function of images. Such estimated theoretical bounds substantially exceed the performance of existing neural image codecs (NICs). To narrow this gap, we propose a theoretical bound-guided hierarchical VAE (BG-VAE) for NIC. The proposed BG-VAE leverages the theoretical bound to guide the NIC model towards enhanced performance. We implement the BG-VAE using Hierarchical VAEs and demonstrate its effectiveness through extensive experiments. Along with advanced neural network blocks, we provide a versatile, variable-rate NIC that outperforms existing methods when considering both rate-distortion performance and computational complexity. The code is available at BG-VAE.

49.9CVApr 10
Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception

Gautham Vinod, Bruce Coburn, Siddeshwar Raghavan et al.

Accurate volume estimation of objects from visual data is a long-standing challenge in computer vision with significant applications in robotics, logistics, and smart health. Existing methods often rely on complex 3D reconstruction pipelines or struggle with the ambiguity inherent in single-view images. To address these limitations, we introduce a new method that fuses implicit 3D cues from stereo vision with explicit prior knowledge from natural language text. Our approach extracts deep features from a stereo image pair and a descriptive text prompt that contains the object's class and an approximate volume, then integrates them using a simple yet effective projection layer into a unified, multi-modal representation for regression. We conduct extensive experiments on public datasets demonstrating that our text-guided approach significantly outperforms vision-only baselines. Our findings show that leveraging even simple textual priors can effectively guide the volume estimation task, paving the way for more context-aware visual measurement systems. Code: https://gitlab.com/viper-purdue/stereo-typical-estimator.

29.1CVApr 7
DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images

Gautham Vinod, Siddeshwar Raghavan, Bruce Coburn et al.

Accurate dietary assessment is critical for precision nutrition, yet most image-based methods rely on a single pre-consumption image and provide only coarse, meal-level estimates. These approaches cannot determine what was actually consumed and often require restrictive inputs such as depth sensing, multi-view imagery, or explicit segmentation. In this paper, we propose a simple vision-language framework for food-item-level nutritional analysis using paired before-and-after eating images. Instead of relying on rigid segmentation masks, our method leverages natural language prompts to localize specific food items and estimate their weight directly from a single RGB image. We further estimate food consumption by predicting weight differences between paired images using a two-stage training strategy. We evaluate our method on three publicly available datasets and demonstrate consistent improvements over existing approaches, establishing a strong baseline for before-and-after dietary image analysis.

IVAug 6, 2025Code
OpenDCVCs: A PyTorch Open Source Implementation and Performance Evaluation of the DCVC series Video Codecs

Yichi Zhang, Fengqing Zhu

We present OpenDCVCs, an open-source PyTorch implementation designed to advance reproducible research in learned video compression. OpenDCVCs provides unified and training-ready implementations of four representative Deep Contextual Video Compression (DCVC) models--DCVC, DCVC with Temporal Context Modeling (DCVC-TCM), DCVC with Hybrid Entropy Modeling (DCVC-HEM), and DCVC with Diverse Contexts (DCVC-DC). While the DCVC series achieves substantial bitrate reductions over both classical codecs and advanced learned models, previous public code releases have been limited to evaluation codes, presenting significant barriers to reproducibility, benchmarking, and further development. OpenDCVCs bridges this gap by offering a comprehensive, self-contained framework that supports both end-to-end training and evaluation for all included algorithms. The implementation includes detailed documentation, evaluation protocols, and extensive benchmarking results across diverse datasets, providing a transparent and consistent foundation for comparison and extension. All code and experimental tools are publicly available at https://gitlab.com/viper-purdue/opendcvcs, empowering the community to accelerate research and foster collaboration.

LGMar 2
Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning

Jinge Ma, Fengqing Zhu

With the widespread adoption of deep learning in visual tasks, Class-Incremental Learning (CIL) has become an important paradigm for handling dynamically evolving data distributions. However, CIL faces the core challenge of catastrophic forgetting, often manifested as a prediction bias toward new classes. Existing methods mainly attribute this bias to intra-task class imbalance and focus on corrections at the classifier head. In this paper, we highlight an overlooked factor -- temporal imbalance -- as a key cause of this bias. Earlier classes receive stronger negative supervision toward the end of training, leading to asymmetric precision and recall. We establish a temporal supervision model, formally define temporal imbalance, and propose Temporal-Adjusted Loss (TAL), which uses a temporal decay kernel to construct a supervision strength vector and dynamically reweight the negative supervision in cross-entropy loss. Theoretical analysis shows that TAL degenerates to standard cross-entropy under balanced conditions and effectively mitigates prediction bias under imbalance. Extensive experiments demonstrate that TAL significantly reduces forgetting and improves performance on multiple CIL benchmarks, underscoring the importance of temporal modeling for stable long-term learning.

IVJun 2, 2025Code
Flexible Mixed Precision Quantization for Learned Image Compression

Md Adnan Faisal Hossain, Zhihao Duan, Fengqing Zhu

Despite its improvements in coding performance compared to traditional codecs, Learned Image Compression (LIC) suffers from large computational costs for storage and deployment. Model quantization offers an effective solution to reduce the computational complexity of LIC models. However, most existing works perform fixed-precision quantization which suffers from sub-optimal utilization of resources due to the varying sensitivity to quantization of different layers of a neural network. In this paper, we propose a Flexible Mixed Precision Quantization (FMPQ) method that assigns different bit-widths to different layers of the quantized network using the fractional change in rate-distortion loss as the bit-assignment criterion. We also introduce an adaptive search algorithm which reduces the time-complexity of searching for the desired distribution of quantization bit-widths given a fixed model size. Evaluation of our method shows improved BD-Rate performance under similar model size constraints compared to other works on quantization of LIC models. We have made the source code available at gitlab.com/viper-purdue/fmpq.

35.4CVMar 20
Adaptive Greedy Frame Selection for Long Video Understanding

Yuning Huang, Fengqing Zhu

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.

GRJun 24, 2025Code
ICP-3DGS: SfM-free 3D Gaussian Splatting for Large-scale Unbounded Scenes

Chenhao Zhang, Yezhi Shen, Fengqing Zhu

In recent years, neural rendering methods such as NeRFs and 3D Gaussian Splatting (3DGS) have made significant progress in scene reconstruction and novel view synthesis. However, they heavily rely on preprocessed camera poses and 3D structural priors from structure-from-motion (SfM), which are challenging to obtain in outdoor scenarios. To address this challenge, we propose to incorporate Iterative Closest Point (ICP) with optimization-based refinement to achieve accurate camera pose estimation under large camera movements. Additionally, we introduce a voxel-based scene densification approach to guide the reconstruction in large-scale scenes. Experiments demonstrate that our approach ICP-3DGS outperforms existing methods in both camera pose estimation and novel view synthesis across indoor and outdoor scenes of various scales. Source code is available at https://github.com/Chenhao-Z/ICP-3DGS.

IVJan 28
Leveraging Second-Order Curvature for Efficient Learned Image Compression: Theory and Empirical Evidence

Yichi Zhang, Fengqing Zhu

Training learned image compression (LIC) models entails navigating a challenging optimization landscape defined by the fundamental trade-off between rate and distortion. Standard first-order optimizers, such as SGD and Adam, struggle with \emph{gradient conflicts} arising from competing objectives, leading to slow convergence and suboptimal rate-distortion performance. In this work, we demonstrate that a simple utilization of a second-order quasi-Newton optimizer, \textbf{SOAP}, dramatically improves both training efficiency and final performance across diverse LICs. Our theoretical and empirical analyses reveal that Newton preconditioning inherently resolves the intra-step and inter-step update conflicts intrinsic to the R-D objective, facilitating faster, more stable convergence. Beyond acceleration, we uncover a critical deployability benefit: second-order trained models exhibit significantly fewer activation and latent outliers. This substantially enhances robustness to post-training quantization. Together, these results establish second-order optimization, achievable as a seamless drop-in replacement of the imported optimizer, as a powerful, practical tool for advancing the efficiency and real-world readiness of LICs.

CVFeb 28, 2024
Gradient Reweighting: Towards Imbalanced Class-Incremental Learning

Jiangpeng He, Fengqing Zhu

Class-Incremental Learning (CIL) trains a model to continually recognize new classes from non-stationary data while retaining learned knowledge. A major challenge of CIL arises when applying to real-world data characterized by non-uniform distribution, which introduces a dual imbalance problem involving (i) disparities between stored exemplars of old tasks and new class data (inter-phase imbalance), and (ii) severe class imbalances within each individual task (intra-phase imbalance). We show that this dual imbalance issue causes skewed gradient updates with biased weights in FC layers, thus inducing over/under-fitting and catastrophic forgetting in CIL. Our method addresses it by reweighting the gradients towards balanced optimization and unbiased classifier learning. Additionally, we observe imbalanced forgetting where paradoxically the instance-rich classes suffer higher performance degradation during CIL due to a larger amount of training data becoming unavailable in subsequent learning phases. To tackle this, we further introduce a distribution-aware knowledge distillation loss to mitigate forgetting by aligning output logits proportionally with the distribution of lost training data. We validate our method on CIFAR-100, ImageNetSubset, and Food101 across various evaluation protocols and demonstrate consistent improvements compared to existing works, showing great potential to apply CIL in real-world scenarios with enhanced robustness and effectiveness.

CVApr 18, 2024
Food Portion Estimation via 3D Object Scaling

Gautham Vinod, Jiangpeng He, Zeman Shao et al.

Image-based methods to analyze food images have alleviated the user burden and biases associated with traditional methods. However, accurate portion estimation remains a major challenge due to the loss of 3D information in the 2D representation of foods captured by smartphone cameras or wearable devices. In this paper, we propose a new framework to estimate both food volume and energy from 2D images by leveraging the power of 3D food models and physical reference in the eating scene. Our method estimates the pose of the camera and the food object in the input image and recreates the eating occasion by rendering an image of a 3D model of the food with the estimated poses. We also introduce a new dataset, SimpleFood45, which contains 2D images of 45 food items and associated annotations including food volume, weight, and energy. Our method achieves an average error of 31.10 kCal (17.67%) on this dataset, outperforming existing portion estimation methods. The dataset can be accessed at: https://lorenz.ecn.purdue.edu/~gvinod/simplefood45/ and the code can be accessed at: https://gitlab.com/viper-purdue/monocular-food-volume-3d

CVMay 30, 2025
CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning

Jiangpeng He, Zhihao Duan, Fengqing Zhu

Class-Incremental Learning (CIL) aims to learn new classes sequentially while retaining the knowledge of previously learned classes. Recently, pre-trained models (PTMs) combined with parameter-efficient fine-tuning (PEFT) have shown remarkable performance in rehearsal-free CIL without requiring exemplars from previous tasks. However, existing adapter-based methods, which incorporate lightweight learnable modules into PTMs for CIL, create new adapters for each new task, leading to both parameter redundancy and failure to leverage shared knowledge across tasks. In this work, we propose ContinuaL Low-Rank Adaptation (CL-LoRA), which introduces a novel dual-adapter architecture combining \textbf{task-shared adapters} to learn cross-task knowledge and \textbf{task-specific adapters} to capture unique features of each new task. Specifically, the shared adapters utilize random orthogonal matrices and leverage knowledge distillation with gradient reassignment to preserve essential shared knowledge. In addition, we introduce learnable block-wise weights for task-specific adapters, which mitigate inter-task interference while maintaining the model's plasticity. We demonstrate CL-LoRA consistently achieves promising performance under multiple benchmarks with reduced training and inference computation, establishing a more efficient and scalable paradigm for continual learning with pre-trained models.

LGApr 6, 2024
DELTA: Decoupling Long-Tailed Online Continual Learning

Siddeshwar Raghavan, Jiangpeng He, Fengqing Zhu

A significant challenge in achieving ubiquitous Artificial Intelligence is the limited ability of models to rapidly learn new information in real-world scenarios where data follows long-tailed distributions, all while avoiding forgetting previously acquired knowledge. In this work, we study the under-explored problem of Long-Tailed Online Continual Learning (LTOCL), which aims to learn new tasks from sequentially arriving class-imbalanced data streams. Each data is observed only once for training without knowing the task data distribution. We present DELTA, a decoupled learning approach designed to enhance learning representations and address the substantial imbalance in LTOCL. We enhance the learning process by adapting supervised contrastive learning to attract similar samples and repel dissimilar (out-of-class) samples. Further, by balancing gradients during training using an equalization loss, DELTA significantly enhances learning outcomes and successfully mitigates catastrophic forgetting. Through extensive evaluation, we demonstrate that DELTA improves the capacity for incremental learning, surpassing existing OCL methods. Our results suggest considerable promise for applying OCL in real-world applications.

CVMar 25, 2024
Strategies to Improve Real-World Applicability of Laparoscopic Anatomy Segmentation Models

Fiona R. Kolbinger, Jiangpeng He, Jinge Ma et al.

Accurate identification and localization of anatomical structures of varying size and appearance in laparoscopic imaging are necessary to leverage the potential of computer vision techniques for surgical decision support. Segmentation performance of such models is traditionally reported using metrics of overlap such as IoU. However, imbalanced and unrealistic representation of classes in the training data and suboptimal selection of reported metrics have the potential to skew nominal segmentation performance and thereby ultimately limit clinical translation. In this work, we systematically analyze the impact of class characteristics (i.e., organ size differences), training and test data composition (i.e., representation of positive and negative examples), and modeling parameters (i.e., foreground-to-background class weight) on eight segmentation metrics: accuracy, precision, recall, IoU, F1 score (Dice Similarity Coefficient), specificity, Hausdorff Distance, and Average Symmetric Surface Distance. Our findings support two adjustments to account for data biases in surgical data science: First, training on datasets that are similar to the clinical real-world scenarios in terms of class distribution, and second, class weight adjustments to optimize segmentation model performance with regard to metrics of particular relevance in the respective clinical setting.

IVFeb 27, 2025
Balanced Rate-Distortion Optimization in Learned Image Compression

Yichi Zhang, Zhihao Duan, Yuning Huang et al.

Learned image compression (LIC) using deep learning architectures has seen significant advancements, yet standard rate-distortion (R-D) optimization often encounters imbalanced updates due to diverse gradients of the rate and distortion objectives. This imbalance can lead to suboptimal optimization, where one objective dominates, thereby reducing overall compression efficiency. To address this challenge, we reformulate R-D optimization as a multi-objective optimization (MOO) problem and introduce two balanced R-D optimization strategies that adaptively adjust gradient updates to achieve more equitable improvements in both rate and distortion. The first proposed strategy utilizes a coarse-to-fine gradient descent approach along standard R-D optimization trajectories, making it particularly suitable for training LIC models from scratch. The second proposed strategy analytically addresses the reformulated optimization as a quadratic programming problem with an equality constraint, which is ideal for fine-tuning existing models. Experimental results demonstrate that both proposed methods enhance the R-D performance of LIC models, achieving around a 2\% BD-Rate reduction with acceptable additional training cost, leading to a more balanced and efficient optimization process. Code will be available at https://gitlab.com/viper-purdue/Balanced-RD.

CVNov 14, 2024
MFP3D: Monocular Food Portion Estimation Leveraging 3D Point Clouds

Jinge Ma, Xiaoyan Zhang, Gautham Vinod et al.

Food portion estimation is crucial for monitoring health and tracking dietary intake. Image-based dietary assessment, which involves analyzing eating occasion images using computer vision techniques, is increasingly replacing traditional methods such as 24-hour recalls. However, accurately estimating the nutritional content from images remains challenging due to the loss of 3D information when projecting to the 2D image plane. Existing portion estimation methods are challenging to deploy in real-world scenarios due to their reliance on specific requirements, such as physical reference objects, high-quality depth information, or multi-view images and videos. In this paper, we introduce MFP3D, a new framework for accurate food portion estimation using only a single monocular image. Specifically, MFP3D consists of three key modules: (1) a 3D Reconstruction Module that generates a 3D point cloud representation of the food from the 2D image, (2) a Feature Extraction Module that extracts and concatenates features from both the 3D point cloud and the 2D RGB image, and (3) a Portion Regression Module that employs a deep regression model to estimate the food's volume and energy content based on the extracted features. Our MFP3D is evaluated on MetaFood3D dataset, demonstrating its significant improvement in accurate portion estimation over existing methods.

IVApr 11, 2024
Learning to Classify New Foods Incrementally Via Compressed Exemplars

Justin Yang, Zhihao Duan, Jiangpeng He et al.

Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image classification systems should adapt to and manage data that continuously evolves. This is where continual learning plays an important role. A challenge in continual learning is catastrophic forgetting, where ML models tend to discard old knowledge upon learning new information. While memory-replay algorithms have shown promise in mitigating this problem by storing old data as exemplars, they are hampered by the limited capacity of memory buffers, leading to an imbalance between new and previously learned data. To address this, our work explores the use of neural image compression to extend buffer size and enhance data diversity. We introduced the concept of continuously learning a neural compression model to adaptively improve the quality of compressed data and optimize the bitrates per pixel (bpp) to store more exemplars. Our extensive experiments, including evaluations on food-specific datasets including Food-101 and VFN-74, as well as the general dataset ImageNet-100, demonstrate improvements in classification accuracy. This progress is pivotal in advancing more realistic food recognition systems that are capable of adapting to continually evolving data. Moreover, the principles and methodologies we've developed hold promise for broader applications, extending their benefits to other domains of continual machine learning systems.

CVJul 9, 2025
Comprehensive Evaluation of Large Multimodal Models for Nutrition Analysis: A New Benchmark Enriched with Contextual Metadata

Bruce Coburn, Jiangpeng He, Megan E. Rollo et al.

Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce \textbf{ACETADA}, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.

CVMar 10, 2024
Probing Image Compression For Class-Incremental Learning

Justin Yang, Zhihao Duan, Andrew Peng et al.

Image compression emerges as a pivotal tool in the efficient handling and transmission of digital images. Its ability to substantially reduce file size not only facilitates enhanced data storage capacity but also potentially brings advantages to the development of continual machine learning (ML) systems, which learn new knowledge incrementally from sequential data. Continual ML systems often rely on storing representative samples, also known as exemplars, within a limited memory constraint to maintain the performance on previously learned data. These methods are known as memory replay-based algorithms and have proven effective at mitigating the detrimental effects of catastrophic forgetting. Nonetheless, the limited memory buffer size often falls short of adequately representing the entire data distribution. In this paper, we explore the use of image compression as a strategy to enhance the buffer's capacity, thereby increasing exemplar diversity. However, directly using compressed exemplars introduces domain shift during continual ML, marked by a discrepancy between compressed training data and uncompressed testing data. Additionally, it is essential to determine the appropriate compression algorithm and select the most effective rate for continual ML systems to balance the trade-off between exemplar quality and quantity. To this end, we introduce a new framework to incorporate image compression for continual ML including a pre-processing data compression step and an efficient compression rate/algorithm selection method. We conduct extensive experiments on CIFAR-100 and ImageNet datasets and show that our method significantly improves image classification accuracy in continual ML settings.

CVJan 25
Training-Free Text-to-Image Compositional Food Generation via Prompt Grafting

Xinyue Pan, Yuhao Chen, Fengqing Zhu

Real-world meal images often contain multiple food items, making reliable compositional food image generation important for applications such as image-based dietary assessment, where multi-food data augmentation is needed, and recipe visualization. However, modern text-to-image diffusion models struggle to generate accurate multi-food images due to object entanglement, where adjacent foods (e.g., rice and soup) fuse together because many foods do not have clear boundaries. To address this challenge, we introduce Prompt Grafting (PG), a training-free framework that combines explicit spatial cues in text with implicit layout guidance during sampling. PG runs a two-stage process where a layout prompt first establishes distinct regions and the target prompt is grafted once layout formation stabilizes. The framework enables food entanglement control: users can specify which food items should remain separated or be intentionally mixed by editing the arrangement of layouts. Across two food datasets, our method significantly improves the presence of target objects and provides qualitative evidence of controllable separation.

CVJan 13
Instance camera focus prediction for crystal agglomeration classification

Xiaoyu Ji, Chenhao Zhang, Tyler James Downard et al.

Agglomeration refers to the process of crystal clustering due to interparticle forces. Crystal agglomeration analysis from microscopic images is challenging due to the inherent limitations of two-dimensional imaging. Overlapping crystals may appear connected even when located at different depth layers. Because optical microscopes have a shallow depth of field, crystals that are in-focus and out-of-focus in the same image typically reside on different depth layers and do not constitute true agglomeration. To address this, we first quantified camera focus with an instance camera focus prediction network to predict 2 class focus level that aligns better with visual observations than traditional image processing focus measures. Then an instance segmentation model is combined with the predicted focus level for agglomeration classification. Our proposed method has a higher agglomeration classification and segmentation accuracy than the baseline models on ammonium perchlorate crystal and sugar crystal dataset.

CVFeb 4
Food Portion Estimation: From Pixels to Calories

Gautham Vinod, Fengqing Zhu

Reliance on images for dietary assessment is an important strategy to accurately and conveniently monitor an individual's health, making it a vital mechanism in the prevention and care of chronic diseases and obesity. However, image-based dietary assessment suffers from estimating the three dimensional size of food from 2D image inputs. Many strategies have been devised to overcome this critical limitation such as the use of auxiliary inputs like depth maps, multi-view inputs, or model-based approaches such as template matching. Deep learning also helps bridge the gap by either using monocular images or combinations of the image and the auxillary inputs to precisely predict the output portion from the image input. In this paper, we explore the different strategies employed for accurate portion estimation.

CVJan 27
Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation

Gautham Vinod, Bruce Coburn, Siddeshwar Raghavan et al.

The rise of chronic diseases related to diet, such as obesity and diabetes, emphasizes the need for accurate monitoring of food intake. While AI-driven dietary assessment has made strides in recent years, the ill-posed nature of recovering size (portion) information from monocular images for accurate estimation of ``how much did you eat?'' is a pressing challenge. Some 3D reconstruction methods have achieved impressive geometric reconstruction but fail to recover the crucial real-world scale of the reconstructed object, limiting its usage in precision nutrition. In this paper, we bridge the gap between 3D computer vision and digital health by proposing a method that recovers a true-to-scale 3D reconstructed object from a monocular image. Our approach leverages rich visual features extracted from models trained on large-scale datasets to estimate the scale of the reconstructed object. This learned scale enables us to convert single-view 3D reconstructions into true-to-life, physically meaningful models. Extensive experiments and ablation studies on two publicly available datasets show that our method consistently outperforms existing techniques, achieving nearly a 30% reduction in mean absolute volume-estimation error, showcasing its potential to enhance the domain of precision nutrition. Code: https://gitlab.com/viper-purdue/size-matters

CVOct 3, 2025
Denoising of Two-Phase Optically Sectioned Structured Illumination Reconstructions Using Encoder-Decoder Networks

Allison Davis, Yezhi Shen, Xiaoyu Ji et al.

Structured illumination (SI) enhances image resolution and contrast by projecting patterned light onto a sample. In two-phase optical-sectioning SI (OS-SI), reduced acquisition time introduces residual artifacts that conventional denoising struggles to suppress. Deep learning offers an alternative to traditional methods; however, supervised training is limited by the lack of clean, optically sectioned ground-truth data. We investigate encoder-decoder networks for artifact reduction in two-phase OS-SI, using synthetic training pairs formed by applying real artifact fields to synthetic images. An asymmetrical denoising autoencoder (DAE) and a U-Net are trained on the synthetic data, then evaluated on real OS-SI images. Both networks improve image clarity, with each excelling against different artifact types. These results demonstrate that synthetic training enables supervised denoising of OS-SI images and highlight the potential of encoder-decoder networks to streamline reconstruction workflows.

CVSep 25, 2025
Unsupervised Defect Detection for Surgical Instruments

Joseph Huang, Yichi Zhang, Jingxi Yu et al.

Ensuring the safety of surgical instruments requires reliable detection of visual defects. However, manual inspection is prone to error, and existing automated defect detection methods, typically trained on natural/industrial images, fail to transfer effectively to the surgical domain. We demonstrate that simply applying or fine-tuning these approaches leads to issues: false positive detections arising from textured backgrounds, poor sensitivity to small, subtle defects, and inadequate capture of instrument-specific features due to domain shift. To address these challenges, we propose a versatile method that adapts unsupervised defect detection methods specifically for surgical instruments. By integrating background masking, a patch-based analysis strategy, and efficient domain adaptation, our method overcomes these limitations, enabling the reliable detection of fine-grained defects in surgical instrument imagery.

CVAug 13, 2025
EntropyGS: An Efficient Entropy Coding on 3D Gaussian Splatting

Yuning Huang, Jiahao Pang, Fengqing Zhu et al.

As an emerging novel view synthesis approach, 3D Gaussian Splatting (3DGS) demonstrates fast training/rendering with superior visual quality. The two tasks of 3DGS, Gaussian creation and view rendering, are typically separated over time or devices, and thus storage/transmission and finally compression of 3DGS Gaussians become necessary. We begin with a correlation and statistical analysis of 3DGS Gaussian attributes. An inspiring finding in this work reveals that spherical harmonic AC attributes precisely follow Laplace distributions, while mixtures of Gaussian distributions can approximate rotation, scaling, and opacity. Additionally, harmonic AC attributes manifest weak correlations with other attributes except for inherited correlations from a color space. A factorized and parameterized entropy coding method, EntropyGS, is hereinafter proposed. During encoding, distribution parameters of each Gaussian attribute are estimated to assist their entropy coding. The quantization for entropy coding is adaptively performed according to Gaussian attribute types. EntropyGS demonstrates about 30x rate reduction on benchmark datasets while maintaining similar rendering quality compared to input 3DGS data, with a fast encoding and decoding time.

CVJul 31, 2025
Confidence-aware agglomeration classification and segmentation of 2D microscopic food crystal images

Xiaoyu Ji, Ali Shakouri, Fengqing Zhu

Food crystal agglomeration is a phenomenon occurs during crystallization which traps water between crystals and affects food product quality. Manual annotation of agglomeration in 2D microscopic images is particularly difficult due to the transparency of water bonding and the limited perspective focusing on a single slide of the imaged sample. To address this challenge, we first propose a supervised baseline model to generate segmentation pseudo-labels for the coarsely labeled classification dataset. Next, an instance classification model that simultaneously performs pixel-wise segmentation is trained. Both models are used in the inference stage to combine their respective strengths in classification and segmentation. To preserve crystal properties, a post processing module is designed and included to both steps. Our method improves true positive agglomeration classification accuracy and size distribution predictions compared to other existing methods. Given the variability in confidence levels of manual annotations, our proposed method is evaluated under two confidence levels and successfully classifies potential agglomerated instances.