CVMar 6
Optimizing 3D Diffusion Models for Medical Imaging via Multi-Scale Reward LearningYueying Tian, Xudong Han, Meng Zhou et al.
Diffusion models have emerged as powerful tools for 3D medical image generation, yet bridging the gap between standard training objectives and clinical relevance remains a challenge. This paper presents a method to enhance 3D diffusion models using Reinforcement Learning (RL) with multi-scale feedback. We first pretrain a 3D diffusion model on MRI volumes to establish a robust generative prior. Subsequently, we fine-tune the model using Proximal Policy Optimization (PPO), guided by a novel reward system that integrates both 2D slice-wise assessments and 3D volumetric analysis. This combination allows the model to simultaneously optimize for local texture details and global structural coherence. We validate our framework on the BraTS 2019 and OASIS-1 datasets. Our results indicate that incorporating RL feedback effectively steers the generation process toward higher quality distributions. Quantitative analysis reveals significant improvements in Fréchet Inception Distance (FID) and, crucially, the synthetic data demonstrates enhanced utility in downstream tumor and disease classification tasks compared to non-optimized baselines.
LGJan 27
DSP-Reg: Domain-Sensitive Parameter Regularization for Robust Domain GeneralizationXudong Han, Senkang Hu, Yihang Tao et al.
Domain Generalization (DG) is a critical area that focuses on developing models capable of performing well on data from unseen distributions, which is essential for real-world applications. Existing approaches primarily concentrate on learning domain-invariant features, which assume that a model robust to variations in the source domains will generalize well to unseen target domains. However, these approaches neglect a deeper analysis at the parameter level, which makes the model hard to explicitly differentiate between parameters sensitive to domain shifts and those robust, potentially hindering its overall ability to generalize. In order to address these limitations, we first build a covariance-based parameter sensitivity analysis framework to quantify the sensitivity of each parameter in a model to domain shifts. By computing the covariance of parameter gradients across multiple source domains, we can identify parameters that are more susceptible to domain variations, which serves as our theoretical foundation. Based on this, we propose Domain-Sensitive Parameter Regularization (DSP-Reg), a principled framework that guides model optimization by a soft regularization technique that encourages the model to rely more on domain-invariant parameters while suppressing those that are domain-specific. This approach provides a more granular control over the model's learning process, leading to improved robustness and generalization to unseen domains. Extensive experiments on benchmarks, such as PACS, VLCS, OfficeHome, and DomainNet, demonstrate that DSP-Reg outperforms state-of-the-art approaches, achieving an average accuracy of 66.7\% and surpassing all baselines.
11.1LGApr 17
From User Recognition to Activity Counting: An Identity-Agnostic Approach to Multi-User WiFi SensingKemal Bayik, Olayinka Ajayi, Daniel Roggen et al.
Wi-Fi Channel State Information (CSI) enables device-free human activity recognition, but existing multi-user approaches assume a fixed set of known users during both training and inference. This closed-set assumption limits deployment, as models trained on a specific user set degrade when applied to new individuals or environments. We reformulate multi-user activity recognition as activity counting, estimating how many users perform each activity type at a given time, without associating actions with specific individuals. We propose a pipeline that converts CSI measurements into spatial projections and extracts features using a pretrained convolutional backbone. Two formulations are evaluated on the WiMANS dataset: a conventional identity-dependent model that assigns activities to fixed user slots, and an identity-agnostic model that estimates scene-level activity composition through regression. Under standard evaluation, the identity-agnostic model achieves a mean absolute error of 0.1081 on a 0-5 count scale. Under unseen-user evaluation, the identity-dependent model's macro-F1 drops from 80.38 to 32.61, while the identity-agnostic model's counting error remains stable. Feature space analysis confirms that identity-agnostic representations are more user-invariant, which explains their stronger generalization. These results suggest that activity counting provides a more practical and generalizable alternative to identity-dependent formulations for multi-user WiFi sensing.
CVMay 24, 2024
ETTrack: Enhanced Temporal Motion Predictor for Multi-Object TrackingXudong Han, Nobuyuki Oishi, Yueying Tian et al.
Many Multi-Object Tracking (MOT) approaches exploit motion information to associate all the detected objects across frames. However, many methods that rely on filtering-based algorithms, such as the Kalman Filter, often work well in linear motion scenarios but struggle to accurately predict the locations of objects undergoing complex and non-linear movements. To tackle these scenarios, we propose a motion-based MOT approach with an enhanced temporal motion predictor, ETTrack. Specifically, the motion predictor integrates a transformer model and a Temporal Convolutional Network (TCN) to capture short-term and long-term motion patterns, and it predicts the future motion of individual objects based on the historical motion information. Additionally, we propose a novel Momentum Correction Loss function that provides additional information regarding the motion direction of objects during training. This allows the motion predictor rapidly adapt to motion variations and more accurately predict future motion. Our experimental results demonstrate that ETTrack achieves a competitive performance compared with state-of-the-art trackers on DanceTrack and SportsMOT, scoring 56.4% and 74.4% in HOTA metrics, respectively.
CVJan 25, 2025
Enhancing Fetal Plane Classification Accuracy with Data Augmentation Using Diffusion ModelsYueying Tian, Elif Ucurum, Xudong Han et al.
Ultrasound imaging is widely used in medical diagnosis, especially for fetal health assessment. However, the availability of high-quality annotated ultrasound images is limited, which restricts the training of machine learning models. In this paper, we investigate the use of diffusion models to generate synthetic ultrasound images to improve the performance on fetal plane classification. We train different classifiers first on synthetic images and then fine-tune them with real images. Extensive experimental results demonstrate that incorporating generated images into training pipelines leads to better classification accuracy than training with real images alone. The findings suggest that generating synthetic data using diffusion models can be a valuable tool in overcoming the challenges of data scarcity in ultrasound medical imaging.
LGAug 18, 2025
Physically Plausible Data Augmentations for Wearable IMU-based Human Activity Recognition Using Physics SimulationNobuyuki Oishi, Philip Birch, Daniel Roggen et al.
The scarcity of high-quality labeled data in sensor-based Human Activity Recognition (HAR) hinders model performance and limits generalization across real-world scenarios. Data augmentation is a key strategy to mitigate this issue by enhancing the diversity of training datasets. Signal Transformation-based Data Augmentation (STDA) techniques have been widely used in HAR. However, these methods are often physically implausible, potentially resulting in augmented data that fails to preserve the original meaning of the activity labels. In this study, we introduce and systematically characterize Physically Plausible Data Augmentation (PPDA) enabled by physics simulation. PPDA leverages human body movement data from motion capture or video-based pose estimation and incorporates various realistic variabilities through physics simulation, including modifying body movements, sensor placements, and hardware-related effects. We compare the performance of PPDAs with traditional STDAs on three public datasets of daily activities and fitness workouts. First, we evaluate each augmentation method individually, directly comparing PPDAs to their STDA counterparts. Next, we assess how combining multiple PPDAs can reduce the need for initial data collection by varying the number of subjects used for training. Experiments show consistent benefits of PPDAs, improving macro F1 scores by an average of 3.7 pp (up to 13 pp) and achieving competitive performance with up to 60% fewer training subjects than STDAs. As the first systematic study of PPDA in sensor-based HAR, these results highlight the advantages of pursuing physical plausibility in data augmentation and the potential of physics simulation for generating synthetic Inertial Measurement Unit data for training deep learning HAR models. This cost-effective and scalable approach therefore helps address the annotation scarcity challenge in HAR.
CVAug 11, 2025
GRASPTrack: Geometry-Reasoned Association via Segmentation and Projection for Multi-Object TrackingXudong Han, Pengcheng Fang, Yueying Tian et al.
Multi-object tracking (MOT) in monocular videos is fundamentally challenged by occlusions and depth ambiguity, issues that conventional tracking-by-detection (TBD) methods struggle to resolve owing to a lack of geometric awareness. To address these limitations, we introduce GRASPTrack, a novel depth-aware MOT framework that integrates monocular depth estimation and instance segmentation into a standard TBD pipeline to generate high-fidelity 3D point clouds from 2D detections, thereby enabling explicit 3D geometric reasoning. These 3D point clouds are then voxelized to enable a precise and robust Voxel-Based 3D Intersection-over-Union (IoU) for spatial association. To further enhance tracking robustness, our approach incorporates Depth-aware Adaptive Noise Compensation, which dynamically adjusts the Kalman filter process noise based on occlusion severity for more reliable state estimation. Additionally, we propose a Depth-enhanced Observation-Centric Momentum, which extends the motion direction consistency from the image plane into 3D space to improve motion-based association cues, particularly for objects with complex trajectories. Extensive experiments on the MOT17, MOT20, and DanceTrack benchmarks demonstrate that our method achieves competitive performance, significantly improving tracking robustness in complex scenes with frequent occlusions and intricate motion patterns.