LGJan 29
Memorization Control in Diffusion Models from Denoising-centric PerspectiveThuy Phuong Vu, Mai Viet Hoang Do, Minhhuy Le et al.
Controlling memorization in diffusion models is critical for applications that require generated data to closely match the training distribution. Existing approaches mainly focus on data centric or model centric modifications, treating the diffusion model as an isolated predictor. In this paper, we study memorization in diffusion models from a denoising centric perspective. We show that uniform timestep sampling leads to unequal learning contributions across denoising steps due to differences in signal to noise ratio, which biases training toward memorization. To address this, we propose a timestep sampling strategy that explicitly controls where learning occurs along the denoising trajectory. By adjusting the width of the confidence interval, our method provides direct control over the memorization generalization trade off. Experiments on image and 1D signal generation tasks demonstrate that shifting learning emphasis toward later denoising steps consistently reduces memorization and improves distributional alignment with training data, validating the generality and effectiveness of our approach.
CVOct 19, 2025Code
Region in Context: Text-condition Image editing with Human-like semantic reasoningThuy Phuong Vu, Dinh-Cuong Hoang, Minhhuy Le et al.
Recent research has made significant progress in localizing and editing image regions based on text. However, most approaches treat these regions in isolation, relying solely on local cues without accounting for how each part contributes to the overall visual and semantic composition. This often results in inconsistent edits, unnatural transitions, or loss of coherence across the image. In this work, we propose Region in Context, a novel framework for text-conditioned image editing that performs multilevel semantic alignment between vision and language, inspired by the human ability to reason about edits in relation to the whole scene. Our method encourages each region to understand its role within the global image context, enabling precise and harmonized changes. At its core, the framework introduces a dual-level guidance mechanism: regions are represented with full-image context and aligned with detailed region-level descriptions, while the entire image is simultaneously matched to a comprehensive scene-level description generated by a large vision-language model. These descriptions serve as explicit verbal references of the intended content, guiding both local modifications and global structure. Experiments show that it produces more coherent and instruction-aligned results. Code is available at: https://github.com/thuyvuphuong/Region-in-Context.git
ROAug 22, 2019
Object-RPE: Dense 3D Reconstruction and Pose Estimation with Convolutional Neural Networks for Warehouse RobotsDinh-Cuong Hoang, Todor Stoyanov, Achim J. Lilienthal
We present an approach for recognizing all objects in a scene and estimating their full pose from an accurate 3D instance-aware semantic reconstruction using an RGB-D camera. Our framework couples convolutional neural networks (CNNs) and a state-of-the-art dense Simultaneous Localisation and Mapping (SLAM) system, ElasticFusion, to achieve both high-quality semantic reconstruction as well as robust 6D pose estimation for relevant objects. While the main trend in CNN-based 6D pose estimation has been to infer object's position and orientation from single views of the scene, our approach explores performing pose estimation from multiple viewpoints, under the conjecture that combining multiple predictions can improve the robustness of an object detection system. The resulting system is capable of producing high-quality object-aware semantic reconstructions of room-sized environments, as well as accurately detecting objects and their 6D poses. The developed method has been verified through experimental validation on the YCB-Video dataset and a newly collected warehouse object dataset. Experimental results confirmed that the proposed system achieves improvements over state-of-the-art methods in terms of surface reconstruction and object pose prediction. Our code and video are available at https://sites.google.com/view/object-rpe.
ROMar 26, 2019
High-quality Instance-aware Semantic 3D Map Using RGB-D CameraDinh-Cuong Hoang, Todor Stoyanov, Achim J. Lilienthal
We present a mapping system capable of constructing detailed instance-level semantic models of room-sized indoor environments by means of an RGB-D camera. In this work, we integrate deep-learning-based instance segmentation and classification into a state of the art RGB-D SLAM system. We leverage the pipeline of ElasticFusion [1] as a backbone and propose modifications of the registration cost function. The proposed objective function features a tunable weight for the appearance channel, which can be learned from data. The resulting system is capable of producing accurate semantic maps of room-sized environments, as well as reconstructing highly detailed object-level models. The developed method has been verified through experimental validation on the TUMRGB-D SLAM benchmark and the YCB video dataset. Our results confirmed that the proposed system performs favorably in terms of trajectory estimation, surface reconstruction, and segmentation quality in comparison to other state-of-the-art systems.