LGMar 22, 2022
Insights From the NeurIPS 2021 NetHack ChallengeEric Hambro, Sharada Mohanty, Dmitrii Babaev et al. · deepmind, oxford
In this report, we summarize the takeaways from the first NeurIPS 2021 NetHack Challenge. Participants were tasked with developing a program or agent that can win (i.e., 'ascend' in) the popular dungeon-crawler game of NetHack by interacting with the NetHack Learning Environment (NLE), a scalable, procedurally generated, and challenging Gym environment for reinforcement learning (RL). The challenge showcased community-driven progress in AI with many diverse approaches significantly beating the previously best results on NetHack. Furthermore, it served as a direct comparison between neural (e.g., deep RL) and symbolic AI, as well as hybrid systems, demonstrating that on NetHack symbolic bots currently outperform deep RL by a large margin. Lastly, no agent got close to winning the game, illustrating NetHack's suitability as a long-term benchmark for AI research.
CVJan 26, 2023Code
RMSim: Controlled Respiratory Motion Simulation on Static Patient ScansDonghoon Lee, Ellen Yorke, Masoud Zarepisheh et al.
This work aims to generate realistic anatomical deformations from static patient scans. Specifically, we present a method to generate these deformations/augmentations via deep learning driven respiratory motion simulation that provides the ground truth for validating deformable image registration (DIR) algorithms and driving more accurate deep learning based DIR. We present a novel 3D Seq2Seq deep learning respiratory motion simulator (RMSim) that learns from 4D-CT images and predicts future breathing phases given a static CT image. The predicted respiratory patterns, represented by time-varying displacement vector fields (DVFs) at different breathing phases, are modulated through auxiliary inputs of 1D breathing traces so that a larger amplitude in the trace results in more significant predicted deformation. Stacked 3D-ConvLSTMs are used to capture the spatial-temporal respiration patterns. Training loss includes a smoothness loss in the DVF and mean-squared error between the predicted and ground truth phase images. A spatial transformer deforms the static CT with the predicted DVF to generate the predicted phase image. 10-phase 4D-CTs of 140 internal patients were used to train and test RMSim. The trained RMSim was then used to augment a public DIR challenge dataset for training VoxelMorph to show the effectiveness of RMSim-generated deformation augmentation. We validated our RMSim output with both private and public benchmark datasets (healthy and cancer patients). The proposed approach can be used for validating DIR algorithms as well as for patient-specific augmentations to improve deep learning DIR algorithms. The code, pretrained models, and augmented DIR validation datasets will be released at https://github.com/nadeemlab/SeqX2Y.
CVMar 21, 2023
MAGVLT: Masked Generative Vision-and-Language TransformerSungwoong Kim, Daejin Jo, Donghoon Lee et al.
While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks.
LGOct 11, 2022
LECO: Learnable Episodic Count for Task-Specific Intrinsic RewardDaejin Jo, Sungwoong Kim, Daniel Wontae Nam et al.
Episodic count has been widely used to design a simple yet effective intrinsic motivation for reinforcement learning with a sparse reward. However, the use of episodic count in a high-dimensional state space as well as over a long episode time requires a thorough state compression and fast hashing, which hinders rigorous exploitation of it in such hard and complex exploration environments. Moreover, the interference from task-irrelevant observations in the episodic count may cause its intrinsic motivation to overlook task-related important changes of states, and the novelty in an episodic manner can lead to repeatedly revisit the familiar states across episodes. In order to resolve these issues, in this paper, we propose a learnable hash-based episodic count, which we name LECO, that efficiently performs as a task-specific intrinsic reward in hard exploration problems. In particular, the proposed intrinsic reward consists of the episodic novelty and the task-specific modulation where the former employs a vector quantized variational autoencoder to automatically obtain the discrete state codes for fast counting while the latter regulates the episodic novelty by learning a modulator to optimize the task-specific extrinsic reward. The proposed LECO specifically enables the automatic transition from exploration to exploitation during reinforcement learning. We experimentally show that in contrast to the previous exploration methods LECO successfully solves hard exploration problems and also scales to large state spaces through the most difficult tasks in MiniGrid and DMLab environments.
AIAug 23, 2024
Enhancing Knowledge Tracing with Concept Map and Response DisentanglementSoonwook Park, Donghoon Lee, Hogun Park
In the rapidly advancing realm of educational technology, it becomes critical to accurately trace and understand student knowledge states. Conventional Knowledge Tracing (KT) models have mainly focused on binary responses (i.e., correct and incorrect answers) to questions. Unfortunately, they largely overlook the essential information in students' actual answer choices, particularly for Multiple Choice Questions (MCQs), which could help reveal each learner's misconceptions or knowledge gaps. To tackle these challenges, we propose the Concept map-driven Response disentanglement method for enhancing Knowledge Tracing (CRKT) model. CRKT benefits KT by directly leveraging answer choices--beyond merely identifying correct or incorrect answers--to distinguish responses with different incorrect choices. We further introduce the novel use of unchosen responses by employing disentangled representations to get insights from options not selected by students. Additionally, CRKT tracks the student's knowledge state at the concept level and encodes the concept map, representing the relationships between them, to better predict unseen concepts. This approach is expected to provide actionable feedback, improving the learning experience. Our comprehensive experiments across multiple datasets demonstrate CRKT's effectiveness, achieving superior performance in prediction accuracy and interpretability over state-of-the-art models.
LGJun 15, 2025
Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language ModelsTung Minh Luu, Younghwan Lee, Donghoon Lee et al.
Designing effective reward functions remains a fundamental challenge in reinforcement learning (RL), as it often requires extensive human effort and domain expertise. While RL from human feedback has been successful in aligning agents with human intent, acquiring high-quality feedback is costly and labor-intensive, limiting its scalability. Recent advancements in foundation models present a promising alternative--leveraging AI-generated feedback to reduce reliance on human supervision in reward learning. Building on this paradigm, we introduce ERL-VLM, an enhanced rating-based RL method that effectively learns reward functions from AI feedback. Unlike prior methods that rely on pairwise comparisons, ERL-VLM queries large vision-language models (VLMs) for absolute ratings of individual trajectories, enabling more expressive feedback and improved sample efficiency. Additionally, we propose key enhancements to rating-based RL, addressing instability issues caused by data imbalance and noisy labels. Through extensive experiments across both low-level and high-level control tasks, we demonstrate that ERL-VLM significantly outperforms existing VLM-based reward generation methods. Our results demonstrate the potential of AI feedback for scaling RL with minimal human intervention, paving the way for more autonomous and efficient reward learning.
CLMar 30, 2025
FeRG-LLM : Feature Engineering by Reason Generation Large Language ModelsJeonghyun Ko, Gyeongyun Park, Donghoon Lee et al.
One of the key tasks in machine learning for tabular data is feature engineering. Although it is vital for improving the performance of models, it demands considerable human expertise and deep domain knowledge, making it labor-intensive endeavor. To address this issue, we propose a novel framework, \textbf{FeRG-LLM} (\textbf{Fe}ature engineering by \textbf{R}eason \textbf{G}eneration \textbf{L}arge \textbf{L}anguage \textbf{M}odels), a large language model designed to automatically perform feature engineering at an 8-billion-parameter scale. We have constructed two-stage conversational dialogues that enable language models to analyze machine learning tasks and discovering new features, exhibiting their Chain-of-Thought (CoT) capabilities. We use these dialogues to fine-tune Llama 3.1 8B model and integrate Direct Preference Optimization (DPO) to receive feedback improving quality of new features and the model's performance. Our experiments show that FeRG-LLM performs comparably to or better than Llama 3.1 70B on most datasets, while using fewer resources and achieving reduced inference time. It outperforms other studies in classification tasks and performs well in regression tasks. Moreover, since it does not rely on cloud-hosted LLMs like GPT-4 with extra API costs when generating features, it can be deployed locally, addressing security concerns.
LGApr 3, 2025
Reward Generation via Large Vision-Language Model in Offline Reinforcement LearningYounghwan Lee, Tung M. Luu, Donghoon Lee et al.
In offline reinforcement learning (RL), learning from fixed datasets presents a promising solution for domains where real-time interaction with the environment is expensive or risky. However, designing dense reward signals for offline dataset requires significant human effort and domain expertise. Reinforcement learning with human feedback (RLHF) has emerged as an alternative, but it remains costly due to the human-in-the-loop process, prompting interest in automated reward generation models. To address this, we propose Reward Generation via Large Vision-Language Models (RG-VLM), which leverages the reasoning capabilities of LVLMs to generate rewards from offline data without human involvement. RG-VLM improves generalization in long-horizon tasks and can be seamlessly integrated with the sparse reward signals to enhance task performance, demonstrating its potential as an auxiliary reward signal.
LGJul 31, 2025
Policy Learning from Large Vision-Language Model Feedback without Reward ModelingTung M. Luu, Donghoon Lee, Younghwan Lee et al.
Offline reinforcement learning (RL) provides a powerful framework for training robotic agents using pre-collected, suboptimal datasets, eliminating the need for costly, time-consuming, and potentially hazardous online interactions. This is particularly useful in safety-critical real-world applications, where online data collection is expensive and impractical. However, existing offline RL algorithms typically require reward labeled data, which introduces an additional bottleneck: reward function design is itself costly, labor-intensive, and requires significant domain expertise. In this paper, we introduce PLARE, a novel approach that leverages large vision-language models (VLMs) to provide guidance signals for agent training. Instead of relying on manually designed reward functions, PLARE queries a VLM for preference labels on pairs of visual trajectory segments based on a language task description. The policy is then trained directly from these preference labels using a supervised contrastive preference learning objective, bypassing the need to learn explicit reward models. Through extensive experiments on robotic manipulation tasks from the MetaWorld, PLARE achieves performance on par with or surpassing existing state-of-the-art VLM-based reward generation methods. Furthermore, we demonstrate the effectiveness of PLARE in real-world manipulation tasks with a physical robot, further validating its practical applicability.
CVMay 23, 2025
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLMDonghwan Chi, Hyomin Kim, Yoonjin Oh et al.
Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs' capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can encode local visual details while maintaining high-level semantics, and also align with textual data to be integrated seamlessly within a unified next-token prediction framework of LLMs. The resulting Slot-MLLM demonstrates significant performance improvements over baselines with previous visual tokenizers across various vision-language tasks that entail local detailed comprehension and generation. Notably, this work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.
LGJul 13, 2021
Automated Learning Rate Scheduler for Large-batch TrainingChiheon Kim, Saehoon Kim, Jongmin Kim et al.
Large-batch training has been essential in leveraging large-scale datasets and models in deep learning. While it is computationally beneficial to use large batch sizes, it often requires a specially designed learning rate (LR) schedule to achieve a comparable level of performance as in smaller batch training. Especially, when the number of training epochs is constrained, the use of a large LR and a warmup strategy is critical in the final performance of large-batch training due to the reduced number of updating steps. In this work, we propose an automated LR scheduling algorithm which is effective for neural network training with a large batch size under the given epoch budget. In specific, the whole schedule consists of two phases: adaptive warmup and predefined decay, where the LR is increased until the training loss no longer decreases and decreased to zero until the end of training. Here, whether the training loss has reached the minimum value is robustly checked with Gaussian process smoothing in an online manner with a low computational burden. Coupled with adaptive stochastic optimizers such as AdamP and LAMB, the proposed scheduler successfully adjusts the LRs without cumbersome hyperparameter tuning and achieves comparable or better performances than tuned baselines on various image classification benchmarks and architectures with a wide range of batch sizes.
CVJun 16, 2021
Deformation Driven Seq2Seq Longitudinal Tumor and Organs-at-Risk Prediction for RadiotherapyDonghoon Lee, Sadegh R Alam, Jue Jiang et al.
Purpose: Radiotherapy presents unique challenges and clinical requirements for longitudinal tumor and organ-at-risk (OAR) prediction during treatment. The challenges include tumor inflammation/edema and radiation-induced changes in organ geometry, whereas the clinical requirements demand flexibility in input/output sequence timepoints to update the predictions on rolling basis and the grounding of all predictions in relationship to the pre-treatment imaging information for response and toxicity assessment in adaptive radiotherapy. Methods: To deal with the aforementioned challenges and to comply with the clinical requirements, we present a novel 3D sequence-to-sequence model based on Convolution Long Short Term Memory (ConvLSTM) that makes use of series of deformation vector fields (DVF) between individual timepoints and reference pre-treatment/planning CTs to predict future anatomical deformations and changes in gross tumor volume as well as critical OARs. High-quality DVF training data is created by employing hyper-parameter optimization on the subset of the training data with DICE coefficient and mutual information metric. We validated our model on two radiotherapy datasets: a publicly available head-and-neck dataset (28 patients with manually contoured pre-, mid-, and post-treatment CTs), and an internal non-small cell lung cancer dataset (63 patients with manually contoured planning CT and 6 weekly CBCTs). Results: The use of DVF representation and skip connections overcomes the blurring issue of ConvLSTM prediction with the traditional image representation. The mean and standard deviation of DICE for predictions of lung GTV at week 4, 5, and 6 were 0.83$\pm$0.09, 0.82$\pm$0.08, and 0.81$\pm$0.10, respectively, and for post-treatment ipsilateral and contralateral parotids, were 0.81$\pm$0.06 and 0.85$\pm$0.02.
LGMay 18, 2020
Entropy-Augmented Entropy-Regularized Reinforcement Learning and a Continuous Path from Policy Gradient to Q-LearningDonghoon Lee
Entropy augmented to reward is known to soften the greedy argmax policy to softmax policy. Entropy augmentation is reformulated and leads to a motivation to introduce an additional entropy term to the objective function in the form of KL-divergence to regularize optimization process. It results in a policy which monotonically improves while interpolating from the current policy to the softmax greedy policy. This policy is used to build a continuously parameterized algorithm which optimize policy and Q-function simultaneously and whose extreme limits correspond to policy gradient and Q-learning, respectively. Experiments show that there can be a performance gain using an intermediate algorithm.
SPSep 5, 2019
A Variational Bayes Approach to Adaptive Radio TomographyDonghoon Lee, Georgios B. Giannakis
Radio tomographic imaging (RTI) is an emerging technology for localization of physical objects in a geographical area covered by wireless networks. With attenuation measurements collected at spatially distributed sensors, RTI capitalizes on spatial loss fields (SLFs) measuring the absorption of radio frequency waves at spatial locations along the propagation path. These SLFs can be utilized for interference management in wireless communication networks, environmental monitoring, and survivor localization after natural disasters such as earthquakes. Key to the success of RTI is to accurately model shadowing as the weighted line integral of the SLF. To learn the SLF exhibiting statistical heterogeneity induced by spatially diverse environments, the present work develops a Bayesian framework entailing a piecewise homogeneous SLF with an underlying hidden Markov random field model. Utilizing variational Bayes techniques, the novel approach yields efficient field estimators at affordable complexity. A data-adaptive sensor selection strategy is also introduced to collect informative measurements for effective reconstruction of the SLF. Numerical tests using synthetic and real datasets demonstrate the capabilities of the proposed approach to radio tomography and channel-gain estimation.
CVMar 15, 2019
Inserting Videos into VideosDonghoon Lee, Tomas Pfister, Ming-Hsuan Yang
In this paper, we introduce a new problem of manipulating a given video by inserting other videos into it. Our main task is, given an object video and a scene video, to insert the object video at a user-specified location in the scene video so that the resulting video looks realistic. We aim to handle different object motions and complex backgrounds without expensive segmentation annotations. As it is difficult to collect training pairs for this problem, we synthesize fake training pairs that can provide helpful supervisory signals when training a neural network with unpaired real data. The proposed network architecture can take both real and fake pairs as input and perform both supervised and unsupervised training in an adversarial learning scheme. To synthesize a realistic video, the network renders each frame based on the current input and previous frames. Within this framework, we observe that injecting noise into previous frames while generating the current frame stabilizes training. We conduct experiments on real-world videos in object tracking and person re-identification benchmark datasets. Experimental results demonstrate that the proposed algorithm is able to synthesize long sequences of realistic videos with a given object video inserted.
CVDec 6, 2018
Context-Aware Synthesis and Placement of Object InstancesDonghoon Lee, Sifei Liu, Jinwei Gu et al.
Learning to insert an object instance into an image in a semantically coherent manner is a challenging and interesting problem. Solving it requires (a) determining a location to place an object in the scene and (b) determining its appearance at the location. Such an object insertion model can potentially facilitate numerous image editing and scene parsing applications. In this paper, we propose an end-to-end trainable neural network for the task of inserting an object instance mask of a specified class into the semantic label map of an image. Our network consists of two generative modules where one determines where the inserted object mask should be (i.e., location and scale) and the other determines what the object mask shape (and pose) should look like. The two modules are connected together via a spatial transformation network and jointly trained. We devise a learning procedure that leverage both supervised and unsupervised data and show our model can insert an object at diverse locations with various appearances. We conduct extensive experimental validations with comparisons to strong baselines to verify the effectiveness of the proposed network.
CVMay 28, 2018
Learning Instance-Aware Object Detection Using Determinantal Point ProcessesNuri Kim, Donghoon Lee, Songhwai Oh
Recent object detectors find instances while categorizing candidate regions. As each region is evaluated independently, the number of candidate regions from a detector is usually larger than the number of objects. Since the final goal of detection is to assign a single detection to each object, a heuristic algorithm, such as non-maximum suppression (NMS), is used to select a single bounding box for an object. While simple heuristic algorithms are effective for stand-alone objects, they can fail to detect overlapped objects. In this paper, we address this issue by training a network to distinguish different objects using the relationship between candidate boxes. We propose an instance-aware detection network (IDNet), which can learn to extract features from candidate regions and measure their similarities. Based on pairwise similarities and detection qualities, the IDNet selects a subset of candidate bounding boxes using instance-aware determinantal point process inference (IDPP). Extensive experiments demonstrate that the proposed algorithm achieves significant improvements for detecting overlapped objects compared to existing state-of-the-art detection methods on the PASCAL VOC and MS COCO datasets.
CVJan 12, 2018
Generative Single Image Reflection SeparationDonghoon Lee, Ming-Hsuan Yang, Songhwai Oh
Single image reflection separation is an ill-posed problem since two scenes, a transmitted scene and a reflected scene, need to be inferred from a single observation. To make the problem tractable, in this work we assume that categories of two scenes are known. It allows us to address the problem by generating both scenes that belong to the categories while their contents are constrained to match with the observed image. A novel network architecture is proposed to render realistic images of both scenes based on adversarial learning. The network can be trained in a weakly supervised manner, i.e., it learns to separate an observed image without corresponding ground truth images of transmission and reflection scenes which are difficult to collect in practice. Experimental results on real and synthetic datasets demonstrate that the proposed algorithm performs favorably against existing methods.
CRJul 29, 2017
Privacy Protection Cache Policy on Hybrid Main MemoryNa-Young Ahn, Donghoon Lee
We firstly suggest privacy protection cache policy applying the duty to delete personal information on a hybrid main memory system. This cache policy includes generating random data and overwriting the random data into the personal information. Proposed cache policy is more economical and effective regarding perfect deletion of data.
CVMar 31, 2017
Unsupervised Holistic Image Generation from Key Local PatchesDonghoon Lee, Sangdoo Yun, Sungjoon Choi et al.
We introduce a new problem of generating an image based on a small number of key local patches without any geometric prior. In this work, key local patches are defined as informative regions of the target object or scene. This is a challenging problem since it requires generating realistic images and predicting locations of parts at the same time. We construct adversarial networks to tackle this problem. A generator network generates a fake image as well as a mask based on the encoder-decoder framework. On the other hand, a discriminator network aims to detect fake images. The network is trained with three losses to consider spatial, appearance, and adversarial information. The spatial loss determines whether the locations of predicted parts are correct. Input patches are restored in the output image without much modification due to the appearance loss. The adversarial loss ensures output images are realistic. The proposed network is trained without supervisory signals since no labels of key parts are required. Experimental results on six datasets demonstrate that the proposed algorithm performs favorably on challenging objects and scenes.