46.2LGApr 15
Robust Adversarial Policy Optimization Under Dynamics UncertaintyMintae Kim, Koushil Sreenath
Reinforcement learning (RL) policies often fail under dynamics that differ from training, a gap not fully addressed by domain randomization or existing adversarial RL methods. Distributionally robust RL provides a formal remedy but still relies on surrogate adversaries to approximate intractable primal problems, leaving blind spots that potentially cause instability and over-conservatism. We propose a dual formulation that directly exposes the robustness-performance trade-off. At the trajectory level, a temperature parameter from the dual problem is approximated with an adversarial network, yielding efficient and stable worst-case rollouts within a divergence bound. At the model level, we employ Boltzmann reweighting over dynamics ensembles, focusing on more adverse environments to the current policy rather than uniform sampling. The two components act independently and complement each other: trajectory-level steering ensures robust rollouts, while model-level sampling provides policy-sensitive coverage of adverse dynamics. The resulting framework, robust adversarial policy optimization (RAPO) outperforms robust RL baselines, improving resilience to uncertainty and generalization to out-of-distribution dynamics while maintaining dual tractability.
61.2LGApr 10
WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement LearningMintae Kim, Koushil Sreenath
Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose \textit{World Model-based Experience Transfer} (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.
ROSep 14, 2025Code
RoVerFly: Robust and Versatile Implicit Hybrid Control of Quadrotor-Payload SystemsMintae Kim, Jiaze Cai, Koushil Sreenath
Designing robust controllers for precise trajectory tracking with quadrotors is challenging due to nonlinear dynamics and underactuation, and becomes harder with flexible cable-suspended payloads that add degrees of freedom and hybrid dynamics. Classical model-based methods offer stability guarantees but require extensive tuning and often fail to adapt when the configuration changes-when a payload is added or removed, or when its mass or cable length varies. We present RoVerFly, a unified learning-based control framework where a single reinforcement learning (RL) policy functions as an implicit hybrid controller, managing complex dynamics without explicit mode detection or controller switching. Trained with task and domain randomization, the controller is resilient to disturbances and varying dynamics. It achieves strong zero-shot generalization across payload settings-including no payload as well as varying mass and cable length-without re-tuning, while retaining the interpretability and structure of a feedback tracking controller. Code and supplementary materials are available at https://github.com/mintaeshkim/roverfly.
CLJul 9, 2025Code
ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual PretrainingSeonwu Kim, Yohan Na, Kihun Kim et al.
The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative despite inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been explored for domain adaptation, its utility in commercial settings remains under-examined. In this study, we validate the effectiveness of a DACP-based recipe across diverse foundation models and service domains, producing DACP-applied sLLMs (ixi-GEN). Through extensive experiments and real-world evaluations, we demonstrate that ixi-GEN models achieve substantial gains in target-domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.
IRJan 12
ReinPool: Reinforcement Learning Pooling Multi-Vector Embeddings for Retrieval SystemSungguk Cha, DongWook Kim, Mintae Kim et al.
Multi-vector embedding models have emerged as a powerful paradigm for document retrieval, preserving fine-grained visual and textual details through token-level representations. However, this expressiveness comes at a staggering cost: storing embeddings for every token inflates index sizes by over $1000\times$ compared to single-vector approaches, severely limiting scalability. We introduce \textbf{ReinPool}, a reinforcement learning framework that learns to dynamically filter and pool multi-vector embeddings into compact, retrieval-optimized representations. By training with an inverse retrieval objective and NDCG-based rewards, ReinPool identifies and retains only the most discriminative vectors without requiring manual importance annotations. On the Vidore V2 benchmark across three vision-language embedding models, ReinPool compresses multi-vector representations by $746$--$1249\times$ into single vectors while recovering 76--81\% of full multi-vector retrieval performance. Compared to static mean pooling baselines, ReinPool achieves 22--33\% absolute NDCG@3 improvement, demonstrating that learned selection significantly outperforms heuristic aggregation.
CVJul 31, 2025
Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World DocumentsSungguk Cha, DongWook Kim, Taeseung Hahn et al.
Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbf{RL-QR}, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with $\text{RL-QR}_{\text{multi-modal}}$ achieving an 11\% relative gain in NDCG@3 for multi-modal RAG and $\text{RL-QR}_{\text{lexical}}$ yielding a 9\% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR's potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.