Jongho Park

LG
h-index44
23papers
488citations
Novelty57%
AI Score59

23 Papers

AIJan 30Code
THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Seanie Lee, Sangwoo Park, Yumin Choi et al.

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self-generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in-distribution safety reasoning traces. Fine-tuning on these self-generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe.git.

DSSep 20, 2023
Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions

Ilias Diakonikolas, Sushrut Karmalkar, Jongho Park et al.

We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, \new{the noisy labels are of the form} $y = g(w^* \cdot x) + ξ+ ε$, where $ξ$ is the oblivious noise drawn independently of $x$ \new{and satisfies} $\Pr[ξ= 0] \geq o(1)$, and $ε\sim \mathcal N(0, σ^2)$. Our goal is to accurately recover a \new{parameter vector $w$ such that the} function $g(w \cdot x)$ \new{has} arbitrarily small error when compared to the true values $g(w^* \cdot x)$, rather than the noisy measurements $y$. We present an algorithm that tackles \new{this} problem in its most general distribution-independent setting, where the solution may not \new{even} be identifiable. \new{Our} algorithm returns \new{an accurate estimate of} the solution if it is identifiable, and otherwise returns a small list of candidates, one of which is close to the true solution. Furthermore, we \new{provide} a necessary and sufficient condition for identifiability, which holds in broad settings. \new{Specifically,} the problem is identifiable when the quantile at which $ξ+ ε= 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated $g(w^* \cdot x) + A$ for some real number $A$, while also having large error when compared to $g(w^* \cdot x)$. This is the first \new{algorithmic} result for GLM regression \new{with oblivious noise} which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression, and gave algorithms under restrictive assumptions.

NAJun 6, 2019
A Finite Element Approach for the Dual Rudin--Osher--Fatemi Model and Its Nonoverlapping Domain Decomposition Methods

Chang-Ock Lee, Eun-Hee Park, Jongho Park

We consider a finite element discretization for the dual Rudin--Osher--Fatemi model using a Raviart--Thomas basis for $H_0 (\mathrm{div};Ω)$. Since the proposed discretization has splitting property for the energy functional, which is not satisfied for existing finite difference based discretizations, it is more adequate for designing domain decomposition methods. In this paper, a primal domain decomposition method is proposed, which resembles the classical Schur complement method for the second order elliptic problems, and it achieves $O(1/n^2)$ convergence. A primal-dual domain decomposition method based on the method of Lagrange multipliers on the subdomain interfaces is also considered. Local problems of the proposed primal-dual domain decomposition method can be solved in linear convergence rate. Numerical results for the proposed methods are provided.

89.2LGMay 25
Looped Diffusion Language Models

Sanghyun Lee, Chunsan Hong, Seungryong Kim et al.

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models for language modeling, yet the effective design of transformer architectures for MDMs remains underexplored. In this paper, we show that selectively looping the early-middle transformer layers significantly improves both training efficiency and model performance in MDMs. We call this approach LoopMDM(Looped Masked Diffusion Model), which brings two key benefits: looping layers at training-time yields a depth-scaling effect without adding parameters, while varying the number of loops at inference-time enables flexible compute scaling. Despite the simplicity, the results are striking: across multiple pre-training corpora, LoopMDM matches the performance of same-size MDMs with up to 3.3 fewer training FLOPs, while its final performance outperforms them on various reasoning benchmarks, including up to 8.5 points on GSM8K. It even surpasses deeper non-looped MDMs trained with comparable per-step compute, indicating that selective looping is more effective than naive depth scaling. Furthermore, LoopMDM can scale inference-time compute by increasing the number of loops. Adaptively adjusting the number of loops throughout the sampling process further yields additional gains in compute efficiency while maintaining performance. Lastly, with attention analysis, we provide evidence that looping is effective in MDMs by promoting interactions among masked positions. Our code and weights will be publicly released.

96.3NAApr 1
Parallel multilevel methods for solving the Darcy--Forchheimer model based on a nearly semicoercive formulation

Jongho Park, S. Majid Hassanizadeh

High-velocity fluid flow through porous media is modeled by prescribing a nonlinear relationship between the flow rate and the pressure gradient, called the Darcy--Forchheimer equation. This paper is concerned with the analysis of parallel multilevel methods for solving the Darcy--Forchheimer model. We begin by reformulating the Darcy--Forchheimer model as a nearly semicoercive convex optimization problem via the augmented Lagrangian method. Building on this formulation, we develop a parallel multilevel method, also known as a multilevel additive Schwarz method, within the framework of subspace correction for nearly semicoercive convex problems. The proposed method exhibits robustness with respect to both the nearly semicoercive nature of the problem and the size of the discretized system. To further enhance convergence, we incorporate a backtracking line search scheme and a full approximation scheme. Numerical results validate the theoretical findings and demonstrate the effectiveness and superiority of the proposed approach.

LGOct 19, 2023
Balanced Group Convolution: An Improved Group Convolution Based on Approximability Estimates

Youngkyu Lee, Jongho Park, Chang-Ock Lee

The performance of neural networks has been significantly improved by increasing the number of channels in convolutional layers. However, this increase in performance comes with a higher computational cost, resulting in numerous studies focused on reducing it. One promising approach to address this issue is group convolution, which effectively reduces the computational cost by grouping channels. However, to the best of our knowledge, there has been no theoretical analysis on how well the group convolution approximates the standard convolution. In this paper, we mathematically analyze the approximation of the group convolution to the standard convolution with respect to the number of groups. Furthermore, we propose a novel variant of the group convolution called balanced group convolution, which shows a higher approximation with a small additional computational cost. We provide experimental results that validate our theoretical findings and demonstrate the superior performance of the balanced group convolution over other variants of group convolution.

AIJul 1, 2025Code
Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess

Dongyoon Hwang, Hojoon Lee, Jaegul Choo et al.

While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether LLMs can develop strategic reasoning capabilities through RL in chess. To this end, we leverage a chess-pretrained action-value network to provide dense reward on the LLM's output move quality, which can be seen as a form of knowledge distillation. Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards. However, surprisingly, all models plateau far below expert levels. We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models' internal understanding of chess-a deficit which RL alone may not be able to fully overcome. The code is available at https://github.com/krafton-ai/Chess-R1.

NASep 29, 2025
Parallel subspace correction methods for semicoercive and nearly semicoercive convex optimization with applications to nonlinear PDEs

Young-Ju Lee, Jongho Park

We present new convergence analyses for parallel subspace correction methods for unconstrained semicoercive and nearly semicoercive convex optimization problems, generalizing the theory of singular and nearly singular linear problems to a class of nonlinear problems. Our results demonstrate that the elegant theoretical framework developed for singular and nearly singular linear problems can be extended to unconstrained semicoercive and nearly semicoercive convex optimization problems. For semicoercive problems, we show that the convergence rate can be estimated in terms of a seminorm stable decomposition over the subspaces and the kernel of the problem, aligning with the theory for singular linear problems. For nearly semicoercive problems, we establish a parameter-independent convergence rate, assuming the kernel of the semicoercive part can be decomposed into a sum of local kernels, which aligns with the theory for nearly singular problems. To demonstrate the applicability of our results, we provide convergence analyses of two-level additive Schwarz methods for solving certain nonlinear partial differential equations with Neumann boundary conditions, within the proposed abstract framework.

LGFeb 6, 2024
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

Jongho Park, Jaeseung Park, Zheyang Xiong et al.

State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in language modeling, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention. Although SSMs exhibit competitive performance, their in-context learning (ICL) capabilities, a remarkable emergent property of modern language models that enables task execution without parameter optimization, remain underexplored compared to Transformers. In this study, we evaluate the ICL performance of SSMs, focusing on Mamba, against Transformer models across various tasks. Our results show that SSMs perform comparably to Transformers in standard regression ICL tasks, while outperforming them in tasks like sparse parity learning. However, SSMs fall short in tasks involving non-standard retrieval functionality. To address these limitations, we introduce a hybrid model, MambaFormer, that combines Mamba with attention blocks, surpassing individual models in tasks where they struggle independently. Our findings suggest that hybrid architectures offer promising avenues for enhancing ICL in language models.

LGNov 4, 2025
Effective Test-Time Scaling of Discrete Diffusion through Iterative Refinement

Sanghyun Lee, Sunwoo Kim, Seungryong Kim et al.

Test-time scaling through reward-guided generation remains largely unexplored for discrete diffusion models despite its potential as a promising alternative. In this work, we introduce Iterative Reward-Guided Refinement (IterRef), a novel test-time scaling method tailored to discrete diffusion that leverages reward-guided noising-denoising transitions to progressively refine misaligned intermediate states. We formalize this process within a Multiple-Try Metropolis (MTM) framework, proving convergence to the reward-aligned distribution. Unlike prior methods that assume the current state is already aligned with the reward distribution and only guide the subsequent transition, our approach explicitly refines each state in situ, progressively steering it toward the optimal intermediate distribution. Across both text and image domains, we evaluate IterRef on diverse discrete diffusion models and observe consistent improvements in reward-guided generation quality. In particular, IterRef achieves striking gains under low compute budgets, far surpassing prior state-of-the-art baselines.

23.8OCApr 3
Randomized subspace correction methods for convex optimization

Boou Jiang, Jongho Park, Jinchao Xu

This paper introduces an abstract framework for randomized subspace correction methods for convex optimization, which unifies and generalizes a broad class of existing algorithms, including domain decomposition, multigrid, and block coordinate descent methods. We provide a convergence rate analysis ranging from minimal assumptions to more practical settings, such as sharpness and strong convexity. While most existing studies on block coordinate descent methods focus on nonoverlapping decompositions and smooth or strongly convex problems, our framework extends to more general settings involving arbitrary space decompositions, inexact local solvers, and problems with weaker smoothness or convexity assumptions. The proposed framework is broadly applicable to convex optimization problems arising in areas such as nonlinear partial differential equations, imaging, and data science.

95.8NAApr 2
Unified analysis of saddle point problems via auxiliary space theory

Jongho Park

We present sharp estimates for the extremal eigenvalues of the Schur complements arising in saddle point problems. These estimates are derived using the auxiliary space theory, in which a given iterative method is interpreted as an equivalent but more elementary iterative method on an auxiliary space, enabling us to obtain sharp convergence estimates. The proposed framework improves or refines several existing results, which can be recovered as corollaries of our results. To demonstrate the versatility of the framework, we present various applications from scientific computing: the augmented Lagrangian method, mixed finite element methods, and nonoverlapping domain decomposition methods. In all these applications, the condition numbers of the corresponding Schur complements can be estimated in a straightforward manner using the proposed framework.

LGNov 4, 2025
Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models

Sanghyun Lee, Seungryong Kim, Jongho Park et al.

Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance crucially depends on the inference time order of unmasking. Prevailing heuristics, such as confidence based sampling, are myopic: they optimize locally, fail to leverage extra test-time compute, and let early decoding mistakes cascade. We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders without the need for an external reward model. Our framework couples (i) a path generator that proposes paths by sampling from pools of unmasking sets with (ii) a verifier that computes the uncertainty of the proposed paths and performs importance sampling to subsequently select the final paths. Empirically, erroneous unmasking measurably inflates sequence level uncertainty, and our method exploits this to avoid error-prone trajectories. We validate our framework across six benchmarks, such as mathematics, planning, and coding, and demonstrate consistent performance improvements. LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection. The consistent improvements on both LLaDA and post-trained LLaDA 1.5 are particularly striking: base LLaDA with LookUM rivals the performance of RL-tuned LLaDA 1.5, while LookUM further enhances LLaDA 1.5 itself showing that uncertainty based verification provides orthogonal benefits to reinforcement learning and underscoring the versatility of our framework. Code will be publicly released.

LGDec 12, 2024
Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

Junhyuck Kim, Jongho Park, Jaewoong Cho et al.

We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.

LGOct 13, 2025
Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

Junhyuck Kim, Ethan Ewer, Taehong Moon et al.

While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where the KV cache rather than model size can dominate memory. Through systematic experiments across 1,700 inference scenarios on AIME25 and GPQA-Diamond, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations. This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization. Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for larger ones, maximize test-time compute. Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies from those established for non-reasoning models.

CVSep 2, 2025
A Diffusion-Based Framework for Configurable and Realistic Multi-Storage Trace Generation

Seohyun Kim, Junyoung Lee, Jongho Park et al.

We propose DiTTO, a novel diffusion-based framework for generating realistic, precisely configurable, and diverse multi-device storage traces. Leveraging advanced diffusion techniques, DiTTO enables the synthesis of high-fidelity continuous traces that capture temporal dynamics and inter-device dependencies with user-defined configurations. Our experimental results demonstrate that DiTTO can generate traces with high fidelity and diversity while aligning closely with guided configurations with only 8% errors.

LGJun 2, 2025
Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

Jihun Yun, Juno Kim, Jongho Park et al.

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, when viewed as `loss + regularization,' the standard RLHF objective lacks theoretical justification and incentivizes degenerate, deterministic solutions, an issue that variants such as Direct Policy Optimization (DPO) also inherit. In this paper, we rethink alignment by framing it as \emph{distribution learning} from pairwise preference feedback by explicitly modeling how information about the target language model bleeds through the preference data. This explicit modeling leads us to propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We theoretically show that all three approaches enjoy strong non-asymptotic $O(1/n)$ convergence to the target language model, naturally avoiding degeneracy and reward overfitting. Finally, we empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO across various tasks and models.

LGMay 17, 2023
DualFL: A Duality-based Federated Learning Algorithm with Communication Acceleration in the General Convex Regime

Jongho Park, Jinchao Xu

We propose a new training algorithm, named DualFL (Dualized Federated Learning), for solving distributed optimization problems in federated learning. DualFL achieves communication acceleration for very general convex cost functions, thereby providing a solution to an open theoretical problem in federated learning concerning cost functions that may not be smooth nor strongly convex. We provide a detailed analysis for the local iteration complexity of DualFL to ensure the overall computational efficiency of DualFL. Furthermore, we introduce a completely new approach for the convergence analysis of federated learning based on a dual formulation. This new technique enables concise and elegant analysis, which contrasts the complex calculations used in existing literature on convergence of federated learning algorithms.

CLMay 8, 2023
Prompted LLMs as Chatbot Modules for Long Open-domain Conversation

Gibbeum Lee, Volker Hartmann, Jongho Park et al.

In this paper, we propose MPC (Modular Prompted Chatbot), a new approach for creating high-quality conversational agents without the need for fine-tuning. Our method utilizes pre-trained large language models (LLMs) as individual modules for long-term consistency and flexibility, by using techniques such as few-shot prompting, chain-of-thought (CoT), and external memory. Our human evaluation results show that MPC is on par with fine-tuned chatbot models in open-domain conversations, making it an effective solution for creating consistent and engaging chatbots.

LGOct 11, 2021
Two-level Group Convolution

Youngkyu Lee, Jongho Park, Chang-Ock Lee

Group convolution has been widely used in order to reduce the computation time of convolution, which takes most of the training time of convolutional neural networks. However, it is well known that a large number of groups significantly reduce the performance of group convolution. In this paper, we propose a new convolution methodology called ``two-level'' group convolution that is robust with respect to the increase of the number of groups and suitable for multi-GPU parallel computation. We first observe that the group convolution can be interpreted as a one-level block Jacobi approximation of the standard convolution, which is a popular notion in the field of numerical analysis. In numerical analysis, there have been numerous studies on the two-level method that introduces an intergroup structure that resolves the performance degradation issue without disturbing parallel computation. Motivated by these, we introduce a coarse-level structure which promotes intergroup communication without being a bottleneck in the group convolution. We show that all the additional work induced by the coarse-level structure can be efficiently processed in a distributed memory system. Numerical results that verify the robustness of the proposed method with respect to the number of groups are presented. Moreover, we compare the proposed method to various approaches for group convolution in order to highlight the superiority of the proposed method in terms of execution time, memory efficiency, and performance.

LGSep 10, 2021
ReLU Regression with Massart Noise

Ilias Diakonikolas, Jongho Park, Christos Tzamos

We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data. This supervised learning task is efficiently solvable in the realizable setting, but is known to be computationally hard with adversarial label noise. In this work, we focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model. In this model, the label of every point is generated according to a function in the class, but an adversary is allowed to change this value arbitrarily with some probability, which is {\em at most} $η< 1/2$. We develop an efficient algorithm that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution. Such assumptions are necessary for exact recovery to be information-theoretically possible. We demonstrate that our algorithm significantly outperforms naive applications of $\ell_1$ and $\ell_2$ regression on both synthetic and real data.

NAMar 16, 2021
Parareal Neural Networks Emulating a Parallel-in-time Algorithm

Chang-Ock Lee, Youngkyu Lee, Jongho Park

As deep neural networks (DNNs) become deeper, the training time increases. In this perspective, multi-GPU parallel computing has become a key tool in accelerating the training of DNNs. In this paper, we introduce a novel methodology to construct a parallel neural network that can utilize multiple GPUs simultaneously from a given DNN. We observe that layers of DNN can be interpreted as the time step of a time-dependent problem and can be parallelized by emulating a parallel-in-time algorithm called parareal. The parareal algorithm consists of fine structures which can be implemented in parallel and a coarse structure which gives suitable approximations to the fine structures. By emulating it, the layers of DNN are torn to form a parallel structure which is connected using a suitable coarse network. We report accelerated and accuracy-preserved results of the proposed methodology applied to VGG-16 and ResNet-1001 on several datasets.

LGFeb 10, 2020
On Robust Mean Estimation under Coordinate-level Corruption

Zifan Liu, Jongho Park, Theodoros Rekatsinas et al.

We study the problem of robust mean estimation and introduce a novel Hamming distance-based measure of distribution shift for coordinate-level corruptions. We show that this measure yields adversary models that capture more realistic corruptions than those used in prior works, and present an information-theoretic analysis of robust mean estimation in these settings. We show that for structured distributions, methods that leverage the structure yield information theoretically more accurate mean estimation. We also focus on practical algorithms for robust mean estimation and study when data cleaning-inspired approaches that first fix corruptions in the input data and then perform robust mean estimation can match the information theoretic bounds of our analysis. We finally demonstrate experimentally that this two-step approach outperforms structure-agnostic robust estimation and provides accurate mean estimation even for high-magnitude corruption.