Jaehun Jung

h-index28

14papers

789citations

Novelty54%

AI Score42

Ranked #62,591 of 194,257 authors (top 32%)#12,201 in CL (top 40%)

14 Papers

28.7CLMay 24, 2022

Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations

Jaehun Jung, Lianhui Qin, Sean Welleck et al.

Despite their impressive capabilities, large pre-trained language models (LMs) struggle with consistent reasoning; recently, prompting LMs to generate explanations that self-guide the inference has emerged as a promising direction to amend this. However, these approaches are fundamentally bounded by the correctness of explanations, which themselves are often noisy and inconsistent. In this work, we develop Maieutic Prompting, which infers a correct answer to a question even from the noisy and inconsistent generations of LM. Maieutic Prompting induces a tree of explanations abductively (e.g. X is true, because ...) and recursively, then frames the inference as a satisfiability problem over these explanations and their logical relations. We test Maieutic Prompting for true/false QA on three challenging benchmarks that require complex commonsense reasoning. Maieutic Prompting achieves up to 20% better accuracy than state-of-the-art prompting methods, and as a fully unsupervised approach, performs competitively with supervised models. We also show that Maieutic Prompting improves robustness in inference while providing interpretable rationales.

33.8CLAug 20, 2025

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Aarti Basant, Abhijit Khairnar, Abhijit Paithankar et al. · nvidia

We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

21.9CLNov 13, 2023Code

STEER: Unified Style Transfer with Expert Reinforcement

Skyler Hallinan, Faeze Brahman, Ximing Lu et al. · uw

While text style transfer has many applications across natural language processing, the core premise of transferring from a single source style is unrealistic in a real-world setting. In this work, we focus on arbitrary style transfer: rewriting a text from an arbitrary, unknown style to a target style. We propose STEER: Unified Style Transfer with Expert Reinforcement, a unified frame-work developed to overcome the challenge of limited parallel data for style transfer. STEER involves automatically generating a corpus of style-transfer pairs using a product of experts during decoding. The generated offline data is then used to pre-train an initial policy before switching to online, off-policy reinforcement learning for further improvements via fine-grained reward signals. STEER is unified and can transfer to multiple target styles from an arbitrary, unknown source style, making it particularly flexible and efficient. Experimental results on a challenging dataset with text from a diverse set of styles demonstrate state-of-the-art results compared to competitive baselines. Remarkably, STEER outperforms the 175B parameter instruction-tuned GPT-3 on overall style transfer quality, despite being 226 times smaller in size. We also show STEER is robust, maintaining its style transfer capabilities on out-of-domain data, and surpassing nearly all baselines across various styles. The success of our method highlights the potential of RL algorithms when augmented with controllable decoding to overcome the challenge of limited data supervision.

31.9LGJul 25, 2024

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun Jung, Faeze Brahman, Yejin Choi

We present a principled approach to provide LLM-based evaluation with a rigorous guarantee of human agreement. We first propose that a reliable evaluation method should not uncritically rely on model preferences for pairwise evaluation, but rather assess the confidence of judge models and selectively decide when to trust its judgement. We then show that under this selective evaluation framework, human agreement can be provably guaranteed -- such that the model evaluation aligns with that of humans to a user-specified agreement level. As part of our framework, we also introduce Simulated Annotators, a novel confidence estimation method that significantly improves judge calibration and thus enables high coverage of evaluated instances. Finally, we propose Cascaded Selective Evaluation, where we use cheaper models as initial judges and escalate to stronger models only when necessary -- again, while still providing a provable guarantee of human agreement. Experimental results show that Cascaded Selective Evaluation guarantees strong alignment with humans, far beyond what LLM judges could achieve without selective evaluation. For example, on a subset of Chatbot Arena where GPT-4 almost never achieves 80% human agreement, our method, even while employing substantially cost-effective models such as Mistral-7B, guarantees over 80% human agreement with almost 80% test coverage.

2.2SDMar 29, 2022

Machine Composition of Korean Music via Topological Data Analysis and Artificial Neural Network

Mai Lan Tran, Dongjin Lee, Jae-Hun Jung

Common AI music composition algorithms based on artificial neural networks are to train a machine by feeding a large number of music pieces and create artificial neural networks that can produce music similar to the input music data. This approach is a blackbox optimization, that is, the underlying composition algorithm is, in general, not known to users. In this paper, we present a way of machine composition that trains a machine the composition principle embedded in the given music data instead of directly feeding music pieces. We propose this approach by using the concept of {\color{black}{Overlap}} matrix proposed in \cite{TPJ}. In \cite{TPJ}, a type of Korean music, so-called the {\it Dodeuri} music such as Suyeonjangjigok has been analyzed using topological data analysis (TDA), particularly using persistent homology. As the raw music data is not suitable for TDA analysis, the music data is first reconstructed as a graph. The node of the graph is defined as a two-dimensional vector composed of the pitch and duration of each music note. The edge between two nodes is created when those nodes appear consecutively in the music flow. Distance is defined based on the frequency of such appearances. Through TDA on the constructed graph, a unique set of cycles is found for the given music. In \cite{TPJ}, the new concept of the {\it {\color{black}{Overlap}} matrix} has been proposed, which visualizes how those cycles are interconnected over the music flow, in a matrix form. In this paper, we explain how we use the {\color{black}{Overlap}} matrix for machine composition. The {\color{black}{Overlap}} matrix makes it possible to compose a new music piece algorithmically and also provide a seed music towards the desired artificial neural network. In this paper, we use the {\it Dodeuri} music and explain detailed steps.

4.6LGJul 8, 2024

A third-order finite difference weighted essentially non-oscillatory scheme with shallow neural network

Kwanghyuk Park, Xinjuan Chen, Dongjin Lee et al.

In this paper, we introduce the finite difference weighted essentially non-oscillatory (WENO) scheme based on the neural network for hyperbolic conservation laws. We employ the supervised learning and design two loss functions, one with the mean squared error and the other with the mean squared logarithmic error, where the WENO3-JS weights are computed as the labels. Each loss function consists of two components where the first component compares the difference between the weights from the neural network and WENO3-JS weights, while the second component matches the output weights of the neural network and the linear weights. The former of the loss function enforces the neural network to follow the WENO properties, implying that there is no need for the post-processing layer. Additionally the latter leads to better performance around discontinuities. As a neural network structure, we choose the shallow neural network (SNN) for computational efficiency with the Delta layer consisting of the normalized undivided differences. These constructed WENO3-SNN schemes show the outperformed results in one-dimensional examples and improved behavior in two-dimensional examples, compared with the simulations from WENO3-JS and WENO3-Z.

16.6CLFeb 13, 2024Code

JAMDEC: Unsupervised Authorship Obfuscation using Constrained Decoding over Small Language Models

Jillian Fisher, Ximing Lu, Jaehun Jung et al. · uw

The permanence of online content combined with the enhanced authorship identification techniques calls for stronger computational methods to protect the identity and privacy of online authorship when needed, e.g., blind reviews for scientific papers, anonymous online reviews, or anonymous interactions in the mental health forums. In this paper, we propose an unsupervised inference-time approach to authorship obfuscation to address the unique challenges of authorship obfuscation: lack of supervision data for diverse authorship and domains, and the need for a sufficient level of revision beyond simple paraphrasing to obfuscate the authorship, all the while preserving the original content and fluency. We introduce JAMDEC, a user-controlled, inference-time algorithm for authorship obfuscation that can be in principle applied to any text and authorship. Our approach builds on small language models such as GPT2-XL in order to help avoid disclosing the original content to proprietary LLM's APIs, while also reducing the performance gap between small and large language models via algorithmic enhancement. The key idea behind our approach is to boost the creative power of smaller language models through constrained decoding, while also allowing for user-specified controls and flexibility. Experimental results demonstrate that our approach based on GPT2-XL outperforms previous state-of-the-art methods based on comparably small models, while performing competitively against GPT3.5 175B, a propriety model that is two orders of magnitudes larger.

5.5CLMar 20, 2024

Information-Theoretic Distillation for Reference-less Summarization

Jaehun Jung, Ximing Lu, Liwei Jiang et al. · allen-ai, uw

The current winning recipe for automatic summarization is using proprietary large-scale language models (LLMs) such as ChatGPT as is, or imitation learning from them as teacher models. While increasingly ubiquitous dependence on such large-scale language models is convenient, there remains an important question of whether small-scale models could have achieved competitive results, if we were to seek an alternative learning method -- that allows for a more cost-efficient, controllable, yet powerful summarizer. We present InfoSumm, a novel framework to distill a powerful summarizer based on the information-theoretic objective for summarization, without relying on either the LLM's capability or human-written references. To achieve this, we first propose a novel formulation of the desiderata of summarization (saliency, faithfulness and brevity) through the lens of mutual information between the original document and the summary. Based on this formulation, we start off from Pythia-2.8B as the teacher model, which is not yet capable of summarization, then self-train the model to optimize for the information-centric measures of ideal summaries. Distilling from the improved teacher, we arrive at a compact but powerful summarizer with only 568M parameters that performs competitively against ChatGPT, without ever relying on ChatGPT's capabilities. Extensive analysis demonstrates that our approach outperforms in-domain supervised models in human evaluation, let alone state-of-the-art unsupervised methods, and wins over ChatGPT in controllable summarization.

11.8CVJun 10, 2025

Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions

David Acuna, Ximing Lu, Jaehun Jung et al. · nvidia, uw

Recent research in vision-language models (VLMs) has centered around the possibility of equipping them with implicit long-form chain-of-thought reasoning -- akin to the success observed in language models -- via distillation and reinforcement learning. But what about the non-reasoning models already trained and deployed across the internet? Should we simply abandon them, or is there hope for a search mechanism that can elicit hidden knowledge and induce long reasoning traces -- without any additional training or supervision? In this paper, we explore this possibility using a Monte Carlo Tree Search (MCTS)-inspired algorithm, which injects subquestion-subanswer pairs into the model's output stream. We show that framing reasoning as a search process -- where subquestions act as latent decisions within a broader inference trajectory -- helps the model "connect the dots" between fragmented knowledge and produce extended reasoning traces in non-reasoning models. We evaluate our method across three benchmarks and observe consistent improvements. Notably, our approach yields a 2% overall improvement on MMMU-PRO, including a significant 9% gain in Liberal Arts.

13.3CLMay 26, 2023

Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing

Jaehun Jung, Peter West, Liwei Jiang et al.

We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization, that distills a high-quality dataset and model from a low-quality teacher that itself cannot perform these tasks. Unlike prior works that rely on an extreme-scale teacher model (e.g., GPT3) or task-specific architecture, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs (e.g., GPT2), where paraphrases occupy a proximal subspace in the LM distribution. By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs. We evaluate our method on multiple benchmarks spanning unconstrained / syntax-controlled paraphrase generation and sentence summarization. Our model with 770M parameters consistently outperforms strong baselines, including models distilled from ChatGPT, and sometimes, even ChatGPT itself. Also, we find that our distilled dataset from 1.5B LMs exhibits higher diversity and fidelity than up to 13 times larger datasets.

1.8LGFeb 1, 2022

Weighted Isolation and Random Cut Forest Algorithms for Anomaly Detection

Sijin Yeom, Jae-Hun Jung

Random cut forest (RCF) algorithms have been developed for anomaly detection, particularly in time series data. The RCF algorithm is an improved version of the isolation forest (IF) algorithm. Unlike the IF algorithm, the RCF algorithm can determine whether real-time input contains an anomaly by inserting the input into the constructed tree network. Various RCF algorithms, including Robust RCF (RRCF), have been developed, where the cutting procedure is adaptively chosen probabilistically. The RRCF algorithm demonstrates better performance than the IF algorithm, as dimension cuts are decided based on the geometric range of the data, whereas the IF algorithm randomly chooses dimension cuts. However, the overall data structure is not considered in both IF and RRCF, given that split values are chosen randomly. In this paper, we propose new IF and RCF algorithms, referred to as the weighted IF (WIF) and weighted RCF (WRCF) algorithms, respectively. Their split values are determined by considering the density of the given data. To introduce the WIF and WRCF, we first present a new geometric measure, a density measure, which is crucial for constructing the WIF and WRCF. We provide various mathematical properties of the density measure, accompanied by theorems that support and validate our claims through numerical examples.

4.3SDMar 11, 2021

Topological Data Analysis of Korean Music in Jeongganbo: A Cycle Structure

Mai Lan Tran, Changbom Park, Jae-Hun Jung

Jeongganbo is a unique music representation invented by Sejong the Great. Contrary to the western music notation, the pitch of each note is encrypted and the length is visualized directly in a matrix form in Jeongganbo. We use topological data analysis (TDA) to analyze the Korean music written in Jeongganbo for Suyeonjang, Songuyeo, and Taryong, those well-known pieces played at the palace and among noble community. We are particularly interested in the cycle structure. We first define and determine the node elements of each music, characterized uniquely with its pitch and length. Then we transform the music into a graph and define the distance between the nodes as their adjacent occurrence rate. The graph is used as a point cloud whose homological structure is investigated by measuring the hole structure in each dimension. We identify cycles of each music, match those in Jeongganbo, and show how those cycles are interconnected. The main discovery of this work is that the cycles of Suyeonjang and Songuyeo, categorized as a special type of cyclic music known as Dodeuri, frequently overlap each other when appearing in the music while the cycles found in Taryong, which does not belong to Dodeuri class, appear individually.

5.8LGDec 19, 2020

T-GAP: Learning to Walk across Time for Temporal Knowledge Graph Completion

Jaehun Jung, Jinhong Jung, U Kang

Temporal knowledge graphs (TKGs) inherently reflect the transient nature of real-world knowledge, as opposed to static knowledge graphs. Naturally, automatic TKG completion has drawn much research interests for a more realistic modeling of relational reasoning. However, most of the existing mod-els for TKG completion extend static KG embeddings that donot fully exploit TKG structure, thus lacking in 1) account-ing for temporally relevant events already residing in the lo-cal neighborhood of a query, and 2) path-based inference that facilitates multi-hop reasoning and better interpretability. In this paper, we propose T-GAP, a novel model for TKG completion that maximally utilizes both temporal information and graph structure in its encoder and decoder. T-GAP encodes query-specific substructure of TKG by focusing on the temporal displacement between each event and the query times-tamp, and performs path-based inference by propagating attention through the graph. Our empirical experiments demonstrate that T-GAP not only achieves superior performance against state-of-the-art baselines, but also competently generalizes to queries with unseen timestamps. Through extensive qualitative analyses, we also show that T-GAP enjoys from transparent interpretability, and follows human intuition in its reasoning process.

4.3IMOct 18, 2019

Detection of gravitational waves using topological data analysis and convolutional neural network: An improved approach

Christopher Bresten, Jae-Hun Jung

The gravitational wave detection problem is challenging because the noise is typically overwhelming. Convolutional neural networks (CNNs) have been successfully applied, but require a large training set and the accuracy suffers significantly in the case of low SNR. We propose an improved method that employs a feature extraction step using persistent homology. The resulting method is more resilient to noise, more capable of detecting signals with varied signatures and requires less training. This is a powerful improvement as the detection problem can be computationally intense and is concerned with a relatively large class of wave signatures.