Namgyu Ho

CL
h-index8
10papers
981citations
Novelty47%
AI Score40

10 Papers

CLDec 20, 2022Code
Large Language Models Are Reasoning Teachers

Namgyu Ho, Laura Schmid, Se-Young Yun

Recent works have shown that chain-of-thought (CoT) prompting can elicit language models to solve complex reasoning tasks, step-by-step. However, prompt-based CoT methods are dependent on very large models such as GPT-3 175B which are prohibitive to deploy at scale. In this paper, we use these large models as reasoning teachers to enable complex reasoning in smaller models and reduce model size requirements by several orders of magnitude. We propose Fine-tune-CoT, a method that generates reasoning samples from very large teacher models to fine-tune smaller models. We evaluate our method on a wide range of public models and complex tasks. We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks. Additionally, we extend our method by leveraging the teacher model's ability to generate multiple distinct rationales for each original sample. Enriching the fine-tuning data with such diverse reasoning results in a substantial performance boost across datasets, even for very small models. We conduct ablations and sample studies to understand the emergence of reasoning capabilities of student models. Our code implementation and data are available at https://github.com/itsnamgyu/reasoning-teacher.

CLNov 1, 2023Code
HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning

Yongjin Yang, Joonkee Kim, Yujin Kim et al.

With the proliferation of social media, accurate detection of hate speech has become critical to ensure safety online. To combat nuanced forms of hate speech, it is important to identify and thoroughly explain hate speech to help users understand its harmful effects. Recent benchmarks have attempted to tackle this issue by training generative models on free-text annotations of implications in hateful text. However, we find significant reasoning gaps in the existing annotations schemes, which may hinder the supervision of detection models. In this paper, we introduce a hate speech detection framework, HARE, which harnesses the reasoning capabilities of large language models (LLMs) to fill these gaps in explanations of hate speech, thus enabling effective supervision of detection models. Experiments on SBIC and Implicit Hate benchmarks show that our method, using model-generated data, consistently outperforms baselines, using existing free-text human annotations. Analysis demonstrates that our method enhances the explanation quality of trained models and improves generalization to unseen datasets. Our code is available at https://github.com/joonkeekim/hare-hate-speech.git.

LGJun 30, 2022Code
Benchmark Dataset for Precipitation Forecasting by Post-Processing the Numerical Weather Prediction

Taehyeon Kim, Namgyu Ho, Donggyu Kim et al.

Precipitation forecasting is an important scientific challenge that has wide-reaching impacts on society. Historically, this challenge has been tackled using numerical weather prediction (NWP) models, grounded on physics-based simulations. Recently, many works have proposed an alternative approach, using end-to-end deep learning (DL) models to replace physics-based NWP models. While these DL methods show improved performance and computational efficiency, they exhibit limitations in long-term forecasting and lack the explainability. In this work, we present a hybrid NWP-DL workflow to fill the gap between standalone NWP and DL approaches. Under this workflow, the outputs of NWP models are fed into a deep neural network, which post-processes the data to yield a refined precipitation forecast. The deep model is trained with supervision, using Automatic Weather Station (AWS) observations as ground-truth labels. This can achieve the best of both worlds, and can even benefit from future improvements in NWP technology. To facilitate study in this direction, we present a novel dataset focused on the Korean Peninsula, termed KoMet (Korea Meteorological Dataset), comprised of NWP outputs and AWS observations. For the NWP model, the Global Data Assimilation and Prediction Systems-Korea Integrated Model (GDAPS-KIM) is utilized. We provide analysis on a comprehensive set of baseline methods aimed at addressing the challenges of KoMet, including the sparsity of AWS observations and class imbalance. To lower the barrier to entry and encourage further study, we also provide an extensive open-source Python package for data processing and model development. Our benchmark data and code are available at https://github.com/osilab-kaist/KoMet-Benchmark-Dataset.

CLNov 14, 2023
Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models

Yujin Kim, Jaehong Yoon, Seonghyeon Ye et al.

The dynamic nature of knowledge in an ever-changing world presents challenges for language models trained on static data; the model in the real world often requires not only acquiring new knowledge but also overwriting outdated information into updated ones. To study the ability of language models for these time-dependent dynamics in human language, we introduce a novel task, EvolvingQA, a temporally evolving question-answering benchmark designed for training and evaluating LMs on an evolving Wikipedia database. The construction of EvolvingQA is automated with our pipeline using large language models. We uncover that existing continual learning baselines suffer from updating and removing outdated knowledge. Our analysis suggests that models fail to rectify knowledge due to small weight gradients. In addition, we elucidate that language models particularly struggle to reflect the change of numerical or temporal information. Our work aims to model the dynamic nature of real-world information, suggesting faithful evaluations of the evolution-adaptability of language models.

CVMay 11, 2022
ReFine: Re-randomization before Fine-tuning for Cross-domain Few-shot Learning

Jaehoon Oh, Sungnyun Kim, Namgyu Ho et al.

Cross-domain few-shot learning (CD-FSL), where there are few target samples under extreme differences between source and target domains, has recently attracted huge attention. Recent studies on CD-FSL generally focus on transfer learning based approaches, where a neural network is pre-trained on popular labeled source domain datasets and then transferred to target domain data. Although the labeled datasets may provide suitable initial parameters for the target data, the domain difference between the source and target might hinder fine-tuning on the target domain. This paper proposes a simple yet powerful method that re-randomizes the parameters fitted on the source domain before adapting to the target data. The re-randomization resets source-specific parameters of the source pre-trained model and thus facilitates fine-tuning on the target domain, improving few-shot performance.

CLFeb 27, 2025Code
Self-Training Elicits Concise Reasoning in Large Language Models

Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim et al.

Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning

CVAug 29, 2023
Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification with Cross-Modal Retrieval

Seongha Eom, Namgyu Ho, Jaehoon Oh et al.

Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability, namely image classification using novel text labels. Existing works have attempted to enhance CLIP by fine-tuning on downstream tasks, but these have inadvertently led to performance degradation on unseen classes, thus harming zero-shot generalization. This paper aims to address this challenge by leveraging readily available image-text pairs from an external dataset for cross-modal guidance during inference. To this end, we propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble. Given a query image, we harness the power of CLIP's cross-modal representations to retrieve relevant textual information from an external image-text pair dataset. Then, we assign higher weights to the more reliable modality between the original query image and retrieved text, contributing to the final prediction. X-MoRe demonstrates robust performance across a diverse set of tasks without the need for additional training, showcasing the effectiveness of utilizing cross-modal features to maximize CLIP's zero-shot ability.

CLJun 9, 2024Code
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Seungone Kim, Juyoung Suk, Ji Yong Cho et al.

As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench.

CLJun 4, 2024Code
Block Transformer: Global-to-Local Language Modeling for Fast Inference

Namgyu Ho, Sangmin Bae, Taehyeon Kim et al.

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of all previous sequences to be retrieved from memory at every decoding step to retrieve context information, leading to two primary bottlenecks during batch inference. First, there is a significant delay in obtaining the first token, as the information of the entire prompt must first be processed to prefill the KV cache. Second, computation of subsequent tokens is bottlenecked by the high memory I/O demand of fetching the entire KV cache, which grows linearly with sequence length, incurring quadratic memory reads overall. We design the Block Transformer to strategically mitigate these costs, by incorporating coarsity and locality into an integrated global-to-local architecture. At the lower layers, we aggregate tokens into fixed size blocks to apply attention across the entire sequence at coarse-grained detail, to capture the global context while minimizing KV cache overhead. At upper layers, we apply attention within each block to decode individual tokens, to model fine-grained details with a lightweight local KV cache. We pretrain vanilla and Block Transformers from scratch and demonstrate that Block Transformers reach 10--20x inference throughput compared to vanilla transformers with equivalent perplexity and zero-shot task performance. Code is available at https://github.com/itsnamgyu/block-transformer.

LGFeb 1, 2022Code
Understanding Cross-Domain Few-Shot Learning Based on Domain Similarity and Few-Shot Difficulty

Jaehoon Oh, Sungnyun Kim, Namgyu Ho et al.

Cross-domain few-shot learning (CD-FSL) has drawn increasing attention for handling large differences between the source and target domains--an important concern in real-world scenarios. To overcome these large differences, recent works have considered exploiting small-scale unlabeled data from the target domain during the pre-training stage. This data enables self-supervised pre-training on the target domain, in addition to supervised pre-training on the source domain. In this paper, we empirically investigate which pre-training is preferred based on domain similarity and few-shot difficulty of the target domain. We discover that the performance gain of self-supervised pre-training over supervised pre-training becomes large when the target domain is dissimilar to the source domain, or the target domain itself has low few-shot difficulty. We further design two pre-training schemes, mixed-supervised and two-stage learning, that improve performance. In this light, we present six findings for CD-FSL, which are supported by extensive experiments and analyses on three source and eight target benchmark datasets with varying levels of domain similarity and few-shot difficulty. Our code is available at https://github.com/sungnyun/understanding-cdfsl.