CLOct 31, 2022Code
Pneg: Prompt-based Negative Response Generation for Dialogue Response Selection TaskNyoungwoo Lee, ChaeHun Park, Ho-Jin Choi et al.
In retrieval-based dialogue systems, a response selection model acts as a ranker to select the most appropriate response among several candidates. However, such selection models tend to rely on context-response content similarity, which makes models vulnerable to adversarial responses that are semantically similar but not relevant to the dialogue context. Recent studies have shown that leveraging these adversarial responses as negative training samples is useful for improving the discriminating power of the selection model. Nevertheless, collecting human-written adversarial responses is expensive, and existing synthesizing methods often have limited scalability. To overcome these limitations, this paper proposes a simple but efficient method for generating adversarial negative responses leveraging a large-scale language model. Experimental results on dialogue selection tasks show that our method outperforms other methods of synthesizing adversarial negative responses. These results suggest that our method can be an effective alternative to human annotators in generating adversarial responses. Our dataset and generation code is available at https://github.com/leenw23/generating-negatives-by-gpt3.
CLAug 29, 2022Code
Reweighting Strategy based on Synthetic Data Identification for Sentence SimilarityTaehee Kim, ChaeHun Park, Jimin Hong et al.
Semantically meaningful sentence embeddings are important for numerous tasks in natural language processing. To obtain such embeddings, recent studies explored the idea of utilizing synthetically generated data from pretrained language models (PLMs) as a training corpus. However, PLMs often generate sentences much different from the ones written by human. We hypothesize that treating all these synthetic examples equally for training deep neural networks can have an adverse effect on learning semantically meaningful embeddings. To analyze this, we first train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences. Based on this, we propose a novel approach that first trains the classifier to measure the importance of each sentence. The distilled information from the classifier is then used to train a reliable sentence embedding model. Through extensive evaluation on four real-world datasets, we demonstrate that our model trained on synthetic data generalizes well and outperforms the existing baselines. Our implementation is publicly available at https://github.com/ddehun/coling2022_reweighting_sts.
CLSep 22, 2023
Learning to Diversify Neural Text Generation via Degenerative ModelJimin Hong, ChaeHun Park, Jaegul Choo
Neural language models often fail to generate diverse and informative texts, limiting their applicability in real-world problems. While previous approaches have proposed to address these issues by identifying and penalizing undesirable behaviors (e.g., repetition, overuse of frequent words) from language models, we propose an alternative approach based on an observation: models primarily learn attributes within examples that are likely to cause degeneration problems. Based on this observation, we propose a new approach to prevent degeneration problems by training two models. Specifically, we first train a model that is designed to amplify undesirable patterns. We then enhance the diversity of the second model by focusing on patterns that the first model fails to learn. Extensive experiments on two tasks, namely language modeling and dialogue generation, demonstrate the effectiveness of our approach.
CLApr 1, 2024Code
PairEval: Open-domain Dialogue Evaluation with Pairwise ComparisonChaeHun Park, Minseok Choi, Dohyun Lee et al.
Building a reliable and automated evaluation metric is a necessary but challenging problem for open-domain dialogue systems. Recent studies proposed evaluation metrics that assess generated responses by considering their relevance to previous dialogue histories. Although effective, these metrics evaluate individual responses directly rather than considering their relative quality compared to other responses. To handle this, we propose PairEval, a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations. PairEval is built on top of open-sourced and moderate-size language models, and we make them specialized in pairwise comparison between dialogue responses. Extensive experiments on multiple benchmarks demonstrate that our metric exhibits a higher correlation with human judgments than baseline metrics. We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems, including repetition and speaker insensitivity.
36.4CLMar 14
LiveWeb-IE: A Benchmark For Online Web Information ExtractionSeungbin Yang, Jihwan Kim, Jaemin Choi et al.
Web information extraction (WIE) is the task of automatically extracting data from web pages, offering high utility for various applications. The evaluation of WIE systems has traditionally relied on benchmarks built from HTML snapshots captured at a single point in time. However, this offline evaluation paradigm fails to account for the temporally evolving nature of the web; consequently, performance on these static benchmarks often fails to generalize to dynamic real-world scenarios. To bridge this gap, we introduce \dataset, a new benchmark designed for evaluating WIE systems directly against live websites. Based on trusted and permission-granted websites, we curate natural language queries that require information extraction of various data categories, such as text, images, and hyperlinks. We further design these queries to represent four levels of complexity, based on the number and cardinality of attributes to be extracted, enabling a granular assessment of WIE systems. In addition, we propose Visual Grounding Scraper (VGS), a novel multi-stage agentic framework that mimics human cognitive processes by visually narrowing down web page content to extract desired information. Extensive experiments across diverse backbone models demonstrate the effectiveness and robustness of VGS. We believe that this study lays the foundation for developing practical and robust WIE systems.
CLJun 24, 2024Code
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM CollaborationChaeHun Park, Yujin Baek, Jaeseok Kim et al.
To create culturally inclusive vision-language models (VLMs), developing a benchmark that tests their ability to address culturally relevant questions is essential. Existing approaches typically rely on human annotators, making the process labor-intensive and creating a cognitive burden in generating diverse questions. To address this, we propose a semi-automated framework for constructing cultural VLM benchmarks, specifically targeting multiple-choice QA. This framework combines human-VLM collaboration, where VLMs generate questions based on guidelines, a small set of annotated examples, and relevant knowledge, followed by a verification process by native speakers. We demonstrate the effectiveness of this framework through the creation of \texttt{K-Viscuit}, a dataset focused on Korean culture. Our experiments on this dataset reveal that open-source models lag behind proprietary ones in understanding Korean culture, highlighting key areas for improvement. We also present a series of further analyses, including human evaluation, augmenting VLMs with external knowledge, and the evaluation beyond multiple-choice QA. Our dataset is available at https://huggingface.co/datasets/ddehun/k-viscuit.
CLJun 18, 2024Code
Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?Seungbin Yang, ChaeHun Park, Taehee Kim et al.
Recent advancements in integrating large language models (LLMs) with tools have allowed the models to interact with real-world environments. However, these tool-augmented LLMs often encounter incomplete scenarios when users provide partial information or the necessary tools are unavailable. Recognizing and managing such scenarios is crucial for LLMs to ensure their reliability, but this exploration remains understudied. This study examines whether LLMs can identify incomplete conditions and appropriately determine when to refrain from using tools. To quantitatively evaluate this capability, we construct a new benchmark dataset where instances are systematically altered to simulate the ambiguous and incomplete conditions common in real-world interactions. Our experiments reveal that even state-of-the-art LLMs often struggle to identify these conditions, attempting to use tools without sufficient information or when the correct tool is unavailable. To better understand these limitations, we conduct a detailed behavioral analysis across various conditions, including implicit evaluation and scenarios where models receive feedback from previous tool invocations. Based on this analysis, we propose a novel prompting-based reasoning strategy that explicitly instructs models to assess the sufficiency of information and the availability of tools. Our proposed approach significantly enhances the models' ability to recognize incomplete conditions, resulting in more informed and contextually appropriate tool-use decisions. We believe our research contributes to advancing the reliability of LLMs, especially in real-world applications where incomplete or ambiguous information is common. Our dataset is available at https://huggingface.co/datasets/ddehun/ICT.
CLMay 8, 2023Code
DEnsity: Open-domain Dialogue Evaluation Metric using Density EstimationChaeHun Park, Seungil Chad Lee, Daniel Rim et al.
Despite the recent advances in open-domain dialogue systems, building a reliable evaluation metric is still a challenging problem. Recent studies proposed learnable metrics based on classification models trained to distinguish the correct response. However, neural classifiers are known to make overly confident predictions for examples from unseen distributions. We propose DEnsity, which evaluates a response by utilizing density estimation on the feature space derived from a neural classifier. Our metric measures how likely a response would appear in the distribution of human conversations. Moreover, to improve the performance of DEnsity, we utilize contrastive learning to further compress the feature space. Experiments on multiple response evaluation datasets show that DEnsity correlates better with human evaluations than the existing metrics. Our code is available at https://github.com/ddehun/DEnsity.
CLOct 17, 2024
Breaking Chains: Unraveling the Links in Multi-Hop Knowledge UnlearningMinseok Choi, ChaeHun Park, Dohyun Lee et al.
Large language models (LLMs) serve as giant information stores, often including personal or copyrighted data, and retraining them from scratch is not a viable option. This has led to the development of various fast, approximate unlearning techniques to selectively remove knowledge from LLMs. Prior research has largely focused on minimizing the probabilities of specific token sequences by reversing the language modeling objective. However, these methods still leave LLMs vulnerable to adversarial attacks that exploit indirect references. In this work, we examine the limitations of current unlearning techniques in effectively erasing a particular type of indirect prompt: multi-hop queries. Our findings reveal that existing methods fail to completely remove multi-hop knowledge when one of the intermediate hops is unlearned. To address this issue, we propose MUNCH, a simple uncertainty-based approach that breaks down multi-hop queries into subquestions and leverages the uncertainty of the unlearned model in final decision-making. Empirical results demonstrate the effectiveness of our framework, and MUNCH can be easily integrated with existing unlearning techniques, making it a flexible and useful solution for enhancing unlearning processes.
CLAug 31, 2025
Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMsDaehoon Gwak, Minseo Jung, Junwoo Park et al.
Masked diffusion models (MDMs) offer a promising non-autoregressive alternative for large language modeling. Standard decoding methods for MDMs, such as confidence-based sampling, select tokens independently based on individual token confidences at each diffusion step. However, we observe that this independent token selection often results in generation orders resembling sequential autoregressive processes, limiting the advantages of non-autoregressive modeling. To mitigate this pheonomenon, we propose Reward-Weighted Sampling (RWS), a novel decoding strategy that leverages an external reward model to provide a principled global signal during the iterative diffusion process. Specifically, at each diffusion step, RWS evaluates the quality of the entire intermediate sequence and scales token logits accordingly, guiding token selection by integrating global sequence-level coherence. This method selectively increases the confidence of tokens that initially have lower scores, thereby promoting a more non-autoregressive generation order. Furthermore, we provide theoretical justification showing that reward-weighted logit scaling induces beneficial rank reversals in token selection and consistently improves expected reward. Experiments demonstrate that RWS significantly promotes non-autoregressive generation orders, leading to improvements across multiple evaluation metrics. These results highlight the effectiveness of integrating global signals in enhancing both the non-autoregressive properties and overall performance of MDMs.
CLNov 21, 2024
Forecasting Future International Events: A Reliable Dataset for Text-Based Event ModelingDaehoon Gwak, Junwoo Park, Minho Park et al.
Predicting future international events from textual information, such as news articles, has tremendous potential for applications in global policy, strategic decision-making, and geopolitics. However, existing datasets available for this task are often limited in quality, hindering the progress of related research. In this paper, we introduce WORLDREP (WORLD Relationship and Event Prediction), a novel dataset designed to address these limitations by leveraging the advanced reasoning capabilities of large-language models (LLMs). Our dataset features high-quality scoring labels generated through advanced prompt modeling and rigorously validated by domain experts in political science. We showcase the quality and utility of WORLDREP for real-world event prediction tasks, demonstrating its effectiveness through extensive experiments and analysis. Furthermore, we publicly release our dataset along with the full automation source code for data collection, labeling, and benchmarking, aiming to support and advance research in text-based event prediction.
CLOct 24, 2024
Evaluating Automatic Speech Recognition Systems for Korean Meteorological ExpertsChaeHun Park, Hojun Cho, Jaegul Choo
This paper explores integrating Automatic Speech Recognition (ASR) into natural language query systems to improve weather forecasting efficiency for Korean meteorologists. We address challenges in developing ASR systems for the Korean weather domain, specifically specialized vocabulary and Korean linguistic intricacies. To tackle these issues, we constructed an evaluation dataset of spoken queries recorded by native Korean speakers. Using this dataset, we assessed various configurations of a multilingual ASR model family, identifying performance limitations related to domain-specific terminology. We then implemented a simple text-to-speech-based data augmentation method, which improved the recognition of specialized terms while maintaining general-domain performance. Our contributions include creating a domain-specific dataset, comprehensive ASR model evaluations, and an effective augmentation technique. We believe our work provides a foundation for future advancements in ASR for the Korean weather forecasting domain.
CLJun 18, 2024
The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM EvaluatorsHawon Jeong, ChaeHun Park, Jimin Hong et al.
As large language models (LLMs) are increasingly used as evaluators for natural language generation tasks, ensuring unbiased assessments is essential. However, LLM evaluators often display biased preferences, such as favoring verbosity and authoritative tones. Our empirical analysis reveals that these biases are exacerbated in pairwise evaluation, where LLMs directly compare two outputs and easily prioritize superficial attributes. In contrast, pointwise evaluation, which assesses outputs independently, is less susceptible to such bias because each output is judged in isolation. To address the limitations of the pairwise evaluation, we introduce a novel evaluation method, PRePair, which integrates pointwise reasoning within a pairwise framework. PRePair effectively alleviates biased preference, improving performance on the adversarial benchmark (LLMBar) while outperforming pointwise evaluation on the standard benchmark (MT-Bench).
CLJun 4, 2024
Translation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question AnsweringChaeHun Park, Koanho Lee, Hyesu Lim et al.
Building a reliable visual question answering~(VQA) system across different languages is a challenging problem, primarily due to the lack of abundant samples for training. To address this challenge, recent studies have employed machine translation systems for the cross-lingual VQA task. This involves translating the evaluation samples into a source language (usually English) and using monolingual models (i.e., translate-test). However, our analysis reveals that translated texts contain unique characteristics distinct from human-written ones, referred to as translation artifacts. We find that these artifacts can significantly affect the models, confirmed by extensive experiments across diverse models, languages, and translation processes. In light of this, we present a simple data augmentation strategy that can alleviate the adverse impacts of translation artifacts.
CLSep 1, 2021
Evaluating Predictive Uncertainty under Distributional Shift on Dialogue DatasetNyoungwoo Lee, ChaeHun Park, Ho-Jin Choi
In open-domain dialogues, predictive uncertainties are mainly evaluated in a domain shift setting to cope with out-of-distribution inputs. However, in real-world conversations, there could be more extensive distributional shifted inputs than the out-of-distribution. To evaluate this, we first propose two methods, Unknown Word (UW) and Insufficient Context (IC), enabling gradual distributional shifts by corruption on the dialogue dataset. We then investigate the effect of distributional shifts on accuracy and calibration. Our experiments show that the performance of existing uncertainty estimation methods consistently degrades with intensifying the shift. The results suggest that the proposed methods could be useful for evaluating the calibration of dialogue systems under distributional shifts.
IRMay 3, 2021
Unsupervised Document Expansion for Information Retrieval with Stochastic Text GenerationSoyeong Jeong, Jinheon Baek, ChaeHun Park et al.
One of the challenges in information retrieval (IR) is the vocabulary mismatch problem, which happens when the terms between queries and documents are lexically different but semantically similar. While recent work has proposed to expand the queries or documents by enriching their representations with additional relevant terms to address this challenge, they usually require a large volume of query-document pairs to train an expansion model. In this paper, we propose an Unsupervised Document Expansion with Generation (UDEG) framework with a pre-trained language model, which generates diverse supplementary sentences for the original document without using labels on query-document pairs for training. For generating sentences, we further stochastically perturb their embeddings to generate more diverse sentences for document expansion. We validate our framework on two standard IR benchmark datasets. The results show that our framework significantly outperforms relevant expansion baselines for IR.
CLFeb 19, 2021
An Empirical Study on Measuring the Similarity of Sentential Arguments with Language Model Domain AdaptationChaeHun Park, Sangwoo Seo
Measuring the similarity between two different sentential arguments is an important task in argument mining. However, one of the challenges in this field is that the dataset must be annotated using expertise in a variety of topics, making supervised learning with labeled data expensive. In this paper, we investigated whether this problem could be alleviated through transfer learning. We first adapted a pretrained language model to a domain of interest using self-supervised learning. Then, we fine-tuned the model to a task of measuring the similarity between sentences taken from different domains. Our approach improves a correlation with human-annotated similarity scores compared to competitive baseline models on the Argument Facet Similarity dataset in an unsupervised setting. Moreover, we achieve comparable performance to a fully supervised baseline model by using only about 60% of the labeled data samples. We believe that our work suggests the possibility of a generalized argument clustering model for various argumentative topics.