CLApr 1, 2023
When Crowd Meets Persona: Creating a Large-Scale Open-Domain Persona Dialogue CorpusWon Ik Cho, Yoon Kyung Lee, Seoyeon Bae et al.
Building a natural language dataset requires caution since word semantics is vulnerable to subtle text change or the definition of the annotated concept. Such a tendency can be seen in generative tasks like question-answering and dialogue generation and also in tasks that create a categorization-based corpus, like topic classification or sentiment analysis. Open-domain conversations involve two or more crowdworkers freely conversing about any topic, and collecting such data is particularly difficult for two reasons: 1) the dataset should be ``crafted" rather than ``obtained" due to privacy concerns, and 2) paid creation of such dialogues may differ from how crowdworkers behave in real-world settings. In this study, we tackle these issues when creating a large-scale open-domain persona dialogue corpus, where persona implies that the conversation is performed by several actors with a fixed persona and user-side workers from an unspecified crowd.
CYMay 23
Position: Adopting AI in Practice Does Not Guarantee the Productivity BoostWon Ik Cho, Seong-hun Kim, Geunhye Kim
This position paper argues that adopting AI in organizational practice does not guarantee productivity gains, because human and environmental factors critically moderate the relationship between AI deployment and realized productivity improvements. Following the advent of high-performance generative models, AI use has been rapidly encouraged in some sectors while being restricted in others. Most practitioners assume that AI brings productivity boosts owing to enhanced technical capabilities, but regardless of apparent performance advances in AI technology, human and environmental factors of the organization may substantially attenuate -- or even negate -- the effective productivity benefits. We identify five key moderating factors: human resource composition, baseline capability of individuals, learning curve of practitioners, incentives for fair use, and flexibility of objectives. Drawing on the partial equilibrium model of Gries and Naudé (2022), we argue that existing economic frameworks may inadvertently overlook these factors. We revise the existing framework to redefine effective organizational determinants and shed light on practical implications including industry and education, responding to alternative views and calling for action of stakeholders.
CLApr 6, 2022
DAGAM: Data Augmentation with Generation And ModificationByeong-Cheol Jo, Tak-Sung Heo, Yeongjoon Park et al.
Text classification is a representative downstream task of natural language processing, and has exhibited excellent performance since the advent of pre-trained language models based on Transformer architecture. However, in pre-trained language models, under-fitting often occurs due to the size of the model being very large compared to the amount of available training data. Along with significant importance of data collection in modern machine learning paradigm, studies have been actively conducted for natural language data augmentation. In light of this, we introduce three data augmentation schemes that help reduce underfitting problems of large-scale language models. Primarily we use a generation model for data augmentation, which is defined as Data Augmentation with Generation (DAG). Next, we augment data using text modification techniques such as corruption and word order change (Data Augmentation with Modification, DAM). Finally, we propose Data Augmentation with Generation And Modification (DAGAM), which combines DAG and DAM techniques for a boosted performance. We conduct data augmentation for six benchmark datasets of text classification task, and verify the usefulness of DAG, DAM, and DAGAM through BERT-based fine-tuning and evaluation, deriving better results compared to the performance with original datasets.
CLOct 13, 2024Code
Single Ground Truth Is Not Enough: Adding Flexibility to Aspect-Based Sentiment Analysis EvaluationSoyoung Yang, Hojun Cho, Jiyoung Lee et al.
Aspect-based sentiment analysis (ABSA) is a challenging task of extracting sentiments along with their corresponding aspects and opinion terms from the text. The inherent subjectivity of span annotation makes variability in the surface forms of extracted terms, complicating the evaluation process. Traditional evaluation methods often constrain ground truths (GT) to a single term, potentially misrepresenting the accuracy of semantically valid predictions that differ in surface form. To address this limitation, we propose a novel and fully automated pipeline that expands existing evaluation sets by adding alternative valid terms for aspect and opinion. Our approach facilitates an equitable assessment of language models by accommodating multiple-answer candidates, resulting in enhanced human agreement compared to single-answer test sets (achieving up to a 10\%p improvement in Kendall's Tau score). Experimental results demonstrate that our expanded evaluation set helps uncover the capabilities of large language models (LLMs) in ABSA tasks, which is concealed by the single-answer GT sets. Consequently, our work contributes to the development of a flexible evaluation framework for ABSA by embracing diverse surface forms to span extraction tasks in a cost-effective and reproducible manner. Our code and dataset is open at https://github.com/dudrrm/zoom-in-n-out-absa.
CLMay 20, 2021Code
KLUE: Korean Language Understanding EvaluationSungjoon Park, Jihyung Moon, Sungdong Kim et al.
We introduce Korean Language Understanding Evaluation (KLUE) benchmark. KLUE is a collection of 8 Korean natural language understanding (NLU) tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking. We build all of the tasks from scratch from diverse source corpora while respecting copyrights, to ensure accessibility for anyone without any restrictions. With ethical considerations in mind, we carefully design annotation protocols. Along with the benchmark tasks and data, we provide suitable evaluation metrics and fine-tuning recipes for pretrained language models for each task. We furthermore release the pretrained language models (PLM), KLUE-BERT and KLUE-RoBERTa, to help reproducing baseline models on KLUE and thereby facilitate future research. We make a few interesting observations from the preliminary experiments using the proposed KLUE benchmark suite, already demonstrating the usefulness of this new benchmark suite. First, we find KLUE-RoBERTa-large outperforms other baselines, including multilingual PLMs and existing open-source Korean PLMs. Second, we see minimal degradation in performance even when we replace personally identifiable information from the pretraining corpus, suggesting that privacy and NLU capability are not at odds with each other. Lastly, we find that using BPE tokenization in combination with morpheme-level pre-tokenization is effective in tasks involving morpheme-level tagging, detection and generation. In addition to accelerating Korean NLP research, our comprehensive documentation on creating KLUE will facilitate creating similar resources for other languages in the future. KLUE is available at https://klue-benchmark.com.
CLDec 31, 2020Code
Open Korean Corpora: A Practical ReportWon Ik Cho, Sangwhan Moon, Youngsook Song
Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.
CLJan 8
From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering DatasetHaneul Yoo, Won Ik Cho, Geunhye Kim et al.
Large language models (LLMs) achieve strong performance on many tasks, but their progress remains uneven across languages and cultures, often reflecting values latent in English-centric training data. To enable practical cultural alignment, we propose a scalable approach that leverages national social studies curricula as a foundation for culture-aware supervision. We introduce CuCu, an automated multi-agent LLM framework that transforms national textbook curricula into open-ended, culture-specific question-answer pairs. Applying CuCu to the Korean national social studies curriculum, we construct KCaQA, comprising 34.1k open-ended QA pairs. Our quantitative and qualitative analyses suggest that KCaQA covers culture-specific topics and produces responses grounded in local sociocultural contexts.
CLJan 29, 2025
RICoTA: Red-teaming of In-the-wild Conversation with Test AttemptsEujeong Choi, Younghun Jeong, Soomin Kim et al.
User interactions with conversational agents (CAs) evolve in the era of heavily guardrailed large language models (LLMs). As users push beyond programmed boundaries to explore and build relationships with these systems, there is a growing concern regarding the potential for unauthorized access or manipulation, commonly referred to as "jailbreaking." Moreover, with CAs that possess highly human-like qualities, users show a tendency toward initiating intimate sexual interactions or attempting to tame their chatbots. To capture and reflect these in-the-wild interactions into chatbot designs, we propose RICoTA, a Korean red teaming dataset that consists of 609 prompts challenging LLMs with in-the-wild user-made dialogues capturing jailbreak attempts. We utilize user-chatbot conversations that were self-posted on a Korean Reddit-like community, containing specific testing and gaming intentions with a social chatbot. With these prompts, we aim to evaluate LLMs' ability to identify the type of conversation and users' testing purposes to derive chatbot design implications for mitigating jailbreaking risks. Our dataset will be made publicly available via GitHub.
CLJan 10, 2025
Hermit Kingdom Through the Lens of Multiple Perspectives: A Case Study of LLM Hallucination on North KoreaEunjung Cho, Won Ik Cho, Soomin Seo
Hallucination in large language models (LLMs) remains a significant challenge for their safe deployment, particularly due to its potential to spread misinformation. Most existing solutions address this challenge by focusing on aligning the models with credible sources or by improving how models communicate their confidence (or lack thereof) in their outputs. While these measures may be effective in most contexts, they may fall short in scenarios requiring more nuanced approaches, especially in situations where access to accurate data is limited or determining credible sources is challenging. In this study, we take North Korea - a country characterised by an extreme lack of reliable sources and the prevalence of sensationalist falsehoods - as a case study. We explore and evaluate how some of the best-performing multilingual LLMs and specific language-based models generate information about North Korea in three languages spoken in countries with significant geo-political interests: English (United States, United Kingdom), Korean (South Korea), and Mandarin Chinese (China). Our findings reveal significant differences, suggesting that the choice of model and language can lead to vastly different understandings of North Korea, which has important implications given the global security challenges the country poses.
CLApr 17, 2024
Evaluating Span Extraction in Generative Paradigm: A Reflection on Aspect-Based Sentiment AnalysisSoyoung Yang, Won Ik Cho
In the era of rapid evolution of generative language models within the realm of natural language processing, there is an imperative call to revisit and reformulate evaluation methodologies, especially in the domain of aspect-based sentiment analysis (ABSA). This paper addresses the emerging challenges introduced by the generative paradigm, which has moderately blurred traditional boundaries between understanding and generation tasks. Building upon prevailing practices in the field, we analyze the advantages and shortcomings associated with the prevalent ABSA evaluation paradigms. Through an in-depth examination, supplemented by illustrative examples, we highlight the intricacies involved in aligning generative outputs with other evaluative metrics, specifically those derived from other tasks, including question answering. While we steer clear of advocating for a singular and definitive metric, our contribution lies in paving the path for a comprehensive guideline tailored for ABSA evaluations in this generative paradigm. In this position paper, we aim to provide practitioners with profound reflections, offering insights and directions that can aid in navigating this evolving landscape, ensuring evaluations that are both accurate and reflective of generative capabilities.
CLFeb 25, 2022
APEACH: Attacking Pejorative Expressions with Analysis on Crowd-Generated Hate Speech Evaluation DatasetsKichang Yang, Wonjun Jang, Won Ik Cho
In hate speech detection, developing training and evaluation datasets across various domains is the critical issue. Whereas, major approaches crawl social media texts and hire crowd-workers to annotate the data. Following this convention often restricts the scope of pejorative expressions to a single domain lacking generalization. Sometimes domain overlap between training corpus and evaluation set overestimate the prediction performance when pretraining language models on low-data language. To alleviate these problems in Korean, we propose APEACH that asks unspecified users to generate hate speech examples followed by minimal post-labeling. We find that APEACH can collect useful datasets that are less sensitive to the lexical overlaps between the pretraining corpus and the evaluation set, thereby properly measuring the model performance.
CLJul 6, 2021
Kosp2e: Korean Speech to English Translation CorpusWon Ik Cho, Seok Min Kim, Hyunchang Cho et al.
Most speech-to-text (S2T) translation studies use English speech as a source, which makes it difficult for non-English speakers to take advantage of the S2T technologies. For some languages, this problem was tackled through corpus construction, but the farther linguistically from English or the more under-resourced, this deficiency and underrepresentedness becomes more significant. In this paper, we introduce kosp2e (read as `kospi'), a corpus that allows Korean speech to be translated into English text in an end-to-end manner. We adopt open license speech recognition corpus, translation corpus, and spoken language corpora to make our dataset freely available to the public, and check the performance through the pipeline and training-based approaches. Using pipeline and various end-to-end schemes, we obtain the highest BLEU of 21.3 and 18.0 for each based on the English hypothesis, validating the feasibility of our data. We plan to supplement annotations for other target languages through community contributions in the future.
CLMar 24, 2021
StyleKQC: A Style-Variant Paraphrase Corpus for Korean Questions and CommandsWon Ik Cho, Sangwhan Moon, Jong In Kim et al.
Paraphrasing is often performed with less concern for controlled style conversion. Especially for questions and commands, style-variant paraphrasing can be crucial in tone and manner, which also matters with industrial applications such as dialog systems. In this paper, we attack this issue with a corpus construction scheme that simultaneously considers the core content and style of directives, namely intent and formality, for the Korean language. Utilizing manually generated natural language queries on six daily topics, we expand the corpus to formal and informal sentences by human rewriting and transferring. We verify the validity and industrial applicability of our approach by checking the adequate classification and inference performance that fit with conventional fine-tuning approaches, at the same time proposing a supervised formality transfer task.
CLMay 26, 2020
BEEP! Korean Corpus of Online News Comments for Toxic Speech DetectionJihyung Moon, Won Ik Cho, Junbum Lee
Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff's alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.
CLMay 17, 2020
Speech to Text Adaptation: Towards an Efficient Cross-Modal DistillationWon Ik Cho, Donghyun Kwak, Ji Won Yoon et al.
Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized end-to-end structures that preserve the uncertainty information. This further reduces the propagation of speech recognition error and guarantees computational efficiency. We claim that in this process, the speech comprehension can benefit from the inference of massive pre-trained language models (LMs). We transfer the knowledge from a concrete Transformer-based text LM to an SLU module which can face a data shortage, based on recent cross-modal distillation methodologies. We demonstrate the validity of our proposal upon the performance on Fluent Speech Command, an English SLU benchmark. Thereby, we experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module, in which the abstracted speech is expected to meet the semantic representation.
CLDec 1, 2019
Machines Getting with the Program: Understanding Intent Arguments of Non-Canonical DirectivesWon Ik Cho, Young Ki Moon, Sangwhan Moon et al.
Modern dialog managers face the challenge of having to fulfill human-level conversational skills as part of common user expectations, including but not limited to discourse with no clear objective. Along with these requirements, agents are expected to extrapolate intent from the user's dialogue even when subjected to non-canonical forms of speech. This depends on the agent's comprehension of paraphrased forms of such utterances. Especially in low-resource languages, the lack of data is a bottleneck that prevents advancements of the comprehension performance for these types of agents. In this regard, here we demonstrate the necessity of extracting the intent argument of non-canonical directives in a natural language format, which may yield more accurate parsing, and suggest guidelines for building a parallel corpus for this purpose. Following the guidelines, we construct a Korean corpus of 50K instances of question/command-intent pairs, including the labels for classification of the utterance type. We also propose a method for mitigating class imbalance, demonstrating the potential applications of the corpus generation method and its multilingual extensibility.
CLOct 21, 2019
Text Matters but Speech Influences: A Computational Analysis of Syntactic Ambiguity ResolutionWon Ik Cho, Jeonghwa Cho, Woo Hyun Kang et al.
Analyzing how human beings resolve syntactic ambiguity has long been an issue of interest in the field of linguistics. It is, at the same time, one of the most challenging issues for spoken language understanding (SLU) systems as well. As syntactic ambiguity is intertwined with issues regarding prosody and semantics, the computational approach toward speech intention identification is expected to benefit from the observations of the human language processing mechanism. In this regard, we address the task with attentive recurrent neural networks that exploit acoustic and textual features simultaneously and reveal how the modalities interact with each other to derive sentence meaning. Utilizing a speech corpus recorded on Korean scripts of syntactically ambiguous utterances, we revealed that co-attention frameworks, namely multi-hop attention and cross-attention, show significantly superior performance in disambiguating speech intention. With further analysis, we demonstrate that the computational models reflect the internal relationship between auditory and linguistic processes.
CLMay 31, 2019
Investigating an Effective Character-level Embedding in Korean Sentence ClassificationWon Ik Cho, Seok Min Kim, Nam Soo Kim
Different from the writing systems of many Romance and Germanic languages, some languages or language families show complex conjunct forms in character composition. For such cases where the conjuncts consist of the components representing consonant(s) and vowel, various character encoding schemes can be adopted beyond merely making up a one-hot vector. However, there has been little work done on intra-language comparison regarding performances using each representation. In this study, utilizing the Korean language which is character-rich and agglutinative, we investigate an encoding scheme that is the most effective among Jamo-level one-hot, character-level one-hot, character-level dense, and character-level multi-hot. Classification performance with each scheme is evaluated on two corpora: one on binary sentiment analysis of movie reviews, and the other on multi-class identification of intention types. The result displays that the character-level features show higher performance in general, although the Jamo-level features may show compatibility with the attention-based models if guaranteed adequate parameter set size.
CLMay 28, 2019
On Measuring Gender Bias in Translation of Gender-neutral PronounsWon Ik Cho, Ji Won Kim, Seok Min Kim et al.
Ethics regarding social bias has recently thrown striking issues in natural language processing. Especially for gender-related topics, the need for a system that reduces the model bias has grown in areas such as image captioning, content recommendation, and automated employment. However, detection and evaluation of gender bias in the machine translation systems are not yet thoroughly investigated, for the task being cross-lingual and challenging to define. In this paper, we propose a scheme for making up a test set that evaluates the gender bias in a machine translation system, with Korean, a language with gender-neutral pronouns. Three word/phrase sets are primarily constructed, each incorporating positive/negative expressions or occupations; all the terms are gender-independent or at least not biased to one side severely. Then, additional sentence lists are constructed concerning formality of the pronouns and politeness of the sentences. With the generated sentence set of size 4,236 in total, we evaluate gender bias in conventional machine translation systems utilizing the proposed measure, which is termed here as translation gender bias index (TGBI). The corpus and the code for evaluation is available on-line.
CLNov 10, 2018
Speech Intention Understanding in a Head-final Language: A Disambiguation Utilizing Intonation-dependencyWon Ik Cho, Hyeon Seung Lee, Ji Won Yoon et al.
For a large portion of real-life utterances, the intention cannot be solely decided by either their semantic or syntactic characteristics. Although not all the sociolinguistic and pragmatic information can be digitized, at least phonetic features are indispensable in understanding the spoken language. Especially in head-final languages such as Korean, sentence-final prosody has great importance in identifying the speaker's intention. This paper suggests a system which identifies the inherent intention of a spoken utterance given its transcript, in some cases using auxiliary acoustic features. The main point here is a separate distinction for cases where discrimination of intention requires an acoustic cue. Thus, the proposed classification system decides whether the given utterance is a fragment, statement, question, command, or a rhetorical question/command, utilizing the intonation-dependency coming from the head-finality. Based on an intuitive understanding of the Korean language that is engaged in the data annotation, we construct a network which identifies the intention of a speech, and validate its utility with the test sentences. The system, if combined with up-to-date speech recognizers, is expected to be flexibly inserted into various language understanding modules.
CLOct 31, 2018
Giving Space to Your Message: Assistive Word Segmentation for the Electronic Typing of Digital MinoritiesWon Ik Cho, Sung Jun Cheon, Woo Hyun Kang et al.
For readability and disambiguation of the written text, appropriate word segmentation is recommended for documentation, and it also holds for the digitized texts. If the language is agglutinative while far from scriptio continua, for instance in the Korean language, the problem becomes more significant. However, some device users these days find it challenging to communicate via key stroking, not only for handicap but also for being unskilled. In this study, we propose a real-time assistive technology that utilizes an automatic word segmentation, designed for digital minorities who are not familiar with electronic typing. We propose a data-driven system trained upon a spoken Korean language corpus with various non-canonical expressions and dialects, guaranteeing the comprehension of contextual information. Through quantitative and qualitative comparison with other text processing toolkits, we show the reliability of the proposed system and its fit with colloquial and non-normalized texts, which fulfills the aim of supportive technology.
CLOct 10, 2018
Extracting Arguments from Korean Question and Command: An Annotated Corpus for Structured ParaphrasingWon Ik Cho, Young Ki Moon, Woo Hyun Kang et al.
Intention identification is a core issue in dialog management. However, due to the non-canonicality of the spoken language, it is difficult to extract the content automatically from the conversation-style utterances. This is much more challenging for languages like Korean and Japanese since the agglutination between morphemes make it difficult for the machines to parse the sentence and understand the intention. To suggest a guideline for this problem, and to merge the issue flexibly with the neural paraphrasing systems introduced recently, we propose a structured annotation scheme for Korean question/commands and the resulting corpus which are widely applicable to the field of argument mining. The scheme and dataset are expected to help machines understand the intention of natural language and grasp the core meaning of conversation-style instructions.