CLMay 27
HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity AlignmentYoonjin Jang, Junwoo Kim, Youngjoong Ko
Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject same-name entities that refer to different real-world objects. Our primary contribution is a same-name hard-negative augmentation strategy that simultaneously yields quality-controlled evaluation benchmarks (DW-HN29K, DY-HN27K) and augmented training corpora (DW-Train, DY-Train), by mining same-name but distinct entity pairs from KG name-collision groups. We further introduce HELEA, a two-stage framework integrating (i) entity encoder retrieval trained on hard-negative-augmented training corpora with 1-hop KG context, and (ii) LLM-based reranking without additional training. Experiments show that name-dependent baselines collapse to near-random performance on our hard-negative benchmarks, while HELEA achieves F1 0.967 on DW-HN29K while maintaining Hit@1 0.993 on standard DW-15K.
CLNov 3, 2025
ECO Decoding: Entropy-Based Control for Controllability and Fluency in Controllable Dialogue GenerationSeungmin Shin, Dooyoung Kim, Youngjoong Ko
Controllable Dialogue Generation (CDG) enables chatbots to generate responses with desired attributes, and weighted decoding methods have achieved significant success in the CDG task. However, using a fixed constant value to manage the bias of attribute probabilities makes it challenging to find an ideal control strength that satisfies both controllability and fluency. To address this issue, we propose ECO decoding (Entropy-based COntrol), which dynamically adjusts the control strength at each generation step according to the model's entropy in both the language model and attribute classifier probability distributions. Experiments on the DailyDialog and MultiWOZ datasets demonstrate that ECO decoding consistently improves controllability while maintaining fluency and grammaticality, outperforming prior decoding methods across various models and settings. Furthermore, ECO decoding alleviates probability interpolation issues in multi-attribute generation and consequently demonstrates strong performance in both single and multi-attribute scenarios.
IRMay 28, 2025
Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense RetrievalSeongwan Park, Taeklim Kim, Youngjoong Ko
Despite their strong performance, Dense Passage Retrieval (DPR) models suffer from a lack of interpretability. In this work, we propose a novel interpretability framework that leverages Sparse Autoencoders (SAEs) to decompose previously uninterpretable dense embeddings from DPR models into distinct, interpretable latent concepts. We generate natural language descriptions for each latent concept, enabling human interpretations of both the dense embeddings and the query-document similarity scores of DPR models. We further introduce Concept-Level Sparse Retrieval (CL-SR), a retrieval framework that directly utilizes the extracted latent concepts as indexing units. CL-SR effectively combines the semantic expressiveness of dense embeddings with the transparency and efficiency of sparse representations. We show that CL-SR achieves high index-space and computational efficiency while maintaining robust performance across vocabulary and semantic mismatches.
CLMar 17, 2025
DAPI: Domain Adaptive Toxicity Probe Vector Intervention for Fine-Grained DetoxificationCho Hyeonsu, Dooyoung Kim, Youngjoong Ko
There have been attempts to utilize linear probe for detoxification, with existing studies relying on a single toxicity probe vector to reduce toxicity. However, toxicity can be fine-grained into various subcategories, making it difficult to remove certain types of toxicity by using a single toxicity probe vector. To address this limitation, we propose a category-specific toxicity probe vector approach. First, we train multiple toxicity probe vectors for different toxicity categories. During generation, we dynamically select the most relevant toxicity probe vector based on the current context. Finally, the selected vector is dynamically scaled and subtracted from model. Our method successfully mitigated toxicity from categories that the single probe vector approach failed to detoxify. Experiments demonstrate that our approach achieves up to a 78.52% reduction in toxicity on the evaluation dataset, while fluency remains nearly unchanged, with only a 0.052% drop compared to the unsteered model.
CLFeb 24, 2025
SS-MPC: A Sequence-Structured Multi-Party Conversation SystemYoonjin Jang, Keunha Kim, Youngjoong Ko
Recent Multi-Party Conversation (MPC) models typically rely on graph-based approaches to capture dialogue structures. However, these methods have limitations, such as information loss during the projection of utterances into structural embeddings and constraints in leveraging pre-trained language models directly. In this paper, we propose \textbf{SS-MPC}, a response generation model for MPC that eliminates the need for explicit graph structures. Unlike existing models that depend on graphs to analyze conversation structures, SS-MPC internally encodes the dialogue structure as a sequential input, enabling direct utilization of pre-trained language models. Experimental results show that \textbf{SS-MPC} achieves \textbf{15.60\% BLEU-1} and \textbf{12.44\% ROUGE-L} score, outperforming the current state-of-the-art MPC response generation model by \textbf{3.91\%p} in \textbf{BLEU-1} and \textbf{0.62\%p} in \textbf{ROUGE-L}. Additionally, human evaluation confirms that SS-MPC generates more fluent and accurate responses compared to existing MPC models.
CLFeb 24, 2025
Dependency Parsing with the Structuralized Prompt TemplateKeunha Kim, Youngjoong Ko
Dependency parsing is a fundamental task in natural language processing (NLP), aiming to identify syntactic dependencies and construct a syntactic tree for a given sentence. Traditional dependency parsing models typically construct embeddings and utilize additional layers for prediction. We propose a novel dependency parsing method that relies solely on an encoder model with a text-to-text training approach. To facilitate this, we introduce a structured prompt template that effectively captures the structural information of dependency trees. Our experimental results demonstrate that the proposed method achieves outstanding performance compared to traditional models, despite relying solely on a pre-trained model. Furthermore, this method is highly adaptable to various pre-trained models across different target languages and training environments, allowing easy integration of task-specific features.
CLNov 11, 2019
Word Sense Disambiguation using Knowledge-based Word SimilaritySunjae Kwon, Dongsuk Oh, Youngjoong Ko
In natural language processing, word-sense disambiguation (WSD) is an open problem concerned with identifying the correct sense of words in a particular context. To address this problem, we introduce a novel knowledge-based WSD system. We suggest the adoption of two methods in our system. First, we suggest a novel method to encode the word vector representation by considering the graphical semantic relationships from the lexical knowledge-base. Second, we propose a method for extracting the contextual words from the text for analyzing an ambiguous word based on the similarity of word vector representations. To validate the effectiveness of our WSD system, we conducted experiments on the five benchmark English WSD corpora (Senseval-02, Senseval-03, SemEval-07, SemEval-13, and SemEval-15). The obtained results demonstrated that the suggested methods significantly enhanced the WSD performance. Furthermore, our system outperformed the existing knowledge-based WSD systems and showed a performance comparable to that of the state-of-the-art supervised WSD systems.