Grace Byun

CL
h-index6
6papers
69citations
Novelty47%
AI Score47

6 Papers

CLMay 28, 2025Code
Measuring Sycophancy of Language Models in Multi-turn Dialogues

Jiseung Hong, Grace Byun, Seungone Kim et al.

Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy--conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model's ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user's underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at https://github.com/JiseungHong/SYCON-Bench.

CLApr 18, 2025Code
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model

Grace Byun, Jinho D. Choi

Evaluating generative models with open-ended generation is challenging due to inconsistencies in response formats. Multiple-choice (MC) evaluation mitigates this issue, but generating high-quality distractors is time-consuming and labor-intensive. We introduce D-GEN, the first open-source distractor generator model that transforms open-ended data into an MC format. To evaluate distractor quality, we propose two novel methods: (1) ranking alignment, ensuring generated distractors retain the discriminatory power of ground-truth distractors, and (2) entropy analysis, comparing model confidence distributions. Our results show that D-GEN preserves ranking consistency (Spearman's rho 0.99, Kendall's tau 0.94) and closely matches the entropy distribution of ground-truth distractors. Human evaluation further confirms the fluency, coherence, distractiveness, and incorrectness. Our work advances robust and efficient distractor generation with automated evaluation, setting a new standard for MC evaluation.

CLApr 18, 2025Code
Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering

Grace Byun, Shinsun Lee, Nayoung Choi et al.

Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9 percent win rates across correctness, richness, and helpfulness in LLM-based evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG.

CLNov 13, 2025
LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

Grace Byun, Swati Rajwal, Jinho D. Choi

Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.

AIFeb 26, 2025
Reference-Aligned Retrieval-Augmented Question Answering over Heterogeneous Proprietary Documents

Nayoung Choi, Grace Byun, Andrew Chung et al.

Proprietary corporate documents contain rich domain-specific knowledge, but their overwhelming volume and disorganized structure make it difficult even for employees to access the right information when needed. For example, in the automotive industry, vehicle crash-collision tests, each costing hundreds of thousands of dollars, produce highly detailed documentation. However, retrieving relevant content during decision-making remains time-consuming due to the scale and complexity of the material. While Retrieval-Augmented Generation (RAG)-based Question Answering (QA) systems offer a promising solution, building an internal RAG-QA system poses several challenges: (1) handling heterogeneous multi-modal data sources, (2) preserving data confidentiality, and (3) enabling traceability between each piece of information in the generated answer and its original source document. To address these, we propose a RAG-QA framework for internal enterprise use, consisting of: (1) a data pipeline that converts raw multi-modal documents into a structured corpus and QA pairs, (2) a fully on-premise, privacy-preserving architecture, and (3) a lightweight reference matcher that links answer segments to supporting content. Applied to the automotive domain, our system improves factual correctness (+1.79, +1.94), informativeness (+1.33, +1.16), and helpfulness (+1.08, +1.67) over a non-RAG baseline, based on 1-5 scale ratings from both human and LLM judge.

CLOct 27, 2025
CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection

Grace Byun, Rebecca Lipschutz, Sean T. Minton et al.

Detecting mental health crisis situations such as suicide ideation, rape, domestic violence, child abuse, and sexual harassment is a critical yet underexplored challenge for language models. When such situations arise during user--model interactions, models must reliably flag them, as failure to do so can have serious consequences. In this work, we introduce CRADLE BENCH, a benchmark for multi-faceted crisis detection. Unlike previous efforts that focus on a limited set of crisis types, our benchmark covers seven types defined in line with clinical standards and is the first to incorporate temporal labels. Our benchmark provides 600 clinician-annotated evaluation examples and 420 development examples, together with a training corpus of around 4K examples automatically labeled using a majority-vote ensemble of multiple language models, which significantly outperforms single-model annotation. We further fine-tune six crisis detection models on subsets defined by consensus and unanimous ensemble agreement, providing complementary models trained under different agreement criteria.