Bruce W. Lee

CL
h-index26
17papers
2,683citations
Novelty44%
AI Score44

17 Papers

LGSep 6, 2024Code
Programming Refusal with Conditional Activation Steering

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy et al. · ibm-research

LLMs have shown remarkable capabilities, but precisely controlling their response behavior remains challenging. Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants. In this paper, we propose Conditional Activation Steering (CAST), which analyzes LLM activation patterns during inference to selectively apply or withhold activation steering based on the input context. Our method is based on the observation that different categories of prompts activate distinct patterns in the model's hidden states. Using CAST, one can systematically control LLM behavior with rules like "if input is about hate speech or adult content, then refuse" or "if input is not about legal advice, then refuse." This allows for selective modification of responses to specific content while maintaining normal responses to other content, all without requiring weight optimization. We release an open-source implementation of our framework at github.com/IBM/activation-steering .

CLOct 14, 2023
Instruction Tuning with Human Curriculum

Bruce W. Lee, Hyunsoo Cho, Kang Min Yoo

In this work, we (1) introduce Curriculum Instruction Tuning, (2) explore the potential advantages of employing diverse curriculum strategies, and (3) delineate a synthetic instruction-response generation framework that complements our theoretical approach. Distinct from the existing instruction tuning dataset, our generation pipeline is systematically structured to emulate the sequential and orderly characteristic of human learning. Additionally, we describe a methodology for generating instruction-response datasets that extensively span the various stages of human education, from middle school through the graduate level, utilizing educational subject catalogs. Before training, we meticulously organize the instruction data to ensure that questions escalate in difficulty regarding (A) the subject matter and (B) the intricacy of the instructions. The findings of our study reveal that substantial improvements in performance can be achieved through the mere application of curriculum ordering to instruction data (achieving gains of +4.76 on TruthfulQA, +2.98 on MMLU, +2.8 on OpenbookQA, and +1.28 on ARC-hard) compared to random shuffling. This enhancement is achieved without incurring additional computational expenses. Through comprehensive experimentation, we observe that the advantages of our proposed method are consistently evident across nine benchmarks.

LGFeb 25
Training Agents to Self-Report Misbehavior

Bruce W. Lee, Chen Yueh-Han, Tomek Korbak · berkeley

Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment training aims to prevent such behavior by reinforcing the correct goals, but alignment may not always succeed and can lead to unwanted side effects. We propose self-incrimination training, which instead trains agents to produce a visible signal when they covertly misbehave. We train GPT-4.1 and Gemini-2.0 agents to call a report_scheming() tool when behaving deceptively and measure their ability to cause harm undetected in out-of-distribution environments. Self-incrimination significantly reduces the undetected successful attack rate, outperforming matched-capability monitors and alignment baselines while preserving instruction hierarchy and incurring minimal safety tax on general capabilities. Unlike blackbox monitoring, self-incrimination performance is consistent across tasks regardless of how suspicious the misbehavior appears externally. The trained behavior persists under adversarial prompt optimization and generalizes to settings where agents pursue misaligned goals themselves rather than being instructed to misbehave. Our results suggest self-incrimination offers a viable path for reducing frontier misalignment risk, one that neither assumes misbehavior can be prevented nor that it can be reliably classified from the outside.

CLFeb 25, 2023
Prompt-based Learning for Text Readability Assessment

Bruce W. Lee, Jason Hyung-Jong Lee

We propose the novel adaptation of a pre-trained seq2seq model for readability assessment. We prove that a seq2seq model - T5 or BART - can be adapted to discern which text is more difficult from two given texts (pairwise). As an exploratory study to prompt-learn a neural network for text readability in a text-to-text manner, we report useful tips for future work in seq2seq training and ranking-based approach to readability assessment. Specifically, we test nine input-output formats/prefixes and show that they can significantly influence the final model performance. Also, we argue that the combination of text-to-text training and pairwise ranking setup 1) enables leveraging multiple parallel text simplification data for teaching readability and 2) trains a neural model for the general concept of readability (therefore, better cross-domain generalization). At last, we report a 99.6% pairwise classification accuracy on Newsela and a 98.7% for OneStopEnglish, through a joint training approach.

CLJan 8, 2023
Traditional Readability Formulas Compared for English

Bruce W. Lee, Jason Hyung-Jong Lee

Traditional English readability formulas, or equations, were largely developed in the 20th century. Nonetheless, many researchers still rely on them for various NLP applications. This phenomenon is presumably due to the convenience and straightforwardness of readability formulas. In this work, we contribute to the NLP community by 1. introducing New English Readability Formula (NERF), 2. recalibrating the coefficients of old readability formulas (Flesch-Kincaid Grade Level, Fog Index, SMOG Index, Coleman-Liau Index, and Automated Readability Index), 3. evaluating the readability formulas, for use in text simplification studies and medical texts, and 4. developing a Python-based program for the wide application to various NLP projects.

CLFeb 17, 2024Code
Language Models Don't Learn the Physical Manifestation of Language

Bruce W. Lee, JaeHyuk Lim

We argue that language-only models don't learn the physical manifestation of language. We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-Test. These tasks highlight a fundamental gap between human linguistic understanding and the sensory-deprived linguistic understanding of LLMs. In support of our hypothesis, 1. deliberate reasoning (Chain-of-Thought), 2. few-shot examples, or 3. stronger LLM from the same model family (LLaMA 2 13B -> LLaMA 2 70B) has no significant effect on H-Test performance. We bring in the philosophical case of Mary, who learns about the world in a sensory-deprived environment as a useful conceptual framework to understand how language-only models learn about the world (Jackson, 1986). Our experiments show that some of the strongest proprietary LLMs stay near random chance baseline accuracy of 50%, highlighting the limitations of linguistic knowledge acquired in the absence of sensory experience. Our code and data are available at <github.com/brucewlee/h-test>.

CLAug 16, 2024
When Prompting Fails to Sway: Inertia in Moral and Value Judgments of Large Language Models

Bruce W. Lee, Yeongheon Lee, Hyunsoo Cho

Large Language Models (LLMs) exhibit non-deterministic behavior, and prompting has emerged as a primary method for steering their outputs toward desired directions. One popular strategy involves assigning a specific "persona" to the model to induce more varied and context-sensitive responses, akin to the diversity found in human perspectives. However, contrary to the expectation that persona-based prompting would yield a wide range of opinions, our experiments demonstrate that LLMs maintain consistent value orientations. In particular, we observe a persistent inertia in their responses, where certain moral and value dimensions, especially harm avoidance and fairness, remain distinctly skewed in one direction despite varied persona settings. To investigate this phenomenon systematically, use role-play at scale, which combines randomized, diverse persona prompts with a macroscopic trend analysis of model outputs. Our findings highlight the strong internal biases and value preferences in LLMs, underscoring the need for careful scrutiny and potential adjustment of these models to ensure balanced and equitable applications.

CLJul 7, 2023
A Side-by-side Comparison of Transformers for English Implicit Discourse Relation Classification

Bruce W. Lee, BongSeok Yang, Jason Hyung-Jong Lee

Though discourse parsing can help multiple NLP fields, there has been no wide language model search done on implicit discourse relation classification. This hinders researchers from fully utilizing public-available models in discourse analysis. This work is a straightforward, fine-tuned discourse performance comparison of seven pre-trained language models. We use PDTB-3, a popular discourse relation annotated dataset. Through our model search, we raise SOTA to 0.671 ACC and obtain novel observations. Some are contrary to what has been reported before (Shi and Demberg, 2019b), that sentence-level pre-training objectives (NSP, SBO, SOP) generally fail to produce the best performing model for implicit discourse relation classification. Counterintuitively, similar-sized PLMs with MLM and full attention led to better performance.

CLMay 14, 2022
Auto-Select Reading Passages in English Assessment Tests?

Bruce W. Lee, Jason H. Lee

We show a method to auto-select reading passages in English assessment tests and share some key insights that can be helpful in related fields. In specifics, we prove that finding a similar passage (to a passage that already appeared in the test) can give a suitable passage for test development. In the process, we create a simple database-tagger-filter algorithm and perform a human evaluation. However, 1. the textual features, that we analyzed, lack coverage, and 2. we fail to find meaningful correlations between each feature and suitability score. Lastly, we describe the future developments to improve automated reading passage selection.

AIAug 17, 2024
Measuring Agreeableness Bias in Multimodal Models

Jaehyuk Lim, Bruce W. Lee

This paper examines a phenomenon in multimodal language models where pre-marked options in question images can significantly influence model responses. Our study employs a systematic methodology to investigate this effect: we present models with images of multiple-choice questions, which they initially answer correctly, then expose the same model to versions with pre-marked options. Our findings reveal a significant shift in the models' responses towards the pre-marked option, even when it contradicts their answers in the neutral settings. Comprehensive evaluations demonstrate that this agreeableness bias is a consistent and quantifiable behavior across various model architectures. These results show potential limitations in the reliability of these models when processing images with pre-marked options, raising important questions about their application in critical decision-making contexts where such visual cues might be present.

CLMay 25, 2023Code
LFTK: Handcrafted Features in Computational Linguistics

Bruce W. Lee, Jason Hyung-Jong Lee

Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, most existing handcrafted feature extraction libraries are not open-source or not actively maintained. As a result, a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system for public access to a rich set of pre-implemented handcrafted features. Our system is coined LFTK and is the largest of its kind. Find it at github.com/brucewlee/lftk.

LGFeb 12, 2025
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa et al.

As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.

CLApr 2, 2024
HyperCLOVA X Technical Report

Kang Min Yoo, Jaegeun Han, Sookyo In et al.

We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.

CLMay 25, 2023
Linguistic Properties of Truthful Response

Bruce W. Lee, Benedict Florance Arockiaraj, Helen Jin

We investigate the phenomenon of an LLM's untruthful response using a large set of 220 handcrafted linguistic features. We focus on GPT-3 models and find that the linguistic profiles of responses are similar across model sizes. That is, how varying-sized LLMs respond to given prompts stays similar on the linguistic properties level. We expand upon this finding by training support vector machines that rely only upon the stylistic components of model responses to classify the truthfulness of statements. Though the dataset size limits our current findings, we show the possibility that truthfulness detection is possible without evaluating the content itself. But at the same time, the limited scope of our experiments must be taken into account in interpreting the results.

CLSep 25, 2021
Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features

Bruce W. Lee, Yoo Sung Jang, Jason Hyung-Jong Lee

We report two essential improvements in readability assessment: 1. three novel features in advanced semantics and 2. the timely evidence that traditional ML models (e.g. Random Forest, using handcrafted features) can combine with transformers (e.g. RoBERTa) to augment model performance. First, we explore suitable transformers and traditional ML models. Then, we extract 255 handcrafted linguistic features using self-developed extraction software. Finally, we assemble those to create several hybrid models, achieving state-of-the-art (SOTA) accuracy on popular datasets in readability assessment. The use of handcrafted features help model performance on smaller datasets. Notably, our RoBERTA-RF-T1 hybrid achieves the near-perfect classification accuracy of 99%, a 20.3% increase from the previous SOTA.

CLOct 26, 2020
LXPER Index 2.0: Improving Text Readability Assessment Model for L2 English Students in Korea

Bruce W. Lee, Jason Lee

Developing a text readability assessment model specifically for texts in a foreign English Language Training (ELT) curriculum has never had much attention in the field of Natural Language Processing. Hence, most developed models show extremely low accuracy for L2 English texts, up to the point where not many even serve as a fair comparison. In this paper, we investigate a text readability assessment model for L2 English learners in Korea. In accordance, we improve and expand the Text Corpus of the Korean ELT curriculum (CoKEC-text). Each text is labeled with its target grade level. We train our model with CoKEC-text and significantly improve the accuracy of readability assessment for texts in the Korean ELT curriculum.

CLAug 1, 2020
LXPER Index: a curriculum-specific text readability assessment model for EFL students in Korea

Bruce W. Lee, Jason Hyung-Jong Lee

Automatic readability assessment is one of the most important applications of Natural Language Processing (NLP) in education. Since automatic readability assessment allows the fast selection of appropriate reading material for readers at all levels of proficiency, it can be particularly useful for the English education of English as Foreign Language (EFL) students around the world. Most readability assessment models are developed for the native readers of English and have low accuracy for texts in the non-native English Language Training (ELT) curriculum. We introduce LXPER Index, which is a readability assessment model for non-native EFL readers in the ELT curriculum of Korea. Our experiments show that our new model, trained with CoKEC-text (Text Corpus of the Korean ELT Curriculum), significantly improves the accuracy of automatic readability assessment for texts in the Korean ELT curriculum.