CLJul 27, 2022
RealTime QA: What's the Answer Right Now?Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi et al. · allen-ai, uw
We introduce REALTIME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). REALTIME QA inquires about the current world, and QA systems need to answer questions about novel events or information. It therefore challenges static, conventional assumptions in open-domain QA datasets and pursues instantaneous applications. We build strong baseline models upon large pretrained language models, including GPT-3 and T5. Our benchmark is an ongoing effort, and this paper presents real-time evaluation results over the past year. Our experimental results show that GPT-3 can often properly update its generation results, based on newly-retrieved documents, highlighting the importance of up-to-date information retrieval. Nonetheless, we find that GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to find an answer. This suggests an important avenue for future research: can an open-domain QA system identify such unanswerable cases and communicate with the user or even the retrieval module to modify the retrieval results? We hope that REALTIME QA will spur progress in instantaneous applications of question answering and beyond.
42.9CLJun 2
Language Models Compare Quantities Using Number-specific and Unit-specific HeuristicsMutsumi Sasaki, Go kamoda, Ryosuke Takahashi et al.
Quantities with measurement units, such as 110 cm and 1.2 m, require language models (LMs) to combine a numeral with a symbolic unit scale. Here, we study how LMs compare such quantities in controlled settings spanning several unit systems. We find that accuracy degrades near the comparison boundary, where small changes in value determine the correct answer. The resulting errors are systematic: linear surrogate models predict LM preferences from numerical-difference and unit-scale-difference cues, and causal interventions on subspaces aligned with these variables shift model's output. The results suggest that LMs compare quantities through a bag of heuristics over numerals and units, rather than first converting both expressions to an exact shared-scale representation.
CLMay 23, 2022
Context Limitations Make Neural Language Models More Human-LikeTatsuki Kuribayashi, Yohei Oseki, Ana Brassard et al.
Language models (LMs) have been used in cognitive modeling as well as engineering studies -- they compute information-theoretic complexity metrics that simulate humans' cognitive load during reading. This study highlights a limitation of modern neural LMs as the model of choice for this purpose: there is a discrepancy between their context access capacities and that of humans. Our results showed that constraining the LMs' context access improved their simulation of human reading behavior. We also showed that LM-human gaps in context access were associated with specific syntactic constructions; incorporating syntactic biases into LMs' context access might enhance their cognitive plausibility.
CLFeb 1, 2023
Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention MapsGoro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi et al.
Transformers are ubiquitous in wide tasks. Interpreting their internals is a pivotal goal. Nevertheless, their particular components, feed-forward (FF) blocks, have typically been less analyzed despite their substantial parameter amounts. We analyze the input contextualization effects of FF blocks by rendering them in the attention maps as a human-friendly visualization scheme. Our experiments with both masked- and causal-language models reveal that FF networks modify the input contextualization to emphasize specific types of linguistic compositions. In addition, FF and its surrounding components tend to cancel out each other's effects, suggesting potential redundancy in the processing of the Transformer layer.
CLMay 23, 2022
Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and BeyondMasato Mita, Keisuke Sakaguchi, Masato Hagiwara et al.
Natural language processing technology has rapidly improved automated grammatical error correction tasks, and the community begins to explore document-level revision as one of the next challenges. To go beyond sentence-level automated grammatical error correction to NLP-based document-level revision assistant, there are two major obstacles: (1) there are few public corpora with document-level revisions being annotated by professional editors, and (2) it is not feasible to elicit all possible references and evaluate the quality of revision with such references because there are infinite possibilities of revision. This paper tackles these challenges. First, we introduce a new document-revision corpus, TETRA, where professional editors revised academic papers sampled from the ACL anthology which contain few trivial grammatical errors that enable us to focus more on document- and paragraph-level edits such as coherence and consistency. Second, we explore reference-less and interpretable methods for meta-evaluation that can detect quality improvements by document revision. We show the uniqueness of TETRA compared with existing document revision corpora and demonstrate that a fine-tuned pre-trained language model can discriminate the quality of documents after revision even when the difference is subtle. This promising result will encourage the community to further explore automated document revision models and metrics in future.
CLMar 25, 2023
Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error CorrectionSteven Coyne, Keisuke Sakaguchi, Diana Galvan-Sosa et al.
GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks. However, there is a relative lack of detailed published analysis of their performance on the task of grammatical error correction (GEC). To address this, we perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC benchmarks. We compare the performance of different prompts in both zero-shot and few-shot settings, analyzing intriguing or problematic outputs encountered with different prompt formats. We report the performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that the GPT models can perform well in a sentence-level revision setting, with GPT-4 achieving a new high score on the JFLEG benchmark. Through human evaluation experiments, we compare the GPT models' corrections to source, human reference, and baseline GEC system sentences and observe differences in editing strategies and how they are scored by human raters.
CLJun 16, 2022
Balancing Cost and Quality: An Exploration of Human-in-the-loop Frameworks for Automated Short Answer ScoringHiroaki Funayama, Tasuku Sato, Yuichiroh Matsubayashi et al.
Short answer scoring (SAS) is the task of grading short text written by a learner. In recent years, deep-learning-based approaches have substantially improved the performance of SAS models, but how to guarantee high-quality predictions still remains a critical issue when applying such models to the education field. Towards guaranteeing high-quality predictions, we present the first study of exploring the use of human-in-the-loop framework for minimizing the grading cost while guaranteeing the grading quality by allowing a SAS model to share the grading task with a human grader. Specifically, by introducing a confidence estimation method for indicating the reliability of the model predictions, one can guarantee the scoring quality by utilizing only predictions with high reliability for the scoring results and casting predictions with low reliability to human graders. In our experiments, we investigate the feasibility of the proposed framework using multiple confidence estimation methods and multiple SAS datasets. We find that our human-in-the-loop framework allows automatic scoring models and human graders to achieve the target scoring quality.
CLFeb 15, 2023
Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi et al.
Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps.
CLJul 28, 2023
Teach Me How to Improve My Argumentation Skills: A Survey on Feedback in ArgumentationCamélia Guerraoui, Paul Reisert, Naoya Inoue et al.
The use of argumentation in education has been shown to improve critical thinking skills for end-users such as students, and computational models for argumentation have been developed to assist in this process. Although these models are useful for evaluating the quality of an argument, they oftentimes cannot explain why a particular argument is considered poor or not, which makes it difficult to provide constructive feedback to users to strengthen their critical thinking skills. In this survey, we aim to explore the different dimensions of feedback (Richness, Visualization, Interactivity, and Personalization) provided by the current computational models for argumentation, and the possibility of enhancing the power of explanations of such models, ultimately helping learners improve their critical thinking skills.
CLSep 20, 2022
Target-Guided Open-Domain Conversation PlanningYosuke Kishinami, Reina Akama, Shiki Sato et al.
Prior studies addressing target-oriented conversational tasks lack a crucial notion that has been intensively studied in the context of goal-oriented artificial intelligence agents, namely, planning. In this study, we propose the task of Target-Guided Open-Domain Conversation Planning (TGCP) task to evaluate whether neural conversational agents have goal-oriented conversation planning abilities. Using the TGCP task, we investigate the conversation planning abilities of existing retrieval models and recent strong generative models. The experimental results reveal the challenges facing current technology.
CLAug 26, 2024
Reducing the Cost: Cross-Prompt Pre-Finetuning for Short Answer ScoringHiroaki Funayama, Yuya Asazuma, Yuichiroh Matsubayashi et al.
Automated Short Answer Scoring (SAS) is the task of automatically scoring a given input to a prompt based on rubrics and reference answers. Although SAS is useful in real-world applications, both rubrics and reference answers differ between prompts, thus requiring a need to acquire new data and train a model for each new prompt. Such requirements are costly, especially for schools and online courses where resources are limited and only a few prompts are used. In this work, we attempt to reduce this cost through a two-phase approach: train a model on existing rubrics and answers with gold score signals and finetune it on a new prompt. Specifically, given that scoring rubrics and reference answers differ for each prompt, we utilize key phrases, or representative expressions that the answer should contain to increase scores, and train a SAS model to learn the relationship between key phrases and answers using already annotated prompts (i.e., cross-prompts). Our experimental results show that finetuning on existing cross-prompt data with key phrases significantly improves scoring accuracy, especially when the training data is limited. Finally, our extensive analysis shows that it is crucial to design the model so that it can learn the task's general property.
LGMay 24, 2022
Diverse Lottery Tickets Boost Ensemble from a Single Pretrained ModelSosuke Kobayashi, Shun Kiyono, Jun Suzuki et al.
Ensembling is a popular method used to improve performance as a last resort. However, ensembling multiple models finetuned from a single pretrained model has been not very effective; this could be due to the lack of diversity among ensemble members. This paper proposes Multi-Ticket Ensemble, which finetunes different subnetworks of a single pretrained model and ensembles them. We empirically demonstrated that winning-ticket subnetworks produced more diverse predictions than dense networks, and their ensemble outperformed the standard ensemble on some tasks.
CLNov 2, 2022
Cross-stitching Text and Knowledge Graph Encoders for Distantly Supervised Relation ExtractionQin Dai, Benjamin Heinzerling, Kentaro Inui
Bi-encoder architectures for distantly-supervised relation extraction are designed to make use of the complementary information found in text and knowledge graphs (KG). However, current architectures suffer from two drawbacks. They either do not allow any sharing between the text encoder and the KG encoder at all, or, in case of models with KG-to-text attention, only share information in one direction. Here, we introduce cross-stitch bi-encoders, which allow full interaction between the text encoder and the KG encoder via a cross-stitch mechanism. The cross-stitch mechanism allows sharing and updating representations between the two encoders at any layer, with the amount of sharing being dynamically controlled via cross-attention-based gates. Experimental results on two relation extraction benchmarks from two different domains show that enabling full interaction between the two encoders yields strong improvements.
CLAug 2, 2023
Chat Translation Error Detection for Assisting Cross-lingual CommunicationsYunmeng Li, Jun Suzuki, Makoto Morishita et al.
In this paper, we describe the development of a communication support system that detects erroneous translations to facilitate crosslingual communications due to the limitations of current machine chat translation methods. We trained an error detector as the baseline of the system and constructed a new Japanese-English bilingual chat corpus, BPersona-chat, which comprises multiturn colloquial chats augmented with crowdsourced quality ratings. The error detector can serve as an encouraging foundation for more advanced erroneous translation detection systems.
LGJan 17, 2023
Tracing and Manipulating Intermediate Values in Neural Math Problem SolversYuta Matsumoto, Benjamin Heinzerling, Masashi Yoshikawa et al.
How language models process complex input that requires multiple steps of inference is not well understood. Previous research has shown that information about intermediate values of these inputs can be extracted from the activations of the models, but it is unclear where that information is encoded and whether that information is indeed used during inference. We introduce a method for analyzing how a Transformer model processes these inputs by focusing on simple arithmetic problems and their intermediate values. To trace where information about intermediate values is encoded, we measure the correlation between intermediate values and the activations of the model using principal component analysis (PCA). Then, we perform a causal intervention by manipulating model weights. This intervention shows that the weights identified via tracing are not merely correlated with intermediate values, but causally related to model predictions. Our findings show that the model has a locality to certain intermediate values, and this is useful for enhancing the interpretability of the models.
CLApr 4, 2022
LPAttack: A Feasible Annotation Scheme for Capturing Logic Pattern of Attacks in ArgumentsFarjana Sultana Mim, Naoya Inoue, Shoichi Naito et al.
In argumentative discourse, persuasion is often achieved by refuting or attacking others arguments. Attacking is not always straightforward and often comprise complex rhetorical moves such that arguers might agree with a logic of an argument while attacking another logic. Moreover, arguer might neither deny nor agree with any logics of an argument, instead ignore them and attack the main stance of the argument by providing new logics and presupposing that the new logics have more value or importance than the logics present in the attacked argument. However, no existing studies in the computational argumentation capture such complex rhetorical moves in attacks or the presuppositions or value judgements in them. In order to address this gap, we introduce LPAttack, a novel annotation scheme that captures the common modes and complex rhetorical moves in attacks along with the implicit presuppositions and value judgements in them. Our annotation study shows moderate inter-annotator agreement, indicating that human annotation for the proposed scheme is feasible. We publicly release our annotated corpus and the annotation guidelines.
AIFeb 16, 2023
Empirical Investigation of Neural Symbolic Reasoning StrategiesYoichi Aoki, Keito Kudo, Tatsuki Kuribayashi et al.
Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer. Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation. Our results indicate the importance of further exploring effective strategies for neural reasoning models.
CLAug 4, 2022
N-best Response-based Analysis of Contradiction-awareness in Neural Response Generation ModelsShiki Sato, Reina Akama, Hiroki Ouchi et al.
Avoiding the generation of responses that contradict the preceding context is a significant challenge in dialogue response generation. One feasible method is post-processing, such as filtering out contradicting responses from a resulting n-best response list. In this scenario, the quality of the n-best list considerably affects the occurrence of contradictions because the final response is chosen from this n-best list. This study quantitatively analyzes the contextual contradiction-awareness of neural response generation models using the consistency of the n-best lists. Particularly, we used polar questions as stimulus inputs for concise and quantitative analyses. Our tests illustrate the contradiction-awareness of recent neural response generation models and methodologies, followed by a discussion of their properties and limitations.
CLSep 9, 2024
Representational Analysis of Binding in Language ModelsQin Dai, Benjamin Heinzerling, Kentaro Inui
Entity tracking is essential for complex reasoning. To perform in-context entity tracking, language models (LMs) must bind an entity to its attribute (e.g., bind a container to its content) to recall attribute for a given entity. For example, given a context mentioning ``The coffee is in Box Z, the stone is in Box M, the map is in Box H'', to infer ``Box Z contains the coffee'' later, LMs must bind ``Box Z'' to ``coffee''. To explain the binding behaviour of LMs, existing research introduces a Binding ID mechanism and states that LMs use a abstract concept called Binding ID (BI) to internally mark entity-attribute pairs. However, they have not captured the Ordering ID (OI) from entity activations that directly determines the binding behaviour. In this work, we provide a novel view of the BI mechanism by localizing OI and proving the causality between OI and binding behaviour. Specifically, by leveraging dimension reduction methods (e.g., PCA), we discover that there exists a low-rank subspace in the activations of LMs, that primarily encodes the order (i.e., OI) of entity and attribute. Moreover, we also discover the causal effect of OI on binding that when editing representations along the OI encoding direction, LMs tend to bind a given entity to other attributes accordingly. For example, by patching activations along the OI encoding direction we can make the LM to infer ``Box Z contains the stone'' and ``Box Z contains the map''.
CLAug 29, 2024
MQM-Chat: Multidimensional Quality Metrics for Chat TranslationYunmeng Li, Jun Suzuki, Makoto Morishita et al.
The complexities of chats pose significant challenges for machine translation models. Recognizing the need for a precise evaluation metric to address the issues of chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat). Through the experiments of five models using MQM-Chat, we observed that all models generated certain fundamental errors, while each of them has different shortcomings, such as omission, overly correcting ambiguous source content, and buzzword issues, resulting in the loss of stylized information. Our findings underscore the effectiveness of MQM-Chat in evaluating chat translation, emphasizing the importance of stylized content and dialogue consistency for future studies.
85.4CLApr 13
Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better EvaluationJoe Stacey, Hadas Orgad, Kentaro Inui et al.
Recent work has shown that the hidden states of large language models contain signals useful for uncertainty estimation and hallucination detection, motivating a growing interest in efficient probe-based approaches. Yet it remains unclear how robust existing methods are, and which probe designs provide uncertainty estimates that are reliable under distribution shift. We present a systematic study of supervised uncertainty probes across models, tasks, and OOD settings, training over 2,000 probes while varying the representation layer, feature type, and token aggregation strategy. Our evaluation highlights poor robustness in current methods, particularly in the case of long-form generations. We also find that probe robustness is driven less by architecture and more by the probe inputs. Middle-layer representations generalise more reliably than final-layer hidden states, and aggregating across response tokens is consistently more robust than relying on single-token features. These differences are often largely invisible in-distribution but become more important under distribution shift. Informed by our evaluation, we explore a simple hybrid back-off strategy for improving robustness, arguing that better evaluation is a prerequisite for building more robust probes.
CLOct 24, 2023
Contrastive Learning-based Sentence Encoders Implicitly Weight Informative WordsHiroto Kurita, Goro Kobayashi, Sho Yokoi et al.
The performance of sentence encoders can be significantly improved through the simple practice of fine-tuning using contrastive loss. A natural question arises: what characteristics do models acquire during contrastive learning? This paper theoretically and experimentally shows that contrastive-based sentence encoders implicitly weight words based on information-theoretic quantities; that is, more informative words receive greater weight, while others receive less. The theory states that, in the lower bound of the optimal value of the contrastive learning objective, the norm of word embedding reflects the information gain associated with the distribution of surrounding words. We also conduct comprehensive experiments using various models, multiple datasets, two methods to measure the implicit weighting of models (Integrated Gradients and SHAP), and two information-theoretic quantities (information gain and self-information). The results provide empirical evidence that contrastive fine-tuning emphasizes informative words.
CLOct 26, 2023
Test-time Augmentation for Factual ProbingGo Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi et al.
Factual probing is a method that uses prompts to test if a language model "knows" certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA.
CLAug 28, 2024
An Investigation of Warning Erroneous Chat Translations in Cross-lingual CommunicationYunmeng Li, Jun Suzuki, Makoto Morishita et al.
Machine translation models are still inappropriate for translating chats, despite the popularity of translation software and plug-in applications. The complexity of dialogues poses significant challenges and can hinder crosslingual communication. Instead of pursuing a flawless translation system, a more practical approach would be to issue warning messages about potential mistranslations to reduce confusion. However, it is still unclear how individuals perceive these warning messages and whether they benefit the crowd. This paper tackles to investigate this question and demonstrates the warning messages' contribution to making chat translation systems effective.
CLMar 15, 2024Code
Monotonic Representation of Numeric Properties in Language ModelsBenjamin Heinzerling, Kentaro Inui
Language models (LMs) can express factual knowledge involving numeric properties such as Karl Popper was born in 1902. However, how this information is encoded in the model's internal representations is not understood well. Here, we introduce a simple method for finding and editing representations of numeric properties such as an entity's birth year. Empirically, we find low-dimensional subspaces that encode numeric properties monotonically, in an interpretable and editable fashion. When editing representations along directions in these subspaces, LM output changes accordingly. For example, by patching activations along a "birthyear" direction we can make the LM express an increasingly late birthyear: Karl Popper was born in 1929, Karl Popper was born in 1957, Karl Popper was born in 1968. Property-encoding directions exist across several numeric properties in all models under consideration, suggesting the possibility that monotonic representation of numeric properties consistently emerges during LM pretraining. Code: https://github.com/bheinzerling/numeric-property-repr
26.2CLApr 21
Cell-Based Representation of Relational Binding in Language ModelsQin Dai, Benjamin Heinzerling, Kentaro Inui
Understanding a discourse requires tracking entities and the relations that hold between them. While Large Language Models (LLMs) perform well on relational reasoning, the mechanism by which they bind entities, relations, and attributes remains unclear. We study discourse-level relational binding and show that LLMs encode it via a Cell-based Binding Representation (CBR): a low-dimensional linear subspace in which each ``cell'' corresponds to an entity--relation index pair, and bound attributes are retrieved from the corresponding cell during inference. Using controlled multi-sentence data annotated with entity and relation indices, we identify the CBR subspace by decoding these indices from attribute-token activations with Partial Least Squares regression. Across domains and two model families, the indices are linearly decodable and form a grid-like geometry in the projected space. We further find that context-specific CBR representations are related by translation vectors in activation space, enabling cross-context transfer. Finally, activation patching shows that manipulating this subspace systematically changes relational predictions and that perturbing it disrupts performance, providing causal evidence that LLMs rely on CBR for relational binding.
CLMay 8, 2024Code
ACORN: Aspect-wise Commonsense Reasoning Explanation EvaluationAna Brassard, Benjamin Heinzerling, Keito Kudo et al.
Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanations. We observed that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters. However, their correlation with majority-voted human ratings varied across different quality aspects, indicating that they are not a complete replacement. In turn, using LLMs as a supplement to a smaller group of human raters in some cases improved the correlation with the original majority labels. However, the effect was limited to cases where human raters were scarce, and an additional human rater had a more pronounced effect in all cases. Overall, we recommend against using LLMs as a complete replacement for human raters but encourage using them in configurations that end with targeted human involvement. Data available here: https://github.com/a-brassard/ACORN
LGFeb 25
WaveSSM: Multiscale State-Space Models for Non-stationary Signal AttentionRuben Solozabal, Velibor Bojkovic, Hilal Alquabeh et al.
State-space models (SSMs) have emerged as a powerful foundation for long-range sequence modeling, with the HiPPO framework showing that continuous-time projection operators can be used to derive stable, memory-efficient dynamical systems that encode the past history of the input signal. However, existing projection-based SSMs often rely on polynomial bases with global temporal support, whose inductive biases are poorly matched to signals exhibiting localized or transient structure. In this work, we introduce \emph{WaveSSM}, a collection of SSMs constructed over wavelet frames. Our key observation is that wavelet frames yield a localized support on the temporal dimension, useful for tasks requiring precise localization. Empirically, we show that on equal conditions, \textit{WaveSSM} outperforms orthogonal counterparts as S4 on real-world datasets with transient dynamics, including physiological signals on the PTB-XL dataset and raw audio on Speech Commands.
CLMar 3
Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language ModelsHaruto Yoshida, Keito Kudo, Yoichi Aoki et al.
Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.
ROMay 1, 2025Code
SmallPlan: Leverage Small Language Models for Sequential Path Planning with Simulation-Powered, LLM-Guided DistillationQuang P. M. Pham, Khoi T. N. Nguyen, Nhi H. Doan et al.
Efficient path planning in robotics, particularly within large-scale, complex environments, remains a significant hurdle. While Large Language Models (LLMs) offer strong reasoning capabilities, their high computational cost and limited adaptability hinder real-time deployment on edge devices. We present SmallPlan - a novel framework leveraging LLMs as teacher models to train lightweight Small Language Models (SLMs) for high-level path planning tasks. In SmallPlan, the SLMs provide optimal action sequences to navigate across scene graphs that compactly represent full-scaled 3D scenes. The SLMs are trained in a simulation-powered, interleaved manner with LLM-guided supervised fine-tuning (SFT) and reinforcement learning (RL). This strategy not only enables SLMs to successfully complete navigation tasks but also makes them aware of important factors like distance travel, providing more efficient path planning. Through experiments, we demonstrate that the fine-tuned SLMs perform competitively with larger models like GPT-4o on sequential path planning, without suffering from hallucination and overfitting. SmallPlan is resource-efficient, making it well-suited for edge-device deployment and advancing practical autonomous robotics. Our source code is available here: https://github.com/quangpham2006/SmallPlan
CLJan 23, 2025Code
RECALL: Library-Like Behavior In Language Models is Enhanced by Self-Referencing Causal CyclesMunachiso Nwadike, Zangir Iklassov, Toluwani Aremu et al.
We introduce the concept of the self-referencing causal cycle (abbreviated RECALL) - a mechanism that enables large language models (LLMs) to bypass the limitations of unidirectional causality, which underlies a phenomenon known as the reversal curse. When an LLM is prompted with sequential data, it often fails to recall preceding context. For example, when we ask an LLM to recall the line preceding "O say does that star-spangled banner yet wave" in the U.S. National Anthem, it often fails to correctly return "Gave proof through the night that our flag was still there" - this is due to the reversal curse. It occurs because language models such as ChatGPT and Llama generate text based on preceding tokens, requiring facts to be learned and reproduced in a consistent token order. While the reversal curse is often viewed as a limitation, we offer evidence of an alternative view: it is not always an obstacle in practice. We find that RECALL is driven by what we designate as cycle tokens - sequences that connect different parts of the training data, enabling recall of preceding tokens from succeeding ones. Through rigorous probabilistic formalization and controlled experiments, we demonstrate how the cycles they induce influence a model's ability to reproduce information. To facilitate reproducibility, we provide our code and experimental details at https://anonymous.4open.science/r/remember-B0B8/.
CLJan 18, 2022Code
COPA-SSE: Semi-structured Explanations for Commonsense ReasoningAna Brassard, Benjamin Heinzerling, Pride Kavumba et al.
We present Semi-Structured Explanations for COPA (COPA-SSE), a new crowdsourced dataset of 9,747 semi-structured, English common sense explanations for Choice of Plausible Alternatives (COPA) questions. The explanations are formatted as a set of triple-like common sense statements with ConceptNet relations but freely written concepts. This semi-structured format strikes a balance between the high quality but low coverage of structured data and the lower quality but high coverage of free-form crowdsourcing. Each explanation also includes a set of human-given quality ratings. With their familiar format, the explanations are geared towards commonsense reasoners operating on knowledge graphs and serve as a starting point for ongoing work on improving such systems. The dataset is available at https://github.com/a-brassard/copa-sse.
CLJun 2, 2021Code
SyGNS: A Systematic Generalization Testbed Based on Natural Language SemanticsHitomi Yanaka, Koji Mineshima, Kentaro Inui
Recently, deep neural networks (DNNs) have achieved great success in semantically challenging NLP tasks, yet it remains unclear whether DNN models can capture compositional meanings, those aspects of meaning that have been long studied in formal semantics. To investigate this issue, we propose a Systematic Generalization testbed based on Natural language Semantics (SyGNS), whose challenge is to map natural language sentences to multiple forms of scoped meaning representations, designed to account for various semantic phenomena. Using SyGNS, we test whether neural networks can systematically parse sentences involving novel combinations of logical expressions such as quantifiers and negation. Experiments show that Transformer and GRU models can generalize to unseen combinations of quantifiers, negations, and modifiers that are similar to given training instances in form, but not to the others. We also find that the generalization performance to unseen combinations is better when the form of meaning representations is simpler. The data and code for SyGNS are publicly available at https://github.com/verypluming/SyGNS.
CLFeb 12, 2021Code
Two Training Strategies for Improving Relation Extraction over Universal GraphQin Dai, Naoya Inoue, Ryo Takahashi et al.
This paper explores how the Distantly Supervised Relation Extraction (DS-RE) can benefit from the use of a Universal Graph (UG), the combination of a Knowledge Graph (KG) and a large-scale text collection. A straightforward extension of a current state-of-the-art neural model for DS-RE with a UG may lead to degradation in performance. We first report that this degradation is associated with the difficulty in learning a UG and then propose two training strategies: (1) Path Type Adaptive Pretraining, which sequentially trains the model with different types of UG paths so as to prevent the reliance on a single type of UG path; and (2) Complexity Ranking Guided Attention mechanism, which restricts the attention span according to the complexity of a UG path so as to force the model to extract features not only from simple UG paths but also from complex ones. Experimental results on both biomedical and NYT10 datasets prove the robustness of our methods and achieve a new state-of-the-art result on the NYT10 dataset. The code and datasets used in this paper are available at https://github.com/baodaiqin/UGDSRE.
CLJan 26, 2021Code
Exploring Transitivity in Neural NLI Models through VeridicalityHitomi Yanaka, Koji Mineshima, Kentaro Inui
Despite the recent success of deep neural networks in natural language processing, the extent to which they can demonstrate human-like generalization capacities for natural language understanding remains unclear. We explore this issue in the domain of natural language inference (NLI), focusing on the transitivity of inference relations, a fundamental property for systematically drawing inferences. A model capturing transitivity can compose basic inference patterns and draw new inferences. We introduce an analysis method using synthetic and naturalistic NLI datasets involving clause-embedding verbs to evaluate whether models can perform transitivity inferences composed of veridical inferences and arbitrary inference types. We find that current NLI models do not perform consistently well on transitivity inference tasks, suggesting that they lack the generalization capacity for drawing composite inferences from provided training examples. The data and code for our analysis are publicly available at https://github.com/verypluming/transitivity.
CLMay 3, 2020Code
Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error CorrectionMasahiro Kaneko, Masato Mita, Shun Kiyono et al.
This paper investigates how to effectively incorporate a pre-trained masked language model (MLM), such as BERT, into an encoder-decoder (EncDec) model for grammatical error correction (GEC). The answer to this question is not as straightforward as one might expect because the previous common methods for incorporating a MLM into an EncDec model have potential drawbacks when applied to GEC. For example, the distribution of the inputs to a GEC model can be considerably different (erroneous, clumsy, etc.) from that of the corpora used for pre-training MLMs; however, this issue is not addressed in the previous methods. Our experiments show that our proposed method, where we first fine-tune a MLM with a given GEC corpus and then use the output of the fine-tuned MLM as additional features in the GEC model, maximizes the benefit of the MLM. The best-performing model achieves state-of-the-art performances on the BEA-2019 and CoNLL-2014 benchmarks. Our code is publicly available at: https://github.com/kanekomasahiro/bert-gec.
CLApr 30, 2020Code
Word Rotator's DistanceSho Yokoi, Ryo Takahashi, Reina Akama et al.
A key principle in assessing textual similarity is measuring the degree of semantic overlap between two texts by considering the word alignment. Such alignment-based approaches are intuitive and interpretable; however, they are empirically inferior to the simple cosine similarity between general-purpose sentence vectors. To address this issue, we focus on and demonstrate the fact that the norm of word vectors is a good proxy for word importance, and their angle is a good proxy for word similarity. Alignment-based approaches do not distinguish them, whereas sentence-vector approaches automatically use the norm as the word importance. Accordingly, we propose a method that first decouples word vectors into their norm and direction, and then computes alignment-based similarity using earth mover's distance (i.e., optimal transport cost), which we refer to as word rotator's distance. Besides, we find how to grow the norm and direction of word vectors (vector converter), which is a new systematic approach derived from sentence-vector estimation methods. On several textual similarity datasets, the combination of these simple proposed methods outperformed not only alignment-based approaches but also strong baselines. The source code is available at https://github.com/eumesy/wrd
CLSep 5, 2019Code
TEASPN: Framework and Protocol for Integrated Writing Assistance EnvironmentsMasato Hagiwara, Takumi Ito, Tatsuki Kuribayashi et al.
Language technologies play a key role in assisting people with their writing. Although there has been steady progress in e.g., grammatical error correction (GEC), human writers are yet to benefit from this progress due to the high development cost of integrating with writing software. We propose TEASPN, a protocol and an open-source framework for achieving integrated writing assistance environments. The protocol standardizes the way writing software communicates with servers that implement such technologies, allowing developers and researchers to integrate the latest developments in natural language processing (NLP) with low cost. As a result, users can enjoy the integrated experience in their favorite writing software. The results from experiments with human participants show that users use a wide range of technologies and rate their writing experience favorably, allowing them to write more fluent text.
LGMay 24, 2018Code
Interpretable and Compositional Relation Learning by Joint Training with an AutoencoderRyo Takahashi, Ran Tian, Kentaro Inui
Embedding models for entities and relations are extremely useful for recovering missing facts in a knowledge base. Intuitively, a relation can be modeled by a matrix mapping entity vectors. However, relations reside on low dimension sub-manifolds in the parameter space of arbitrary matrices---for one reason, composition of two relations $\boldsymbol{M}_1,\boldsymbol{M}_2$ may match a third $\boldsymbol{M}_3$ (e.g. composition of relations currency_of_country and country_of_film usually matches currency_of_film_budget), which imposes compositional constraints to be satisfied by the parameters (i.e. $\boldsymbol{M}_1\cdot \boldsymbol{M}_2\approx \boldsymbol{M}_3$). In this paper we investigate a dimension reduction technique by training relations jointly with an autoencoder, which is expected to better capture compositional constraints. We achieve state-of-the-art on Knowledge Base Completion tasks with strongly improved Mean Rank, and show that joint training with an autoencoder leads to interpretable sparse codings of relations, helps discovering compositional constraints and benefits from compositional training. Our source code is released at github.com/tianran/glimvec.
CLJan 23
Sycophancy Hides Linearly in the Attention HeadsRifo Genadi, Munachiso Nwadike, Nurdaulet Mukhituly et al.
We find that correct-to-incorrect sycophancy signals are most linearly separable within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified "truthful" directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads attend disproportionately to expressions of user doubt, contributing to sycophantic shifts. Overall, these findings suggest that sycophancy can be mitigated through simple, targeted linear interventions that exploit the internal geometry of attention activations.
70.8AIMay 4
Measuring AI Reasoning: A Guide for ResearchersMunachiso Samuel Nwadike, Zangir Iklassov, Kareem Ali et al.
In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reasoning requires selecting intermediate steps and halting according to input-dependent conditions, which we formalize as a search-like procedure. We show that single forward passes in scalable architectures are structurally limited in their ability to realize such variable-depth computation, motivating intermediate decoding and externalized reasoning traces as appropriate evaluation interfaces. Central to our argument is that final-answer accuracy alone is an insufficient measure of reasoning, because it provides little ability to diagnose or debug the underlying processes that produce individual solutions in frontier models. We therefore argue for a shift toward process-based evaluation, in which reasoning is assessed through the faithfulness and validity of intermediate reasoning traces as first-class evaluation targets.
29.7CLApr 30
Why Mean Pooling Works: Quantifying Second-Order Collapse in Text EmbeddingsTomomasa Hara, Hiroto Kurita, Masaaki Imaizumi et al.
For constructing text embeddings, mean pooling, which averages token embeddings, is the standard approach. This paper examines whether mean pooling actually works well in real models. First, we note that mean pooling can collapse information beyond the first-order statistics of the token embeddings, such as second-order statistics that capture their spatial structure, potentially mapping distinct token embedding distributions to similar text embeddings. Motivated by this concern, we propose a simple metric to quantify such a collapse induced by mean pooling. Then, using this metric, we empirically measure how often this collapse occurs in actual models and texts, and find that modern text encoders are robust to this collapse. In particular, contrastive fine-tuned text encoders tend to be less prone to the collapse than their pretrained backbone models. We also find that the robustness of these text encoders lies in the concentration of token embeddings within each text. In addition, we find that robustness to the collapse, as quantified by our proposed metric, correlates with downstream task performance. Overall, our findings offer a new perspective on why modern text encoders remain effective despite relying on seemingly coarse mean pooling.
CLOct 17, 2024
Repetition Neurons: How Do Language Models Produce Repetitions?Tatsuya Hiraoka, Kentaro Inui
This paper introduces repetition neurons, regarded as skill neurons responsible for the repetition problem in text generation tasks. These neurons are progressively activated more strongly as repetition continues, indicating that they perceive repetition as a task to copy the previous context repeatedly, similar to in-context learning. We identify these repetition neurons by comparing activation values before and after the onset of repetition in texts generated by recent pre-trained language models. We analyze the repetition neurons in three English and one Japanese pre-trained language models and observe similar patterns across them.
CLFeb 3, 2025
Large Language Models Are Human-Like InternallyTatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb et al.
Recent cognitive modeling studies have reported that larger language models (LMs) exhibit a poorer fit to human reading behavior (Oh and Schuler, 2023b; Shain et al., 2024; Kuribayashi et al., 2024), leading to claims of their cognitive implausibility. In this paper, we revisit this argument through the lens of mechanistic interpretability and argue that prior conclusions were skewed by an exclusive focus on the final layers of LMs. Our analysis reveals that next-word probabilities derived from internal layers of larger LMs align with human sentence processing data as well as, or better than, those from smaller LMs. This alignment holds consistently across behavioral (self-paced reading times, gaze durations, MAZE task processing times) and neurophysiological (N400 brain potentials) measures, challenging earlier mixed results and suggesting that the cognitive plausibility of larger LMs has been underestimated. Furthermore, we first identify an intriguing relationship between LM layers and human measures: earlier layers correspond more closely with fast gaze durations, while later layers better align with relatively slower signals such as N400 potentials and MAZE processing times. Our work opens new avenues for interdisciplinary research at the intersection of mechanistic interpretability and cognitive modeling.
28.5CLApr 9
Linear Representations of Hierarchical Concepts in Language ModelsMasaki Sakata, Benjamin Heinzerling, Takumi Ito et al.
We investigate how and to what extent hierarchical relations (e.g., Japan $\subset$ Eastern Asia $\subset$ Asia) are encoded in the internal representations of language models. Building on Linear Relational Concepts, we train linear transformations specific to each hierarchical depth and semantic domain, and characterize representational differences associated with hierarchical relations by comparing these transformations. Going beyond prior work on the representational geometry of hierarchies in LMs, our analysis covers multi-token entities and cross-layer representations. Across multiple domains we learn such transformations and evaluate in-domain generalization to unseen data and cross-domain transfer. Experiments show that, within a domain, hierarchical relations can be linearly recovered from model representations. We then analyze how hierarchical information is encoded in representation space. We find that it is encoded in a relatively low-dimensional subspace and that this subspace tends to be domain-specific. Our main result is that hierarchy representation is highly similar across these domain-specific subspaces. Overall, we find that all models considered in our experiments encode concept hierarchies in the form of highly interpretable linear representations.
CLOct 17, 2024
The Geometry of Numerical Reasoning: Language Models Compare Numeric Properties in Linear SubspacesAhmed Oumar El-Shangiti, Tatsuya Hiraoka, Hilal AlQuabeh et al.
This paper investigates whether large language models (LLMs) utilize numerical attributes encoded in a low-dimensional subspace of the embedding space when answering questions involving numeric comparisons, e.g., Was Cristiano born before Messi? We first identified, using partial least squares regression, these subspaces, which effectively encode the numerical attributes associated with the entities in comparison prompts. Further, we demonstrate causality, by intervening in these subspaces to manipulate hidden states, thereby altering the LLM's comparison outcomes. Experiments conducted on three different LLMs showed that our results hold across different numerical attributes, indicating that LLMs utilize the linearly encoded information for numerical reasoning.
CLApr 23, 2025
How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary StudyRendi Chevi, Kentaro Inui, Thamar Solorio et al.
What makes an interaction with the LLM more preferable for the user? While it is intuitive to assume that information accuracy in the LLM's responses would be one of the influential variables, recent studies have found that inaccurate LLM's responses could still be preferable when they are perceived to be more authoritative, certain, well-articulated, or simply verbose. These variables interestingly fall under the broader category of language style, implying that the style in the LLM's responses might meaningfully influence users' preferences. This hypothesized dynamic could have double-edged consequences: enhancing the overall user experience while simultaneously increasing their susceptibility to risks such as LLM's misinformation or hallucinations. In this short paper, we present our preliminary studies in exploring this subject. Through a series of exploratory and experimental user studies, we found that LLM's language style does indeed influence user's preferences, but how and which language styles influence the preference varied across different user populations, and more interestingly, moderated by the user's very own individual traits. As a preliminary work, the findings in our studies should be interpreted with caution, particularly given the limitations in our samples, which still need wider demographic diversity and larger sample sizes. Our future directions will first aim to address these limitations, which would enable a more comprehensive joint effect analysis between the language style, individual traits, and preferences, and further investigate the potential causal relationship between and beyond these variables.
CLJan 27, 2025
Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without InferenceGo Kamoda, Benjamin Heinzerling, Tatsuro Inaba et al.
According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the model's "inner vocabulary". Prior analysis of this detokenization stage has predominantly relied on probing and interventions such as path patching, which involve selecting particular inputs, choosing a subset of components that will be patched, and then observing changes in model behavior. Here, we show that several important aspects of the detokenization stage can be understood purely by analyzing model weights, without performing any model inference steps. Specifically, we introduce an analytical decomposition of first-layer attention in GPT-2. Our decomposition yields interpretable terms that quantify the relative contributions of position-related, token-related, and mixed effects. By focusing on terms in this decomposition, we discover weight-based explanations of attention bias toward close tokens and attention for detokenization.
CLMar 9, 2025
How a Bilingual LM Becomes Bilingual: Tracing Internal Representations with Sparse AutoencodersTatsuro Inaba, Go Kamoda, Kentaro Inui et al.
This study explores how bilingual language models develop complex internal representations. We employ sparse autoencoders to analyze internal representations of bilingual language models with a focus on the effects of training steps, layers, and model sizes. Our analysis shows that language models first learn languages separately, and then gradually form bilingual alignments, particularly in the mid layers. We also found that this bilingual tendency is stronger in larger models. Building on these findings, we demonstrate the critical role of bilingual representations in model performance by employing a novel method that integrates decomposed representations from a fully trained model into a mid-training model. Our results provide insights into how language models acquire bilingual capabilities.
CLApr 17, 2024
To Drop or Not to Drop? Predicting Argument Ellipsis Judgments: A Case Study in JapaneseYukiko Ishizuki, Tatsuki Kuribayashi, Yuichiroh Matsubayashi et al.
Speakers sometimes omit certain arguments of a predicate in a sentence; such omission is especially frequent in pro-drop languages. This study addresses a question about ellipsis -- what can explain the native speakers' ellipsis decisions? -- motivated by the interest in human discourse processing and writing assistance for this choice. To this end, we first collect large-scale human annotations of whether and why a particular argument should be omitted across over 2,000 data points in the balanced corpus of Japanese, a prototypical pro-drop language. The data indicate that native speakers overall share common criteria for such judgments and further clarify their quantitative characteristics, e.g., the distribution of related linguistic factors in the balanced corpus. Furthermore, the performance of the language model-based argument ellipsis judgment model is examined, and the gap between the systems' prediction and human judgments in specific linguistic aspects is revealed. We hope our fundamental resource encourages further studies on natural human ellipsis judgment.