CLOct 13, 2022
Mass-Editing Memory in a TransformerKevin Meng, Arnab Sen Sharma, Alex Andonian et al. · mit
Recent work has shown exciting promise in updating large language models with new memories, so as to replace obsolete information or add specialized knowledge. However, this line of work is predominantly limited to updating single associations. We develop MEMIT, a method for directly updating a language model with many memories, demonstrating experimentally that it can scale up to thousands of associations for GPT-J (6B) and GPT-NeoX (20B), exceeding prior work by orders of magnitude. Our code and data are at https://memit.baulab.info.
CLAug 17, 2023
Linearity of Relation Decoding in Transformer Language ModelsEvan Hernandez, Arnab Sen Sharma, Tal Haklay et al. · microsoft-research, mit
Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations: relations between words and their synonyms, entities and their attributes, etc. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation. Linear relation representations may be obtained by constructing a first-order approximation to the LM from a single prompt, and they exist for a variety of factual, commonsense, and linguistic relations. However, we also identify many cases in which LM predictions capture relational knowledge accurately, but this knowledge is not linearly encoded in their representations. Our results thus reveal a simple, interpretable, but heterogeneously deployed knowledge representation strategy in transformer LMs.
CLJun 1, 2022Code
BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social ContextsNauros Romim, Mosahed Ahmed, Md. Saiful Islam et al.
Social media platforms and online streaming services have spawned a new breed of Hate Speech (HS). Due to the massive amount of user-generated content on these sites, modern machine learning techniques are found to be feasible and cost-effective to tackle this problem. However, linguistically diverse datasets covering different social contexts in which offensive language is typically used are required to train generalizable models. In this paper, we identify the shortcomings of existing Bangla HS datasets and introduce a large manually labeled dataset BD-SHS that includes HS in different social contexts. The labeling criteria were prepared following a hierarchical annotation process, which is the first of its kind in Bangla HS to the best of our knowledge. The dataset includes more than 50,200 offensive comments crawled from online social networking sites and is at least 60% larger than any existing Bangla HS datasets. We present the benchmark result of our dataset by training different NLP models resulting in the best one achieving an F1-score of 91.0%. In our experiments, we found that a word embedding trained exclusively using 1.47 million comments from social media and streaming sites consistently resulted in better modeling of HS detection in comparison to other pre-trained embeddings. Our dataset and all accompanying codes is publicly available at github.com/naurosromim/hate-speech-dataset-for-Bengali-social-media
LGJul 18, 2024Code
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model InternalsJaden Fiotto-Kaufman, Alexander R. Loftus, Eric Todd et al.
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred remote execution. The National Deep Inference Fabric (NDIF) is a scalable inference service that executes NNsight requests, allowing users to share GPU resources and pretrained models. These technologies are enabled by the Intervention Graph, an architecture developed to decouple experimental design from model runtime. Together, this framework provides transparent and efficient access to the internals of deep neural networks such as very large language models (LLMs) without imposing the cost or complexity of hosting customized models individually. We conduct a quantitative survey of the machine learning literature that reveals a growing gap in the study of the internals of large-scale AI. We demonstrate the design and use of our framework to address this gap by enabling a range of research methods on huge models. Finally, we conduct benchmarks to compare performance with previous approaches. Code, documentation, and tutorials are available at https://nnsight.net/.
CLOct 23, 2023
Function Vectors in Large Language ModelsEric Todd, Millicent L. Li, Arnab Sen Sharma et al.
We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). FVs are robust to changes in context, i.e., they trigger execution of the task on inputs such as zero-shot and natural text settings that do not resemble the ICL contexts from which they are collected. We test FVs across a range of tasks, models, and layers and find strong causal effects across settings in middle layers. We investigate the internal structure of FVs and find while that they often contain information that encodes the output space of the function, this information alone is not sufficient to reconstruct an FV. Finally, we test semantic vector composition in FVs, and find that to some extent they can be summed to create vectors that trigger new complex tasks. Our findings show that compact, causal internal vector representations of function abstractions can be explicitly extracted from LLMs. Our code and data are available at https://functions.baulab.info.
LGAug 2, 2024
The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation AnalysisAaron Mueller, Jannik Brinkmann, Millicent Li et al.
Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this article, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate. We argue that this framing yields a more cohesive narrative of the field and helps researchers select appropriate methods based on their research objective. Our analysis yields actionable recommendations for future work, including the discovery of new mediators and the development of standardized evaluations tailored to these goals.
AIOct 30, 2025
LLMs Process Lists With General Filter HeadsArnab Sen Sharma, Giordano Rogers, Natalie Shapira et al.
We investigate the mechanisms underlying a range of list-processing tasks in LLMs, and we find that LLMs have learned to encode a compact, causal representation of a general filtering operation that mirrors the generic "filter" function of functional programming. Using causal mediation analysis on a diverse set of list-processing tasks, we find that a small number of attention heads, which we dub filter heads, encode a compact representation of the filtering predicate in their query states at certain tokens. We demonstrate that this predicate representation is general and portable: it can be extracted and reapplied to execute the same filtering operation on different collections, presented in different formats, languages, or even in tasks. However, we also identify situations where transformer LMs can exploit a different strategy for filtering: eagerly evaluating if an item satisfies the predicate and storing this intermediate result as a flag directly in the item representations. Our results reveal that transformer LMs can develop human-interpretable implementations of abstract computational operations that generalize in ways that are surprisingly similar to strategies used in traditional functional programming patterns.
CLApr 4, 2024
Locating and Editing Factual Associations in MambaArnab Sen Sharma, David Atkinson, David Bau
We investigate the mechanisms of factual recall in the Mamba state space model. Our work is inspired by previous findings in autoregressive transformer language models suggesting that their knowledge recall is localized to particular modules at specific token locations; we therefore ask whether factual recall in Mamba can be similarly localized. To investigate this, we conduct four lines of experiments on Mamba. First, we apply causal tracing or interchange interventions to localize key components inside Mamba that are responsible for recalling facts, revealing that specific components within middle layers show strong causal effects at the last token of the subject, while the causal effect of intervening on later layers is most pronounced at the last token of the prompt, matching previous findings on autoregressive transformers. Second, we show that rank-one model editing methods can successfully insert facts at specific locations, again resembling findings on transformer LMs. Third, we examine the linearity of Mamba's representations of factual relations. Finally we adapt attention-knockout techniques to Mamba in order to dissect information flow during factual recall. We compare Mamba directly to a similar-sized autoregressive transformer LM and conclude that despite significant differences in architectural approach, when it comes to factual recall, the two architectures share many similarities.
CLMay 20, 2025
Language Models use Lookbacks to Track BeliefsNikhil Prakash, Natalie Shapira, Arnab Sen Sharma et al.
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs' ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset, CausalToM, consisting of simple stories where two characters independently change the state of two objects, potentially unaware of each other's actions. Our investigation uncovers a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating their reference information, represented as Ordering IDs (OIs), in low-rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the correct state OI and then the answer lookback retrieves the corresponding state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
CLFeb 18, 2025
Elucidating Mechanisms of Demographic Bias in LLMs for HealthcareHiba Ahsan, Arnab Sen Sharma, Silvio Amir et al.
We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gender information is highly localized in MLP layers and can be reliably manipulated at inference time via patching. Such interventions can surgically alter generated clinical vignettes for specific conditions, and also influence downstream clinical predictions which correlate with gender, e.g., patient risk of depression. We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree. To our knowledge, this is the first application of mechanistic interpretability methods to LLMs for healthcare.
CLDec 17, 2025
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation ExplainersAdam Karvonen, James Chua, Clément Dumas et al.
Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Our best AOs match or exceed white-box baselines on all four tasks and the best overall baseline on 3 of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.
CLDec 3, 2021
HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech Detection in BanglaNauros Romim, Mosahed Ahmed, Md Saiful Islam et al.
In this paper, we present HS-BAN, a binary class hate speech (HS) dataset in Bangla language consisting of more than 50,000 labeled comments, including 40.17% hate and rest are non hate speech. While preparing the dataset a strict and detailed annotation guideline was followed to reduce human annotation bias. The HS dataset was also preprocessed linguistically to extract different types of slang currently people write using symbols, acronyms, or alternative spellings. These slang words were further categorized into traditional and non-traditional slang lists and included in the results of this paper. We explored traditional linguistic features and neural network-based methods to develop a benchmark system for hate speech detection for the Bangla language. Our experimental results show that existing word embedding models trained with informal texts perform better than those trained with formal text. Our benchmark shows that a Bi-LSTM model on top of the FastText informal word embedding achieved 86.78% F1-score. We will make the dataset available for public use.
IROct 13, 2021
Presenting a Larger Up-to-date Movie Dataset and Investigating the Effects of Pre-released Attributes on Gross RevenueArnab Sen Sharma, Tirtha Roy, Sadique Ahmmod Rifat et al.
Movie-making has become one of the most costly and risky endeavors in the entertainment industry. Continuous change in the preference of the audience makes it harder to predict what kind of movie will be financially successful at the box office. So, it is no wonder that cautious, intelligent stakeholders and large production houses will always want to know the probable revenue that will be generated by a movie before making an investment. Researchers have been working on finding an optimal strategy to help investors in making the right decisions. But the lack of a large, up-to-date dataset makes their work harder. In this work, we introduce an up-to-date, richer, and larger dataset that we have prepared by scraping IMDb for researchers and data analysts to work with. The compiled dataset contains the summery data of 7.5 million titles and detail information of more than 200K movies. Additionally, we perform different statistical analysis approaches on our dataset to find out how a movie's revenue is affected by different pre-released attributes such as budget, runtime, release month, content rating, genre etc. In our analysis, we have found that having a star cast/director has a positive impact on generated revenue. We introduce a novel approach for calculating the star power of a movie. Based on our analysis we select a set of attributes as features and train different machine learning algorithms to predict a movie's expected revenue. Based on generated revenue, we classified the movies in 10 categories and achieved a one-class-away accuracy rate of almost 60% (bingo accuracy of 30%). All the generated datasets and analysis codes are available online. We also made the source codes of our scraper bots public, so that researchers interested in extending this work can easily modify these bots as they need and prepare their own up-to-date datasets.
IRNov 19, 2019
Automatic Detection of Satire in Bangla Documents: A CNN Approach Based on Hybrid Feature Extraction ModelArnab Sen Sharma, Maruf Ahmed Mridul, Md Saiful Islam
Widespread of satirical news in online communities is an ongoing trend. The nature of satires is so inherently ambiguous that sometimes it's too hard even for humans to understand whether it's actually satire or not. So, research interest has grown in this field. The purpose of this research is to detect Bangla satirical news spread in online news portals as well as social media. In this paper, we propose a hybrid technique for extracting features from text documents combining Word2Vec and TF-IDF. Using our proposed feature extraction technique, with standard CNN architecture we could detect whether a Bangla text document is satire or not with an accuracy of more than 96%.