CLAug 27, 2025
Scalable and consistent few-shot classification of survey responses using text embeddingsJonas Timmann Mjaaland, Markus Fleten Kreutzer, Halvor Tyseng et al.
Qualitative analysis of open-ended survey responses is a commonly-used research method in the social sciences, but traditional coding approaches are often time-consuming and prone to inconsistency. Existing solutions from Natural Language Processing such as supervised classifiers, topic modeling techniques, and generative large language models have limited applicability in qualitative analysis, since they demand extensive labeled data, disrupt established qualitative workflows, and/or yield variable results. In this paper, we introduce a text embedding-based classification framework that requires only a handful of examples per category and fits well with standard qualitative workflows. When benchmarked against human analysis of a conceptual physics survey consisting of 2899 open-ended responses, our framework achieves a Cohen's Kappa ranging from 0.74 to 0.83 as compared to expert human coders in an exhaustive coding scheme. We further show how performance of this framework improves with fine-tuning of the text embedding model, and how the method can be used to audit previously-analyzed datasets. These findings demonstrate that text embedding-assisted coding can flexibly scale to thousands of responses without sacrificing interpretability, opening avenues for deductive qualitative analysis at scale.
LGSep 29, 2025
Towards Understanding the Shape of Representations in Protein Language ModelsKosio Beshkov, Anders Malthe-Sørenssen
While protein language models (PLMs) are one of the most promising avenues of research for future de novo protein design, the way in which they transform sequences to hidden representations, as well as the information encoded in such representations is yet to be fully understood. Several works have attempted to propose interpretability tools for PLMs, but they have focused on understanding how individual sequences are transformed by such models. Therefore, the way in which PLMs transform the whole space of sequences along with their relations is still unknown. In this work we attempt to understand this transformed space of sequences by identifying protein structure and representation with square-root velocity (SRV) representations and graph filtrations. Both approaches naturally lead to a metric space in which pairs of proteins or protein representations can be compared with each other. We analyze different types of proteins from the SCOP dataset and show that the Karcher mean and effective dimension of the SRV shape space follow a non-linear pattern as a function of the layers in ESM2 models of different sizes. Furthermore, we use graph filtrations as a tool to study the context lengths at which models encode the structural features of proteins. We find that PLMs preferentially encode immediate as well as local relations between residues, but start to degrade for larger context lengths. The most structurally faithful encoding tends to occur close to, but before the last layer of the models, indicating that training a folding model ontop of these layers might lead to improved folding performance.
NEFeb 5, 2024
Exploring Biologically Inspired Mechanisms of Adversarial RobustnessKonstantin Holzhausen, Mia Merlid, Håkon Olav Torvik et al.
Backpropagation-optimized artificial neural networks, while precise, lack robustness, leading to unforeseen behaviors that affect their safety. Biological neural systems do solve some of these issues already. Unlike artificial models, biological neurons adjust connectivity based on neighboring cell activity. Understanding the biological mechanisms of robustness can pave the way towards building trust worthy and safe systems. Robustness in neural representations is hypothesized to correlate with the smoothness of the encoding manifold. Recent work suggests power law covariance spectra, which were observed studying the primary visual cortex of mice, to be indicative of a balanced trade-off between accuracy and robustness in representations. Here, we show that unsupervised local learning models with winner takes all dynamics learn such power law representations, providing upcoming studies a mechanistic model with that characteristic. Our research aims to understand the interplay between geometry, spectral properties, robustness, and expressivity in neural representations. Hence, we study the link between representation smoothness and spectrum by using weight, Jacobian and spectral regularization while assessing performance and adversarial robustness. Our work serves as a foundation for future research into the mechanisms underlying power law spectra and optimally smooth encodings in both biological and artificial systems. The insights gained may elucidate the mechanisms that realize robust neural networks in mammalian brains and inform the development of more stable and reliable artificial systems.
COMP-PHJan 5, 2024
Tailoring Frictional Properties of Surfaces Using Diffusion ModelsEven Marius Nordhagen, Henrik Andersen Sveinsson, Anders Malthe-Sørenssen
This Letter introduces an approach for precisely designing surface friction properties using a conditional generative machine learning model, specifically a diffusion denoising probabilistic model (DDPM). We created a dataset of synthetic surfaces with frictional properties determined by molecular dynamics simulations, which trained the DDPM to predict surface structures from desired frictional outcomes. Unlike traditional trial-and-error and numerical optimization methods, our approach directly yields surface designs meeting specified frictional criteria with high accuracy and efficiency. This advancement in material surface engineering demonstrates the potential of machine learning in reducing the iterative nature of surface design processes. Our findings not only provide a new pathway for precise surface property tailoring but also suggest broader applications in material science where surface characteristics are critical.