IVApr 11, 2022Code
Negligible effect of brain MRI data preprocessing for tumor segmentationEkaterina Kondrateva, Polina Druzhinina, Alexandra Dalechina et al.
Magnetic resonance imaging (MRI) data is heterogeneous due to differences in device manufacturers, scanning protocols, and inter-subject variability. A conventional way to mitigate MR image heterogeneity is to apply preprocessing transformations such as anatomy alignment, voxel resampling, signal intensity equalization, image denoising, and localization of regions of interest. Although a preprocessing pipeline standardizes image appearance, its influence on the quality of image segmentation and on other downstream tasks in deep neural networks has never been rigorously studied. We conduct experiments on three publicly available datasets and evaluate the effect of different preprocessing steps in intra- and inter-dataset training scenarios. Our results demonstrate that most popular standardization steps add no value to the network performance; moreover, preprocessing can hamper model performance. We suggest that image intensity normalization approaches do not contribute to model accuracy because of the reduction of signal variance with image standardization. Finally, we show that the contribution of skull-stripping in data preprocessing is almost negligible if measured in terms of estimated tumor volume. We show that the only essential transformation for accurate deep learning analysis is the unification of voxel spacing across the dataset. In contrast, inter-subjects anatomy alignment in the form of non-rigid atlas registration is not necessary and intensity equalization steps (denoising, bias-field correction and histogram matching) do not improve models' performance. The study code is accessible online https://github.com/MedImAIR/brain-mri-processing-pipeline
CLMar 24, 2025Code
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse AutoencodersAndrey Galichin, Alexey Dontsov, Polina Druzhinina et al.
Recent LLMs like DeepSeek-R1 have demonstrated state-of-the-art performance by integrating deep thinking and complex reasoning during generation. However, the internal mechanisms behind these reasoning processes remain unexplored. We observe reasoning LLMs consistently use vocabulary associated with human reasoning processes. We hypothesize these words correspond to specific reasoning moments within the models' internal mechanisms. To test this hypothesis, we employ Sparse Autoencoders (SAEs), a technique for sparse decomposition of neural network activations into human-interpretable features. We introduce ReasonScore, an automatic metric to identify active SAE features during these reasoning moments. We perform manual and automatic interpretation of the features detected by our metric, and find those with activation patterns matching uncertainty, exploratory thinking, and reflection. Through steering experiments, we demonstrate that amplifying these features increases performance on reasoning-intensive benchmarks (+2.2%) while producing longer reasoning traces (+20.5%). Using the model diffing technique, we provide evidence that these features are present only in models with reasoning capabilities. Our work provides the first step towards a mechanistic understanding of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning
CLFeb 20, 2025Code
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of TransformersAnton Razzhigaev, Matvey Mikhalchuk, Temurbek Rahmatullaev et al.
We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens -- especially stopwords, articles, and commas -- consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis also shows a strong correlation between contextualization and linearity, where linearity measures how closely the transformation from one layer's embeddings to the next can be approximated by a single linear mapping. These findings underscore the hidden importance of filler tokens in maintaining context. For further exploration, we present LLM-Microscope, an open-source toolkit that assesses token-level nonlinearity, evaluates contextual memory, visualizes intermediate layer contributions (via an adapted Logit Lens), and measures the intrinsic dimensionality of representations. This toolkit illuminates how seemingly trivial tokens can be critical for long-range understanding.
CLDec 9, 2025
MindShift: Analyzing Language Models' Reactions to Psychological PromptsAnton Vasiliuk, Irina Abdullaeva, Polina Druzhinina et al.
Large language models (LLMs) hold the potential to absorb and reflect personality traits and attitudes specified by users. In our study, we investigated this potential using robust psychometric measures. We adapted the most studied test in psychological literature, namely Minnesota Multiphasic Personality Inventory (MMPI) and examined LLMs' behavior to identify traits. To asses the sensitivity of LLMs' prompts and psychological biases we created personality-oriented prompts, crafting a detailed set of personas that vary in trait intensity. This enables us to measure how well LLMs follow these roles. Our study introduces MindShift, a benchmark for evaluating LLMs' psychological adaptability. The results highlight a consistent improvement in LLMs' role perception, attributed to advancements in training datasets and alignment techniques. Additionally, we observe significant differences in responses to psychometric assessments across different model types and families, suggesting variability in their ability to emulate human-like personality traits. MindShift prompts and code for LLM evaluation will be publicly available.
CLMar 5, 2025
Feature-Level Insights into Artificial Text Detection with Sparse AutoencodersKristian Kuznetsov, Laida Kushnareva, Polina Druzhinina et al.
Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.
AIFeb 11, 2025
Universal Adversarial Attack on Aligned Multimodal LLMsTemurbek Rahmatullaev, Polina Druzhinina, Nikita Kurdiukov et al.
We propose a universal adversarial attack on multimodal Large Language Models (LLMs) that leverages a single optimized image to override alignment safeguards across diverse queries and even multiple models. By backpropagating through the vision encoder and language head, we craft a synthetic image that forces the model to respond with a targeted phrase (e.g., "Sure, here it is") or otherwise unsafe content -- even for harmful prompts. In experiments on the SafeBench and MM-SafetyBench benchmarks, our method achieves higher attack success rates than existing baselines, including text-only universal prompts (e.g., up to 81% on certain models). We further demonstrate cross-model universality by training on several multimodal LLMs simultaneously. Additionally, a multi-answer variant of our approach produces more natural-sounding (yet still malicious) responses. These findings underscore critical vulnerabilities in current multimodal alignment and call for more robust adversarial defenses. We will release code and datasets under the Apache-2.0 license. Warning: some content generated by Multimodal LLMs in this paper may be offensive.