Aria Nourbakhsh

CL
h-index5
6papers
1,225citations
Novelty38%
AI Score51

6 Papers

34.3LGMay 28
CSULoRA: Closest Safe Update Low-Rank Adaptation

Oleksandr Marchenko Breneur, Adelaide Danilov, Aria Nourbakhsh et al.

Low-rank adaptation has become a standard method for parameter-efficient fine-tuning of large language models, but even small amounts of unsafe or adversarial fine-tuning data can substantially weaken the safety behavior of aligned models. Existing safety-preserving LoRA methods often rely on hard interventions such as projection, pruning, thresholding, or additional training objectives. While these methods can suppress unsafe update directions, they may also remove task-relevant information or require extra tuning. We introduce CSULoRA, a post-hoc method for correcting trained LoRA adapters through closest safe update estimation. CSULoRA estimates a safety-aligned subspace from the weight displacement between a safety-aligned model and its corresponding base checkpoint. It then decomposes each LoRA update into fully aligned, partially aligned, and off-subspace components. Instead of discarding components outside the estimated safety subspace, CSULoRA solves a closed-form penalized minimum-change problem that preserves the fully aligned component while smoothly attenuating potentially unsafe directions according to their relative energy. In adversarial fine-tuning experiments, CSULoRA substantially reduces attack success rate while preserving most of the utility gains obtained from standard LoRA fine-tuning.

29.5CLApr 13
AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection

Aria Nourbakhsh, Adelaide Danilov, Christoph Schommer et al.

Detecting AI-generated text is becoming increasingly challenging as modern language models approach human-level fluency and can evade detectors that rely on surface statistics or likelihood-based signals. We propose \textsc{AEyeDE}, an attribution-driven approach to human-AI authorship detection that leverages model attention as a discriminative signal. Specifically, we extract attention-based attribution matrices for both human- and AI-generated text using a \emph{proxy} Transformer model with white-box access and train a lightweight Convolutional Neural Network to learn representations from these attribution maps. Across encoder-decoder translation settings, our method consistently outperforms a text-only baseline. In decoder-only settings, it performs strongly in generator-specific detection, remains competitive on standard benchmarks, and shows robustness under cross-dataset transfer and alternative-spelling perturbations. We further show that attention maps exhibit recurring local structures whose relative frequencies differ consistently between human- and AI-generated text across datasets and proxy models. These findings suggest that attention-based attribution maps provide a complementary and interpretable signal for AI-generated text detection. We will make the code publicly available to support future research.

31.8CLMar 11
Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation

Aria Nourbakhsh, Salima Lamsiyah, Adelaide Danilov et al.

The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in sequence-to-sequence (seq2seq) models is less explored. This paper introduces a new approach for evaluating explainability methods in transformer-based seq2seq models. We use teacher-derived attribution maps as a structured side signal to guide a student model, and quantify the utility of different attribution methods through the student's ability to simulate targets. Using the Inseq library, we extract attribution scores over source-target sequence pairs and inject these scores into the attention mechanism of a student transformer model under four composition operators (addition, multiplication, averaging, and replacement). Across three language pairs (de-en, fr-en, ar-en) and attributions from Marian-MT and mBART models, Attention, Value Zeroing, and Layer Gradient $\times$ Activation consistently yield the largest gains in BLEU (and corresponding improvements in chrF) relative to baselines. In contrast, other gradient-based methods (Saliency, Integrated Gradients, DeepLIFT, Input $\times$ Gradient, GradientShap) lead to smaller and less consistent improvements. These results suggest that different attribution methods capture distinct signals and that attention-derived attributions better capture alignment between source and target representations in seq2seq models. Finally, we introduce an Attributor transformer that, given a source-target pair, learns to reconstruct the teacher's attribution map. Our findings demonstrate that the more accurately the Attributor can reproduce attribution maps, the more useful an injection of those maps is for the downstream task. The source code can be found on GitHub.

LGJul 26, 2025
Cluster Purge Loss: Structuring Transformer Embeddings for Equivalent Mutants Detection

Adelaide Danilov, Aria Nourbakhsh, Christoph Schommer

Recent pre-trained transformer models achieve superior performance in various code processing objectives. However, although effective at optimizing decision boundaries, common approaches for fine-tuning them for downstream classification tasks - distance-based methods or training an additional classification head - often fail to thoroughly structure the embedding space to reflect nuanced intra-class semantic relationships. Equivalent code mutant detection is one of these tasks, where the quality of the embedding space is crucial to the performance of the models. We introduce a novel framework that integrates cross-entropy loss with a deep metric learning objective, termed Cluster Purge Loss. This objective, unlike conventional approaches, concentrates on adjusting fine-grained differences within each class, encouraging the separation of instances based on semantical equivalency to the class center using dynamically adjusted borders. Employing UniXCoder as the base model, our approach demonstrates state-of-the-art performance in the domain of equivalent mutant detection and produces a more interpretable embedding space.

CLJan 14, 2022
The Dark Side of the Language: Pre-trained Transformers in the DarkNet

Leonardo Ranaldi, Aria Nourbakhsh, Arianna Patrizi et al.

Pre-trained Transformers are challenging human performances in many NLP tasks. The massive datasets used for pre-training seem to be the key to their success on existing tasks. In this paper, we explore how a range of pre-trained Natural Language Understanding models perform on definitely unseen sentences provided by classification tasks over a DarkNet corpus. Surprisingly, results show that syntactic and lexical neural networks perform on par with pre-trained Transformers even after fine-tuning. Only after what we call extreme domain adaptation, that is, retraining with the masked language model task on all the novel corpus, pre-trained Transformers reach their standard high results. This suggests that huge pre-training corpora may give Transformers unexpected help since they are exposed to many of the possible sentences.

CLAug 23, 2019
Toward Dialogue Modeling: A Semantic Annotation Scheme for Questions and Answers

Maria-Andrea Cruz-Blandón, Gosse Minnema, Aria Nourbakhsh et al.

The present study proposes an annotation scheme for classifying the content and discourse contribution of question-answer pairs. We propose detailed guidelines for using the scheme and apply them to dialogues in English, Spanish, and Dutch. Finally, we report on initial machine learning experiments for automatic annotation.