Ameen Ali

LG
h-index10
11papers
332citations
Novelty54%
AI Score51

11 Papers

LGJun 2, 2023
Centered Self-Attention Layers

Ameen Ali, Tomer Galanti, Lior Wolf · meta-ai

The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied within deep learning architectures. We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers for different tokens in transformers and different nodes in graph neural networks. Based on our analysis, we present a correction term to the aggregating operator of these mechanisms. Empirically, this simple term eliminates much of the oversmoothing problem in visual transformers, obtaining performance in weakly supervised segmentation that surpasses elaborate baseline methods that introduce multiple auxiliary networks and training phrases. In graph neural networks, the correction term enables the training of very deep architectures more effectively than many recent solutions to the same problem.

LGMar 3, 2024Code
The Hidden Attention of Mamba Models

Ameen Ali, Itamar Zimerman, Lior Wolf

The Mamba layer offers an efficient selective state space model (SSM) that is highly effective in modeling multiple domains, including NLP, long-range sequence processing, and computer vision. Selective SSMs are viewed as dual models, in which one trains in parallel on the entire sequence via an IO-aware parallel scan, and deploys in an autoregressive manner. We add a third view and show that such models can be viewed as attention-driven models. This new perspective enables us to empirically and theoretically compare the underlying mechanisms to that of the self-attention layers in transformers and allows us to peer inside the inner workings of the Mamba model with explainability methods. Our code is publicly available.

CLApr 2
Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models

Liran Ringel, Ameen Ali, Yaniv Romano

Discrete diffusion language models (dLLMs) accelerate text generation by unmasking multiple tokens in parallel. However, parallel decoding introduces a distributional mismatch: it approximates the joint conditional using a fully factorized product of per-token marginals, which degrades output quality when selected tokens are strongly dependent. We propose DEMASK (DEpendency-guided unMASKing), a lightweight dependency predictor that attaches to the final hidden states of a dLLM. In a single forward pass, it estimates pairwise conditional influences between masked positions. Using these predictions, a greedy selection algorithm identifies positions with bounded cumulative dependency for simultaneous unmasking. Under a sub-additivity assumption, we prove this bounds the total variation distance between our parallel sampling and the model's joint. Empirically, DEMASK achieves 1.7-2.2$\times$ speedup on Dream-7B while matching or improving accuracy compared to confidence-based and KL-based baselines.

CVNov 15, 2025
Suppressing VLM Hallucinations with Spectral Representation Filtering

Ameen Ali, Tamim Zoabi, Lior Wolf

Vision-language models (VLMs) frequently produce hallucinations in the form of descriptions of objects, attributes, or relations that do not exist in the image due to over-reliance on language priors and imprecise cross-modal grounding. We introduce Spectral Representation Filtering (SRF), a lightweight, training-free method to suppress such hallucinations by analyzing and correcting the covariance structure of the model's representations. SRF identifies low-rank hallucination modes through eigendecomposition of the covariance of the differences between features collected for truthful and hallucinatory captions, revealing structured biases in the feature space. A soft spectral filter then attenuates these modes in the feed-forward projection weights of deeper vLLM layers, equalizing feature variance while preserving semantic fidelity. Unlike decoding or retraining-based approaches, SRF operates entirely post-hoc, incurs zero inference overhead, and requires no architectural modifications. Across three families of VLMs (LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2), SRF consistently reduces hallucination rates on MSCOCO, POPE-VQA, and other visual tasks benchmarks, achieving state-of-the-art faithfulness without degrading caption quality.

CVOct 21, 2021Code
Video and Text Matching with Conditioned Embeddings

Ameen Ali, Idan Schwartz, Tamir Hazan et al.

We present a method for matching a text sentence from a given corpus to a given video clip and vice versa. Traditionally video and text matching is done by learning a shared embedding space and the encoding of one modality is independent of the other. In this work, we encode the dataset data in a way that takes into account the query's relevant information. The power of the method is demonstrated to arise from pooling the interaction data between words and frames. Since the encoding of the video clip depends on the sentence compared to it, the representation needs to be recomputed for each potential match. To this end, we propose an efficient shallow neural network. Its training employs a hierarchical triplet loss that is extendable to paragraph/video matching. The method is simple, provides explainability, and achieves state-of-the-art results for both sentence-clip and video-text by a sizable margin across five different datasets: ActivityNet, DiDeMo, YouCook2, MSR-VTT, and LSMDC. We also show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX. Source code is available at https://github.com/AmeenAli/VideoMatch.

CLJul 12, 2025
Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models

Ameen Ali, Shahar Katz, Lior Wolf et al.

Large language models (LLMs) often develop learned mechanisms specialized to specific datasets, such as reliance on domain-specific correlations, which yield high-confidence predictions without generalizable reasoning. While beneficial in one setting, these dataset-specific mechanisms typically degrade performance when models encounter novel tasks or distributions. In this work, we introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms in transformer-based LLMs. Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance without supporting robust, transferable reasoning. Selectively pruning these neurons compels the model to depend on generalizable representations. Evaluated across multiple-choice benchmarks, our pruning-based fine-tuning significantly enhances performance, surpassing prior (non-pruning) adaptation methods.

LGDec 16, 2023
Degree-based stratification of nodes in Graph Neural Networks

Ameen Ali, Hakan Cevikalp, Lior Wolf

Despite much research, Graph Neural Networks (GNNs) still do not display the favorable scaling properties of other deep neural networks such as Convolutional Neural Networks and Transformers. Previous work has identified issues such as oversmoothing of the latent representation and have suggested solutions such as skip connections and sophisticated normalization schemes. Here, we propose a different approach that is based on a stratification of the graph nodes. We provide motivation that the nodes in a graph can be stratified into those with a low degree and those with a high degree and that the two groups are likely to behave differently. Based on this motivation, we modify the Graph Neural Network (GNN) architecture so that the weight matrices are learned, separately, for the nodes in each group. This simple-to-implement modification seems to improve performance across datasets and GNN methods. To verify that this increase in performance is not only due to the added capacity, we also perform the same modification for random splits of the nodes, which does not lead to any improvement.

LGFeb 15, 2022
XAI for Transformers: Better Explanations through Conservative Propagation

Ameen Ali, Thomas Schnake, Oliver Eberle et al.

Transformers have become an important workhorse of machine learning, with numerous applications. This necessitates the development of reliable methods for increasing their transparency. Multiple interpretability methods, often based on gradient information, have been proposed. We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. We identify Attention Heads and LayerNorm as main reasons for such unreliable explanations and propose a more stable way for propagation through these layers. Our proposal, which can be seen as a proper extension of the well-established LRP method to Transformers, is shown both theoretically and empirically to overcome the deficiency of a simple gradient-based approach, and achieves state-of-the-art explanation performance on a broad range of Transformer models and datasets.

IVMar 25, 2021
Explainability Guided Multi-Site COVID-19 CT Classification

Ameen Ali, Tal Shaharabany, Lior Wolf

Radiologist examination of chest CT is an effective way for screening COVID-19 cases. In this work, we overcome three challenges in the automation of this process: (i) the limited number of supervised positive cases, (ii) the lack of region-based supervision, and (iii) the variability across acquisition sites. These challenges are met by incorporating a recent augmentation solution called SnapMix, by a new patch embedding technique, and by performing a test-time stability analysis. The three techniques are complementary and are all based on utilizing the heatmaps produced by the Class Activation Mapping (CAM) explainability method. Compared to the current state of the art, we obtain an increase of five percent in the F1 score on a site with a relatively high number of cases, and a gap twice as large for a site with much fewer training images.

LGMar 22, 2021
Weakly Supervised Recovery of Semantic Attributes

Ameen Ali, Tomer Galanti, Evgeniy Zheltonozhskiy et al.

We consider the problem of the extraction of semantic attributes, supervised only with classification labels. For example, when learning to classify images of birds into species, we would like to observe the emergence of features that zoologists use to classify birds. To tackle this problem, we propose training a neural network with discrete features in the last layer, which is followed by two heads: a multi-layered perceptron (MLP) and a decision tree. Since decision trees utilize simple binary decision stumps we expect those discrete features to obtain semantic meaning. We present a theoretical analysis as well as a practical method for learning in the intersection of two hypothesis classes. Our results on multiple benchmarks show an improved ability to extract a set of features that are highly correlated with the set of unseen attributes.

CVDec 3, 2020
Visualization of Supervised and Self-Supervised Neural Networks via Attribution Guided Factorization

Shir Gur, Ameen Ali, Lior Wolf

Neural network visualization techniques mark image locations by their relevancy to the network's classification. Existing methods are effective in highlighting the regions that affect the resulting classification the most. However, as we show, these methods are limited in their ability to identify the support for alternative classifications, an effect we name {\em the saliency bias} hypothesis. In this work, we integrate two lines of research: gradient-based methods and attribution-based methods, and develop an algorithm that provides per-class explainability. The algorithm back-projects the per pixel local influence, in a manner that is guided by the local attributions, while correcting for salient features that would otherwise bias the explanation. In an extensive battery of experiments, we demonstrate the ability of our methods to class-specific visualization, and not just the predicted label. Remarkably, the method obtains state of the art results in benchmarks that are commonly applied to gradient-based methods as well as in those that are employed mostly for evaluating attribution methods. Using a new unsupervised procedure, our method is also successful in demonstrating that self-supervised methods learn semantic information.