Dmitry Krotov

h-index11

18papers

1,260citations

Novelty59%

AI Score45

Ranked #44,627 of 194,257 authors (top 23%)#10,320 in LG (top 26%)

18 Papers

29.7LGFeb 14, 2023Code

Energy Transformer

Benjamin Hoover, Yuchen Liang, Bao Pham et al. · gatech, ibm-research

Our work combines aspects of three promising paradigms in machine learning, namely, attention mechanism, energy-based models, and associative memory. Attention is the power-house driving modern deep learning successes, but it lacks clear theoretical foundations. Energy-based models allow a principled approach to discriminative and generative tasks, but the design of the energy functional is not straightforward. At the same time, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, and allow an intuitive design of the energy function. We propose a novel architecture, called the Energy Transformer (or ET for short), that uses a sequence of attention layers that are purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. In this work, we introduce the theoretical foundations of ET, explore its empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection and graph classification tasks.

14.3LGSep 28, 2023

Memory in Plain Sight: Surveying the Uncanny Resemblances of Associative Memories and Diffusion Models

Benjamin Hoover, Hendrik Strobelt, Dmitry Krotov et al. · gatech, ibm-research

The generative process of Diffusion Models (DMs) has recently set state-of-the-art on many AI generation benchmarks. Though the generative process is traditionally understood as an "iterative denoiser", there is no universally accepted language to describe it. We introduce a novel perspective to describe DMs using the mathematical language of memory retrieval from the field of energy-based Associative Memories (AMs), making efforts to keep our presentation approachable to newcomers to both of these fields. Unifying these two fields provides insight that DMs can be seen as a particular kind of AM where Lyapunov stability guarantees are bypassed by intelligently engineering the dynamics (i.e., the noise and step size schedules) of the denoising process. Finally, we present a growing body of evidence that records DMs exhibiting empirical behavior we would expect from AMs, and conclude by discussing research opportunities that are revealed by understanding DMs as a form of energy-based memory.

19.8NEJun 7, 2023Code

Long Sequence Hopfield Memory

Hamza Tahir Chaudhry, Jacob A. Zavatone-Veth, Dmitry Krotov et al. · harvard

Sequence memory is an essential attribute of natural and artificial intelligence that enables agents to encode, store, and retrieve complex sequences of stimuli and actions. Computational models of sequence memory have been proposed where recurrent Hopfield-like neural networks are trained with temporally asymmetric Hebbian rules. However, these networks suffer from limited sequence capacity (maximal length of the stored sequence) due to interference between the memories. Inspired by recent work on Dense Associative Memories, we expand the sequence capacity of these models by introducing a nonlinear interaction term, enhancing separation between the patterns. We derive novel scaling laws for sequence capacity with respect to network size, significantly outperforming existing scaling laws for models based on traditional Hopfield networks, and verify these theoretical results with numerical simulation. Moreover, we introduce a generalized pseudoinverse rule to recall sequences of highly correlated patterns. Finally, we extend this model to store sequences with variable timing between states' transitions and describe a biologically-plausible implementation, with connections to motor neuroscience.

14.9LGJun 5, 2023Code

End-to-end Differentiable Clustering with Associative Memories

Bishwajit Saha, Dmitry Krotov, Mohammed J. Zaki et al.

Clustering is a widely used unsupervised learning technique involving an intensive discrete optimization problem. Associative Memory models or AMs are differentiable neural networks defining a recursive dynamical system, which have been integrated with various deep learning architectures. We uncover a novel connection between the AM dynamics and the inherent discrete assignment necessary in clustering to propose a novel unconstrained continuous relaxation of the discrete clustering problem, enabling end-to-end differentiable clustering with AM, dubbed ClAM. Leveraging the pattern completion ability of AMs, we further develop a novel self-supervised clustering loss. Our evaluations on varied datasets demonstrate that ClAM benefits from the self-supervision, and significantly improves upon both the traditional Lloyd's k-means algorithm, and more recent continuous clustering relaxations (by upto 60% in terms of the Silhouette Coefficient).

18.2NEMar 20, 2023Code

Sparse Distributed Memory is a Continual Learner

Trenton Bricken, Xander Davies, Deepak Singh et al.

Continual learning is a problem for artificial neural networks that their biological counterparts are adept at solving. Building on work using Sparse Distributed Memory (SDM) to connect a core neural circuit with the powerful Transformer model, we create a modified Multi-Layered Perceptron (MLP) that is a strong continual learner. We find that every component of our MLP variant translated from biology is necessary for continual learning. Our solution is also free from any memory replay or task information, and introduces novel methods to train sparse networks that may be broadly applicable.

27.2CLFeb 1, 2025Code

M+: Extending MemoryLLM with Scalable Long-Term Memory

Yu Wang, Dmitry Krotov, Yuanzhe Hu et al.

Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead. We open-source our code at https://github.com/wangyu-ustc/MemoryLLM

1.8LGAug 30, 2022

Associative Learning for Network Embedding

Yuchen Liang, Dmitry Krotov, Mohammed J. Zaki

The network embedding task is to represent the node in the network as a low-dimensional vector while incorporating the topological and structural information. Most existing approaches solve this problem by factorizing a proximity matrix, either directly or implicitly. In this work, we introduce a network embedding method from a new perspective, which leverages Modern Hopfield Networks (MHN) for associative learning. Our network learns associations between the content of each node and that node's neighbors. These associations serve as memories in the MHN. The recurrent dynamics of the network make it possible to recover the masked node, given that node's neighbors. Our proposed method is evaluated on different downstream tasks such as node classification and linkage prediction. The results show competitive performance compared to the common matrix factorization techniques and deep learning based methods.

12.6CLFeb 21, 2024

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

Zexue He, Leonid Karlinsky, Donghyun Kim et al.

Large Language Models (LLMs) struggle to handle long input sequences due to high memory and runtime costs. Memory-augmented models have emerged as a promising solution to this problem, but current methods are hindered by limited memory capacity and require costly re-training to integrate with a new LLM. In this work, we introduce an associative memory module which can be coupled to any pre-trained (frozen) attention-based LLM without re-training, enabling it to handle arbitrarily long input sequences. Unlike previous methods, our associative memory module consolidates representations of individual tokens into a non-parametric distribution model, dynamically managed by properly balancing the novelty and recency of the incoming data. By retrieving information from this consolidated associative memory, the base LLM can achieve significant (up to 29.7% on Arxiv) perplexity reduction in long-context modeling compared to other baselines evaluated on standard benchmarks. This architecture, which we call CAMELoT (Consolidated Associative Memory Enhanced Long Transformer), demonstrates superior performance even with a tiny context window of 128 tokens, and also enables improved in-context learning with a much larger set of demonstrations.

15.0LGOct 31, 2024Code

Dense Associative Memory Through the Lens of Random Features

Benjamin Hoover, Duen Horng Chau, Hendrik Strobelt et al. · gatech, ibm-research

Dense Associative Memories are high storage capacity variants of the Hopfield networks that are capable of storing a large number of memory patterns in the weights of the network of a given size. Their common formulations typically require storing each pattern in a separate set of synaptic weights, which leads to the increase of the number of synaptic weights when new patterns are introduced. In this work we propose an alternative formulation of this class of models using random features, commonly used in kernel methods. In this formulation the number of network's parameters remains fixed. At the same time, new memories can be added to the network by modifying existing weights. We show that this novel network closely approximates the energy function and dynamics of conventional Dense Associative Memories and shares their desirable computational properties.

10.8FLU-DYNJan 2, 2025

Operator Learning for Reconstructing Flow Fields from Sparse Measurements: an Energy Transformer Approach

Qian Zhang, Dmitry Krotov, George Em Karniadakis

Machine learning methods have shown great success in various scientific areas, including fluid mechanics. However, reconstruction problems, where full velocity fields must be recovered from partial observations, remain challenging. In this paper, we propose a novel operator learning framework for solving reconstruction problems by using the Energy Transformer (ET), an architecture inspired by associative memory models. We formulate reconstruction as a mapping from incomplete observed data to full reconstructed fields. The method is validated on three fluid mechanics examples using diverse types of data: (1) unsteady 2D vortex street in flow past a cylinder using simulation data; (2) high-speed under-expanded impinging supersonic jets impingement using Schlieren imaging; and (3) 3D turbulent jet flow using particle tracking. The results demonstrate the ability of ET to accurately reconstruct complex flow fields from highly incomplete data (90\% missing), even for noisy experimental measurements, with fast training and inference on a single GPU. This work provides a promising new direction for tackling reconstruction problems in fluid mechanics and other areas in mechanics, geophysics, weather prediction, and beyond.

15.7LGJun 12, 2025

Dense Associative Memory with Epanechnikov Energy

Benjamin Hoover, Zhaoyang Shi, Krishnakumar Balasubramanian et al.

We propose a novel energy function for Dense Associative Memory (DenseAM) networks, the log-sum-ReLU (LSR), inspired by optimal kernel density estimation. Unlike the common log-sum-exponential (LSE) function, LSR is based on the Epanechnikov kernel and enables exact memory retrieval with exponential capacity without requiring exponential separation functions. Moreover, it introduces abundant additional \emph{emergent} local minima while preserving perfect pattern recovery -- a characteristic previously unseen in DenseAM literature. Empirical results show that LSR energy has significantly more local minima (memories) that have comparable log-likelihood to LSE-based models. Analysis of LSR's emergent memories on image datasets reveals a degree of creativity and novelty, hinting at this method's potential for both large-scale memory storage and generative tasks.

25.2NEJul 14, 2021

Hierarchical Associative Memory

Dmitry Krotov

Dense Associative Memories or Modern Hopfield Networks have many appealing properties of associative memory. They can do pattern completion, store a large number of memories, and can be described using a recurrent neural network with a degree of biological plausibility and rich feedback between the neurons. At the same time, up until now all the models of this class have had only one hidden layer, and have only been formulated with densely connected network architectures, two aspects that hinder their machine learning applications. This paper tackles this gap and describes a fully recurrent model of associative memory with an arbitrary large number of layers, some of which can be locally connected (convolutional), and a corresponding energy function that decreases on the dynamical trajectory of the neurons' activations. The memories of the full network are dynamically "assembled" using primitives encoded in the synaptic weights of the lower layers, with the "assembling rules" encoded in the synaptic weights of the higher layers. In addition to the bottom-up propagation of information, typical of commonly used feedforward neural networks, the model described has rich top-down feedback from higher layers that help the lower-layer neurons to decide on their response to the input stimuli.

3.4CLJan 18, 2021Code

Can a Fruit Fly Learn Word Embeddings?

Yuchen Liang, Chaitanya K. Ryali, Benjamin Hoover et al.

The mushroom body of the fruit fly brain is one of the best studied systems in neuroscience. At its core it consists of a population of Kenyon cells, which receive inputs from multiple sensory modalities. These cells are inhibited by the anterior paired lateral neuron, thus creating a sparse high dimensional representation of the inputs. In this work we study a mathematical formalization of this network motif and apply it to learning the correlational structure between words and their context in a corpus of unstructured text, a common natural language processing (NLP) task. We show that this network can learn semantic representations of words and can generate both static and context-dependent word embeddings. Unlike conventional methods (e.g., BERT, GloVe) that use dense representations for word embedding, our algorithm encodes semantic meaning of words and their context in the form of sparse binary hash codes. The quality of the learned representations is evaluated on word similarity analysis, word-sense disambiguation, and document classification. It is shown that not only can the fruit fly network motif achieve performance comparable to existing methods in NLP, but, additionally, it uses only a fraction of the computational resources (shorter training time and smaller memory footprint).

27.5NCAug 16, 2020

Large Associative Memory Problem in Neurobiology and Machine Learning

Dmitry Krotov, John Hopfield

Dense Associative Memories or modern Hopfield networks permit storage and reliable retrieval of an exponentially large (in the dimension of feature space) number of memories. At the same time, their naive implementation is non-biological, since it seemingly requires the existence of many-body synaptic junctions between the neurons. We show that these models are effective descriptions of a more microscopic (written in terms of biological degrees of freedom) theory that has additional (hidden) neurons and only requires two-body interactions between them. For this reason our proposed microscopic theory is a valid model of large associative memory with a degree of biological plausibility. The dynamics of our network and its reduced dimensional equivalent both minimize energy (Lyapunov) functions. When certain dynamical variables (hidden neurons) are integrated out from our microscopic theory, one can recover many of the models that were previously discussed in the literature, e.g. the model presented in "Hopfield Networks is All You Need" paper. We also provide an alternative derivation of the energy function and the update rule proposed in the aforementioned paper and clarify the relationships between various models of this class.

11.1LGJan 14, 2020

Bio-Inspired Hashing for Unsupervised Similarity Search

Chaitanya K. Ryali, John J. Hopfield, Leopold Grinberg et al.

The fruit fly Drosophila's olfactory circuit has inspired a new locality sensitive hashing (LSH) algorithm, FlyHash. In contrast with classical LSH algorithms that produce low dimensional hash codes, FlyHash produces sparse high-dimensional hash codes and has also been shown to have superior empirical performance compared to classical LSH algorithms in similarity search. However, FlyHash uses random projections and cannot learn from data. Building on inspiration from FlyHash and the ubiquity of sparse expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner. We show that BioHash outperforms previously published benchmarks for various hashing methods. Since our learning algorithm is based on a local and biologically plausible synaptic plasticity rule, our work provides evidence for the proposal that LSH might be a computational reason for the abundance of sparse expansive motifs in a variety of biological systems. We also propose a convolutional variant BioConvHash that further improves performance. From the perspective of computer science, BioHash and BioConvHash are fast, scalable and yield compressed binary representations that are useful for similarity search.

9.4CVAug 14, 2019

Local Unsupervised Learning for Image Analysis

Leopold Grinberg, John Hopfield, Dmitry Krotov

Local Hebbian learning is believed to be inferior in performance to end-to-end training using a backpropagation algorithm. We question this popular belief by designing a local algorithm that can learn convolutional filters at scale on large image datasets. These filters combined with patch normalization and very steep non-linearities result in a good classification accuracy for shallow networks trained locally, as opposed to end-to-end. The filters learned by our algorithm contain both orientation selective units and unoriented color units, resembling the responses of pyramidal neurons located in the cytochrome oxidase 'interblob' and 'blob' regions in the primary visual cortex of primates. It is shown that convolutional networks with patch normalization significantly outperform standard convolutional networks on the task of recovering the original classes when shadows are superimposed on top of standard CIFAR-10 images. Patch normalization approximates the retinal adaptation to the mean light intensity, important for human vision. We also demonstrate a successful transfer of learned representations between CIFAR-10 and ImageNet 32x32 datasets. All these results taken together hint at the possibility that local unsupervised training might be a powerful tool for learning general representations (without specifying the task) directly from unlabeled data.

20.7LGJun 26, 2018

Unsupervised Learning by Competing Hidden Units

Dmitry Krotov, John Hopfield

It is widely believed that the backpropagation algorithm is essential for learning good feature detectors in early layers of artificial neural networks, so that these detectors are useful for the task performed by the higher layers of that neural network. At the same time, the traditional form of backpropagation is biologically implausible. In the present paper we propose an unusual learning rule, which has a degree of biological plausibility, and which is motivated by Hebb's idea that change of the synapse strength should be local - i.e. should depend only on the activities of the pre and post synaptic neurons. We design a learning algorithm that utilizes global inhibition in the hidden layer, and is capable of learning early feature detectors in a completely unsupervised way. These learned lower layer feature detectors can be used to train higher layer weights in a usual supervised way so that the performance of the full network is comparable to the performance of standard feedforward networks trained end-to-end with a backpropagation algorithm.

36.7NEJun 3, 2016

Dense Associative Memory for Pattern Recognition

Dmitry Krotov, John J Hopfield

A model of associative memory is studied, which stores and reliably retrieves many more patterns than the number of neurons in the network. We propose a simple duality between this dense associative memory and neural networks commonly used in deep learning. On the associative memory side of this duality, a family of models that smoothly interpolates between two limiting cases can be constructed. One limit is referred to as the feature-matching mode of pattern recognition, and the other one as the prototype regime. On the deep learning side of the duality, this family corresponds to feedforward neural networks with one hidden layer and various activation functions, which transmit the activities of the visible neurons to the hidden layer. This family of activation functions includes logistics, rectified linear units, and rectified polynomials of higher degrees. The proposed duality makes it possible to apply energy-based intuition from associative memory to analyze computational properties of neural networks with unusual activation functions - the higher rectified polynomials which until now have not been used in deep learning. The utility of the dense memories is illustrated for two test cases: the logical gate XOR and the recognition of handwritten digits from the MNIST data set.