LGAug 23, 2022
Cardinality-Regularized Hawkes-Granger ModelTsuyoshi Idé, Georgios Kollias, Dzung T. Phan et al.
We propose a new sparse Granger-causal learning framework for temporal event data. We focus on a specific class of point processes called the Hawkes process. We begin by pointing out that most of the existing sparse causal learning algorithms for the Hawkes process suffer from a singularity in maximum likelihood estimation. As a result, their sparse solutions can appear only as numerical artifacts. In this paper, we propose a mathematically well-defined sparse causal learning framework based on a cardinality-regularized Hawkes process, which remedies the pathological issues of existing approaches. We leverage the proposed algorithm for the task of instance-wise causal event analysis, where sparsity plays a critical role. We validate the proposed framework with two real use-cases, one from the power grid and the other from the cloud data center management domain.
CLJul 1, 2024
Needle in the Haystack for Memory Based Large Language ModelsElliot Nelson, Georgios Kollias, Payel Das et al.
Current large language models (LLMs) often perform poorly on simple fact retrieval tasks. Here we investigate if coupling a dynamically adaptable external memory to a LLM can alleviate this problem. For this purpose, we test Larimar, a recently proposed language model architecture which uses an external associative memory, on long-context recall tasks including passkey and needle-in-the-haystack tests. We demonstrate that the external memory of Larimar, which allows fast write and read of an episode of text samples, can be used at test time to handle contexts much longer than those seen during training. We further show that the latent readouts from the memory (to which long contexts are written) control the decoder towards generating correct outputs, with the memory stored off of the GPU. Compared to existing transformer-based LLM architectures for long-context recall tasks that use larger parameter counts or modified attention mechanisms, a relatively smaller size Larimar is able to maintain strong performance without any task-specific training or training on longer contexts.
LGMar 18, 2024Code
Larimar: Large Language Models with Episodic Memory ControlPayel Das, Subhajit Chaudhury, Elliot Nelson et al.
Efficient and accurate updating of knowledge stored in Large Language Models (LLMs) is one of the most pressing research challenges today. This paper presents Larimar - a novel, brain-inspired architecture for enhancing LLMs with a distributed episodic memory. Larimar's memory allows for dynamic, one-shot updates of knowledge without the need for computationally expensive re-training or fine-tuning. Experimental results on multiple fact editing benchmarks demonstrate that Larimar attains accuracy comparable to most competitive baselines, even in the challenging sequential editing setup, but also excels in speed - yielding speed-ups of 8-10x depending on the base LLM - as well as flexibility due to the proposed architecture being simple, LLM-agnostic, and hence general. We further provide mechanisms for selective fact forgetting, information leakage prevention, and input context length generalization with Larimar and show their effectiveness. Our code is available at https://github.com/IBM/larimar
CLJul 23, 2024
Generation Constraint Scaling Can Mitigate HallucinationGeorgios Kollias, Payel Das, Subhajit Chaudhury
Addressing the issue of hallucinations in large language models (LLMs) is a critical challenge. As the cognitive mechanisms of hallucination have been related to memory, here we explore hallucination for LLM that is enabled with explicit memory mechanisms. We empirically demonstrate that by simply scaling the readout vector that constrains generation in a memory-augmented LLM decoder, hallucination mitigation can be achieved in a training-free manner. Our method is geometry-inspired and outperforms a state-of-the-art LLM editing method on the task of generation of Wikipedia-like biography entries both in terms of generation quality and runtime complexity.
CLApr 8, 2025Code
Multi-Sense Embeddings for Language Models and Knowledge DistillationQitong Wang, Mohammed J. Zaki, Georgios Kollias et al.
Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a limited number of senses (or meanings). We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language. To construct a sense embedding dictionary, we apply a clustering algorithm to embeddings generated by an LLM and consider the cluster centers as representative sense embeddings. In addition, we propose a novel knowledge distillation method that leverages the sense dictionary to learn a smaller student model that mimics the senses from the much larger base LLM model, offering significant space and inference time savings, while maintaining competitive performance. Via thorough experiments on various benchmarks, we showcase the effectiveness of our sense embeddings and knowledge distillation approach. We share our code at https://github.com/Qitong-Wang/SenseDict
CLFeb 25
Decoder-based Sense Knowledge DistillationQitong Wang, Mohammed J. Zaki, Georgios Kollias et al.
Large language models (LLMs) learn contextual embeddings that capture rich semantic information, yet they often overlook structured lexical knowledge such as word senses and relationships. Prior work has shown that incorporating sense dictionaries can improve knowledge distillation for encoder models, but their application to decoder as generative models remains challenging. In this paper, we introduce Decoder-based Sense Knowledge Distillation (DSKD), a framework that integrates lexical resources into the training of decoder-style LLMs without requiring dictionary lookup at inference time. Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.
LGFeb 6, 2024
Learning Granger Causality from Instance-wise Self-attentive Hawkes ProcessesDongxia Wu, Tsuyoshi Idé, Aurélie Lozano et al.
We address the problem of learning Granger causality from asynchronous, interdependent, multi-type event sequences. In particular, we are interested in discovering instance-level causal structures in an unsupervised manner. Instance-level causality identifies causal relationships among individual events, providing more fine-grained information for decision-making. Existing work in the literature either requires strong assumptions, such as linearity in the intensity function, or heuristically defined model parameters that do not necessarily meet the requirements of Granger causality. We propose Instance-wise Self-Attentive Hawkes Processes (ISAHP), a novel deep learning framework that can directly infer the Granger causality at the event instance level. ISAHP is the first neural point process model that meets the requirements of Granger causality. It leverages the self-attention mechanism of the transformer to align with the principles of Granger causality. We empirically demonstrate that ISAHP is capable of discovering complex instance-level causal structures that cannot be handled by classical models. We also show that ISAHP achieves state-of-the-art performance in proxy tasks involving type-level causal discovery and instance-level event type prediction.
LGFeb 28, 2024
NeuroPrune: A Neuro-inspired Topological Sparse Training Algorithm for Large Language ModelsAmit Dhurandhar, Tejaswini Pedapati, Ronny Luss et al.
Transformer-based Language Models have become ubiquitous in Natural Language Processing (NLP) due to their impressive performance on various tasks. However, expensive training as well as inference remains a significant impediment to their widespread applicability. While enforcing sparsity at various levels of the model architecture has found promise in addressing scaling and efficiency issues, there remains a disconnect between how sparsity affects network topology. Inspired by brain neuronal networks, we explore sparsity approaches through the lens of network topology. Specifically, we exploit mechanisms seen in biological networks, such as preferential attachment and redundant synapse pruning, and show that principled, model-agnostic sparsity approaches are performant and efficient across diverse NLP tasks, spanning both classification (such as natural language inference) and generation (summarization, machine translation), despite our sole objective not being optimizing performance. NeuroPrune is competitive with (or sometimes superior to) baselines on performance and can be up to $10$x faster in terms of training time for a given level of sparsity, simultaneously exhibiting measurable improvements in inference time in many cases.
CLMar 10, 2025
Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks?Payel Das, Ching-Yun Ko, Sihui Dai et al.
Large language models often expose their brittleness in reasoning tasks, especially while executing long chains of reasoning over context. We propose MemReasoner, a new and simple memory-augmented LLM architecture, in which the memory learns the relative order of facts in context, and enables hopping over them, while the decoder selectively attends to the memory. MemReasoner is trained end-to-end, with optional supporting fact supervision of varying degrees. We train MemReasoner, along with existing memory-augmented transformer models and a state-space model, on two distinct synthetic multi-hop reasoning tasks. Experiments performed under a variety of challenging scenarios, including the presence of long distractor text or target answer changes in test set, show strong generalization of MemReasoner on both single- and two-hop tasks. This generalization of MemReasoner is achieved using none-to-weak supporting fact supervision (using none and 1\% of supporting facts for one- and two-hop tasks, respectively). In contrast, baseline models overall struggle to generalize and benefit far less from using full supporting fact supervision. The results highlight the importance of explicit memory mechanisms, combined with additional weak supervision, for improving large language model's context processing ability toward reasoning tasks.
CLFeb 20, 2025
EpMAN: Episodic Memory AttentioN for Generalizing to Longer ContextsSubhajit Chaudhury, Payel Das, Sarathkrishna Swaminathan et al.
Recent advances in Large Language Models (LLMs) have yielded impressive successes on many language tasks. However, efficient processing of long contexts using LLMs remains a significant challenge. We introduce \textbf{EpMAN} -- a method for processing long contexts in an \textit{episodic memory} module while \textit{holistically attending to} semantically relevant context chunks. The output of \textit{episodic attention} is then used to reweigh the decoder's self-attention to the stored KV cache of the context during training and generation. When an LLM decoder is trained using \textbf{EpMAN}, its performance on multiple challenging single-hop long-context recall and question-answering benchmarks is found to be stronger and more robust across the range from 16k to 256k tokens than baseline decoders trained with self-attention, and popular retrieval-augmented generation frameworks.
LGFeb 25, 2022
Directed Graph Auto-EncodersGeorgios Kollias, Vasileios Kalantzis, Tsuyoshi Idé et al.
We introduce a new class of auto-encoders for directed graphs, motivated by a direct extension of the Weisfeiler-Leman algorithm to pairs of node labels. The proposed model learns pairs of interpretable latent representations for the nodes of directed graphs, and uses parameterized graph convolutional network (GCN) layers for its encoder and an asymmetric inner product decoder. Parameters in the encoder control the weighting of representations exchanged between neighboring nodes. We demonstrate the ability of the proposed model to learn meaningful latent embeddings and achieve superior performance on the directed link prediction task on several popular network datasets.
NAOct 13, 2020
Projection techniques to update the truncated SVD of evolving matricesVassilis Kalantzis, Georgios Kollias, Shashanka Ubaru et al.
This paper considers the problem of updating the rank-k truncated Singular Value Decomposition (SVD) of matrices subject to the addition of new rows and/or columns over time. Such matrix problems represent an important computational kernel in applications such as Latent Semantic Indexing and Recommender Systems. Nonetheless, the proposed framework is purely algebraic and targets general updating problems. The algorithm presented in this paper undertakes a projection view-point and focuses on building a pair of subspaces which approximate the linear span of the sought singular vectors of the updated matrix. We discuss and analyze two different choices to form the projection subspaces. Results on matrices from real applications suggest that the proposed algorithm can lead to higher accuracy, especially for the singular triplets associated with the largest modulus singular values. Several practical details and key differences with other approaches are also discussed.
LGMay 23, 2019
Accelerating Physics-Based Simulations Using Neural Network Proxies: An Application in Oil Reservoir ModelingJiri Navratil, Alan King, Jesus Rios et al.
We develop a proxy model based on deep learning methods to accelerate the simulations of oil reservoirs--by three orders of magnitude--compared to industry-strength physics-based PDE solvers. This paper describes a new architectural approach to this task, accompanied by a thorough experimental evaluation on a publicly available reservoir model. We demonstrate that in a practical setting a speedup of more than 2000X can be achieved with an average sequence error of about 10\% relative to the oil-field simulator. The proxy model is contrasted with a high-quality physics-based acceleration baseline and is shown to outperform it by several orders of magnitude. We believe the outcomes presented here are extremely promising and offer a valuable benchmark for continuing research in oil field development optimization. Due to its domain-agnostic architecture, the presented approach can be extended to many applications beyond the field of oil and gas exploration.
SISep 21, 2018
Low rank methods for multiple network alignmentHuda Nassar, Georgios Kollias, Ananth Grama et al.
Multiple network alignment is the problem of identifying similar and related regions in a given set of networks. While there are a large number of effective techniques for pairwise problems with two networks that scale in terms of edges, these cannot be readily extended to align multiple networks as the computational complexity will tend to grow exponentially with the number of networks.In this paper we introduce a new multiple network alignment algorithm and framework that is effective at aligning thousands of networks with thousands of nodes. The key enabling technique of our algorithm is identifying an exact and easy to compute low-rank tensor structure inside of a principled heuristic procedure for pairwise network alignment called IsoRank. This can be combined with a new algorithm for $k$-dimensional matching problems on low-rank tensors to produce the alignment. We demonstrate results on synthetic and real-world problems that show our technique (i) is as good or better in terms of quality as existing methods, when they work on small problems, while running considerably faster and (ii) is able to scale to aligning a number of networks unreachable by current methods. We show in this paper that our method is the realistic choice for aligning multiple networks when no prior information is present.
LGJun 1, 2018
Provably convergent acceleration in factored gradient descent with applications in matrix sensingTayo Ajayi, David Mildebrath, Anastasios Kyrillidis et al.
We present theoretical results on the convergence of \emph{non-convex} accelerated gradient descent in matrix factorization models with $\ell_2$-norm loss. The purpose of this work is to study the effects of acceleration in non-convex settings, where provable convergence with acceleration should not be considered a \emph{de facto} property. The technique is applied to matrix sensing problems, for the estimation of a rank $r$ optimal solution $X^\star \in \mathbb{R}^{n \times n}$. Our contributions can be summarized as follows. $i)$ We show that acceleration in factored gradient descent converges at a linear rate; this fact is novel for non-convex matrix factorization settings, under common assumptions. $ii)$ Our proof technique requires the acceleration parameter to be carefully selected, based on the properties of the problem, such as the condition number of $X^\star$ and the condition number of objective function. $iii)$ Currently, our proof leads to the same dependence on the condition number(s) in the contraction parameter, similar to recent results on non-accelerated algorithms. $iv)$ Acceleration is observed in practice, both in synthetic examples and in two real applications: neuronal multi-unit activities recovery from single electrode recordings, and quantum state tomography on quantum computing simulators.
DCJan 11, 2018
MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep LearningAmith R Mamidala, Georgios Kollias, Chris Ward et al.
Existing Deep Learning frameworks exclusively use either Parameter Server(PS) approach or MPI parallelism. In this paper, we discuss the drawbacks of such approaches and propose a generic framework supporting both PS and MPI programming paradigms, co-existing at the same time. The key advantage of the new model is to embed the scaling benefits of MPI parallelism into the loosely coupled PS task model. Apart from providing a practical usage model of MPI in cloud, such framework allows for novel communication avoiding algorithms that do parameter averaging in Stochastic Gradient Descent(SGD) approaches. We show how MPI and PS models can synergestically apply algorithms such as Elastic SGD to improve the rate of convergence against existing approaches. These new algorithms directly help scaling SGD clusterwide. Further, we also optimize the critical component of the framework, namely global aggregation or allreduce using a novel concept of tensor collectives. These treat a group of vectors on a node as a single object allowing for the existing single vector algorithms to be directly applicable. We back our claims with sufficient emperical evidence using large scale ImageNet 1K data. Our framework is built upon MXNET but the design is generic and can be adapted to other popular DL infrastructures.