Rajgopal Kannan

LG
h-index72
58papers
1,787citations
Novelty55%
AI Score58

58 Papers

ARMar 10, 2022
Model-Architecture Co-Design for High Performance Temporal GNN Inference on FPGA

Hongkuan Zhou, Bingyi Zhang, Rajgopal Kannan et al.

Temporal Graph Neural Networks (TGNNs) are powerful models to capture temporal, structural, and contextual information on temporal graphs. The generated temporal node embeddings outperform other methods in many downstream tasks. Real-world applications require high performance inference on real-time streaming dynamic graphs. However, these models usually rely on complex attention mechanisms to capture relationships between temporal neighbors. In addition, maintaining vertex memory suffers from intrinsic temporal data dependency that hinders task-level parallelism, making it inefficient on general-purpose processors. In this work, we present a novel model-architecture co-design for inference in memory-based TGNNs on FPGAs. The key modeling optimizations we propose include a light-weight method to compute attention scores and a related temporal neighbor pruning strategy to further reduce computation and memory accesses. These are holistically coupled with key hardware optimizations that leverage FPGA hardware. We replace the temporal sampler with an on-chip FIFO based hardware sampler and the time encoder with a look-up-table. We train our simplified models using knowledge distillation to ensure similar accuracy vis-á-vis the original model. Taking advantage of the model optimizations, we propose a principled hardware architecture using batching, pipelining, and prefetching techniques to further improve the performance. We also propose a hardware mechanism to ensure the chronological vertex updating without sacrificing the computation parallelism. We evaluate the performance of the proposed hardware accelerator on three real-world datasets.

ARMay 1, 2022
Fine-Grained Address Segmentation for Attention-Based Variable-Degree Prefetching

Pengmiao Zhang, Ajitesh Srivastava, Anant V. Nori et al.

Machine learning algorithms have shown potential to improve prefetching performance by accurately predicting future memory accesses. Existing approaches are based on the modeling of text prediction, considering prefetching as a classification problem for sequence prediction. However, the vast and sparse memory address space leads to large vocabulary, which makes this modeling impractical. The number and order of outputs for multiple cache line prefetching are also fundamentally different from text prediction. We propose TransFetch, a novel way to model prefetching. To reduce vocabulary size, we use fine-grained address segmentation as input. To predict unordered sets of future addresses, we use delta bitmaps for multiple outputs. We apply an attention-based network to learn the mapping between input and output. Prediction experiments demonstrate that address segmentation achieves 26% - 36% higher F1-score than delta inputs and 15% - 24% higher F1-score than page & offset inputs for SPEC 2006, SPEC 2017, and GAP benchmarks. Simulation results show that TransFetch achieves 38.75% IPC improvement compared with no prefetching, outperforming the best-performing rule-based prefetcher BOP by 10.44%, and ML-based prefetcher Voyager by 6.64%.

ARJan 4, 2023
Accurate, Low-latency, Efficient SAR Automatic Target Recognition on FPGA

Bingyi Zhang, Rajgopal Kannan, Viktor Prasanna et al.

Synthetic aperture radar (SAR) automatic target recognition (ATR) is the key technique for remote-sensing image recognition. The state-of-the-art convolutional neural networks (CNNs) for SAR ATR suffer from \emph{high computation cost} and \emph{large memory footprint}, making them unsuitable to be deployed on resource-limited platforms, such as small/micro satellites. In this paper, we propose a comprehensive GNN-based model-architecture {co-design} on FPGA to address the above issues. \emph{Model design}: we design a novel graph neural network (GNN) for SAR ATR. The proposed GNN model incorporates GraphSAGE layer operators and attention mechanism, achieving comparable accuracy as the state-of-the-art work with near $1/100$ computation cost. Then, we propose a pruning approach including weight pruning and input pruning. While weight pruning through lasso regression reduces most parameters without accuracy drop, input pruning eliminates most input pixels with negligible accuracy drop. \emph{Architecture design}: to fully unleash the computation parallelism within the proposed model, we develop a novel unified hardware architecture that can execute various computation kernels (feature aggregation, feature transformation, graph pooling). The proposed hardware design adopts the Scatter-Gather paradigm to efficiently handle the irregular computation {patterns} of various computation kernels. We deploy the proposed design on an embedded FPGA (AMD Xilinx ZCU104) and evaluate the performance using MSTAR dataset. Compared with the state-of-the-art CNNs, the proposed GNN achieves comparable accuracy with $1/3258$ computation cost and $1/83$ model size. Compared with the state-of-the-art CPU/GPU, our FPGA accelerator achieves $14.8\times$/$2.5\times$ speedup (latency) and is $62\times$/$39\times$ more energy efficient.

LGApr 19, 2022
Label Efficient Regularization and Propagation for Graph Node Classification

Tian Xie, Rajgopal Kannan, C. -C. Jay Kuo

An enhanced label propagation (LP) method called GraphHop was proposed recently. It outperforms graph convolutional networks (GCNs) in the semi-supervised node classification task on various networks. Although the performance of GraphHop was explained intuitively with joint node attribute and label signal smoothening, its rigorous mathematical treatment is lacking. In this paper, we propose a label efficient regularization and propagation (LERP) framework for graph node classification, and present an alternate optimization procedure for its solution. Furthermore, we show that GraphHop only offers an approximate solution to this framework and has two drawbacks. First, it includes all nodes in the classifier training without taking the reliability of pseudo-labeled nodes into account in the label update step. Second, it provides a rough approximation to the optimum of a subproblem in the label aggregation step. Based on the LERP framework, we propose a new method, named the LERP method, to solve these two shortcomings. LERP determines reliable pseudo-labels adaptively during the alternate optimization and provides a better approximation to the optimum with computational efficiency. Theoretical convergence of LERP is guaranteed. Extensive experiments are conducted to demonstrate the effectiveness and efficiency of LERP. That is, LERP outperforms all benchmarking methods, including GraphHop, consistently on five test datasets and an object recognition task at extremely low label rates (i.e., 1, 2, 4, 8, 16, and 20 labeled samples per class).

ARMay 29, 2022
TransforMAP: Transformer for Memory Access Prediction

Pengmiao Zhang, Ajitesh Srivastava, Anant V. Nori et al.

Data Prefetching is a technique that can hide memory latency by fetching data before it is needed by a program. Prefetching relies on accurate memory access prediction, to which task machine learning based methods are increasingly applied. Unlike previous approaches that learn from deltas or offsets and perform one access prediction, we develop TransforMAP, based on the powerful Transformer model, that can learn from the whole address space and perform multiple cache line predictions. We propose to use the binary of memory addresses as model input, which avoids information loss and saves a token table in hardware. We design a block index bitmap to collect unordered future page offsets under the current page address as learning labels. As a result, our model can learn temporal patterns as well as spatial patterns within a page. In a practical implementation, this approach has the potential to hide prediction latency because it prefetches multiple cache lines likely to be used in a long horizon. We show that our approach achieves 35.67% MPKI improvement and 20.55% IPC improvement in simulation, higher than state-of-the-art Best-Offset prefetcher and ISB prefetcher.

DSApr 3, 2018
Optimal Net-Load Balancing in Smart Grids with High PV Penetration

Sanmukh R. Kuppannagari, Rajgopal Kannan, Viktor K. Prasanna

Mitigating Supply-Demand mismatch is critical for smooth power grid operation. Traditionally, load curtailment techniques such as Demand Response (DR) have been used for this purpose. However, these cannot be the only component of a net-load balancing framework for Smart Grids with high PV penetration. These grids can sometimes exhibit supply surplus causing over-voltages. Supply curtailment techniques such as Volt-Var Optimizations are complex and computationally expensive. This increases the complexity of net-load balancing systems used by the grid operator and limits their scalability. Recently new technologies have been developed that enable the rapid and selective connection of PV modules of an installation to the grid. Taking advantage of these advancements, we develop a unified optimal net-load balancing framework which performs both load and solar curtailment. We show that when the available curtailment values are discrete, this problem is NP-hard and develop bounded approximation algorithms for minimizing the curtailment cost. Our algorithms produce fast solutions, given the tight timing constraints required for grid operation. We also incorporate the notion of fairness to ensure that curtailment is evenly distributed among all the nodes. Finally, we develop an online algorithm which performs net-load balancing using only data available for the current interval. Using both theoretical analysis and practical evaluations, we show that our net-load balancing algorithms provide solutions which are close to optimal in a small amount of time.

CVApr 23Code
Latent Denoising Improves Visual Alignment in Large Multimodal Models

Dhruv Parikh, Jacob Fein-Ashley, Rajgopal Kannan et al.

Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher's intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at https://github.com/dhruvashp/latent-denoising-for-lmms.

LGSep 13, 2023
Characterizing Speed Performance of Multi-Agent Reinforcement Learning

Samuel Wiggins, Yuan Meng, Rajgopal Kannan et al.

Multi-Agent Reinforcement Learning (MARL) has achieved significant success in large-scale AI systems and big-data applications such as smart grids, surveillance, etc. Existing advancements in MARL algorithms focus on improving the rewards obtained by introducing various mechanisms for inter-agent cooperation. However, these optimizations are usually compute- and memory-intensive, thus leading to suboptimal speed performance in end-to-end training time. In this work, we analyze the speed performance (i.e., latency-bounded throughput) as the key metric in MARL implementations. Specifically, we first introduce a taxonomy of MARL algorithms from an acceleration perspective categorized by (1) training scheme and (2) communication method. Using our taxonomy, we identify three state-of-the-art MARL algorithms - Multi-Agent Deep Deterministic Policy Gradient (MADDPG), Target-oriented Multi-agent Communication and Cooperation (ToM2C), and Networked Multi-Agent RL (NeurComm) - as target benchmark algorithms, and provide a systematic analysis of their performance bottlenecks on a homogeneous multi-core CPU platform. We justify the need for MARL latency-bounded throughput to be a key performance metric in future literature while also addressing opportunities for parallelization and acceleration.

DCJul 17, 2022
Towards Programmable Memory Controller for Tensor Decomposition

Sasindu Wijeratne, Ta-Yang Wang, Rajgopal Kannan et al.

Tensor decomposition has become an essential tool in many data science applications. Sparse Matricized Tensor Times Khatri-Rao Product (MTTKRP) is the pivotal kernel in tensor decomposition algorithms that decompose higher-order real-world large tensors into multiple matrices. Accelerating MTTKRP can speed up the tensor decomposition process immensely. Sparse MTTKRP is a challenging kernel to accelerate due to its irregular memory access characteristics. Implementing accelerators on Field Programmable Gate Array (FPGA) for kernels such as MTTKRP is attractive due to the energy efficiency and the inherent parallelism of FPGA. This paper explores the opportunities, key challenges, and an approach for designing a custom memory controller on FPGA for MTTKRP while exploring the parameter space of such a custom memory controller.

LGApr 19Code
Do LLM-derived graph priors improve multi-agent coordination?

Nikunj Gupta, Rajgopal Kannan, Viktor Prasanna

Multi-agent reinforcement learning (MARL) is crucial for AI systems that operate collaboratively in distributed and adversarial settings, particularly in multi-domain operations (MDO). A central challenge in cooperative MARL is determining how agents should coordinate: existing approaches must either hand-specify graph topology, rely on proximity-based heuristics, or learn structure entirely from environment interaction; all of which are brittle, semantically uninformed, or data-intensive. We investigate whether large language models (LLMs) can generate useful coordination graph priors for MARL by using minimal natural language descriptions of agent observations to infer latent coordination patterns. These priors are integrated into MARL algorithms via graph convolutional layers within a graph neural network (GNN)-based pipeline, and evaluated on four cooperative scenarios from the Multi-Agent Particle Environment (MPE) benchmark against baselines spanning the full spectrum of coordination modeling, from independent learners to state-of-the-art graph-based methods. We further ablate across five compact open-source LLMs to assess the sensitivity of prior quality to model choice. Our results provide the first quantitative evidence that LLM-derived graph priors can enhance coordination and adaptability in dynamic multi-agent environments, and demonstrate that models as small as 1.5B parameters are sufficient for effective prior generation.

LGDec 10, 2022
Phases, Modalities, Temporal and Spatial Locality: Domain Specific ML Prefetcher for Accelerating Graph Analytics

Pengmiao Zhang, Rajgopal Kannan, Viktor K. Prasanna

Memory performance is a bottleneck in graph analytics acceleration. Existing Machine Learning (ML) prefetchers struggle with phase transitions and irregular memory accesses in graph processing. We propose MPGraph, an ML-based Prefetcher for Graph analytics using domain specific models. MPGraph introduces three novel optimizations: soft detection for phase transitions, phase-specific multi-modality models for access delta and page predictions, and chain spatio-temporal prefetching (CSTP) for prefetch control. Our transition detector achieves 34.17-82.15% higher precision compared with Kolmogorov-Smirnov Windowing and decision tree. Our predictors achieve 6.80-16.02% higher F1-score for delta and 11.68-15.41% higher accuracy-at-10 for page prediction compared with LSTM and vanilla attention models. Using CSTP, MPGraph achieves 12.52-21.23% IPC improvement, outperforming state-of-the-art non-ML prefetcher BO by 7.58-12.03% and ML-based prefetchers Voyager and TransFetch by 3.27-4.58%. For practical implementation, we demonstrate MPGraph using compressed models with reduced latency shows significantly superior accuracy and coverage compared with BO, leading to 3.58% higher IPC improvement.

LGSep 11, 2024Code
Mixture of Scope Experts at Test: Generalizing Deeper Graph Neural Networks with Shallow Variants

Gangda Deng, Hongkuan Zhou, Rajgopal Kannan et al.

Heterophilous graphs, where dissimilar nodes tend to connect, pose a challenge for graph neural networks (GNNs). Increasing the GNN depth can expand the scope (i.e., receptive field), potentially finding homophily from the higher-order neighborhoods. However, GNNs suffer from performance degradation as depth increases. Despite having better expressivity, state-of-the-art deeper GNNs achieve only marginal improvements compared to their shallow variants. Through theoretical and empirical analysis, we systematically demonstrate a shift in GNN generalization preferences across nodes with different homophily levels as depth increases. This creates a disparity in generalization patterns between GNN models with varying depth. Based on these findings, we propose to improve deeper GNN generalization while maintaining high expressivity by Mixture of scope experts at test (Moscat). Experimental results show that Moscat works flexibly with various GNNs across a wide range of datasets while significantly improving accuracy. Our code is available at (https://github.com/Hydrapse/moscat).

CLNov 13, 2025Code
HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning

Nikunj Gupta, Bill Guo, Rajgopal Kannan et al.

Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here https://github.com/ Nikunj-Gupta/hierouter.

LGNov 11, 2025Code
TIGER-MARL: Enhancing Multi-Agent Reinforcement Learning with Temporal Information through Graph-based Embeddings and Representations

Nikunj Gupta, Ludwika Twardecka, James Zachary Hare et al.

In this paper, we propose capturing and utilizing \textit{Temporal Information through Graph-based Embeddings and Representations} or \textbf{TIGER} to enhance multi-agent reinforcement learning (MARL). We explicitly model how inter-agent coordination structures evolve over time. While most MARL approaches rely on static or per-step relational graphs, they overlook the temporal evolution of interactions that naturally arise as agents adapt, move, or reorganize cooperation strategies. Capturing such evolving dependencies is key to achieving robust and adaptive coordination. To this end, TIGER constructs dynamic temporal graphs of MARL agents, connecting their current and historical interactions. It then employs a temporal attention-based encoder to aggregate information across these structural and temporal neighborhoods, yielding time-aware agent embeddings that guide cooperative policy learning. Through extensive experiments on two coordination-intensive benchmarks, we show that TIGER consistently outperforms diverse value-decomposition and graph-based MARL baselines in task performance and sample efficiency. Furthermore, we conduct comprehensive ablation studies to isolate the impact of key design parameters in TIGER, revealing how structural and temporal factors can jointly shape effective policy learning in MARL. All codes can be found here: https://github.com/Nikunj-Gupta/tiger-marl.

SEMar 13
EvoClaw: Evaluating AI Agents on Continuous Software Evolution

Gangda Deng, Zhaoling Chen, Zhongming Yu et al.

With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve customized software to enable interaction within dynamic environments. Yet, existing benchmarks evaluate agents on isolated, one-off coding tasks, neglecting the temporal dependencies and technical debt inherent in real-world software evolution. To bridge this gap, we introduce DeepCommit, an agentic pipeline that reconstructs verifiable Milestone DAGs from noisy commit logs, where milestones are defined as semantically cohesive development goals. These executable sequences enable EvoClaw, a novel benchmark that requires agents to sustain system integrity and limit error accumulation, dimensions of long-term software evolution largely missing from current benchmarks. Our evaluation of 12 frontier models across 4 agent frameworks reveals a critical vulnerability: overall performance scores drop significantly from $>$80% on isolated tasks to at most 38% in continuous settings, exposing agents' profound struggle with long-term maintenance and error propagation.

ARMar 6
A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA

Neelesh Gupta, Peter Wang, Rajgopal Kannan et al.

Gated DeltaNet (GDN) is a linear attention mechanism that replaces the growing KV cache with a fixed-size recurrent state. Hybrid LLMs like Qwen3-Next use 75% GDN layers and achieve competitive accuracy to attention-only models. However, at batch-1, GDN decode is memory-bound on GPUs since the full recurrent state must be round-tripped through HBM every token. We show that this bottleneck is architectural, not algorithmic, as all subquadratic sequence models exhibit arithmetic intensities below 1 FLOP/B at decode time, making them more memory-bound than standard Transformers. We present an FPGA accelerator that eliminates this bottleneck by holding the full 2 MB recurrent state persistently in on-chip BRAM, converting the workload from memory-bound to compute-bound. Our design fuses the GDN recurrence into a five-phase pipelined datapath that performs only one read and one write pass over each state matrix per token, exploits Grouped Value Attention for paired-head parallelism, and overlaps preparation, computation, and output storage via dataflow pipelining. We explore four design points on an AMD Alveo U55C using Vitis HLS, varying head-level parallelism from 2 to 16 value-heads per iteration. Our fastest configuration achieves 63 $μ$s per token, 4.5$\times$ faster than the GPU reference on NVIDIA H100 PCIe. Post-implementation power analysis reports 9.96 W on-chip, yielding up to 60$\times$ greater energy efficiency per token decoded.

CVAug 31, 2024
Studying the Effects of Self-Attention on SAR Automatic Target Recognition

Jacob Fein-Ashley, Rajgopal Kannan, Viktor Prasanna

Attention mechanisms are critically important in the advancement of synthetic aperture radar (SAR) automatic target recognition (ATR) systems. Traditional SAR ATR models often struggle with the noisy nature of the SAR data, frequently learning from background noise rather than the most relevant image features. Attention mechanisms address this limitation by focusing on crucial image components, such as the shadows and small parts of a vehicle, which are crucial for accurate target classification. By dynamically prioritizing these significant features, attention-based models can efficiently characterize the entire image with a few pixels, thus enhancing recognition performance. This capability allows for the discrimination of targets from background clutter, leading to more practical and robust SAR ATR models. We show that attention modules increase top-1 accuracy, improve input robustness, and are qualitatively more explainable on the MSTAR dataset.

LGFeb 19
Action-Graph Policies: Learning Action Co-dependencies in Multi-Agent Reinforcement Learning

Nikunj Gupta, James Zachary Hare, Jesse Milzman et al.

Coordinating actions is the most fundamental form of cooperation in multi-agent reinforcement learning (MARL). Successful decentralized decision-making often depends not only on good individual actions, but on selecting compatible actions across agents to synchronize behavior, avoid conflicts, and satisfy global constraints. In this paper, we propose Action Graph Policies (AGP), that model dependencies among agents' available action choices. It constructs, what we call, \textit{coordination contexts}, that enable agents to condition their decisions on global action dependencies. Theoretically, we show that AGPs induce a strictly more expressive joint policy compared to fully independent policies and can realize coordinated joint actions that are provably more optimal than greedy execution even from centralized value-decomposition methods. Empirically, we show that AGP achieves 80-95\% success on canonical coordination tasks with partial observability and anti-coordination penalties, where other MARL methods reach only 10-25\%. We further demonstrate that AGP consistently outperforms these baselines in diverse multi-agent environments.

LGFeb 6, 2025Code
Deep Meta Coordination Graphs for Multi-agent Reinforcement Learning

Nikunj Gupta, James Zachary Hare, Rajgopal Kannan et al.

This paper presents deep meta coordination graphs (DMCG) for learning cooperative policies in multi-agent reinforcement learning (MARL). Coordination graph formulations encode local interactions and accordingly factorize the joint value function of all agents to improve efficiency in MARL. However, existing approaches rely solely on pairwise relations between agents, which potentially oversimplifies complex multi-agent interactions. DMCG goes beyond these simple direct interactions by also capturing useful higher-order and indirect relationships among agents. It generates novel graph structures accommodating multiple types of interactions and arbitrary lengths of multi-hop connections in coordination graphs to model such interactions. It then employs a graph convolutional network module to learn powerful representations in an end-to-end manner. We demonstrate its effectiveness in multiple coordination problems in MARL where other state-of-the-art methods can suffer from sample inefficiency or fail entirely. All codes can be found here: https://github.com/Nikunj-Gupta/dmcg-marl.

CVMay 11
Can Graphs Help Vision SSMs See Better?

Dhruv Parikh, Anvitha Ramachandran, Haoyang Fan et al.

Vision state space models inherit the efficiency and long-range modeling ability of Mamba-style selective scans. However, their performance depends critically on the representation of two-dimensional visual features as one-dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate-based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state-space mixing. This motivates a simple question: \emph{can graphs help vision state space models see better?} We introduce \textbf{GraphScan}, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate-conditioned interpolation with feature-conditioned semantic routing. Integrated into a hierarchical backbone, \textbf{GraphScan-Mamba} achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state-space modeling.

CVFeb 1Code
ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models

Dhruv Parikh, Haoyang Fan, Rajgopal Kannan et al.

Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens. Existing token reduction methods typically exploit \textit{either} vision-encoder saliency (broad but query-agnostic) \textit{or} LLM cross-attention (query-aware but sparse and costly). We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking). However, making such fusion practical is non-trivial: cross-modal saliency is usually only available \emph{inside} the LLM (too late for efficient pre-LLM pruning), and the two signals are inherently asymmetric, so naive fusion underutilizes their complementary strengths. We propose \textbf{ConsensusDrop}, a training-free framework that derives a \emph{consensus} ranking by reconciling vision encoder saliency with query-aware cross-attention, retaining the most informative tokens while compressing the remainder via encoder-guided token merging. Across LLaVA-1.5/NeXT, Video-LLaVA, and other open-source VLMs, ConsensusDrop consistently outperforms prior pruning methods under identical token budgets and delivers a stronger accuracy-efficiency Pareto frontier -- preserving near-baseline accuracy even at aggressive token reductions while reducing TTFT and KV cache footprint. Our code will be open-sourced.

LGOct 21, 2025Code
Training Diverse Graph Experts for Ensembles: A Systematic Empirical Study

Gangda Deng, Yuxin Yang, Ömer Faruk Akgül et al.

Graph Neural Networks (GNNs) have become essential tools for learning on relational data, yet the performance of a single GNN is often limited by the heterogeneity present in real-world graphs. Recent advances in Mixture-of-Experts (MoE) frameworks demonstrate that assembling multiple, explicitly diverse GNNs with distinct generalization patterns can significantly improve performance. In this work, we present the first systematic empirical study of expert-level diversification techniques for GNN ensembles. Evaluating 20 diversification strategies -- including random re-initialization, hyperparameter tuning, architectural variation, directionality modeling, and training data partitioning -- across 14 node classification benchmarks, we construct and analyze over 200 ensemble variants. Our comprehensive evaluation examines each technique in terms of expert diversity, complementarity, and ensemble performance. We also uncovers mechanistic insights into training maximally diverse experts. These findings provide actionable guidance for expert training and the design of effective MoE frameworks on graph data. Our code is available at https://github.com/Hydrapse/bench-gnn-diversification.

LGSep 25, 2025Code
Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say

Jacob Fein-Ashley, Dhruv Parikh, Rajgopal Kannan et al.

Open-source Large Language Models (LLMs) increasingly specialize by domain (e.g., math, code, general reasoning), motivating systems that leverage complementary strengths across models. Prior multi-LLM approaches either (i) route a query to one or a few experts and generate independently, (ii) aggregate outputs from each model via costly multi-turn exchanges, or (iii) fuse weights into a single model-typically requiring architectural homogeneity. We introduce Mixture of Thoughts (MoT), a simple method for latent-level collaboration among heterogeneous experts under a global routing scheme. For each query, a lightweight router selects top-$K$ experts and designates a primary expert; uniformly placed interaction layers project hidden states into a shared latent space where the primary expert performs cross-attention over its active (selected) peers. Pre-trained experts remain frozen; only the router and the lightweight interaction layers are trained with a novel joint training objective that improves both the expert selection and inter-expert collaboration. Across five in-distribution (ID) and three out-of-distribution (OOD) benchmarks, MoT surpasses the current routing and aggregation-based state-of-the-art, Avengers, by $+0.38\%$ and $+2.92\%$, respectively. Further, MoT significantly outperforms the best-performing single model. It achieves this with single-pass inference, runtime comparable to routing baselines, and none of the overheads of iterative aggregation. MoT offers a simple latent-space mechanism for combining heterogeneous LLMs, a practical step toward broader multi-LLM collaboration. Our code is publicly available at https://github.com/jacobfa/mot.

LGOct 5, 2020Code
Accurate, Efficient and Scalable Training of Graph Neural Networks

Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava et al.

Graph Neural Networks (GNNs) are powerful deep learning models to generate node embeddings on graphs. When applying deep GNNs on large graphs, it is still challenging to perform training in an efficient and scalable way. We propose a novel parallel training framework. Through sampling small subgraphs as minibatches, we reduce training workload by orders of magnitude compared with state-of-the-art minibatch methods. We then parallelize the key computation steps on tightly-coupled shared memory systems. For graph sampling, we exploit parallelism within and across sampler instances, and propose an efficient data structure supporting concurrent accesses from samplers. The parallel sampler theoretically achieves near-linear speedup with respect to number of processing units. For feature propagation within subgraphs, we improve cache utilization and reduce DRAM traffic by data partitioning. Our partitioning is a 2-approximation strategy for minimizing the communication cost compared to the optimal. We further develop a runtime scheduler to reorder the training operations and adjust the minibatch subgraphs to improve parallel performance. Finally, we generalize the above parallelization strategies to support multiple types of GNN models and graph samplers. The proposed training outperforms the state-of-the-art in scalability, efficiency and accuracy simultaneously. On a 40-core Xeon platform, we achieve 60x speedup (with AVX) in the sampling step and 20x speedup in the feature propagation step, compared to the serial implementation. Our algorithm enables fast training of deeper GNNs, as demonstrated by orders of magnitude speedup compared to the Tensorflow implementation. We open-source our code at https://github.com/GraphSAINT/GraphSAINT.

LGMay 8
SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning

Nikunj Gupta, James Zachary Hare, Jesse Milzman et al.

Cooperative multi-agent reinforcement learning agents that act on partial local observations face a fundamental information bottleneck: the knowledge needed to select jointly optimal actions is scattered across the team, yet each agent must commit to a decision without access to its teammates' observations, intentions, or chosen actions. Existing methods either ignore this bottleneck, compress it into a scalar mixing signal, or route around it with learned communication channels. Framing action coordination as a problem of structured information integration among agents, we propose \textit{structured agent coordination via holistic information integration}, or SACHI, in which graph transformer convolutions over an inter-agent coordination graph enrich each agent's representation with receiver-sensitive, content-dependent signals from teammates prior to action selection. We evaluate SACHI across five cooperative tasks spanning spatial, communicative, and adversarial coordination challenges against twelve baselines. SACHI consistently matches or outperforms the best baseline on every task, and rigorous aggregate statistical analyses, including normalized metrics with bootstrap confidence intervals, Friedman ranking, and performance profiling, confirm that this advantage is statistically significant, robust across environments, and not attributable to increased model capacity. Parameter-matched ablations further trace the source of the gains to a single architectural property: the degree of content-dependence in the message-passing operator.

CLMay 7
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

Ömer Faruk Akgül, Rajgopal Kannan, Willie Neiswanger et al.

Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.

LGApr 30
TypeBandit: Type-Level Context Allocation and Reweighting for Effective Attribute Completion in Heterogeneous Graph Neural Networks

Ta-Yang Wang, Rajgopal Kannan, Viktor Prasanna

Heterogeneous graphs are widely used to model multi-relational systems, but missing node attributes remain a major bottleneck for downstream learning. In this paper, we identify and formalize type-dependent information asymmetry: the phenomenon that different node types provide substantially different levels of useful signal for attribute completion. Motivated by this observation, we propose TypeBandit, a lightweight, model-agnostic methodology for heterogeneous attribute completion. TypeBandit combines topology-aware initialization, type-level bandit sampling, and joint representation learning. It allocates a finite global sampling budget across node types, samples representative nodes within each type, and uses the resulting sampled type summaries as shared contextual signals during representation construction. By operating at the type level rather than over each target node's local neighborhood, TypeBandit keeps the adaptive state compact and practical for large heterogeneous graphs. A key advantage of TypeBandit is architectural flexibility. Rather than requiring a new heterogeneous graph neural network architecture, TypeBandit acts as a type-aware front end for representative heterogeneous GNN backbones, including R-GCN, HetGNN, HGT, and SimpleHGN. We further introduce a hybrid pretraining scheme that combines structural degree priors with feature propagation, yielding a more reliable initializer than degree-only pretraining. Under a fixed-split protocol on DBLP, IMDB, and ACM, TypeBandit provides dataset-dependent but practically meaningful gains. Additional ablation, stability, efficiency, semantic-propagation, and sampled OGBN-MAG experiments support TypeBandit as a practical strategy for heterogeneous attribute completion when type-specific information is unevenly distributed and sampling resources are limited.

CVJan 18, 2025
ClusterViG: Efficient Globally Aware Vision GNNs via Image Partitioning

Dhruv Parikh, Jacob Fein-Ashley, Tian Ye et al.

Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have dominated the field of Computer Vision (CV). Graph Neural Networks (GNN) have performed remarkably well across diverse domains because they can represent complex relationships via unstructured graphs. However, the applicability of GNNs for visual tasks was unexplored till the introduction of Vision GNNs (ViG). Despite the success of ViGs, their performance is severely bottlenecked due to the expensive $k$-Nearest Neighbors ($k$-NN) based graph construction. Recent works addressing this bottleneck impose constraints on the flexibility of GNNs to build unstructured graphs, undermining their core advantage while introducing additional inefficiencies. To address these issues, in this paper, we propose a novel method called Dynamic Efficient Graph Convolution (DEGC) for designing efficient and globally aware ViGs. DEGC partitions the input image and constructs graphs in parallel for each partition, improving graph construction efficiency. Further, DEGC integrates local intra-graph and global inter-graph feature learning, enabling enhanced global context awareness. Using DEGC as a building block, we propose a novel CNN-GNN architecture, ClusterViG, for CV tasks. Extensive experiments indicate that ClusterViG reduces end-to-end inference latency for vision tasks by up to $5\times$ when compared against a suite of models such as ViG, ViHGNN, PVG, and GreedyViG, with a similar model parameter count. Additionally, ClusterViG reaches state-of-the-art performance on image classification, object detection, and instance segmentation tasks, demonstrating the effectiveness of the proposed globally aware learning strategy. Finally, input partitioning performed by DEGC enables ClusterViG to be trained efficiently on higher-resolution images, underscoring the scalability of our approach.

CVDec 5, 2023
Realistic Scatterer Based Adversarial Attacks on SAR Image Classifiers

Tian Ye, Rajgopal Kannan, Viktor Prasanna et al.

Adversarial attacks have highlighted the vulnerability of classifiers based on machine learning for Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) tasks. An adversarial attack perturbs SAR images of on-ground targets such that the classifiers are misled into making incorrect predictions. However, many existing attacking techniques rely on arbitrary manipulation of SAR images while overlooking the feasibility of executing the attacks on real-world SAR imagery. Instead, adversarial attacks should be able to be implemented by physical actions, for example, placing additional false objects as scatterers around the on-ground target to perturb the SAR image and fool the SAR ATR. In this paper, we propose the On-Target Scatterer Attack (OTSA), a scatterer-based physical adversarial attack. To ensure the feasibility of its physical execution, we enforce a constraint on the positioning of the scatterers. Specifically, we restrict the scatterers to be placed only on the target instead of in the shadow regions or the background. To achieve this, we introduce a positioning score based on Gaussian kernels and formulate an optimization problem for our OTSA attack. Using a gradient ascent method to solve the optimization problem, the OTSA can generate a vector of parameters describing the positions, shapes, sizes and amplitudes of the scatterers to guide the physical execution of the attack that will mislead SAR image classifiers. The experimental results show that our attack obtains significantly higher success rates under the positioning constraint compared with the existing method.

DCMar 21, 2024
Accelerating ViT Inference on FPGA through Static and Dynamic Pruning

Dhruv Parikh, Shouyi Li, Bingyi Zhang et al.

Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning are two well-known methods for reducing complexity: weight pruning reduces the model size and associated computational demands, while token pruning further dynamically reduces the computation based on the input. Combining these two techniques should significantly reduce computation complexity and model size; however, naively integrating them results in irregular computation patterns, leading to significant accuracy drops and difficulties in hardware acceleration. Addressing the above challenges, we propose a comprehensive algorithm-hardware codesign for accelerating ViT on FPGA through simultaneous pruning -combining static weight pruning and dynamic token pruning. For algorithm design, we systematically combine a hardware-aware structured block-pruning method for pruning model parameters and a dynamic token pruning method for removing unimportant token vectors. Moreover, we design a novel training algorithm to recover the model's accuracy. For hardware design, we develop a novel hardware accelerator for executing the pruned model. The proposed hardware design employs multi-level parallelism with load balancing strategy to efficiently deal with the irregular computation pattern led by the two pruning approaches. Moreover, we develop an efficient hardware mechanism for efficiently executing the on-the-fly token pruning.

LGFeb 21, 2024
PaCKD: Pattern-Clustered Knowledge Distillation for Compressing Memory Access Prediction Models

Neelesh Gupta, Pengmiao Zhang, Rajgopal Kannan et al.

Deep neural networks (DNNs) have proven to be effective models for accurate Memory Access Prediction (MAP), a critical task in mitigating memory latency through data prefetching. However, existing DNN-based MAP models suffer from the challenges such as significant physical storage space and poor inference latency, primarily due to their large number of parameters. These limitations render them impractical for deployment in real-world scenarios. In this paper, we propose PaCKD, a Pattern-Clustered Knowledge Distillation approach to compress MAP models while maintaining the prediction performance. The PaCKD approach encompasses three steps: clustering memory access sequences into distinct partitions involving similar patterns, training large pattern-specific teacher models for memory access prediction for each partition, and training a single lightweight student model by distilling the knowledge from the trained pattern-specific teachers. We evaluate our approach on LSTM, MLP-Mixer, and ResNet models, as they exhibit diverse structures and are widely used for image classification tasks in order to test their effectiveness in four widely used graph applications. Compared to the teacher models with 5.406M parameters and an F1-score of 0.4626, our student models achieve a 552$\times$ model size compression while maintaining an F1-score of 0.4538 (with a 1.92% performance drop). Our approach yields an 8.70% higher result compared to student models trained with standard knowledge distillation and an 8.88% higher result compared to student models trained without any form of knowledge distillation.

CVFeb 1, 2024
A Single Graph Convolution Is All You Need: Efficient Grayscale Image Classification

Jacob Fein-Ashley, Sachini Wickramasinghe, Bingyi Zhang et al.

Image classifiers for domain-specific tasks like Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) and chest X-ray classification often rely on convolutional neural networks (CNNs). These networks, while powerful, experience high latency due to the number of operations they perform, which can be problematic in real-time applications. Many image classification models are designed to work with both RGB and grayscale datasets, but classifiers that operate solely on grayscale images are less common. Grayscale image classification has critical applications in fields such as medical imaging and SAR ATR. In response, we present a novel grayscale image classification approach using a vectorized view of images. By leveraging the lightweight nature of Multi-Layer Perceptrons (MLPs), we treat images as vectors, simplifying the problem to grayscale image classification. Our approach incorporates a single graph convolutional layer in a batch-wise manner, enhancing accuracy and reducing performance variance. Additionally, we develop a customized accelerator on FPGA for our model, incorporating several optimizations to improve performance. Experimental results on benchmark grayscale image datasets demonstrate the effectiveness of our approach, achieving significantly lower latency (up to $16\times$ less on MSTAR) and competitive or superior performance compared to state-of-the-art models for SAR ATR and medical image classification.

CVDec 12, 2023
Benchmarking Deep Learning Classifiers for SAR Automatic Target Recognition

Jacob Fein-Ashley, Tian Ye, Rajgopal Kannan et al.

Synthetic Aperture Radar SAR Automatic Target Recognition ATR is a key technique of remote-sensing image recognition which can be supported by deep neural networks The existing works of SAR ATR mostly focus on improving the accuracy of the target recognition while ignoring the systems performance in terms of speed and storage which is critical to real-world applications of SAR ATR For decision-makers aiming to identify a proper deep learning model to deploy in a SAR ATR system it is important to understand the performance of different candidate deep learning models and determine the best model accordingly This paper comprehensively benchmarks several advanced deep learning models for SAR ATR with multiple distinct SAR imagery datasets Specifically we train and test five SAR image classifiers based on Residual Neural Networks ResNet18 ResNet34 ResNet50 Graph Neural Network GNN and Vision Transformer for Small-Sized Datasets (SS-ViT) We select three datasets MSTAR GBSAR and SynthWakeSAR that offer heterogeneity We evaluate and compare the five classifiers concerning their classification accuracy runtime performance in terms of inference throughput and analytical performance in terms of number of parameters number of layers model size and number of operations Experimental results show that the GNN classifier outperforms with respect to throughput and latency However it is also shown that no clear model winner emerges from all of our chosen metrics and a one model rules all case is doubtful in the domain of SAR ATR

CVApr 6, 2024
VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGA

Sachini Wickramasinghe, Dhruv Parikh, Bingyi Zhang et al.

Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique used in military applications like remote-sensing image recognition. Vision Transformers (ViTs) are the current state-of-the-art in various computer vision applications, outperforming their CNN counterparts. However, using ViTs for SAR ATR applications is challenging due to (1) standard ViTs require extensive training data to generalize well due to their low locality; the standard SAR datasets, however, have a limited number of labeled training data which reduces the learning capability of ViTs; (2) ViTs have a high parameter count and are computation intensive which makes their deployment on resource-constrained SAR platforms difficult. In this work, we develop a lightweight ViT model that can be trained directly on small datasets without any pre-training by utilizing the Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) modules. We directly train this model on SAR datasets which have limited training samples to evaluate its effectiveness for SAR ATR applications. We evaluate our proposed model, that we call VTR (ViT for SAR ATR), on three widely used SAR datasets: MSTAR, SynthWakeSAR, and GBSAR. Further, we propose a novel FPGA accelerator for VTR, in order to enable deployment for real-time SAR ATR applications.

CVApr 4, 2024
FACTUAL: A Novel Framework for Contrastive Learning Based Robust SAR Image Classification

Xu Wang, Tian Ye, Rajgopal Kannan et al.

Deep Learning (DL) Models for Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR), while delivering improved performance, have been shown to be quite vulnerable to adversarial attacks. Existing works improve robustness by training models on adversarial samples. However, by focusing mostly on attacks that manipulate images randomly, they neglect the real-world feasibility of such attacks. In this paper, we propose FACTUAL, a novel Contrastive Learning framework for Adversarial Training and robust SAR classification. FACTUAL consists of two components: (1) Differing from existing works, a novel perturbation scheme that incorporates realistic physical adversarial attacks (such as OTSA) to build a supervised adversarial pre-training network. This network utilizes class labels for clustering clean and perturbed images together into a more informative feature space. (2) A linear classifier cascaded after the encoder to use the computed representations to predict the target labels. By pre-training and fine-tuning our model on both clean and adversarial samples, we show that our model achieves high prediction accuracy on both cases. Our model achieves 99.7% accuracy on clean samples, and 89.6% on perturbed samples, both outperforming previous state-of-the-art methods.

CVJan 5, 2024
PAHD: Perception-Action based Human Decision Making using Explainable Graph Neural Networks on SAR Images

Sasindu Wijeratne, Bingyi Zhang, Rajgopal Kannan et al.

Synthetic Aperture Radar (SAR) images are commonly utilized in military applications for automatic target recognition (ATR). Machine learning (ML) methods, such as Convolutional Neural Networks (CNN) and Graph Neural Networks (GNN), are frequently used to identify ground-based objects, including battle tanks, personnel carriers, and missile launchers. Determining the vehicle class, such as the BRDM2 tank, BMP2 tank, BTR60 tank, and BTR70 tank, is crucial, as it can help determine whether the target object is an ally or an enemy. While the ML algorithm provides feedback on the recognized target, the final decision is left to the commanding officers. Therefore, providing detailed information alongside the identified target can significantly impact their actions. This detailed information includes the SAR image features that contributed to the classification, the classification confidence, and the probability of the identified object being classified as a different object type or class. We propose a GNN-based ATR framework that provides the final classified class and outputs the detailed information mentioned above. This is the first study to provide a detailed analysis of the classification class, making final decisions more straightforward. Moreover, our GNN framework achieves an overall accuracy of 99.2\% when evaluated on the MSTAR dataset, improving over previous state-of-the-art GNN methods.

LGFeb 8, 2024
TASER: Temporal Adaptive Sampling for Fast and Accurate Dynamic Graph Representation Learning

Gangda Deng, Hongkuan Zhou, Hanqing Zeng et al.

Recently, Temporal Graph Neural Networks (TGNNs) have demonstrated state-of-the-art performance in various high-impact applications, including fraud detection and content recommendation. Despite the success of TGNNs, they are prone to the prevalent noise found in real-world dynamic graphs like time-deprecated links and skewed interaction distribution. The noise causes two critical issues that significantly compromise the accuracy of TGNNs: (1) models are supervised by inferior interactions, and (2) noisy input induces high variance in the aggregated messages. However, current TGNN denoising techniques do not consider the diverse and dynamic noise pattern of each node. In addition, they also suffer from the excessive mini-batch generation overheads caused by traversing more neighbors. We believe the remedy for fast and accurate TGNNs lies in temporal adaptive sampling. In this work, we propose TASER, the first adaptive sampling method for TGNNs optimized for accuracy, efficiency, and scalability. TASER adapts its mini-batch selection based on training dynamics and temporal neighbor selection based on the contextual, structural, and temporal properties of past interactions. To alleviate the bottleneck in mini-batch generation, TASER implements a pure GPU-based temporal neighbor finder and a dedicated GPU feature cache. We evaluate the performance of TASER using two state-of-the-art backbone TGNNs. On five popular datasets, TASER outperforms the corresponding baselines by an average of 2.3% in Mean Reciprocal Rank (MRR) while achieving an average of 5.1x speedup in training time.

LGFeb 25, 2025
SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts

Jacob Fein-Ashley, Neelesh Gupta, Rajgopal Kannan et al.

Long-context transformers face significant efficiency challenges due to the quadratic cost of self-attention. However, many modern applications-from multi-turn dialogue to high-resolution vision-require contexts spanning tens of thousands of tokens. We introduce SPECTRE, a method that replaces each attention head with a fast real FFT, a content-adaptive spectral gate, and an inverse FFT, reducing per-layer complexity from $\mathcal{O}(L^{2})$ to $O(L\log L)$ while preserving the surrounding architecture. We extend this efficiency to autoregressive generation through our Prefix-FFT cache and enhance local feature representation with an optional wavelet module that adds negligible computational overhead. Our experiments demonstrate that SPECTRE operates up to 7$\times$ faster than FlashAttention-2 on 128k-token contexts while matching or exceeding baseline performance on PG-19 language modeling and ImageNet-1k classification tasks. SPECTRE achieves these improvements by adding fewer than 6\% parameters to the base model, making hundred-kilotoken context processing feasible on commodity GPUs without specialized hardware.

CVMar 27, 2024
Uncertainty-Aware SAR ATR: Defending Against Adversarial Attacks via Bayesian Neural Networks

Tian Ye, Rajgopal Kannan, Viktor Prasanna et al.

Adversarial attacks have demonstrated the vulnerability of Machine Learning (ML) image classifiers in Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) systems. An adversarial attack can deceive the classifier into making incorrect predictions by perturbing the input SAR images, for example, with a few scatterers attached to the on-ground objects. Therefore, it is critical to develop robust SAR ATR systems that can detect potential adversarial attacks by leveraging the inherent uncertainty in ML classifiers, thereby effectively alerting human decision-makers. In this paper, we propose a novel uncertainty-aware SAR ATR for detecting adversarial attacks. Specifically, we leverage the capability of Bayesian Neural Networks (BNNs) in performing image classification with quantified epistemic uncertainty to measure the confidence for each input SAR image. By evaluating the uncertainty, our method alerts when the input SAR image is likely to be adversarially generated. Simultaneously, we also generate visual explanations that reveal the specific regions in the SAR image where the adversarial scatterers are likely to to be present, thus aiding human decision-making with hints of evidence of adversarial attacks. Experiments on the MSTAR dataset demonstrate that our approach can identify over 80% adversarial SAR images with fewer than 20% false alarms, and our visual explanations can identify up to over 90% of scatterers in an adversarial SAR image.

LGDec 28, 2024
Towards Ideal Temporal Graph Neural Networks: Evaluations and Conclusions after 10,000 GPU Hours

Yuxin Yang, Hongkuan Zhou, Rajgopal Kannan et al.

Temporal Graph Neural Networks (TGNNs) have emerged as powerful tools for modeling dynamic interactions across various domains. The design space of TGNNs is notably complex, given the unique challenges in runtime efficiency and scalability raised by the evolving nature of temporal graphs. We contend that many of the existing works on TGNN modeling inadequately explore the design space, leading to suboptimal designs. Viewing TGNN models through a performance-focused lens often obstructs a deeper understanding of the advantages and disadvantages of each technique. Specifically, benchmarking efforts inherently evaluate models in their original designs and implementations, resulting in unclear accuracy comparisons and misleading runtime. To address these shortcomings, we propose a practical comparative evaluation framework that performs a design space search across well-known TGNN modules based on a unified, optimized code implementation. Using our framework, we make the first efforts towards addressing three critical questions in TGNN design, spending over 10,000 GPU hours: (1) investigating the efficiency of TGNN module designs, (2) analyzing how the effectiveness of these modules correlates with dataset patterns, and (3) exploring the interplay between multiple modules. Key outcomes of this directed investigative approach include demonstrating that the most recent neighbor sampling and attention aggregator outperform uniform neighbor sampling and MLP-Mixer aggregator; Assessing static node memory as an effective node memory alternative, and showing that the choice between static or dynamic node memory should be based on the repetition patterns in the dataset. Our in-depth analysis of the interplay between TGNN modules and dataset patterns should provide a deeper insight into TGNN performance along with potential research directions for designing more general and effective TGNNs.

DCApr 10, 2024
GCV-Turbo: End-to-end Acceleration of GNN-based Computer Vision Tasks on FPGA

Bingyi Zhang, Rajgopal Kannan, Carl Busart et al.

Graph neural networks (GNNs) have recently empowered various novel computer vision (CV) tasks. In GNN-based CV tasks, a combination of CNN layers and GNN layers or only GNN layers are employed. This paper introduces GCV-Turbo, a domain-specific accelerator on FPGA for end-to-end acceleration of GNN-based CV tasks. GCV-Turbo consists of two key components: (1) a \emph{novel} hardware architecture optimized for the computation kernels in both CNNs and GNNs using the same set of computation resources. (2) a PyTorch-compatible compiler that takes a user-defined model as input, performs end-to-end optimization for the computation graph of a given GNN-based CV task, and produces optimized code for hardware execution. The hardware architecture and the compiler work synergistically to support a variety of GNN-based CV tasks. We implement GCV-Turbo on a state-of-the-art FPGA and evaluate its performance across six representative GNN-based CV tasks with diverse input data modalities (e.g., image, human skeleton, point cloud). Compared with state-of-the-art CPU (GPU) implementations, GCV-Turbo achieves an average latency reduction of $68.4\times$ ($4.1\times$) on these six GNN-based CV tasks. Moreover, GCV-Turbo supports the execution of the standalone CNNs or GNNs, achieving performance comparable to that of state-of-the-art CNN (GNN) accelerators for widely used CNN-only (GNN-only) models.

CLDec 5, 2025
LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning

Ömer Faruk Akgül, Yusuf Hakan Kalaycı, Rajgopal Kannan et al.

Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often "overthink": continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model's own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., "hmm", "wait") during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy--efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40--65\%; on MATH-500 it improves accuracy by up to 12 points with roughly 35--60\% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50\% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70\% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.

LGMay 23, 2025
RECIPE-TKG: From Sparse History to Structured Reasoning for LLM-based Temporal Knowledge Graph Completion

Ömer Faruk Akgül, Feiyu Zhu, Yuxin Yang et al.

Temporal Knowledge Graphs (TKGs) represent dynamic facts as timestamped relations between entities. TKG completion involves forecasting missing or future links, requiring models to reason over time-evolving structure. While LLMs show promise for this task, existing approaches often overemphasize supervised fine-tuning and struggle particularly when historical evidence is limited or missing. We introduce RECIPE-TKG, a lightweight and data-efficient framework designed to improve accuracy and generalization in settings with sparse historical context. It combines (1) rule-based multi-hop retrieval for structurally diverse history, (2) contrastive fine-tuning of lightweight adapters to encode relational semantics, and (3) test-time semantic filtering to iteratively refine generations based on embedding similarity. Experiments on four TKG benchmarks show that RECIPE-TKG outperforms previous LLM-based approaches, achieving up to 30.6\% relative improvement in Hits@10. Moreover, our proposed framework produces more semantically coherent predictions, even for the samples with limited historical context.

LGDec 23, 2024
Contextual Feedback Loops: Amplifying Deep Reasoning with Iterative Top-Down Feedback

Jacob Fein-Ashley, Rajgopal Kannan, Viktor Prasanna

Conventional deep networks rely on one-way backpropagation that overlooks reconciling high-level predictions with lower-level representations. We propose \emph{Contextual Feedback Loops} (CFLs), a lightweight mechanism that re-injects top-down context into earlier layers for iterative refinement. Concretely, CFLs map the network's prediction to a compact \emph{context vector}, which is fused back into each layer via gating adapters. Unrolled over multiple feedback steps, CFLs unify feed-forward and feedback-driven inference, letting top-level outputs continually refine lower-level features. Despite minimal overhead, CFLs yield consistent gains on tasks including CIFAR-10, ImageNet-1k, SpeechCommands, and GLUE SST-2. Moreover, by a Banach Fixed Point argument under mild Lipschitz conditions, these updates converge stably. Overall, CFLs show that even modest top-down feedback can substantially improve deep models, aligning with cognitive theories of iterative perception.

LGNov 27, 2024
Adversarial Training in Low-Label Regimes with Margin-Based Interpolation

Tian Ye, Rajgopal Kannan, Viktor Prasanna

Adversarial training has emerged as an effective approach to train robust neural network models that are resistant to adversarial attacks, even in low-label regimes where labeled data is scarce. In this paper, we introduce a novel semi-supervised adversarial training approach that enhances both robustness and natural accuracy by generating effective adversarial examples. Our method begins by applying linear interpolation between clean and adversarial examples to create interpolated adversarial examples that cross decision boundaries by a controlled margin. This sample-aware strategy tailors adversarial examples to the characteristics of each data point, enabling the model to learn from the most informative perturbations. Additionally, we propose a global epsilon scheduling strategy that progressively adjusts the upper bound of perturbation strengths during training. The combination of these strategies allows the model to develop increasingly complex decision boundaries with better robustness and natural accuracy. Empirical evaluations show that our approach effectively enhances performance against various adversarial attacks, such as PGD and AutoAttack.

LGOct 17, 2024
Conformal Prediction for Federated Graph Neural Networks with Missing Neighbor Information

Ömer Faruk Akgül, Rajgopal Kannan, Viktor Prasanna

Graphs play a crucial role in data mining and machine learning, representing real-world objects and interactions. As graph datasets grow, managing large, decentralized subgraphs becomes essential, particularly within federated learning frameworks. These frameworks face significant challenges, including missing neighbor information, which can compromise model reliability in safety-critical settings. Deployment of federated learning models trained in such settings necessitates quantifying the uncertainty of the models. This study extends the applicability of Conformal Prediction (CP), a well-established method for uncertainty quantification, to federated graph learning. We specifically tackle the missing links issue in distributed subgraphs to minimize its adverse effects on CP set sizes. We discuss data dependencies across the distributed subgraphs and establish conditions for CP validity and precise test-time coverage. We introduce a Variational Autoencoder-based approach for reconstructing missing neighbors to mitigate the negative impact of missing data. Empirical evaluations on real-world datasets demonstrate the efficacy of our approach, yielding smaller prediction sets while ensuring coverage guarantees.

NEDec 23, 2023
Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching

Pengmiao Zhang, Neelesh Gupta, Rajgopal Kannan et al.

Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.

CVMay 11, 2023
Graph Neural Network for Accurate and Low-complexity SAR ATR

Bingyi Zhang, Sasindu Wijeratne, Rajgopal Kannan et al.

Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is the key technique for remote sensing image recognition. The state-of-the-art works exploit the deep convolutional neural networks (CNNs) for SAR ATR, leading to high computation costs. These deep CNN models are unsuitable to be deployed on resource-limited platforms. In this work, we propose a graph neural network (GNN) model to achieve accurate and low-latency SAR ATR. We transform the input SAR image into the graph representation. The proposed GNN model consists of a stack of GNN layers that operates on the input graph to perform target classification. Unlike the state-of-the-art CNNs, which need heavy convolution operations, the proposed GNN model has low computation complexity and achieves comparable high accuracy. The GNN-based approach enables our proposed \emph{input pruning} strategy. By filtering out the irrelevant vertices in the input graph, we can reduce the computation complexity. Moreover, we propose the \emph{model pruning} strategy to sparsify the model weight matrices which further reduces the computation complexity. We evaluate the proposed GNN model on the MSTAR dataset and ship discrimination dataset. The evaluation results show that the proposed GNN model achieves 99.38\% and 99.7\% classification accuracy on the above two datasets, respectively. The proposed pruning strategies can prune 98.6\% input vertices and 97\% weight entries with negligible accuracy loss. Compared with the state-of-the-art CNNs, the proposed GNN model has only 1/3000 computation cost and 1/80 model size.

LGJan 19, 2022
Decoupling the Depth and Scope of Graph Neural Networks

Hanqing Zeng, Muhan Zhang, Yinglong Xia et al.

State-of-the-art Graph Neural Networks (GNNs) have limited scalability with respect to the graph and model sizes. On large graphs, increasing the model depth often means exponential expansion of the scope (i.e., receptive field). Beyond just a few layers, two fundamental challenges emerge: 1. degraded expressivity due to oversmoothing, and 2. expensive computation due to neighborhood explosion. We propose a design principle to decouple the depth and scope of GNNs -- to generate representation of a target entity (i.e., a node or an edge), we first extract a localized subgraph as the bounded-size scope, and then apply a GNN of arbitrary depth on top of the subgraph. A properly extracted subgraph consists of a small number of critical neighbors, while excluding irrelevant ones. The GNN, no matter how deep it is, smooths the local neighborhood into informative representation rather than oversmoothing the global graph into "white noise". Theoretically, decoupling improves the GNN expressive power from the perspectives of graph signal processing (GCN), function approximation (GraphSAGE) and topological learning (GIN). Empirically, on seven graphs (with up to 110M nodes) and six backbone GNN architectures, our design achieves significant accuracy improvement with orders of magnitude reduction in computation and hardware cost.

ARSep 18, 2021
Reconfigurable Low-latency Memory System for Sparse Matricized Tensor Times Khatri-Rao Product on FPGA

Sasindu Wijeratne, Rajgopal Kannan, Viktor Prasanna

Tensor decomposition has become an essential tool in many applications in various domains, including machine learning. Sparse Matricized Tensor Times Khatri-Rao Product (MTTKRP) is one of the most computationally expensive kernels in tensor computations. Despite having significant computational parallelism, MTTKRP is a challenging kernel to optimize due to its irregular memory access characteristics. This paper focuses on a multi-faceted memory system, which explores the spatial and temporal locality of the data structures of MTTKRP. Further, users can reconfigure our design depending on the behavior of the compute units used in the FPGA accelerator. Our system efficiently accesses all the MTTKRP data structures while reducing the total memory access time, using a distributed cache and Direct Memory Access (DMA) subsystem. Moreover, our work improves the memory access time by 3.5x compared with commercial memory controller IPs. Also, our system shows 2x and 1.26x speedups compared with cache-only and DMA-only memory systems, respectively.