86.9LGJun 3
Reinforcement Learning from Rich Feedback with Distributional DAggerRishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.
78.0CVApr 23Code
Latent Denoising Improves Visual Alignment in Large Multimodal ModelsDhruv Parikh, Jacob Fein-Ashley, Rajgopal Kannan et al.
Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher's intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at https://github.com/dhruvashp/latent-denoising-for-lmms.
CVNov 14, 2025Code
Bridging Hidden States in Vision-Language ModelsBenjamin Fein-Ashley, Jacob Fein-Ashley
Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities "think". We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at https://github.com/jfeinashley/BRIDGE.
CVAug 31, 2024
Studying the Effects of Self-Attention on SAR Automatic Target RecognitionJacob Fein-Ashley, Rajgopal Kannan, Viktor Prasanna
Attention mechanisms are critically important in the advancement of synthetic aperture radar (SAR) automatic target recognition (ATR) systems. Traditional SAR ATR models often struggle with the noisy nature of the SAR data, frequently learning from background noise rather than the most relevant image features. Attention mechanisms address this limitation by focusing on crucial image components, such as the shadows and small parts of a vehicle, which are crucial for accurate target classification. By dynamically prioritizing these significant features, attention-based models can efficiently characterize the entire image with a few pixels, thus enhancing recognition performance. This capability allows for the discrimination of targets from background clutter, leading to more practical and robust SAR ATR models. We show that attention modules increase top-1 accuracy, improve input robustness, and are qualitatively more explainable on the MSTAR dataset.
CVSep 25, 2024
HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean SpaceJacob Fein-Ashley, Ethan Feng, Minh Pham
Data representation in non-Euclidean spaces has proven effective for capturing hierarchical and complex relationships in real-world datasets. Hyperbolic spaces, in particular, provide efficient embeddings for hierarchical structures. This paper introduces the Hyperbolic Vision Transformer (HVT), a novel extension of the Vision Transformer (ViT) that integrates hyperbolic geometry. While traditional ViTs operate in Euclidean space, our method enhances the self-attention mechanism by leveraging hyperbolic distance and Möbius transformations. This enables more effective modeling of hierarchical and relational dependencies in image data. We present rigorous mathematical formulations, showing how hyperbolic geometry can be incorporated into attention layers, feed-forward networks, and optimization. We offer improved performance for image classification using the ImageNet dataset.
99.2LGMay 12
Solve the Loop: Attractor Models for Language and ReasoningJacob Fein-Ashley, Paria Rashidinejad
Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.
LGSep 25, 2025Code
Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They SayJacob Fein-Ashley, Dhruv Parikh, Rajgopal Kannan et al.
Open-source Large Language Models (LLMs) increasingly specialize by domain (e.g., math, code, general reasoning), motivating systems that leverage complementary strengths across models. Prior multi-LLM approaches either (i) route a query to one or a few experts and generate independently, (ii) aggregate outputs from each model via costly multi-turn exchanges, or (iii) fuse weights into a single model-typically requiring architectural homogeneity. We introduce Mixture of Thoughts (MoT), a simple method for latent-level collaboration among heterogeneous experts under a global routing scheme. For each query, a lightweight router selects top-$K$ experts and designates a primary expert; uniformly placed interaction layers project hidden states into a shared latent space where the primary expert performs cross-attention over its active (selected) peers. Pre-trained experts remain frozen; only the router and the lightweight interaction layers are trained with a novel joint training objective that improves both the expert selection and inter-expert collaboration. Across five in-distribution (ID) and three out-of-distribution (OOD) benchmarks, MoT surpasses the current routing and aggregation-based state-of-the-art, Avengers, by $+0.38\%$ and $+2.92\%$, respectively. Further, MoT significantly outperforms the best-performing single model. It achieves this with single-pass inference, runtime comparable to routing baselines, and none of the overheads of iterative aggregation. MoT offers a simple latent-space mechanism for combining heterogeneous LLMs, a practical step toward broader multi-LLM collaboration. Our code is publicly available at https://github.com/jacobfa/mot.
CVJan 18, 2025
ClusterViG: Efficient Globally Aware Vision GNNs via Image PartitioningDhruv Parikh, Jacob Fein-Ashley, Tian Ye et al.
Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have dominated the field of Computer Vision (CV). Graph Neural Networks (GNN) have performed remarkably well across diverse domains because they can represent complex relationships via unstructured graphs. However, the applicability of GNNs for visual tasks was unexplored till the introduction of Vision GNNs (ViG). Despite the success of ViGs, their performance is severely bottlenecked due to the expensive $k$-Nearest Neighbors ($k$-NN) based graph construction. Recent works addressing this bottleneck impose constraints on the flexibility of GNNs to build unstructured graphs, undermining their core advantage while introducing additional inefficiencies. To address these issues, in this paper, we propose a novel method called Dynamic Efficient Graph Convolution (DEGC) for designing efficient and globally aware ViGs. DEGC partitions the input image and constructs graphs in parallel for each partition, improving graph construction efficiency. Further, DEGC integrates local intra-graph and global inter-graph feature learning, enabling enhanced global context awareness. Using DEGC as a building block, we propose a novel CNN-GNN architecture, ClusterViG, for CV tasks. Extensive experiments indicate that ClusterViG reduces end-to-end inference latency for vision tasks by up to $5\times$ when compared against a suite of models such as ViG, ViHGNN, PVG, and GreedyViG, with a similar model parameter count. Additionally, ClusterViG reaches state-of-the-art performance on image classification, object detection, and instance segmentation tasks, demonstrating the effectiveness of the proposed globally aware learning strategy. Finally, input partitioning performed by DEGC enables ClusterViG to be trained efficiently on higher-resolution images, underscoring the scalability of our approach.
CVFeb 1, 2024
A Single Graph Convolution Is All You Need: Efficient Grayscale Image ClassificationJacob Fein-Ashley, Sachini Wickramasinghe, Bingyi Zhang et al.
Image classifiers for domain-specific tasks like Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) and chest X-ray classification often rely on convolutional neural networks (CNNs). These networks, while powerful, experience high latency due to the number of operations they perform, which can be problematic in real-time applications. Many image classification models are designed to work with both RGB and grayscale datasets, but classifiers that operate solely on grayscale images are less common. Grayscale image classification has critical applications in fields such as medical imaging and SAR ATR. In response, we present a novel grayscale image classification approach using a vectorized view of images. By leveraging the lightweight nature of Multi-Layer Perceptrons (MLPs), we treat images as vectors, simplifying the problem to grayscale image classification. Our approach incorporates a single graph convolutional layer in a batch-wise manner, enhancing accuracy and reducing performance variance. Additionally, we develop a customized accelerator on FPGA for our model, incorporating several optimizations to improve performance. Experimental results on benchmark grayscale image datasets demonstrate the effectiveness of our approach, achieving significantly lower latency (up to $16\times$ less on MSTAR) and competitive or superior performance compared to state-of-the-art models for SAR ATR and medical image classification.
CVDec 12, 2023
Benchmarking Deep Learning Classifiers for SAR Automatic Target RecognitionJacob Fein-Ashley, Tian Ye, Rajgopal Kannan et al.
Synthetic Aperture Radar SAR Automatic Target Recognition ATR is a key technique of remote-sensing image recognition which can be supported by deep neural networks The existing works of SAR ATR mostly focus on improving the accuracy of the target recognition while ignoring the systems performance in terms of speed and storage which is critical to real-world applications of SAR ATR For decision-makers aiming to identify a proper deep learning model to deploy in a SAR ATR system it is important to understand the performance of different candidate deep learning models and determine the best model accordingly This paper comprehensively benchmarks several advanced deep learning models for SAR ATR with multiple distinct SAR imagery datasets Specifically we train and test five SAR image classifiers based on Residual Neural Networks ResNet18 ResNet34 ResNet50 Graph Neural Network GNN and Vision Transformer for Small-Sized Datasets (SS-ViT) We select three datasets MSTAR GBSAR and SynthWakeSAR that offer heterogeneity We evaluate and compare the five classifiers concerning their classification accuracy runtime performance in terms of inference throughput and analytical performance in terms of number of parameters number of layers model size and number of operations Experimental results show that the GNN classifier outperforms with respect to throughput and latency However it is also shown that no clear model winner emerges from all of our chosen metrics and a one model rules all case is doubtful in the domain of SAR ATR
LGFeb 25, 2025
SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long ContextsJacob Fein-Ashley, Neelesh Gupta, Rajgopal Kannan et al.
Long-context transformers face significant efficiency challenges due to the quadratic cost of self-attention. However, many modern applications-from multi-turn dialogue to high-resolution vision-require contexts spanning tens of thousands of tokens. We introduce SPECTRE, a method that replaces each attention head with a fast real FFT, a content-adaptive spectral gate, and an inverse FFT, reducing per-layer complexity from $\mathcal{O}(L^{2})$ to $O(L\log L)$ while preserving the surrounding architecture. We extend this efficiency to autoregressive generation through our Prefix-FFT cache and enhance local feature representation with an optional wavelet module that adds negligible computational overhead. Our experiments demonstrate that SPECTRE operates up to 7$\times$ faster than FlashAttention-2 on 128k-token contexts while matching or exceeding baseline performance on PG-19 language modeling and ImageNet-1k classification tasks. SPECTRE achieves these improvements by adding fewer than 6\% parameters to the base model, making hundred-kilotoken context processing feasible on commodity GPUs without specialized hardware.
LGFeb 8, 2025
Flowing Through Layers: A Continuous Dynamical Systems Perspective on TransformersJacob Fein-Ashley
We show that the standard discrete update rule of transformer layers can be naturally interpreted as a forward Euler discretization of a continuous dynamical system. Our Transformer Flow Approximation Theorem demonstrates that, under standard Lipschitz continuity assumptions, token representations converge uniformly to the unique solution of an ODE as the number of layers grows. Moreover, if the underlying mapping satisfies a one-sided Lipschitz condition with a negative constant, the resulting dynamics are contractive, causing perturbations to decay exponentially across layers. Beyond clarifying the empirical stability and expressivity of transformer models, these insights link transformer updates to a broader iterative reasoning framework, suggesting new avenues for accelerated convergence and architectural innovations inspired by dynamical systems theory.
CVDec 2, 2024
Diffusion Models with Anisotropic Gaussian Splatting for Image InpaintingJacob Fein-Ashley, Benjamin Fein-Ashley
Image inpainting is a fundamental task in computer vision, aiming to restore missing or corrupted regions in images realistically. While recent deep learning approaches have significantly advanced the state-of-the-art, challenges remain in maintaining structural continuity and generating coherent textures, particularly in large missing areas. Diffusion models have shown promise in generating high-fidelity images but often lack the structural guidance necessary for realistic inpainting. We propose a novel inpainting method that combines diffusion models with anisotropic Gaussian splatting to capture both local structures and global context effectively. By modeling missing regions using anisotropic Gaussian functions that adapt to local image gradients, our approach provides structural guidance to the diffusion-based inpainting network. The Gaussian splat maps are integrated into the diffusion process, enhancing the model's ability to generate high-fidelity and structurally coherent inpainting results. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques, producing visually plausible results with enhanced structural integrity and texture realism.
LGFeb 6, 2025
Iterate to Accelerate: A Unified Framework for Iterative Reasoning and Feedback ConvergenceJacob Fein-Ashley
We introduce a unified framework for iterative reasoning that leverages non-Euclidean geometry via Bregman divergences, higher-order operator averaging, and adaptive feedback mechanisms. Our analysis establishes that, under mild smoothness and contractivity assumptions, a generalized update scheme not only unifies classical methods such as mirror descent and dynamic programming but also captures modern chain-of-thought reasoning processes in large language models. In particular, we prove that our accelerated iterative update achieves an $O(1/t^2)$ convergence rate in the absence of persistent perturbations, and we further demonstrate that feedback (iterative) architectures are necessary to approximate certain fixed-point functions efficiently. These theoretical insights bridge classical acceleration techniques with contemporary applications in neural computation and optimization.
LGFeb 17, 2025
Linear Diffusion NetworksJacob Fein-Ashley
We present Linear Diffusion Networks (LDNs), a novel architecture that reinterprets sequential data processing as a unified diffusion process. Our model integrates adaptive diffusion modules with localized nonlinear updates and a diffusion-inspired attention mechanism. This design enables efficient global information propagation while preserving fine-grained temporal details. LDN overcomes the limitations of conventional recurrent and transformer models by allowing full parallelization across time steps and supporting robust multi-scale temporal representations. Experiments on benchmark sequence modeling tasks demonstrate that LDN delivers competitive performance across ImageNet and LRA tasks.
LGDec 23, 2024
Contextual Feedback Loops: Amplifying Deep Reasoning with Iterative Top-Down FeedbackJacob Fein-Ashley, Rajgopal Kannan, Viktor Prasanna
Conventional deep networks rely on one-way backpropagation that overlooks reconciling high-level predictions with lower-level representations. We propose \emph{Contextual Feedback Loops} (CFLs), a lightweight mechanism that re-injects top-down context into earlier layers for iterative refinement. Concretely, CFLs map the network's prediction to a compact \emph{context vector}, which is fused back into each layer via gating adapters. Unrolled over multiple feedback steps, CFLs unify feed-forward and feedback-driven inference, letting top-level outputs continually refine lower-level features. Despite minimal overhead, CFLs yield consistent gains on tasks including CIFAR-10, ImageNet-1k, SpeechCommands, and GLUE SST-2. Moreover, by a Banach Fixed Point argument under mild Lipschitz conditions, these updates converge stably. Overall, CFLs show that even modest top-down feedback can substantially improve deep models, aligning with cognitive theories of iterative perception.
CPApr 17, 2024
A Comparison of Traditional and Deep Learning Methods for Parameter Estimation of the Ornstein-Uhlenbeck ProcessJacob Fein-Ashley
We consider the Ornstein-Uhlenbeck (OU) process, a stochastic process widely used in finance, physics, and biology. Parameter estimation of the OU process is a challenging problem. Thus, we review traditional tracking methods and compare them with novel applications of deep learning to estimate the parameters of the OU process. We use a multi-layer perceptron to estimate the parameters of the OU process and compare its performance with traditional parameter estimation methods, such as the Kalman filter and maximum likelihood estimation. We find that the multi-layer perceptron can accurately estimate the parameters of the OU process given a large dataset of observed trajectories and, on average, outperforms traditional parameter estimation methods.