LGApr 15
LongCoT: Benchmarking Long-Horizon Chain-of-Thought ReasoningSumeet Ramesh Motwani, Daniel Nichols, Charles London et al.
As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.
CVJun 30, 2023
Topological Data Analysis Guided Segment Anything Model Prompt Optimization for Zero-Shot Segmentation in Biological ImagingRuben Glatt, Shusen Liu
Emerging foundation models in machine learning are models trained on vast amounts of data that have been shown to generalize well to new tasks. Often these models can be prompted with multi-modal inputs that range from natural language descriptions over images to point clouds. In this paper, we propose topological data analysis (TDA) guided prompt optimization for the Segment Anything Model (SAM) and show preliminary results in the biological image segmentation domain. Our approach replaces the standard grid search approach that is used in the original implementation and finds point locations based on their topological significance. Our results show that the TDA optimized point cloud is much better suited for finding small objects and massively reduces computational complexity despite the extra step in scenarios which require many segmentations.
CEMar 28
Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD SurrogatesYeping Hu, Ruben Glatt, Shusen Liu
Graph-based surrogate models provide fast alternatives to high-fidelity CFD solvers, but their opaque latent spaces and limited controllability restrict use in safety-critical settings. A key failure mode in oscillatory flows is phase drift, where predictions remain qualitatively correct but gradually lose temporal alignment with observations, limiting use in digital twins and closed-loop control. Correcting this through retraining is expensive and impractical during deployment. We ask whether phase drift can instead be corrected post hoc by manipulating the latent space of a frozen surrogate. We propose a phase-steering framework for pretrained graph-based CFD models that combines the right representation with the right intervention mechanism. To obtain disentangled representation for effective steering, we use sparse autoencoders (SAEs) on frozen MeshGraphNet embeddings. To steer dynamics, we move beyond static per-feature interventions such as scaling or clamping, and introduce a temporally coherent, phase-aware method. Specifically, we identify oscillatory feature pairs with Hilbert analysis, project spatial fields into low-rank temporal coefficients via SVD, and apply smooth time-varying rotations to advance or delay periodic modes while preserving amplitude-phase structure. Using a representation-agnostic setup, we compare SAE-based steering with PCA and raw embedding spaces under the same intervention pipeline. Results show that sparse, disentangled representations outperform dense or entangled ones, while static interventions fail in this dynamical setting. Overall, this work shows that latent-space steering can be extended from semantic domains to time-dependent physical systems when interventions respect the underlying dynamics, and that the same sparse features used for interpretability can also serve as physically meaningful control axes.
NEOct 29, 2021Code
Symbolic Regression via Neural-Guided Genetic Programming Population SeedingT. Nathan Mundhenk, Mikel Landajuela, Ruben Glatt et al.
Symbolic regression is the process of identifying mathematical expressions that fit observed output from a black-box process. It is a discrete optimization problem generally believed to be NP-hard. Prior approaches to solving the problem include neural-guided search (e.g. using reinforcement learning) and genetic programming. In this work, we introduce a hybrid neural-guided/genetic programming approach to symbolic regression and other combinatorial optimization problems. We propose a neural-guided component used to seed the starting population of a random restart genetic programming component, gradually learning better starting populations. On a number of common benchmark tasks to recover underlying expressions from a dataset, our method recovers 65% more expressions than a recently published top-performing model using the same experimental setup. We demonstrate that running many genetic programming generations without interdependence on the neural-guided component performs better for symbolic regression than alternative formulations where the two are more strongly coupled. Finally, we introduce a new set of 22 symbolic regression benchmark problems with increased difficulty over existing benchmarks. Source code is provided at www.github.com/brendenpetersen/deep-symbolic-optimization.
CVOct 2, 2025
Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models through Reinforcement Learning from Ranking FeedbackDerek Shi, Ruben Glatt, Christine Klymko et al.
Recent advances in large video-language models (VLMs) rely on extensive fine-tuning techniques that strengthen alignment between textual and visual comprehension. Leading pipelines typically pair supervised fine-tuning (SFT) with reinforcement learning from preference data to enhance video comprehension. However, as VLMs scale in parameter size, so does the cost of gathering enough human feedback. To make fine-tuning more cost-effective, recent frameworks explore reinforcement learning with AI feedback (RLAIF), which replace human preference with AI as a judge. Current RLAIF frameworks rely on a specialized reward model trained with video narratives to create calibrated scalar rewards -- an expensive and restrictive pipeline. We propose Oracle-RLAIF, a novel framework that replaces the trained reward model with a more general Oracle ranker which acts as a drop-in model ranking candidate model responses rather than scoring them. Alongside Oracle-RLAIF, we introduce $GRPO_{rank}$, a novel rank-based loss function based on Group Relative Policy Optimization (GRPO) that directly optimizes ordinal feedback with rank-aware advantages. Empirically, we demonstrate that Oracle-RLAIF consistently outperforms leading VLMs using existing fine-tuning methods when evaluated across various video comprehension benchmarks. Oracle-RLAIF paves the path to creating flexible and data-efficient frameworks for aligning large multi-modal video models with reinforcement learning from rank rather than score.
IRJul 23, 2025
VERIRAG: Healthcare Claim Verification via Statistical Audit in Retrieval-Augmented GenerationShubham Mohole, Hongjun Choi, Shusen Liu et al.
Retrieval-augmented generation (RAG) systems are increasingly adopted in clinical decision support, yet they remain methodologically blind-they retrieve evidence but cannot vet its scientific quality. A paper claiming "Antioxidant proteins decreased after alloferon treatment" and a rigorous multi-laboratory replication study will be treated as equally credible, even if the former lacked scientific rigor or was even retracted. To address this challenge, we introduce VERIRAG, a framework that makes three notable contributions: (i) the Veritable, an 11-point checklist that evaluates each source for methodological rigor, including data integrity and statistical validity; (ii) a Hard-to-Vary (HV) Score, a quantitative aggregator that weights evidence by its quality and diversity; and (iii) a Dynamic Acceptance Threshold, which calibrates the required evidence based on how extraordinary a claim is. Across four datasets-comprising retracted, conflicting, comprehensive, and settled science corpora-the VERIRAG approach consistently outperforms all baselines, achieving absolute F1 scores ranging from 0.53 to 0.65, representing a 10 to 14 point improvement over the next-best method in each respective dataset. We will release all materials necessary for reproducing our results.
LGJun 29, 2024
Enhancing Accuracy and Parameter-Efficiency of Neural Representations for Network ParameterizationHongjun Choi, Jayaraman J. Thiagarajan, Ruben Glatt et al.
In this work, we investigate the fundamental trade-off regarding accuracy and parameter efficiency in the parameterization of neural network weights using predictor networks. We present a surprising finding that, when recovering the original model accuracy is the sole objective, it can be achieved effectively through the weight reconstruction objective alone. Additionally, we explore the underlying factors for improving weight reconstruction under parameter-efficiency constraints, and propose a novel training scheme that decouples the reconstruction objective from auxiliary objectives such as knowledge distillation that leads to significant improvements compared to state-of-the-art approaches. Finally, these results pave way for more practical scenarios, where one needs to achieve improvements on both model accuracy and predictor network parameter-efficiency simultaneously.
LGJul 19, 2021
Improving exploration in policy gradient search: Application to symbolic optimizationMikel Landajuela, Brenden K. Petersen, Soo K. Kim et al.
Many machine learning strategies designed to automate mathematical tasks leverage neural networks to search large combinatorial spaces of mathematical symbols. In contrast to traditional evolutionary approaches, using a neural network at the core of the search allows learning higher-level symbolic patterns, providing an informed direction to guide the search. When no labeled data is available, such networks can still be trained using reinforcement learning. However, we demonstrate that this approach can suffer from an early commitment phenomenon and from initialization bias, both of which limit exploration. We present two exploration methods to tackle these issues, building upon ideas of entropy regularization and distribution initialization. We show that these techniques can improve the performance, increase sample efficiency, and lower the complexity of solutions for the task of symbolic regression.
MAFeb 1, 2021
Hybrid Information-driven Multi-agent Reinforcement LearningWilliam A. Dawson, Ruben Glatt, Edward Rusu et al.
Information theoretic sensor management approaches are an ideal solution to state estimation problems when considering the optimal control of multi-agent systems, however they are too computationally intensive for large state spaces, especially when considering the limited computational resources typical of large-scale distributed multi-agent systems. Reinforcement learning (RL) is a promising alternative which can find approximate solutions to distributed optimal control problems that take into account the resource constraints inherent in many systems of distributed agents. However, the RL training can be prohibitively inefficient, especially in low-information environments where agents receive little to no feedback in large portions of the state space. We propose a hybrid information-driven multi-agent reinforcement learning (MARL) approach that utilizes information theoretic models as heuristics to help the agents navigate large sparse state spaces, coupled with information based rewards in an RL framework to learn higher-level policies. This paper presents our ongoing work towards this objective. Our preliminary findings show that such an approach can result in a system of agents that are approximately three orders of magnitude more efficient at exploring a sparse state space than naive baseline metrics. While the work is still in its early stages, it provides a promising direction for future research.