Eshed Gal

LG
h-index49
7papers
3citations
Novelty62%
AI Score54

7 Papers

LGMar 13
Probabilistic Gaussian Homotopy: A Probability-Space Continuation Framework for Nonconvex Optimization

Eshed Gal, Samy Wu Fung, Eldad Haber

We introduce Probabilistic Gaussian Homotopy (PGH), a probability-space continuation framework for nonconvex optimization. Unlike classical Gaussian homotopy, which smooths the objective and uniformly averages gradients, PGH deforms the associated Boltzmann distribution and induces Boltzmann-weighted aggregation of perturbed gradients, which exponentially biases descent directions toward low-energy regions. We show that PGH corresponds to a log-sum-exp (soft-min) homotopy that smooths a nonconvex objective at scale $λ>0$ and recovers the original objective as $λ\to 0$, yielding a posterior-mean generalization of the Moreau envelope, and we derive a dynamical system governing minimizer evolution along an annealed homotopy path. This establishes a principled connection between Gaussian continuation, Bayesian denoising, and diffusion-style smoothing. We further propose Probabilistic Gaussian Homotopy Optimization (PGHO), a practical stochastic algorithm based on Monte Carlo gradient estimation, and demonstrate strong performance on high-dimensional nonconvex benchmarks and sparse recovery problems where classical gradient methods and objective-space smoothing frequently fail.

LGMar 2
Preconditioned Score and Flow Matching

Shadab Ahamed, Eshed Gal, Simon Ghyselincks et al.

Flow matching and score-based diffusion train vector fields under intermediate distributions $p_t$, whose geometry can strongly affect their optimization. We show that the covariance $Σ_t$ of $p_t$ governs optimization bias: when $Σ_t$ is ill-conditioned, and gradient-based training rapidly fits high-variance directions while systematically under-optimizing low-variance modes, leading to learning that plateaus at suboptimal weights. We formalize this effect in analytically tractable settings and propose reversible, label-conditional \emph{preconditioning} maps that reshape the geometry of $p_t$ by improving the conditioning of $Σ_t$ without altering the underlying generative model. Rather than accelerating early convergence, preconditioning primarily mitigates optimization stagnation by enabling continued progress along previously suppressed directions. Across MNIST latent flow matching, and additional high-resolution datasets, we empirically track conditioning diagnostics and distributional metrics and show that preconditioning consistently yields better-trained models by avoiding suboptimal plateaus.

LGMar 14
PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers

Eshed Gal, Moshe Eliasof, Siddharth Rout et al.

The success of vision transformers-especially for generative modeling-is limited by the quadratic cost and weak spatial inductive bias of self-attention. We propose PDE-SSM, a spatial state-space block that replaces attention with a learnable convection-diffusion-reaction partial differential equation. This operator encodes a strong spatial prior by modeling information flow via physically grounded dynamics rather than all-to-all token interactions. Solving the PDE in the Fourier domain yields global coupling with near-linear complexity of $O(N \log N)$, delivering a principled and scalable alternative to attention. We integrate PDE-SSM into a flow-matching generative model to obtain the PDE-based Diffusion Transformer PDE-SSM-DiT. Empirically, PDE-SSM-DiT matches or exceeds the performance of state-of-the-art Diffusion Transformers while substantially reducing compute. Our results show that, analogous to 1D settings where SSMs supplant attention, multi-dimensional PDE operators provide an efficient, inductive-bias-rich foundation for next-generation vision models.

LGMay 7
Conservative Flows: A New Paradigm of Generative Models

Eshed Gal, Md Shahriar Rahim Siddiqui, Moshe Eliasof et al.

Modern generative modeling is dominated by transport from a noise prior to data. We propose an alternative paradigm in which generation is performed by a discrete stochastic dynamics that leaves the data distribution invariant, initialized from data-supported states rather than from noise. The framework can utilize any pretrained flow model. We develop two probability-preserving sampling mechanisms, a corrected Langevin dynamics with a Metropolis adjustment and a predictor-corrector flow, that operate directly on existing checkpoints. We validate the framework on a synthetic Swiss-roll target, ImageNet-256 and Oxford Flowers-102, where our samplers consistently improve over the original generation procedures.

LGMay 7
Target-Aware Data Augmentation for SAT Prediction

Eshed Gal, Uri Ascher, Eldad Haber

Learning-based approaches to NP-hard problems have shown increasing promise, but their progress is fundamentally constrained by the high cost of generating labeled training data. In domains such as Boolean satisfiability (SAT), standard pipelines rely on solver-in-the-loop labeling, which scales poorly with problem size and limits the amount of usable supervision. This bottleneck hinders the broader goal of leveraging machine learning to capture structure in hard combinatorial problems. In this work, we propose a target-aware, solver-free data generation framework for SAT that produces correctly labeled SAT and UNSAT instances by construction, eliminating the need for expensive solver calls. Our method aligns generated instances with the structural properties of a target benchmark, making synthetic data effective for downstream learning. We further develop a linear-programming-aware graph neural network (LPGNN) architecture that incorporates constraint-violation residuals into message passing, enabling the model to exploit underlying optimization structure. Together, these contributions support a data-centric paradigm for learning on NP-hard problems, where scalable, task-aligned data generation is as critical as model design. Our approach yields orders-of-magnitude speedups in data generation, demonstrating that benchmark-aligned synthetic data can effectively augment solver-labeled datasets for GNN-based SAT prediction.

LGMar 25, 2025
Towards Efficient Training of Graph Neural Networks: A Multiscale Approach

Eshed Gal, Moshe Eliasof, Carola-Bibiane Schönlieb et al.

Graph Neural Networks (GNNs) have become powerful tools for learning from graph-structured data, finding applications across diverse domains. However, as graph sizes and connectivity increase, standard GNN training methods face significant computational and memory challenges, limiting their scalability and efficiency. In this paper, we present a novel framework for efficient multiscale training of GNNs. Our approach leverages hierarchical graph representations and subgraphs, enabling the integration of information across multiple scales and resolutions. By utilizing coarser graph abstractions and subgraphs, each with fewer nodes and edges, we significantly reduce computational overhead during training. Building on this framework, we propose a suite of scalable training strategies, including coarse-to-fine learning, subgraph-to-full-graph transfer, and multiscale gradient computation. We also provide some theoretical analysis of our methods and demonstrate their effectiveness across various datasets and learning tasks. Our results show that multiscale training can substantially accelerate GNN training for large scale problems while maintaining, or even improving, predictive performance.

CLNov 27, 2025
Reversing Large Language Models for Efficient Training and Fine-Tuning

Eshed Gal, Moshe Eliasof, Javier Turek et al.

Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In this work, we introduce memory-efficient, reversible architectures for LLMs, inspired by symmetric and symplectic differential equations, and investigate their theoretical properties. Different from standard, baseline architectures that store all intermediate activations, the proposed models use time-reversible dynamics to retrieve hidden states during backpropagation, relieving the need to store activations. This property allows for a drastic reduction in memory consumption, allowing for the processing of larger batch sizes for the same available memory, thereby offering improved throughput. In addition, we propose an efficient method for converting existing, non-reversible LLMs into reversible architectures through fine-tuning, rendering our approach practical for exploiting existing pre-trained models. Our results show comparable or improved performance on several datasets and benchmarks, on several LLMs, building a scalable and efficient path towards reducing the memory and computational costs associated with both training from scratch and fine-tuning of LLMs.