LGOct 29, 2025
Training Across Reservoirs: Using Numerical Differentiation To Couple Trainable Networks With Black-Box ReservoirsAndrew Clark, Jack Moursounidis, Osmaan Rasouli et al.
We introduce Bounded Numerical Differentiation (BOND), a perturbative method for estimating partial derivatives across network structures with inaccessible computational graphs. BOND demonstrates improved accuracy and scalability from existing perturbative methods, enabling new explorations of trainable architectures that integrate black-box functions. We observe that these black-box functions, realized in our experiments as fixed, untrained networks, can enhance model performance without increasing the number of trainable parameters. This improvement is achieved without extensive optimization of the architecture or properties of the black-box function itself. Our findings highlight the potential of leveraging fixed, non-trainable modules to expand model capacity, suggesting a path toward combining analogue and digital devices as a mechanism for scaling networks.
LGAug 1, 2025
FeatureCuts: Feature Selection for Large Data by Optimizing the CutoffAndy Hu, Devika Prasad, Luiz Pizzato et al.
In machine learning, the process of feature selection involves finding a reduced subset of features that captures most of the information required to train an accurate and efficient model. This work presents FeatureCuts, a novel feature selection algorithm that adaptively selects the optimal feature cutoff after performing filter ranking. Evaluated on 14 publicly available datasets and one industry dataset, FeatureCuts achieved, on average, 15 percentage points more feature reduction and up to 99.6% less computation time while maintaining model performance, compared to existing state-of-the-art methods. When the selected features are used in a wrapper method such as Particle Swarm Optimization (PSO), it enables 25 percentage points more feature reduction, requires 66% less computation time, and maintains model performance when compared to PSO alone. The minimal overhead of FeatureCuts makes it scalable for large datasets typically seen in enterprise applications.
CLJul 10, 2025
Your Absorbing Discrete Diffusion Secretly Models the Bayesian PosteriorCooper Doyle
Discrete diffusion language models learn to reconstruct text from randomly masked inputs, yet under mild assumptions their denoiser already implements the exact Bayesian posterior over the original tokens. We prove that the expected denoiser output under the forward corruption distribution recovers the true posterior, and that a simple Monte Carlo estimator converges to this posterior at rate O(1/sqrt(K)) with finite-sample concentration bounds. Building on this insight, we introduce an inference-time ensemble that runs K independent denoising passes and aggregates both posterior means and variances without any extra training. On WikiText-2, our MC-marginal sampler recovers the analytic lambda-DCE zero-shot perplexity (approximately 39) to within a few points at K=128, and its per-token variance shows a strong rank correlation with reconstruction error (Spearman rho = 0.996). This cost-proportional procedure yields calibrated uncertainty estimates and a direct trade-off between compute and posterior fidelity in discrete diffusion LMs.
LGJul 4, 2025
Disentangling Doubt in Deep Causal AICooper Doyle
Accurate individual treatment-effect estimation in high-stakes applications demands both reliable point predictions and interpretable uncertainty quantification. We propose a factorized Monte Carlo Dropout framework for deep twin-network models that splits total predictive variance into representation uncertainty (sigma_rep) in the shared encoder and prediction uncertainty (sigma_pred) in the outcome heads. Across three synthetic covariate-shift regimes, our intervals are well-calibrated (ECE < 0.03) and satisfy sigma_rep^2 + sigma_pred^2 ~ sigma_tot^2. Additionally, we observe a crossover: head uncertainty leads on in-distribution data, but representation uncertainty dominates under shift. Finally, on a real-world twins cohort with induced multivariate shifts, only sigma_rep spikes on out-of-distribution samples (delta sigma ~ 0.0002) and becomes the primary error predictor (rho_rep <= 0.89), while sigma_pred remains flat. This module-level decomposition offers a practical diagnostic for detecting and interpreting uncertainty sources in deep causal-effect models.
LGJun 28, 2025
Low-rank variational dropout: Uncertainty and rank selection in adaptersCooper Doyle
Parameter-efficient fine-tuning (PEFT) methods such as LoRA adapt large language models by inserting low-rank adapters, but they leave open two key questions: how to give the adapted model calibrated uncertainty, and how to choose the adapter rank. Existing approaches to uncertainty are typically post-hoc, while rank selection is manual and task-specific. BayesLoRA revisits variational dropout in the LoRA setting and shows that the natural unit of stochasticity is not individual weights but entire ranks of the adapter. By placing rank-wise variational distributions over adapter components, BayesLoRA defines a posterior that (i) yields calibrated predictions through adapter-only Monte Carlo sampling and (ii) prunes redundant ranks automatically via an ARD-style KL term. Theoretical analysis shows that this rank-parameterized posterior localizes uncertainty to the adapted subspace and explains amplification under distribution shift. Empirically, BayesLoRA improves calibration while at the same time producing lighter, faster adapters, removing the need to tune ranks by hand. This dual role of uncertainty estimation and uncertainty-driven pruning suggests BayesLoRA may offer a practical default for reliable and efficient PEFT.