George D. Montañez

h-index7

11papers

136citations

Novelty53%

AI Score41

Ranked #69,802 of 194,257 authors (top 36%)#957 in ML (top 28%)

11 Papers

4.4CLMar 11

The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?

Mar GonzÃ lez I CatalÃ, Haitz Sáez de Ocáriz Borde, George D. Montañez et al.

Recent work uses entropy-based signals at multiple representation levels to study reasoning in large language models, but the field remains largely empirical. A central unresolved puzzle is why internal entropy dynamics, defined under the predictive distribution of a model, correlate so robustly with external correctness given by the ground-truth answer. In this paper, we argue that this correlation arises because autoregressive models reason correctly when they accumulate information about the true answer via answer-informative prefixes. We formalize this intuition via the Stepwise Informativeness Assumption (SIA), which states that reasoning prefixes accumulate answer-relevant information in expectation as generation progresses. We show that SIA naturally emerges from maximum-likelihood optimization on human reasoning traces and is reinforced by standard fine-tuning and reinforcement-learning pipelines. We then derive observable signatures of SIA linking conditional answer entropy dynamics to correctness. We empirically test SIA across multiple reasoning benchmarks (GSM8K, ARC, SVAMP) and a diverse set of open-weight LLMs (Gemma-2, LLaMA-3.2, Qwen-2.5, DeepSeek and Olmo variants), showing that training induces it and that correct traces exhibit characteristic conditional answer entropy patterns.

3.8CRApr 8, 2021

Permutation Encoding for Text Steganography: A Short Tutorial

George D. Montanez

We explore a method of encoding secret messages using factoradic numbering of permuted lists of text or numeric elements. Encoding and decoding methods are provided, with code, and key aspects of the correctness of the methods are formally proven. The method of encoding is simple and provides a working example of using textual and numeric lists as a stenagographic channel. Given the ubiquity of lists, such channels are already present but are often unused.

4.4LGFeb 4, 2021

Undecidability of Underfitting in Learning Algorithms

Sonia Sehra, David Flores, George D. Montanez

Using recent machine learning results that present an information-theoretic perspective on underfitting and overfitting, we prove that deciding whether an encodable learning algorithm will always underfit a dataset, even if given unlimited training time, is undecidable. We discuss the importance of this result and potential topics for further research, including information-theoretic and probabilistic strategies for bounding learning algorithm fit.

5.0LGOct 12, 2020

An Information-Theoretic Perspective on Overfitting and Underfitting

Daniel Bashir, George D. Montanez, Sonia Sehra et al.

We present an information-theoretic framework for understanding overfitting and underfitting in machine learning and prove the formal undecidability of determining whether an arbitrary classification algorithm will overfit a dataset. Measuring algorithm capacity via the information transferred from datasets to models, we consider mismatches between algorithm capacities and datasets to provide a signature for when a model can overfit or underfit a dataset. We present results upper-bounding algorithm capacity, establish its relationship to quantities in the algorithmic search framework for machine learning, and relate our work to recent information-theoretic approaches to generalization.

4.9MLJun 23, 2020

Limits of Transfer Learning

Jake Williams, Abel Tadesse, Tyler Sam et al.

Transfer learning involves taking information and insight from one problem domain and applying it to a new problem domain. Although widely used in practice, theory for transfer learning remains less well-developed. To address this, we prove several novel results related to transfer learning, showing the need to carefully select which sets of information to transfer and the need for dependence between transferred information and target problems. Furthermore, we prove how the degree of probabilistic change in an algorithm using transfer learning places an upper bound on the amount of improvement possible. These results build on the algorithmic search framework for machine learning, allowing the results to apply to a wide range of learning problems using transfer.

3.8MLJan 3, 2020

Decomposable Probability-of-Success Metrics in Algorithmic Search

Tyler Sam, Jake Williams, Abel Tadesse et al.

Previous studies have used a specific success metric within an algorithmic search framework to prove machine learning impossibility results. However, this specific success metric prevents us from applying these results on other forms of machine learning, e.g. transfer learning. We define decomposable metrics as a category of success metrics for search problems which can be expressed as a linear operation on a probability distribution to solve this issue. Using an arbitrary decomposable metric to measure the success of a search, we demonstrate theorems which bound success in various ways, generalizing several existing results in the literature.

2.7LGDec 23, 2019

The Labeling Distribution Matrix (LDM): A Tool for Estimating Machine Learning Algorithm Capacity

Pedro Sandoval Segura, Julius Lauw, Daniel Bashir et al.

Algorithm performance in supervised learning is a combination of memorization, generalization, and luck. By estimating how much information an algorithm can memorize from a dataset, we can set a lower bound on the amount of performance due to other factors such as generalization and luck. With this goal in mind, we introduce the Labeling Distribution Matrix (LDM) as a tool for estimating the capacity of learning algorithms. The method attempts to characterize the diversity of possible outputs by an algorithm for different training datasets, using this to measure algorithm flexibility and responsiveness to data. We test the method on several supervised learning algorithms, and find that while the results are not conclusive, the LDM does allow us to gain potentially valuable insight into the prediction behavior of algorithms. We also introduce the Label Recorder as an additional tool for estimating algorithm capacity, with more promising initial results.

5.7MLNov 9, 2019

The Bias-Expressivity Trade-off

Julius Lauw, Dominique Macias, Akshay Trikha et al.

Learning algorithms need bias to generalize and perform better than random guessing. We examine the flexibility (expressivity) of biased algorithms. An expressive algorithm can adapt to changing training data, altering its outcome based on changes in its input. We measure expressivity by using an information-theoretic notion of entropy on algorithm outcome distributions, demonstrating a trade-off between bias and expressivity. To the degree an algorithm is biased is the degree to which it can outperform uniform random sampling, but is also the degree to which is becomes inflexible. We derive bounds relating bias to expressivity, proving the necessary trade-offs inherent in trying to create strongly performing yet flexible algorithms.

7.7LGJul 13, 2019

The Futility of Bias-Free Learning and Search

George D. Montanez, Jonathan Hayase, Julius Lauw et al.

Building on the view of machine learning as search, we demonstrate the necessity of bias in learning, quantifying the role of bias (measured relative to a collection of possible datasets, or more generally, information resources) in increasing the probability of success. For a given degree of bias towards a fixed target, we show that the proportion of favorable information resources is strictly bounded from above. Furthermore, we demonstrate that bias is a conserved quantity, such that no algorithm can be favorably biased towards many distinct targets simultaneously. Thus bias encodes trade-offs. The probability of success for a task can also be measured geometrically, as the angle of agreement between what holds for the actual task and what is assumed by the algorithm, represented in its bias. Lastly, finding a favorably biasing distribution over a fixed set of information resources is provably difficult, unless the set of resources itself is already favorable with respect to the given task and algorithm.

7.1MLSep 28, 2016

The Famine of Forte: Few Search Problems Greatly Favor Your Algorithm

George D. Montanez

Casting machine learning as a type of search, we demonstrate that the proportion of problems that are favorable for a fixed algorithm is strictly bounded, such that no single algorithm can perform well over a large fraction of them. Our results explain why we must either continue to develop new learning methods year after year or move towards highly parameterized models that are both flexible and sensitive to their hyperparameters. We further give an upper bound on the expected performance for a search algorithm as a function of the mutual information between the target and the information resource (e.g., training dataset), proving the importance of certain types of dependence for machine learning. Lastly, we show that the expected per-query probability of success for an algorithm is mathematically equivalent to a single-query probability of success under a distribution (called a search strategy), and prove that the proportion of favorable strategies is also strictly bounded. Thus, whether one holds fixed the search algorithm and considers all possible problems or one fixes the search problem and looks at all possible search strategies, favorable matches are exceedingly rare. The forte (strength) of any algorithm is quantifiably restricted.

2.8MLJun 8, 2015

The LICORS Cabinet: Nonparametric Algorithms for Spatio-temporal Prediction

George D. Montanez, Cosma Rohilla Shalizi

Spatio-temporal data is intrinsically high dimensional, so unsupervised modeling is only feasible if we can exploit structure in the process. When the dynamics are local in both space and time, this structure can be exploited by splitting the global field into many lower-dimensional "light cones". We review light cone decompositions for predictive state reconstruction, introducing three simple light cone algorithms. These methods allow for tractable inference of spatio-temporal data, such as full-frame video. The algorithms make few assumptions on the underlying process yet have good predictive performance and can provide distributions over spatio-temporal data, enabling sophisticated probabilistic inference.