Identifying Information from Observations with Uncertainty and Novelty
This work provides a theoretical framework for handling uncertainty and novelty in ML tasks, but it is incremental as it builds on existing answers without introducing a new paradigm.
The paper formalizes the concept of identifiable information to address questions about uncertainty and novelty in machine learning from observations, connecting it to model identifiability, sample complexity, and asymptotic statistics for various data-generating processes.
A machine learning tasks from observations must encounter and process uncertainty and novelty, especially when it is to maintain performance when observing new information and to choose the hypothesis that best fits the current observations. In this context, some key questions arise: what and how much information did the observations provide, how much information is required to identify the data-generating process, how many observations remain to get that information, and how does a predictor determine that it has observed novel information? This paper strengthens existing answers to these questions by formalizing the notion of identifiable information that arises from the language used to express the relationship between distinct states. Model identifiability and sample complexity are defined via computation of an indicator function over a set of hypotheses, bridging algorithmic and probabilistic information. Their properties and asymptotic statistics are described for data-generating processes ranging from deterministic processes to ergodic stationary stochastic processes. This connects the notion of identifying information in finite steps with asymptotic statistics and PAC-learning. The indicator function's computation naturally formalizes novel information and its identification from observations with respect to a hypothesis set. We also proved that computable PAC-Bayes learners' sample complexity distribution is determined by its moments in terms of the prior probability distribution over a fixed finite hypothesis set.