LGMay 31, 2022Code
Principle of Relevant Information for Graph SparsificationShujian Yu, Francesco Alesiani, Wenzhe Yin et al.
Graph sparsification aims to reduce the number of edges of a graph while maintaining its structural properties. In this paper, we propose the first general and effective information-theoretic formulation of graph sparsification, by taking inspiration from the Principle of Relevant Information (PRI). To this end, we extend the PRI from a standard scalar random variable setting to structured data (i.e., graphs). Our Graph-PRI objective is achieved by operating on the graph Laplacian, made possible by expressing the graph Laplacian of a subgraph in terms of a sparse edge selection vector $\mathbf{w}$. We provide both theoretical and empirical justifications on the validity of our Graph-PRI approach. We also analyze its analytical solutions in a few special cases. We finally present three representative real-world applications, namely graph sparsification, graph regularized multi-task learning, and medical imaging-derived brain network classification, to demonstrate the effectiveness, the versatility and the enhanced interpretability of our approach over prevalent sparsification techniques. Code of Graph-PRI is available at https://github.com/SJYuCNEL/PRI-Graphs
LGDec 9, 2022
The Normalized Cross Density Functional: A Framework to Quantify Statistical Dependence for Random ProcessesBo Hu, Jose C. Principe
This paper presents a novel approach to measuring statistical dependence between two random processes (r.p.) using a positive-definite function called the Normalized Cross Density (NCD). NCD is derived directly from the probability density functions of two r.p. and constructs a data-dependent Hilbert space, the Normalized Cross-Density Hilbert Space (NCD-HS). By Mercer's Theorem, the NCD norm can be decomposed into its eigenspectrum, which we name the Multivariate Statistical Dependence (MSD) measure, and their sum, the Total Dependence Measure (TSD). Hence, the NCD-HS eigenfunctions serve as a novel embedded feature space, suitable for quantifying r.p. statistical dependence. In order to apply NCD directly to r.p. realizations, we introduce an architecture with two multiple-output neural networks, a cost function, and an algorithm named the Functional Maximal Correlation Algorithm (FMCA). With FMCA, the two networks learn concurrently by approximating each other's outputs, extending the Alternating Conditional Expectation (ACE) for multivariate functions. We mathematically prove that FMCA learns the dominant eigenvalues and eigenfunctions of NCD directly from realizations. Preliminary results with synthetic data and medium-sized image datasets corroborate the theory. Different strategies for applying NCD are proposed and discussed, demonstrating the method's versatility and stability beyond supervised learning. Specifically, when the two r.p. are high-dimensional real-world images and a white uniform noise process, FMCA learns factorial codes, i.e., the occurrence of a code guarantees that a specific training set image was present, which is important for feature learning.
CVNov 3, 2022
Quantifying Model Uncertainty for Semantic Segmentation using Operators in the RKHSRishabh Singh, Jose C. Principe
Deep learning models for semantic segmentation are prone to poor performance in real-world applications due to the highly challenging nature of the task. Model uncertainty quantification (UQ) is one way to address this issue of lack of model trustworthiness by enabling the practitioner to know how much to trust a segmentation output. Current UQ methods in this application domain are mainly restricted to Bayesian based methods which are computationally expensive and are only able to extract central moments of uncertainty thereby limiting the quality of their uncertainty estimates. We present a simple framework for high-resolution predictive uncertainty quantification of semantic segmentation models that leverages a multi-moment functional definition of uncertainty associated with the model's feature space in the reproducing kernel Hilbert space (RKHS). The multiple uncertainty functionals extracted from this framework are defined by the local density dynamics of the model's feature space and hence automatically align themselves at the tail-regions of the intrinsic probability density function of the feature space (where uncertainty is the highest) in such a way that the successively higher order moments quantify the more uncertain regions. This leads to a significantly more accurate view of model uncertainty than conventional Bayesian methods. Moreover, the extraction of such moments is done in a single-shot computation making it much faster than Bayesian and ensemble approaches (that involve a high number of forward stochastic passes of the model to quantify its uncertainty). We demonstrate these advantages through experimental evaluations of our framework implemented over four different state-of-the-art model architectures that are trained and evaluated on two benchmark road-scene segmentation datasets (Camvid and Cityscapes).
LGDec 20, 2022
Adapting the Exploration Rate for Value-of-Information-Based Reinforcement LearningIsaac J. Sledge, Jose C. Principe
In this paper, we consider the problem of adjusting the exploration rate when using value-of-information-based exploration. We do this by converting the value-of-information optimization into a problem of finding equilibria of a flow for a changing exploration rate. We then develop an efficient path-following scheme for converging to these equilibria and hence uncovering optimal action-selection policies. Under this scheme, the exploration rate is automatically adapted according to the agent's experiences. Global convergence is theoretically assured. We first evaluate our exploration-rate adaptation on the Nintendo GameBoy games Centipede and Millipede. We demonstrate aspects of the search process, like that it yields a hierarchy of state abstractions. We also show that our approach returns better policies in fewer episodes than conventional search strategies relying on heuristic, annealing-based exploration-rate adjustments. We then illustrate that these trends hold for deep, value-of-information-based agents that learn to play ten simple games and over forty more complicated games for the Nintendo GameBoy system. Performance either near or well above the level of human play is observed.
LGApr 27, 2024Code
Cauchy-Schwarz Divergence Information Bottleneck for RegressionShujian Yu, Xi Yu, Sigurd Løkse et al.
The information bottleneck (IB) approach is popular to improve the generalization, robustness and explainability of deep neural networks. Essentially, it aims to find a minimum sufficient representation $\mathbf{t}$ by striking a trade-off between a compression term $I(\mathbf{x};\mathbf{t})$ and a prediction term $I(y;\mathbf{t})$, where $I(\cdot;\cdot)$ refers to the mutual information (MI). MI is for the IB for the most part expressed in terms of the Kullback-Leibler (KL) divergence, which in the regression case corresponds to prediction based on mean squared error (MSE) loss with Gaussian assumption and compression approximated by variational inference. In this paper, we study the IB principle for the regression problem and develop a new way to parameterize the IB with deep neural networks by exploiting favorable properties of the Cauchy-Schwarz (CS) divergence. By doing so, we move away from MSE-based regression and ease estimation by avoiding variational approximations or distributional assumptions. We investigate the improved generalization ability of our proposed CS-IB and demonstrate strong adversarial robustness guarantees. We demonstrate its superior performance on six real-world regression tasks over other popular deep IB approaches. We additionally observe that the solutions discovered by CS-IB always achieve the best trade-off between prediction accuracy and compression ratio in the information plane. The code is available at \url{https://github.com/SJYuCNEL/Cauchy-Schwarz-Information-Bottleneck}.
36.3CVMar 25
MLE-UVAD: Minimal Latent Entropy Autoencoder for Fully Unsupervised Video Anomaly DetectionYuang Geng, Junkai Zhou, Kang Yang et al.
In this paper, we address the challenging problem of single-scene, fully unsupervised video anomaly detection (VAD), where raw videos containing both normal and abnormal events are used directly for training and testing without any labels. This differs sharply from prior work that either requires extensive labeling (fully or weakly supervised) or depends on normal-only videos (one-class classification), which are vulnerable to distribution shifts and contamination. We propose an entropy-guided autoencoder that detects anomalies through reconstruction error by reconstructing normal frames well while making anomalies reconstruct poorly. The key idea is to combine the standard reconstruction loss with a novel Minimal Latent Entropy (MLE) loss in the autoencoder. Reconstruction loss alone maps normal and abnormal inputs to distinct latent clusters due to their inherent differences, but also risks reconstructing anomalies too well to detect. Therefore, MLE loss addresses this by minimizing the entropy of latent embeddings, encouraging them to concentrate around high-density regions. Since normal frames dominate the raw video, sparse anomalous embeddings are pulled into the normal cluster, so the decoder emphasizes normal patterns and produces poor reconstructions for anomalies. This dual-loss design produces a clear reconstruction gap that enables effective anomaly detection. Extensive experiments on two widely used benchmarks and a challenging self-collected driving dataset demonstrate that our method achieves robust and superior performance over baselines.
IVFeb 7, 2022Code
Deep Deterministic Independent Component Analysis for Hyperspectral UnmixingHongming Li, Shujian Yu, Jose C. Principe
We develop a new neural network based independent component analysis (ICA) method by directly minimizing the dependence amongst all extracted components. Using the matrix-based R{é}nyi's $α$-order entropy functional, our network can be directly optimized by stochastic gradient descent (SGD), without any variational approximation or adversarial training. As a solid application, we evaluate our ICA in the problem of hyperspectral unmixing (HU) and refute a statement that "\emph{ICA does not play a role in unmixing hyperspectral data}", which was initially suggested by \cite{nascimento2005does}. Code and additional remarks of our DDICA is available at https://github.com/hongmingli1995/DDICA.
ITSep 24, 2021Code
Estimating Rényi's $α$-Cross-Entropies in a Matrix-Based WayIsaac J. Sledge, Jose C. Principe
Conventional information-theoretic quantities assume access to probability distributions. Estimating such distributions is not trivial. Here, we consider function-based formulations of cross entropy that sidesteps this a priori estimation requirement. We propose three measures of Rényi's $α$-cross-entropies in the setting of reproducing-kernel Hilbert spaces. Each measure has its appeals. We prove that we can estimate these measures in an unbiased, non-parametric, and minimax-optimal way. We do this via sample-constructed Gram matrices. This yields matrix-based estimators of Rényi's $α$-cross-entropies. These estimators satisfy all of the axioms that Rényi established for divergences. Our cross-entropies can thus be used for assessing distributional differences. They are also appropriate for handling high-dimensional distributions, since the convergence rate of our estimator is independent of the sample dimensionality. Python code for implementing these measures can be found at https://github.com/isledge/MBRCE
LGJan 31, 2021Code
Deep Deterministic Information Bottleneck with Matrix-based Entropy FunctionalXi Yu, Shujian Yu, Jose C. Principe
We introduce the matrix-based Renyi's $α$-order entropy functional to parameterize Tishby et al. information bottleneck (IB) principle with a neural network. We term our methodology Deep Deterministic Information Bottleneck (DIB), as it avoids variational inference and distribution assumption. We show that deep neural networks trained with DIB outperform the variational objective counterpart and those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.Code available at https://github.com/yuxi120407/DIB
LGDec 5, 2024
ELEMENT: Episodic and Lifelong Exploration via Maximum EntropyHongming Li, Shujian Yu, Bin Liu et al.
This paper proposes \emph{Episodic and Lifelong Exploration via Maximum ENTropy} (ELEMENT), a novel, multiscale, intrinsically motivated reinforcement learning (RL) framework that is able to explore environments without using any extrinsic reward and transfer effectively the learned skills to downstream tasks. We advance the state of the art in three ways. First, we propose a multiscale entropy optimization to take care of the fact that previous maximum state entropy, for lifelong exploration with millions of state observations, suffers from vanishing rewards and becomes very expensive computationally across iterations. Therefore, we add an episodic maximum entropy over each episode to speedup the search further. Second, we propose a novel intrinsic reward for episodic entropy maximization named \emph{average episodic state entropy} which provides the optimal solution for a theoretical upper bound of the episodic state entropy objective. Third, to speed the lifelong entropy maximization, we propose a $k$ nearest neighbors ($k$NN) graph to organize the estimation of the entropy and updating processes that reduces the computation substantially. Our ELEMENT significantly outperforms state-of-the-art intrinsic rewards in both episodic and lifelong setups. Moreover, it can be exploited in task-agnostic pre-training, collecting data for offline reinforcement learning, etc.
LGNov 17, 2025
Contrastive Entropy Bounds for Density and Conditional Density DecompositionBo Hu, Jose C. Principe
This paper studies the interpretability of neural network features from a Bayesian Gaussian view, where optimizing a cost is reaching a probabilistic bound; learning a model approximates a density that makes the bound tight and the cost optimal, often with a Gaussian mixture density. The two examples are Mixture Density Networks (MDNs) using the bound for the marginal and autoencoders using the conditional bound. It is a known result, not only for autoencoders, that minimizing the error between inputs and outputs maximizes the dependence between inputs and the middle. We use Hilbert space and decomposition to address cases where a multiple-output network produces multiple centers defining a Gaussian mixture. Our first finding is that an autoencoder's objective is equivalent to maximizing the trace of a Gaussian operator, the sum of eigenvalues under bases orthonormal w.r.t. the data and model distributions. This suggests that, when a one-to-one correspondence as needed in autoencoders is unnecessary, we can instead maximize the nuclear norm of this operator, the sum of singular values, to maximize overall rank rather than trace. Thus the trace of a Gaussian operator can be used to train autoencoders, and its nuclear norm can be used as divergence to train MDNs. Our second test uses inner products and norms in a Hilbert space to define bounds and costs. Such bounds often have an extra norm compared to KL-based bounds, which increases sample diversity and prevents the trivial solution where a multiple-output network produces the same constant, at the cost of requiring a sample batch to estimate and optimize. We propose an encoder-mixture-decoder architecture whose decoder is multiple-output, producing multiple centers per sample, potentially tightening the bound. Assuming the data are small-variance Gaussian mixtures, this upper bound can be tracked and analyzed quantitatively.
LGAug 1, 2025
A Simple and Effective Method for Uncertainty Quantification and OOD DetectionYaxin Ma, Benjamin Colburn, Jose C. Principe
Bayesian neural networks and deep ensemble methods have been proposed for uncertainty quantification; however, they are computationally intensive and require large storage. By utilizing a single deterministic model, we can solve the above issue. We propose an effective method based on feature space density to quantify uncertainty for distributional shifts and out-of-distribution (OOD) detection. Specifically, we leverage the information potential field derived from kernel density estimation to approximate the feature space density of the training set. By comparing this density with the feature space representation of test samples, we can effectively determine whether a distributional shift has occurred. Experiments were conducted on a 2D synthetic dataset (Two Moons and Three Spirals) as well as an OOD detection task (CIFAR-10 vs. SVHN). The results demonstrate that our method outperforms baseline models.
CVJan 20, 2024
Weakly-Supervised Semantic Segmentation of Circular-Scan, Synthetic-Aperture-Sonar ImageryIsaac J. Sledge, Dominic M. Byrne, Jonathan L. King et al.
We propose a weakly-supervised framework for the semantic segmentation of circular-scan synthetic-aperture-sonar (CSAS) imagery. The first part of our framework is trained in a supervised manner, on image-level labels, to uncover a set of semi-sparse, spatially-discriminative regions in each image. The classification uncertainty of each region is then evaluated. Those areas with the lowest uncertainties are then chosen to be weakly labeled segmentation seeds, at the pixel level, for the second part of the framework. Each of the seed extents are progressively resized according to an unsupervised, information-theoretic loss with structured-prediction regularizers. This reshaping process uses multi-scale, adaptively-weighted features to delineate class-specific transitions in local image content. Content-addressable memories are inserted at various parts of our framework so that it can leverage features from previously seen images to improve segmentation performance for related images. We evaluate our weakly-supervised framework using real-world CSAS imagery that contains over ten seafloor classes and ten target classes. We show that our framework performs comparably to nine fully-supervised deep networks. Our framework also outperforms eleven of the best weakly-supervised deep networks. We achieve state-of-the-art performance when pre-training on natural imagery. The average absolute performance gap to the next-best weakly-supervised network is well over ten percent for both natural imagery and sonar imagery. This gap is found to be statistically significant.
LGDec 19, 2023
An Alternate View on Optimal Filtering in an RKHSBenjamin Colburn, Jose C. Principe, Luis G. Sanchez Giraldo
Kernel Adaptive Filtering (KAF) are mathematically principled methods which search for a function in a Reproducing Kernel Hilbert Space. While they work well for tasks such as time series prediction and system identification they are plagued by a linear relationship between number of training samples and model size, hampering their use on the very large data sets common in today's data saturated world. Previous methods try to solve this issue by sparsification. We describe a novel view of optimal filtering which may provide a route towards solutions in a RKHS which do not necessarily have this linear growth in model size. We do this by defining a RKHS in which the time structure of a stochastic process is still present. Using correntropy [11], an extension of the idea of a covariance function, we create a time based functional which describes some potentially nonlinear desired mapping function. This form of a solution may provide a fruitful line of research for creating more efficient representations of functionals in a RKHS, while theoretically providing computational complexity in the test set similar to Wiener solution.
LGOct 12, 2021
Information Theoretic Structured Generative ModelingBo Hu, Shujian Yu, Jose C. Principe
Rényi's information provides a theoretical foundation for tractable and data-efficient non-parametric density estimation, based on pair-wise evaluations in a reproducing kernel Hilbert space (RKHS). This paper extends this framework to parametric probabilistic modeling, motivated by the fact that Rényi's information can be estimated in closed-form for Gaussian mixtures. Based on this special connection, a novel generative model framework called the structured generative model (SGM) is proposed that makes straightforward optimization possible, because costs are scale-invariant, avoiding high gradient variance while imposing less restrictions on absolute continuity, which is a huge advantage in parametric information theoretic optimization. The implementation employs a single neural network driven by an orthonormal input appended to a single white noise source adapted to learn an infinite Gaussian mixture model (IMoG), which provides an empirically tractable model distribution in low dimensions. To train SGM, we provide three novel variational cost functions, based on Rényi's second-order entropy and divergence, to implement minimization of cross-entropy, minimization of variational representations of $f$-divergence, and maximization of the evidence lower bound (conditional probability). We test the framework for estimation of mutual information and compare the results with the mutual information neural estimation (MINE), for density estimation, for conditional probability estimation in Markov models as well as for training adversarial networks. Our preliminary results show that SGM significantly improves MINE estimation in terms of data efficiency and variance, conventional and variational Gaussian mixture models, as well as the performance of generative adversarial networks.
LGSep 22, 2021
A Physics inspired Functional Operator for Model Uncertainty Quantification in the RKHSRishabh Singh, Jose C. Principe
Accurate uncertainty quantification of model predictions is a crucial problem in machine learning. Existing Bayesian methods, being highly iterative, are expensive to implement and often fail to accurately capture a model's true posterior because of their tendency to select only central moments. We propose a fast single-shot uncertainty quantification framework where, instead of working with the conventional Bayesian definition of model weight probability density function (PDF), we utilize physics inspired functional operators over the projection of model weights in a reproducing kernel Hilbert space (RKHS) to quantify their uncertainty at each model output. The RKHS projection of model weights yields a potential field based interpretation of model weight PDF which consequently allows the definition of a functional operator, inspired by perturbation theory in physics, that performs a moment decomposition of the model weight PDF (the potential field) at a specific model output to quantify its uncertainty. We call this representation of the model weight PDF as the quantum information potential field (QIPF) of the weights. The extracted moments from this approach automatically decompose the weight PDF in the local neighborhood of the specified model output and determine, with great sensitivity, the local heterogeneity of the weight PDF around a given prediction. These moments therefore provide sharper estimates of predictive uncertainty than central stochastic moments of Bayesian methods. Experiments evaluating the error detection capability of different uncertainty quantification methods on covariate shifted test data show our approach to be more precise and better calibrated than baseline methods, while being faster to compute.
CVJul 22, 2021
External-Memory Networks for Low-Shot Learning of Targets in Forward-Looking-Sonar ImageryIsaac J. Sledge, Christopher D. Toole, Joseph A. Maestri et al.
We propose a memory-based framework for real-time, data-efficient target analysis in forward-looking-sonar (FLS) imagery. Our framework relies on first removing non-discriminative details from the imagery using a small-scale DenseNet-inspired network. Doing so simplifies ensuing analyses and permits generalizing from few labeled examples. We then cascade the filtered imagery into a novel NeuralRAM-based convolutional matching network, NRMN, for low-shot target recognition. We employ a small-scale FlowNet, LFN to align and register FLS imagery across local temporal scales. LFN enables target label consensus voting across images and generally improves target detection and recognition rates. We evaluate our framework using real-world FLS imagery with multiple broad target classes that have high intra-class variability and rich sub-class structure. We show that few-shot learning, with anywhere from ten to thirty class-specific exemplars, performs similarly to supervised deep networks trained on hundreds of samples per class. Effective zero-shot learning is also possible. High performance is realized from the inductive-transfer properties of NRMNs when distractor elements are removed.
ITJul 5, 2021
An Information-Theoretic Approach for Automatically Determining the Number of States when Aggregating Markov ChainsIsaac J. Sledge, Jose C. Principe
A fundamental problem when aggregating Markov chains is the specification of the number of state groups. Too few state groups may fail to sufficiently capture the pertinent dynamics of the original, high-order Markov chain. Too many state groups may lead to a non-parsimonious, reduced-order Markov chain whose complexity rivals that of the original. In this paper, we show that an augmented value-of-information-based approach to aggregating Markov chains facilitates the determination of the number of state groups. The optimal state-group count coincides with the case where the complexity of the reduced-order chain is balanced against the mutual dependence between the original- and reduced-order chain dynamics.
LGApr 19, 2021
Labels, Information, and Computation: Efficient Learning Using Sufficient LabelsShiyu Duan, Spencer Chang, Jose C. Principe
In supervised learning, obtaining a large set of fully-labeled training data is expensive. We show that we do not always need full label information on every single training example to train a competent classifier. Specifically, inspired by the principle of sufficiency in statistics, we present a statistic (a summary) of the fully-labeled training set that captures almost all the relevant information for classification but at the same time is easier to obtain directly. We call this statistic "sufficiently-labeled data" and prove its sufficiency and efficiency for finding the optimal hidden representations, on which competent classifier heads can be trained using as few as a single randomly-chosen fully-labeled example per class. Sufficiently-labeled data can be obtained from annotators directly without collecting the fully-labeled data first. And we prove that it is easier to directly obtain sufficiently-labeled data than obtaining fully-labeled data. Furthermore, sufficiently-labeled data is naturally more secure since it stores relative, instead of absolute, information. Extensive experimental results are provided to support our theory.
LGMar 2, 2021
A Kernel Framework to Quantify a Model's Local Predictive Uncertainty under Data Distributional ShiftsRishabh Singh, Jose C. Principe
Traditional Bayesian approaches for model uncertainty quantification rely on notoriously difficult processes of marginalization over each network parameter to estimate its probability density function (PDF). Our hypothesis is that internal layer outputs of a trained neural network contain all of the information related to both its mapping function (quantified by its weights) as well as the input data distribution. We therefore propose a framework for predictive uncertainty quantification of a trained neural network that explicitly estimates the PDF of its raw prediction space (before activation), p(y'|x,w), which we refer to as the model PDF, in a Gaussian reproducing kernel Hilbert space (RKHS). The Gaussian RKHS provides a localized density estimate of p(y'|x,w), which further enables us to utilize gradient based formulations of quantum physics to decompose the model PDF in terms of multiple local uncertainty moments that provide much greater resolution of the PDF than the central moments characterized by Bayesian methods. This provides the framework with a better ability to detect distributional shifts in test data away from the training data PDF learned by the model. We evaluate the framework against existing uncertainty quantification methods on benchmark datasets that have been corrupted using common perturbation techniques. The kernel framework is observed to provide model uncertainty estimates with much greater precision based on the ability to detect model prediction errors.
LGFeb 24, 2021
Annotating Motion Primitives for Simplifying Action Search in Reinforcement LearningIsaac J. Sledge, Darshan W. Bryner, Jose C. Principe
Reinforcement learning in large-scale environments is challenging due to the many possible actions that can be taken in specific situations. We have previously developed a means of constraining, and hence speeding up, the search process through the use of motion primitives; motion primitives are sequences of pre-specified actions taken across a state series. As a byproduct of this work, we have found that if the motion primitives' motions and actions are labeled, then the search can be sped up further. Since motion primitives may initially lack such details, we propose a theoretically viewpoint-insensitive and speed-insensitive means of automatically annotating the underlying motions and actions. We do this through a differential-geometric, spatio-temporal kinematics descriptor, which analyzes how the poses of entities in two motion sequences change over time. We use this descriptor in conjunction with a weighted-nearest-neighbor classifier to label the primitives using a limited set of training examples. In our experiments, we achieve high motion and action annotation rates for human-action-derived primitives with as few as one training sample. We also demonstrate that reinforcement learning using accurately labeled trajectories leads to high-performing policies more quickly than standard reinforcement learning techniques. This is partly because motion primitives encode prior domain knowledge and preempt the need to re-discover that knowledge during training. It is also because agents can leverage the labels to systematically ignore action classes that do not facilitate task objectives, thereby reducing the action space.
LGJan 25, 2021
Measuring Dependence with Matrix-based Entropy FunctionalShujian Yu, Francesco Alesiani, Xi Yu et al.
Measuring the dependence of data plays a central role in statistics and machine learning. In this work, we summarize and generalize the main idea of existing information-theoretic dependence measures into a higher-level perspective by the Shearer's inequality. Based on our generalization, we then propose two measures, namely the matrix-based normalized total correlation ($T_α^*$) and the matrix-based normalized dual total correlation ($D_α^*$), to quantify the dependence of multiple variables in arbitrary dimensional space, without explicit estimation of the underlying data distributions. We show that our measures are differentiable and statistically more powerful than prevalent ones. We also show the impact of our measures in four different machine learning problems, namely the gene regulatory network inference, the robust machine learning under covariate shift and non-Gaussian noises, the subspace outlier detection, and the understanding of the learning dynamics of convolutional neural networks (CNNs), to demonstrate their utilities, advantages, as well as implications to those problems. Code of our dependence measure is available at: https://bit.ly/AAAI-dependence
AIJan 18, 2021
Faster Convergence in Deep-Predictive-Coding Networks to Learn Deeper RepresentationsIsaac J. Sledge, Jose C. Principe
Deep-predictive-coding networks (DPCNs) are hierarchical, generative models. They rely on feed-forward and feed-back connections to modulate latent feature representations of stimuli in a dynamic and context-sensitive manner. A crucial element of DPCNs is a forward-backward inference procedure to uncover sparse, invariant features. However, this inference is a major computational bottleneck. It severely limits the network depth due to learning stagnation. Here, we prove why this bottleneck occurs. We then propose a new forward-inference strategy based on accelerated proximal gradients. This strategy has faster theoretical convergence guarantees than the one used for DPCNs. It overcomes learning stagnation. We also demonstrate that it permits constructing deep and wide predictive-coding networks. Such convolutional networks implement receptive fields that capture well the entire classes of objects on which the networks are trained. This improves the feature representations compared with our lab's previous non-convolutional and convolutional DPCNs. It yields unsupervised object recognition that surpass convolutional autoencoders and are on par with convolutional networks trained in a supervised manner.
CVJan 10, 2021
Target Detection and Segmentation in Circular-Scan Synthetic-Aperture-Sonar Images using Semi-Supervised Convolutional Encoder-DecodersIsaac J. Sledge, Matthew S. Emigh, Jonathan L. King et al.
We propose a framework for saliency-based, multi-target detection and segmentation of circular-scan, synthetic-aperture-sonar (CSAS) imagery. Our framework relies on a multi-branch, convolutional encoder-decoder network (MB-CEDN). The encoder portion of the MB-CEDN extracts visual contrast features from CSAS images. These features are fed into dual decoders that perform pixel-level segmentation to mask targets. Each decoder provides different perspectives as to what constitutes a salient target. These opinions are aggregated and cascaded into a deep-parsing network to refine the segmentation. We evaluate our framework using real-world CSAS imagery consisting of five broad target classes. We compare against existing approaches from the computer-vision literature. We show that our framework outperforms supervised, deep-saliency networks designed for natural imagery. It greatly outperforms unsupervised saliency approaches developed for natural imagery. This illustrates that natural-image-based models may need to be altered to be effective for this imaging-sonar modality.
LGJan 9, 2021
Training Deep Architectures Without End-to-End Backpropagation: A Survey on the Provably Optimal MethodsShiyu Duan, Jose C. Principe
This tutorial paper surveys provably optimal alternatives to end-to-end backpropagation (E2EBP) -- the de facto standard for training deep architectures. Modular training refers to strictly local training without both the forward and the backward pass, i.e., dividing a deep architecture into several nonoverlapping modules and training them separately without any end-to-end operation. Between the fully global E2EBP and the strictly local modular training, there are weakly modular hybrids performing training without the backward pass only. These alternatives can match or surpass the performance of E2EBP on challenging datasets such as ImageNet, and are gaining increasing attention primarily because they offer practical advantages over E2EBP, which will be enumerated herein. In particular, they allow for greater modularity and transparency in deep learning workflows, aligning deep learning with the mainstream computer science engineering that heavily exploits modularization for scalability. Modular training has also revealed novel insights about learning and has further implications on other important research domains. Specifically, it induces natural and effective solutions to some important practical problems such as data efficiency and transferability estimation.
LGOct 18, 2020
Unsupervised Foveal Vision Neural Networks with Top-Down AttentionRyan Burt, Nina N. Thigpen, Andreas Keil et al.
Deep learning architectures are an extremely powerful tool for recognizing and classifying images. However, they require supervised learning and normally work on vectors the size of image pixels and produce the best results when trained on millions of object images. To help mitigate these issues, we propose the fusion of bottom-up saliency and top-down attention employing only unsupervised learning techniques, which helps the object recognition module to focus on relevant data and learn important features that can later be fine-tuned for a specific task. In addition, by utilizing only relevant portions of the data, the training speed can be greatly improved. We test the performance of the proposed Gamma saliency technique on the Toronto and CAT2000 databases, and the foveated vision in the Street View House Numbers (SVHN) database. The results in foveated vision show that Gamma saliency is comparable to the best and computationally faster. The results in SVHN show that our unsupervised cognitive architecture is comparable to fully supervised methods and that the Gamma saliency also improves CNN performance if desired. We also develop a topdown attention mechanism based on the Gamma saliency applied to the top layer of CNNs to improve scene understanding in multi-object images or images with strong background clutter. When we compare the results with human observers in an image dataset of animals occluded in natural scenes, we show that topdown attention is capable of disambiguating object from background and improves system performance beyond the level of human observers.
LGJul 13, 2020
PRI-VAE: Principle-of-Relevant-Information Variational AutoencodersYanjun Li, Shujian Yu, Jose C. Principe et al.
Although substantial efforts have been made to learn disentangled representations under the variational autoencoder (VAE) framework, the fundamental properties to the dynamics of learning of most VAE models still remain unknown and under-investigated. In this work, we first propose a novel learning objective, termed the principle-of-relevant-information variational autoencoder (PRI-VAE), to learn disentangled representations. We then present an information-theoretic perspective to analyze existing VAE models by inspecting the evolution of some critical information-theoretic quantities across training epochs. Our observations unveil some fundamental properties associated with VAEs. Empirical results also demonstrate the effectiveness of PRI-VAE on four benchmark data sets.
LGMay 5, 2020
Measuring the Discrepancy between Conditional Distributions: Methods, Properties and ApplicationsShujian Yu, Ammar Shaker, Francesco Alesiani et al.
We propose a simple yet powerful test statistic to quantify the discrepancy between two conditional distributions. The new statistic avoids the explicit estimation of the underlying distributions in highdimensional space and it operates on the cone of symmetric positive semidefinite (SPS) matrix using the Bregman matrix divergence. Moreover, it inherits the merits of the correntropy function to explicitly incorporate high-order statistics in the data. We present the properties of our new statistic and illustrate its connections to prior art. We finally show the applications of our new statistic on three different machine learning problems, namely the multi-task learning over graphs, the concept drift detection, and the information-theoretic feature selection, to demonstrate its utility and advantage. Code of our statistic is available at https://bit.ly/BregmanCorrentropy.
LGJan 30, 2020
Towards a Kernel based Uncertainty Decomposition Framework for Data and ModelsRishabh Singh, Jose C. Principe
This paper introduces a new framework for quantifying predictive uncertainty for both data and models that relies on projecting the data into a Gaussian reproducing kernel Hilbert space (RKHS) and transforming the data probability density function (PDF) in a way that quantifies the flow of its gradient as a topological potential field quantified at all points in the sample space. This enables the decomposition of the PDF gradient flow by formulating it as a moment decomposition problem using operators from quantum physics, specifically the Schrodinger's formulation. We experimentally show that the higher order modes systematically cluster the different tail regions of the PDF, thereby providing unprecedented discriminative resolution of data regions having high epistemic uncertainty. In essence, this approach decomposes local realizations of the data PDF in terms of uncertainty moments. We apply this framework as a surrogate tool for predictive uncertainty quantification of point-prediction neural network models, overcoming various limitations of conventional Bayesian based uncertainty quantification methods. Experimental comparisons with some established methods illustrate performance advantages exhibited by our framework.
LGJan 1, 2020
Fast Estimation of Information Theoretic Learning Descriptors using Explicit Inner Product SpacesKan Li, Jose C. Principe
Kernel methods form a theoretically-grounded, powerful and versatile framework to solve nonlinear problems in signal processing and machine learning. The standard approach relies on the \emph{kernel trick} to perform pairwise evaluations of a kernel function, leading to scalability issues for large datasets due to its linear and superlinear growth with respect to the training data. Recently, we proposed \emph{no-trick} (NT) kernel adaptive filtering (KAF) that leverages explicit feature space mappings using data-independent basis with constant complexity. The inner product defined by the feature mapping corresponds to a positive-definite finite-rank kernel that induces a finite-dimensional reproducing kernel Hilbert space (RKHS). Information theoretic learning (ITL) is a framework where information theory descriptors based on non-parametric estimator of Renyi entropy replace conventional second-order statistics for the design of adaptive systems. An RKHS for ITL defined on a space of probability density functions simplifies statistical inference for supervised or unsupervised learning. ITL criteria take into account the higher-order statistical behavior of the systems and signals as desired. However, this comes at a cost of increased computational complexity. In this paper, we extend the NT kernel concept to ITL for improved information extraction from the signal without compromising scalability. Specifically, we focus on a family of fast, scalable, and accurate estimators for ITL using explicit inner product space (EIPS) kernels. We demonstrate the superior performance of EIPS-ITL estimators and combined NT-KAF using EIPS-ITL cost functions through experiments.
LGDec 10, 2019
No-Trick (Treat) Kernel Adaptive Filtering using Deterministic FeaturesKan Li, Jose C. Principe
Kernel methods form a powerful, versatile, and theoretically-grounded unifying framework to solve nonlinear problems in signal processing and machine learning. The standard approach relies on the kernel trick to perform pairwise evaluations of a kernel function, which leads to scalability issues for large datasets due to its linear and superlinear growth with respect to the training data. A popular approach to tackle this problem, known as random Fourier features (RFFs), samples from a distribution to obtain the data-independent basis of a higher finite-dimensional feature space, where its dot product approximates the kernel function. Recently, deterministic, rather than random construction has been shown to outperform RFFs, by approximating the kernel in the frequency domain using Gaussian quadrature. In this paper, we view the dot product of these explicit mappings not as an approximation, but as an equivalent positive-definite kernel that induces a new finite-dimensional reproducing kernel Hilbert space (RKHS). This opens the door to no-trick (NT) online kernel adaptive filtering (KAF) that is scalable and robust. Random features are prone to large variances in performance, especially for smaller dimensions. Here, we focus on deterministic feature-map construction based on polynomial-exact solutions and show their superiority over random constructions. Without loss of generality, we apply this approach to classical adaptive filtering algorithms and validate the methodology to show that deterministic features are faster to generate and outperform state-of-the-art kernel methods based on random Fourier features.
SPNov 24, 2019
Functional Bayesian FilterKan Li, Jose C. Principe
We present a general nonlinear Bayesian filter for high-dimensional state estimation using the theory of reproducing kernel Hilbert space (RKHS). Applying kernel method and the representer theorem to perform linear quadratic estimation in a functional space, we derive a Bayesian recursive state estimator for a general nonlinear dynamical system in the original input space. Unlike existing nonlinear extensions of Kalman filter where the system dynamics are assumed known, the state-space representation for the Functional Bayesian Filter (FBF) is completely learned from measurement data in the form of an infinite impulse response (IIR) filter or recurrent network in the RKHS, with universal approximation property. Using positive definite kernel function satisfying Mercer's conditions to compute and evolve information quantities, the FBF exploits both the statistical and time-domain information about the signal, extracts higher-order moments, and preserves the properties of covariances without the ill effects due to conventional arithmetic operations. This novel kernel adaptive filtering algorithm is applied to recurrent network training, chaotic time-series estimation and cooperative filtering using Gaussian and non-Gaussian noises, and inverse kinematics modeling. Simulation results show FBF outperforms existing Kalman-based algorithms.
IVNov 8, 2019
Algorithmic Design and Implementation of Unobtrusive Multistatic Serial LiDAR ImageChi Ding, Zheng Cao, Matthew S. Emigh et al.
To fully understand interactions between marine hydrokinetic (MHK) equipment and marine animals, a fast and effective monitoring system is required to capture relevant information whenever underwater animals appear. A new automated underwater imaging system composed of LiDAR (Light Detection and Ranging) imaging hardware and a scene understanding software module named Unobtrusive Multistatic Serial LiDAR Imager (UMSLI) to supervise the presence of animals near turbines. UMSLI integrates the front end LiDAR hardware and a series of software modules to achieve image preprocessing, detection, tracking, segmentation and classification in a hierarchical manner.
LGJul 13, 2019
Multiscale Principle of Relevant Information for Hyperspectral Image ClassificationYantao Wei, Shujian Yu, Luis Sanchez Giraldo et al.
This paper proposes a novel architecture, termed multiscale principle of relevant information (MPRI), to learn discriminative spectral-spatial features for hyperspectral image (HSI) classification. MPRI inherits the merits of the principle of relevant information (PRI) to effectively extract multiscale information embedded in the given data, and also takes advantage of the multilayer structure to learn representations in a coarse-to-fine manner. Specifically, MPRI performs spectral-spatial pixel characterization (using PRI) and feature dimensionality reduction (using regularized linear discriminant analysis) iteratively and successively. Extensive experiments on three benchmark data sets demonstrate that MPRI outperforms existing state-of-the-art methods (including deep learning based ones) qualitatively and quantitatively, especially in the scenario of limited training samples. Code of MPRI is available at \url{http://bit.ly/MPRI_HSI}.
MLApr 13, 2019
Maximum Correntropy Criterion with Variable CenterBadong Chen, Xin Wang, Yingsong Li et al.
Correntropy is a local similarity measure defined in kernel space and the maximum correntropy criterion (MCC) has been successfully applied in many areas of signal processing and machine learning in recent years. The kernel function in correntropy is usually restricted to the Gaussian function with center located at zero. However, zero-mean Gaussian function may not be a good choice for many practical applications. In this study, we propose an extended version of correntropy, whose center can locate at any position. Accordingly, we propose a new optimization criterion called maximum correntropy criterion with variable center (MCC-VC). We also propose an efficient approach to optimize the kernel width and center location in MCC-VC. Simulation results of regression with linear in parameters (LIP) models confirm the desirable performance of the new method.
LGJan 22, 2019
An Exact Reformulation of Feature-Vector-based Radial-Basis-Function Networks for Graph-based ObservationsIsaac J. Sledge, Jose C. Principe
Radial-basis-function networks are traditionally defined for sets of vector-based observations. In this short paper, we reformulate such networks so that they can be applied to adjacency-matrix representations of weighted, directed graphs that represent the relationships between object pairs. We re-state the sum-of-squares objective function so that it is purely dependent on entries from the adjacency matrix. From this objective function, we derive a gradient descent update for the network weights. We also derive a gradient update that simulates the repositioning of the radial basis prototypes and changes in the radial basis prototype parameters. An important property of our radial basis function networks is that they are guaranteed to yield the same responses as conventional radial-basis networks trained on a corresponding vector realization of the relationships encoded by the adjacency-matrix. Such a vector realization only needs to provably exist for this property to hold, which occurs whenever the relationships correspond to distances from some arbitrary metric applied to a latent set of vectors. We therefore completely avoid needing to actually construct vectorial realizations via multi-dimensional scaling, which ensures that the underlying relationships are totally preserved.
SPDec 31, 2018
Theory and Algorithms for Pulse Signal ProcessingGabriel Nallathambi, Jose C. Principe
The integrate and fire converter transforms an analog signal into train of biphasic pulses. The pulse train has information encoded in the timing and polarity of pulses. While it has been shown that any finite bandwidth analog signal can be reconstructed from these pulse trains with an error as small as desired, there is a need for fundamental signal processing techniques to operate directly on pulse trains without signal reconstruction. In this paper, the feasibility of performing online the signal processing operations of addition, multiplication, and convolution of analog signals using their pulses train representations is explored. Theoretical framework to perform signal processing with pulse trains imposing minimal restrictions is derived, and algorithms for online implementation of the operators are developed. Performance of the algorithms in processing simulated data is studied. An application of noise subtraction and representation of relevant features of interest in electrocardiogram signal is demonstrated with mean pulse rate less than 20 pulses per second.
CVNov 29, 2018
Simple stopping criteria for information theoretic feature selectionShujian Yu, Jose C. Principe
Feature selection aims to select the smallest feature subset that yields the minimum generalization error. In the rich literature in feature selection, information theory-based approaches seek a subset of features such that the mutual information between the selected features and the class labels is maximized. Despite the simplicity of this objective, there still remain several open problems in optimization. These include, for example, the automatic determination of the optimal subset size (i.e., the number of features) or a stopping criterion if the greedy searching strategy is adopted. In this paper, we suggest two stopping criteria by just monitoring the conditional mutual information (CMI) among groups of variables. Using the recently developed multivariate matrix-based Renyi's α-entropy functional, which can be directly estimated from data samples, we showed that the CMI among groups of variables can be easily computed without any decomposition or approximation, hence making our criteria easy to implement and seamlessly integrated into any existing information theoretic feature selection methods with a greedy search strategy.
ITAug 23, 2018
Multivariate Extension of Matrix-based Renyi's α-order Entropy FunctionalShujian Yu, Luis Gonzalo Sanchez Giraldo, Robert Jenssen et al.
The matrix-based Renyi's α-order entropy functional was recently introduced using the normalized eigenspectrum of a Hermitian matrix of the projected data in a reproducing kernel Hilbert space (RKHS). However, the current theory in the matrix-based Renyi's α-order entropy functional only defines the entropy of a single variable or mutual information between two random variables. In information theory and machine learning communities, one is also frequently interested in multivariate information quantities, such as the multivariate joint entropy and different interactive quantities among multiple variables. In this paper, we first define the matrix-based Renyi's α-order joint entropy among multiple variables. We then show how this definition can ease the estimation of various information quantities that measure the interactions among multiple variables, such as interactive information and total correlation. We finally present an application to feature selection to show how our definition provides a simple yet powerful way to estimate a widely-acknowledged intractable quantity from data. A real example on hyperspectral image (HSI) band selection is also provided.
LGJun 25, 2018
Request-and-Reverify: Hierarchical Hypothesis Testing for Concept Drift Detection with Expensive LabelsShujian Yu, Xiaoyang Wang, Jose C. Principe
One important assumption underlying common classification models is the stationarity of the data. However, in real-world streaming applications, the data concept indicated by the joint distribution of feature and label is not stationary but drifting over time. Concept drift detection aims to detect such drifts and adapt the model so as to mitigate any deterioration in the model's predictive performance. Unfortunately, most existing concept drift detection methods rely on a strong and over-optimistic condition that the true labels are available immediately for all already classified instances. In this paper, a novel Hierarchical Hypothesis Testing framework with Request-and-Reverify strategy is developed to detect concept drifts by requesting labels only when necessary. Two methods, namely Hierarchical Hypothesis Testing with Classification Uncertainty (HHT-CU) and Hierarchical Hypothesis Testing with Attribute-wise "Goodness-of-fit" (HHT-AG), are proposed respectively under the novel framework. In experiments with benchmark datasets, our methods demonstrate overwhelming advantages over state-of-the-art unsupervised drift detectors. More importantly, our methods even outperform DDM (the widely used supervised drift detector) when we use significantly fewer labels.
LGApr 18, 2018
Understanding Convolutional Neural Networks with Information Theory: An Initial ExplorationShujian Yu, Kristoffer Wickstrøm, Robert Jenssen et al.
The matrix-based Renyi's α-entropy functional and its multivariate extension were recently developed in terms of the normalized eigenspectrum of a Hermitian matrix of the projected data in a reproducing kernel Hilbert space (RKHS). However, the utility and possible applications of these new estimators are rather new and mostly unknown to practitioners. In this paper, we first show that our estimators enable straightforward measurement of information flow in realistic convolutional neural networks (CNN) without any approximation. Then, we introduce the partial information decomposition (PID) framework and develop three quantities to analyze the synergy and redundancy in convolutional layer representations. Our results validate two fundamental data processing inequalities and reveal some fundamental properties concerning the training of CNN.
LGMar 30, 2018
Understanding Autoencoders with Information Theoretic ConceptsShujian Yu, Jose C. Principe
Despite their great success in practical applications, there is still a lack of theoretical and systematic methods to analyze deep neural networks. In this paper, we illustrate an advanced information theoretic methodology to understand the dynamics of learning and the design of autoencoders, a special type of deep learning architectures that resembles a communication channel. By generalizing the information plane to any cost function, and inspecting the roles and dynamics of different layers using layer-wise information quantities, we emphasize the role that mutual information plays in quantifying learning from data. We further suggest and also experimentally validate, for mean square error training, three fundamental properties regarding the layer-wise flow of information and intrinsic dimensionality of the bottleneck layer, using respectively the data processing inequality and the identification of a bifurcation point in the information plane that is controlled by the given data. Our observations have a direct impact on the optimal design of autoencoders, the design of alternative feedforward training methods, and even in the problem of generalization.
AIFeb 5, 2018
Guided Policy Exploration for Markov Decision Processes using an Uncertainty-Based Value-of-Information CriterionIsaac J. Sledge, Matthew S. Emigh, Jose C. Principe
Reinforcement learning in environments with many action-state pairs is challenging. At issue is the number of episodes needed to thoroughly search the policy space. Most conventional heuristics address this search problem in a stochastic manner. This can leave large portions of the policy space unvisited during the early training stages. In this paper, we propose an uncertainty-based, information-theoretic approach for performing guided stochastic searches that more effectively cover the policy space. Our approach is based on the value of information, a criterion that provides the optimal trade-off between expected costs and the granularity of the search process. The value of information yields a stochastic routine for choosing actions during learning that can explore the policy space in a coarse to fine manner. We augment this criterion with a state-transition uncertainty factor, which guides the search process into previously unexplored regions of the policy space.
LGFeb 1, 2018
Augmented Space Linear ModelZhengda Qin, Badong Chen, Nanning Zheng et al.
The linear model uses the space defined by the input to project the target or desired signal and find the optimal set of model parameters. When the problem is nonlinear, the adaption requires nonlinear models for good performance, but it becomes slower and more cumbersome. In this paper, we propose a linear model called Augmented Space Linear Model (ASLM), which uses the full joint space of input and desired signal as the projection space and approaches the performance of nonlinear models. This new algorithm takes advantage of the linear solution, and corrects the estimate for the current testing phase input with the error assigned to the input space neighborhood in the training phase. This algorithm can solve the nonlinear problem with the computational efficiency of linear methods, which can be regarded as a trade off between accuracy and computational complexity. Making full use of the training data, the proposed augmented space model may provide a new way to improve many modeling tasks.
AIOct 28, 2017
Partitioning Relational Matrices of Similarities or Dissimilarities using the Value of InformationIsaac J. Sledge, Jose C. Principe
In this paper, we provide an approach to clustering relational matrices whose entries correspond to either similarities or dissimilarities between objects. Our approach is based on the value of information, a parameterized, information-theoretic criterion that measures the change in costs associated with changes in information. Optimizing the value of information yields a deterministic annealing style of clustering with many benefits. For instance, investigators avoid needing to a priori specify the number of clusters, as the partitions naturally undergo phase changes, during the annealing process, whereby the number of clusters changes in a data-driven fashion. The global-best partition can also often be identified.
AIOct 8, 2017
An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed BanditsIsaac J. Sledge, Jose C. Principe
In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.
CVAug 4, 2017
Associations among Image Assessments as Cost Functions in Linear Decomposition: MSE, SSIM, and Correlation CoefficientJianji Wang, Nanning Zheng, Badong Chen et al.
The traditional methods of image assessment, such as mean squared error (MSE), signal-to-noise ratio (SNR), and Peak signal-to-noise ratio (PSNR), are all based on the absolute error of images. Pearson's inner-product correlation coefficient (PCC) is also usually used to measure the similarity between images. Structural similarity (SSIM) index is another important measurement which has been shown to be more effective in the human vision system (HVS). Although there are many essential differences among these image assessments, some important associations among them as cost functions in linear decomposition are discussed in this paper. Firstly, the selected bases from a basis set for a target vector are the same in the linear decomposition schemes with different cost functions MSE, SSIM, and PCC. Moreover, for a target vector, the ratio of the corresponding affine parameters in the MSE-based linear decomposition scheme and the SSIM-based scheme is a constant, which is just the value of PCC between the target vector and its estimated vector.
MLMar 23, 2017
Robustness of Maximum Correntropy Estimation Against Large OutliersBadong Chen, Lei Xing, Haiquan Zhao et al.
The maximum correntropy criterion (MCC) has recently been successfully applied in robust regression, classification and adaptive filtering, where the correntropy is maximized instead of minimizing the well-known mean square error (MSE) to improve the robustness with respect to outliers (or impulsive noises). Considerable efforts have been devoted to develop various robust adaptive algorithms under MCC, but so far little insight has been gained as to how the optimal solution will be affected by outliers. In this work, we study this problem in the context of parameter estimation for a simple linear errors-in-variables (EIV) model where all variables are scalar. Under certain conditions, we derive an upper bound on the absolute value of the estimation error and show that the optimal solution under MCC can be very close to the true value of the unknown parameter even with outliers (whose values can be arbitrarily large) in both input and output variables. Illustrative examples are presented to verify and clarify the theory.
LGFeb 28, 2017
Analysis of Agent Expertise in Ms. Pac-Man using Value-of-Information-based PoliciesIsaac J. Sledge, Jose C. Principe
Conventional reinforcement learning methods for Markov decision processes rely on weakly-guided, stochastic searches to drive the learning process. It can therefore be difficult to predict what agent behaviors might emerge. In this paper, we consider an information-theoretic cost function for performing constrained stochastic searches that promote the formation of risk-averse to risk-favoring behaviors. This cost function is the value of information, which provides the optimal trade-off between the expected return of a policy and the policy's complexity; policy complexity is measured by number of bits and controlled by a single hyperparameter on the cost function. As the policy complexity is reduced, the agents will increasingly eschew risky actions. This reduces the potential for high accrued rewards. As the policy complexity increases, the agents will take actions, regardless of the risk, that can raise the long-term rewards. The obtainable reward depends on a single, tunable hyperparameter that regulates the degree of policy complexity. We evaluate the performance of value-of-information-based policies on a stochastic version of Ms. Pac-Man. A major component of this paper is the demonstration that ranges of policy complexity values yield different game-play styles and explaining why this occurs. We also show that our reinforcement-learning search mechanism is more efficient than the others we utilize. This result implies that the value of information theory is appropriate for framing the exploitation-exploration trade-off in reinforcement learning.
MLAug 1, 2016
Kernel Risk-Sensitive Loss: Definition, Properties and Application to Robust Adaptive FilteringBadong Chen, Lei Xing, Bin Xu et al.
Nonlinear similarity measures defined in kernel space, such as correntropy, can extract higher-order statistics of data and offer potentially significant performance improvement over their linear counterparts especially in non-Gaussian signal processing and machine learning. In this work, we propose a new similarity measure in kernel space, called the kernel risk-sensitive loss (KRSL), and provide some important properties. We apply the KRSL to adaptive filtering and investigate the robustness, and then develop the MKRSL algorithm and analyze the mean square convergence performance. Compared with correntropy, the KRSL can offer a more efficient performance surface, thereby enabling a gradient based method to achieve faster convergence speed and higher accuracy while still maintaining the robustness to outliers. Theoretical analysis results and superior performance of the new algorithm are confirmed by simulation.