LGApr 27, 2024
On the Rashomon ratio of infinite hypothesis setsEvzenie Coupkova, Mireille Boutin
Given a classification problem and a family of classifiers, the Rashomon ratio measures the proportion of classifiers that yield less than a given loss. Previous work has explored the advantage of a large Rashomon ratio in the case of a finite family of classifiers. Here we consider the more general case of an infinite family. We show that a large Rashomon ratio guarantees that choosing the classifier with the best empirical accuracy among a random subset of the family, which is likely to improve generalizability, will not increase the empirical loss too much. We quantify the Rashomon ratio in two examples involving infinite classifier families in order to illustrate situations in which it is large. In the first example, we estimate the Rashomon ratio of the classification of normally distributed classes using an affine classifier. In the second, we obtain a lower bound for the Rashomon ratio of a classification problem with a modified Gram matrix when the classifier family consists of two-layer ReLU neural networks. In general, we show that the Rashomon ratio can be estimated using a training dataset along with random samples from the classifier family and we provide guarantees that such an estimation is close to the true value of the Rashomon ratio.
LGAug 11, 2021
Approximation and generalization properties of the random projection classification methodMireille Boutin, Evzenie Coupkova
The generalization gap of a classifier is related to the complexity of the set of functions among which the classifier is chosen. We study a family of low-complexity classifiers consisting of thresholding a random one-dimensional feature. The feature is obtained by projecting the data on a random line after embedding it into a higher-dimensional space parametrized by monomials of order up to k. More specifically, the extended data is projected n-times and the best classifier among those n, based on its performance on training data, is chosen. We show that this type of classifier is extremely flexible as, given full knowledge of the class conditional densities, under mild conditions, the error of these classifiers would converge to the optimal (Bayes) error as k and n go to infinity. We also bound the generalization gap of the random classifiers. In general, these bounds are better than those for any classifier with VC dimension greater than O(ln n). In particular, the bounds imply that, unless the number of projections n is extremely large, the generalization gap of the random projection approach is significantly smaller than that of a linear classifier in the extended space. Thus, for certain classification problems (e.g., those with a large Rashomon ratio), there is a potntially large gain in generalization properties by selecting parameters at random, rather than selecting the best one amongst the class.
MLAug 21, 2020
Clustering small datasets in high-dimension by random projectionAlden Bradford, Tarun Yellamraju, Mireille Boutin
Datasets in high-dimension do not typically form clusters in their original space; the issue is worse when the number of points in the dataset is small. We propose a low-computation method to find statistically significant clustering structures in a small dataset. The method proceeds by projecting the data on a random line and seeking binary clusterings in the resulting one-dimensional data. Non-linear separations are obtained by extending the feature space using monomials of higher degrees in the original features. The statistical validity of the clustering structures obtained is tested in the projected one-dimensional space, thus bypassing the challenge of statistical validation in high-dimension. Projecting on a random line is an extreme dimension reduction technique that has previously been used successfully as part of a hierarchical clustering method for high-dimensional data. Our experiments show that with this simplified framework, statistically significant clustering structures can be found with as few as 100-200 points, depending on the dataset. The different structures uncovered are found to persist as more points are added to the dataset.
MLSep 14, 2019
A new model for natural groupings in high-dimensional dataMireille Boutin, Evzenie Coupkova
Clustering aims to divide a set of points into groups. The current paradigm assumes that the grouping is well-defined (unique) given the probability model from which the data is drawn. Yet, recent experiments have uncovered several high-dimensional datasets that form different binary groupings after projecting the data to randomly chosen one-dimensional subspaces. This paper describes a probability model for the data that could explain this phenomenon. It is a simple model to serve as a proof of concept for understanding the geometry of high-dimensional data. We start by building a rescaled multivariate Bernouilli model (stretched hypercube) so to create several overlapping grouping structures in the data. The size of each scaling parameter is related to the likelihood of uncovering the corresponding grouping by random 1D projection. Clusters in the original space are then created by adding noise to this cluster-free model. In high dimension, these clusters would hardly be observable given a sample set from the distribution because of the curse of dimensionality, but the binary groupings are clear. Our construction makes it clear that one needs to make a distinction between "groupings" and "clusters" in the original space. It also highlights the need to interpret any clustering found in projected data as merely one among potentially many other groupings in a dataset.
CVAug 21, 2018
Three Efficient, Low-Complexity Algorithms for Automatic Color TrappingHaiyin Wang, Mireille Boutin, Jeffrey Trask et al.
Color separations (most often cyan, magenta, yellow, and black) are commonly used in printing to reproduce multi-color images. For mechanical reasons, these color separations are generally not perfectly aligned with respect to each other when they are rendered by their respective imaging stations. This phenomenon, called color plane misregistration, causes gap and halo artifacts in the printed image. Color trapping is an image processing technique that aims to reduce these artifacts by modifying the susceptible edge boundaries to create small, unnoticeable overlaps between the color planes. We propose three low-complexity algorithms for automatic color trapping which hide the effects of small color plane mis-registrations. Our algorithms are designed for software or embedded firmware implementation. The trapping method they follow is based on a hardware-friendly technique proposed by J. Trask (JTHBCT03) which is too computationally expensive for software or firmware implementation. The first two algorithms are based on the use of look-up tables (LUTs). The first LUT-based algorithm corrects all registration errors of one pixel in extent and reduces several cases of misregistration errors of two pixels in extent using only 727 Kbytes of storage space. This algorithm is particularly attractive for implementation in the embedded firmware of low-cost formatter-based printers. The second LUT-based algorithm corrects all types of misregistration errors of up to two pixels in extent using 3.7 Mbytes of storage space. The third algorithm is a hybrid one that combines LUTs and feature extraction to minimize the storage requirements (724 Kbytes) while still correcting all misregistration errors of up to two pixels in extent. This algorithm is suitable for both embedded firmware implementation on low-cost formatter-based printers and software implementation on host-based printers.
CLJul 17, 2018
A Hand-Held Multimedia Translation and Interpretation System with Application to Diet ManagementAlbert Parra, Andrew W. Haddad, Mireille Boutin et al.
We propose a network independent, hand-held system to translate and disambiguate foreign restaurant menu items in real-time. The system is based on the use of a portable multimedia device, such as a smartphones or a PDA. An accurate and fast translation is obtained using a Machine Translation engine and a context-specific corpora to which we apply two pre-processing steps, called translation standardization and $n$-gram consolidation. The phrase-table generated is orders of magnitude lighter than the ones commonly used in market applications, thus making translations computationally less expensive, and decreasing the battery usage. Translation ambiguities are mitigated using multimedia information including images of dishes and ingredients, along with ingredient lists. We implemented a prototype of our system on an iPod Touch Second Generation for English speakers traveling in Spain. Our tests indicate that our translation method yields higher accuracy than translation engines such as Google Translate, and does so almost instantaneously. The memory requirements of the application, including the database of images, are also well within the limits of the device. By combining it with a database of nutritional information, our proposed system can be used to help individuals who follow a medical diet maintain this diet while traveling.
CVJul 17, 2018
Photo-unrealistic Image Enhancement for Subject Placement in Outdoor PhotographyChristian Tendyck, Andrew Haddad, Mireille Boutin
Camera display reflections are an issue in bright light situations, as they may prevent users from correctly positioning the subject in the picture. We propose a software solution to this problem, which consists in modifying the image in the viewer, in real time. In our solution, the user is seeing a posterized image which roughly represents the contour of the objects. Five enhancement methods are compared in a user study. Our results indicate that the problem considered is a valid one, as users had problems locating landmarks nearly 37% of the time under sunny conditions, and that our proposed enhancement method using contrasting colors is a practical solution to that problem.
MLJun 13, 2018
Pattern Dependence Detection using n-TARP ClusteringTarun Yellamraju, Mireille Boutin
Consider an experiment involving a potentially small number of subjects. Some random variables are observed on each subject: a high-dimensional one called the "observed" random variable, and a one-dimensional one called the "outcome" random variable. We are interested in the dependencies between the observed random variable and the outcome random variable. We propose a method to quantify and validate the dependencies of the outcome random variable on the various patterns contained in the observed random variable. Different degrees of relationship are explored (linear, quadratic, cubic, ...). This work is motivated by the need to analyze educational data, which often involves high-dimensional data representing a small number of students. Thus our implementation is designed for a small number of subjects; however, it can be easily modified to handle a very large dataset. As an illustration, the proposed method is used to study the influence of certain skills on the course grade of students in a signal processing class. A valid dependency of the grade on the different skill patterns is observed in the data.
MLJun 13, 2018
Benchmarks for Image Classification and Other High-dimensional Pattern Recognition ProblemsTarun Yellamraju, Jonas Hepp, Mireille Boutin
A good classification method should yield more accurate results than simple heuristics. But there are classification problems, especially high-dimensional ones like the ones based on image/video data, for which simple heuristics can work quite accurately; the structure of the data in such problems is easy to uncover without any sophisticated or computationally expensive method. On the other hand, some problems have a structure that can only be found with sophisticated pattern recognition methods. We are interested in quantifying the difficulty of a given high-dimensional pattern recognition problem. We consider the case where the patterns come from two pre-determined classes and where the objects are represented by points in a high-dimensional vector space. However, the framework we propose is extendable to an arbitrarily large number of classes. We propose classification benchmarks based on simple random projection heuristics. Our benchmarks are 2D curves parameterized by the classification error and computational cost of these simple heuristics. Each curve divides the plane into a "positive- gain" and a "negative-gain" region. The latter contains methods that are ill-suited for the given classification problem. The former is divided into two by the curve asymptote; methods that lie in the small region under the curve but right of the asymptote merely provide a computational gain but no structural advantage over the random heuristics. We prove that the curve asymptotes are optimal (i.e. at Bayes error) in some cases, and thus no sophisticated method can provide a structural advantage over the random heuristics. Such classification problems, an example of which we present in our numerical experiments, provide poor ground for testing new pattern classification methods.
MATH-PHSep 21, 2012
The Pascal Triangle of a Discrete Image: Definition, Properties and Application to Shape AnalysisMireille Boutin, Shanshan Huang
We define the Pascal triangle of a discrete (gray scale) image as a pyramidal arrangement of complex-valued moments and we explore its geometric significance. In particular, we show that the entries of row k of this triangle correspond to the Fourier series coefficients of the moment of order k of the Radon transform of the image. Group actions on the plane can be naturally prolonged onto the entries of the Pascal triangle. We study the prolongation of some common group actions, such as rotations and reflections, and we propose simple tests for detecting equivalences and self-equivalences under these group actions. The motivating application of this work is the problem of characterizing the geometry of objects on images, for example by detecting approximate symmetries.