MLMay 5
Imbalanced Classification under Capacity ConstraintsDaniel Fraiman, Ricardo Fraiman
In many classification settings, the class of primary interest is underrepresented, leading to imbalanced data problems that arise in applications such as rare disease detection and fraud identification. In these contexts, identifying a potential positive instance typically triggers costly follow-up actions, such as medical imaging or detailed transaction inspection, which are subject to limited operational capacity. Motivated by this setting, we consider classification problems where data may arrive sequentially and decisions must be made under constraints on the number of instances that can be selected for further analysis. We propose a classification framework that explicitly controls the rate of positive predictions, enforcing a user-defined bound on the proportion of observations classified as belonging to the minority class while maximizing detection performance. The approach can be implemented using standard learning methods and naturally extends to online settings, where decisions are taken in real time. We show that incorporating capacity constraints leads to substantial improvements over classical approaches, including resampling techniques such as SMOTE, which do not directly control the selection rate.
MLMay 22, 2018
On semi-supervised learningAlejandro Cholaquidis, Ricardo Fraiman, Mariela Sued
Semi-supervised learning deals with the problem of how, if possible, to take advantage of a huge amount of unclassified data, to perform a classification in situations when, typically, there is little labeled data. Even though this is not always possible (it depends on how useful, for inferring the labels, it would be to know the distribution of the unlabeled data), several algorithm have been proposed recently. %but in general they are not proved to outperform A new algorithm is proposed, that under almost necessary conditions, %and it is proved that it attains asymptotically the performance of the best theoretical rule as the amount of unlabeled data tends to infinity. The set of necessary assumptions, although reasonable, show that semi-supervised classification only works for very well conditioned problems. The focus is on understanding when and why semi-supervised learning works when the size of the initial training sample remains fixed and the asymptotic is on the size of the unlabeled data. The performance of the algorithm is assessed in the well known "Isolet" real-data of phonemes, where a strong dependence on the choice of the initial training sample is shown.
STSep 17, 2017
Semi-supervised learningAlejandro Cholaquidis, Ricardo Fraiman, Mariela Sued
Semi-supervised learning deals with the problem of how, if possible, to take advantage of a huge amount of not classified data, to perform classification, in situations when, typically, the labelled data are few. Even though this is not always possible (it depends on how useful is to know the distribution of the unlabelled data in the inference of the labels), several algorithm have been proposed recently. A new algorithm is proposed, that under almost neccesary conditions, attains asymptotically the performance of the best theoretical rule, when the size of unlabeled data tends to infinity. The set of necessary assumptions, although reasonables, show that semi-parametric classification only works for very well conditioned problems.
STSep 4, 2015
A nonlinear aggregation type classifierAlejandro Cholaquidis, Ricardo Fraiman, Juan Kalemkerian et al.
We introduce a nonlinear aggregation type classifier for functional data defined on a separable and complete metric space. The new rule is built up from a collection of $M$ arbitrary training classifiers. If the classifiers are consistent, then so is the aggregation rule. Moreover, asymptotically the aggregation rule behaves as well as the best of the $M$ classifiers. The results of a small simulation are reported both, for high dimensional and functional data, and a real data example is analyzed.