LGAug 13, 2022
A Novel Regularization Approach to Fair MLNorman Matloff, Wenxi Zhang
A number of methods have been introduced for the fair ML issue, most of them complex and many of them very specific to the underlying ML moethodology. Here we introduce a new approach that is simple, easily explained, and potentially applicable to a number of standard ML algorithms. Explicitly Deweighted Features (EDF) reduces the impact of each feature among the proxies of sensitive variables, allowing a different amount of deweighting applied to each such feature. The user specifies the deweighting hyperparameters, to achieve a given point in the Utility/Fairness tradeoff spectrum. We also introduce a new, simple criterion for evaluating the degree of protection afforded by any fair ML method.
LGOct 13, 2022
Walk a Mile in Their Shoes: a New Fairness Criterion for Machine LearningNorman Matloff
The old empathetic adage, ``Walk a mile in their shoes,'' asks that one imagine the difficulties others may face. This suggests a new ML counterfactual fairness criterion, based on a \textit{group} level: How would members of a nonprotected group fare if their group were subject to conditions in some protected group? Instead of asking what sentence would a particular Caucasian convict receive if he were Black, take that notion to entire groups; e.g. how would the average sentence for all White convicts change if they were Black, but with their same White characteristics, e.g. same number of prior convictions? We frame the problem and study it empirically, for different datasets. Our approach also is a solution to the problem of covariate correlation with sensitive attributes.
LGJun 13, 2018Code
Polynomial Regression As an Alternative to Neural NetsXi Cheng, Bohdan Khomtchouk, Norman Matloff et al.
Despite the success of neural networks (NNs), there is still a concern among many over their "black box" nature. Why do they work? Here we present a simple analytic argument that NNs are in fact essentially polynomial regression models. This view will have various implications for NNs, e.g. providing an explanation for why convergence problems arise in NNs, and it gives rough guidance on avoiding overfitting. In addition, we use this phenomenon to predict and confirm a multicollinearity property of NNs not previously reported in the literature. Most importantly, given this loose correspondence, one may choose to routinely use polynomial models instead of NNs, thus avoiding some major problems of the latter, such as having to set many tuning parameters and dealing with convergence issues. We present a number of empirical results; in each case, the accuracy of the polynomial approach matches or exceeds that of NN approaches. A many-featured, open-source software package, polyreg, is available.
LGNov 13, 2024
TowerDebias: A Novel Unfairness Removal Method Based on the Tower PropertyNorman Matloff, Aditya Mittal
Decision-making processes have increasingly come to rely on sophisticated machine learning tools, raising critical concerns about the fairness of their predictions with respect to sensitive groups. The widespread adoption of commercial "black-box" models necessitates careful consideration of their legal and ethical implications for consumers. When users interact with such black-box models, a key challenge arises: how can the influence of sensitive attributes, such as race or gender, be mitigated or removed from its predictions? We propose towerDebias (tDB), a novel post-processing method designed to reduce the influence of sensitive attributes in predictions made by black-box models. Our tDB approach leverages the Tower Property from probability theory to improve prediction fairness without requiring retraining of the original model. This method is highly versatile, as it requires no prior knowledge of the original algorithm's internal structure and is adaptable to a diverse range of applications. We present a formal fairness improvement theorem for tDB and showcase its effectiveness in both regression and classification tasks using multiple real-world datasets.
MENov 6, 2024
dsld: A Socially Relevant Tool for Teaching StatisticsAditya Mittal, Taha Abdullah, Arjun Ashok et al.
The growing influence of data science in statistics education requires tools that make key concepts accessible through real-world applications. We introduce "Data Science Looks At Discrimination" (dsld), an R package that provides a comprehensive set of analytical and graphical methods for examining issues of discrimination involving attributes such as race, gender, and age. By positioning fairness analysis as a teaching tool, the package enables instructors to demonstrate confounder effects, model bias, and related topics through applied examples. An accompanying 80-page Quarto book guides students and legal professionals in understanding these principles and applying them to real data. We describe the implementation of the package functions and illustrate their use with examples. Python interfaces are also available.
HCSep 3, 2017
Top-Frequency Parallel Coordinates PlotsVincent Yang, Harrison Nguyen, Norman Matloff et al.
Parallel coordinates plotting is one of the most popular methods for multivariate visualization. However, when applied to larger data sets, there tends to be a "black screen problem," with the screen becoming so cluttered and full that patterns are difficult or impossible to discern. Xie and Matloff (2014) proposed remedying this problem by plotting only the most frequently-appearing patterns, with frequency defined in terms of nonparametrically estimated multivariate density. This approach displays "typical" patterns, which may reveal important insights for the data. However, this remedy does not cover variables that are discrete or categorical. An alternate method, still frequency-based, is presented here for such cases. We discretize all continuous variables, retaining the discrete/categorical ones, and plot the patterns having the highest counts in the dataset. In addition, we propose some novel approaches to handling missing values in parallel coordinates settings.
MLOct 6, 2015
Improved Estimation of Class Prior Probabilities through Unlabeled DataNorman Matloff
Work in the classification literature has shown that in computing a classification function, one need not know the class membership of all observations in the training set; the unlabeled observations still provide information on the marginal distribution of the feature set, and can thus contribute to increased classification accuracy for future observations. The present paper will show that this scheme can also be used for the estimation of class prior probabilities, which would be very useful in applications in which it is difficult or expensive to determine class membership. Both parametric and nonparametric estimators are developed. Asymptotic distributions of the estimators are derived, and it is proven that the use of the unlabeled observations does reduce asymptotic variance. This methodology is also extended to the estimation of subclass probabilities.