Fair Column Subset Selection
This work addresses fairness in data representation for applications involving partitioned groups, though it is incremental as it builds on existing column subset selection methods.
The paper tackles the problem of fair column subset selection by extending it to a setting with two groups, aiming to minimize the maximum reconstruction error for both groups relative to their best rank-k approximations, and it presents an approximation algorithm that guarantees a solution within 1.5 times the optimal size.
The problem of column subset selection asks for a subset of columns from an input matrix such that the matrix can be reconstructed as accurately as possible within the span of the selected columns. A natural extension is to consider a setting where the matrix rows are partitioned into two groups, and the goal is to choose a subset of columns that minimizes the maximum reconstruction error of both groups, relative to their respective best rank-k approximation. Extending the known results of column subset selection to this fair setting is not straightforward: in certain scenarios it is unavoidable to choose columns separately for each group, resulting in double the expected column count. We propose a deterministic leverage-score sampling strategy for the fair setting and show that sampling a column subset of minimum size becomes NP-hard in the presence of two groups. Despite these negative results, we give an approximation algorithm that guarantees a solution within 1.5 times the optimal solution size. We also present practical heuristic algorithms based on rank-revealing QR factorization. Finally, we validate our methods through an extensive set of experiments using real-world data.