MLSep 13, 2019
Recommendation or Discrimination?: Quantifying Distribution Parity in Information Retrieval SystemsRinat Khaziev, Bryce Casavant, Pearce Washabaugh et al.
Information retrieval (IR) systems often leverage query data to suggest relevant items to users. This introduces the possibility of unfairness if the query (i.e., input) and the resulting recommendations unintentionally correlate with latent factors that are protected variables (e.g., race, gender, and age). For instance, a visual search system for fashion recommendations may pick up on features of the human models rather than fashion garments when generating recommendations. In this work, we introduce a statistical test for "distribution parity" in the top-K IR results, which assesses whether a given set of recommendations is fair with respect to a specific protected variable. We evaluate our test using both simulated and empirical results. First, using artificially biased recommendations, we demonstrate the trade-off between statistically detectable bias and the size of the search catalog. Second, we apply our test to a visual search system for fashion garments, specifically testing for recommendation bias based on the skin tone of fashion models. Our distribution parity test can help ensure that IR systems' results are fair and produce a good experience for all users.
IRSep 5, 2019
Assessing Fashion Recommendations: A Multifaceted Offline Evaluation ApproachJake Sherman, Chinmay Shukla, Rhonda Textor et al.
Fashion is a unique domain for developing recommender systems (RS). Personalization is critical to fashion users. As a result, highly accurate recommendations are not sufficient unless they are also specific to users. Moreover, fashion data is characterized by a large majority of new users, so a recommendation strategy that performs well only for users with prior interaction history is a poor fit to the fashion problem. Critical to addressing these issues in fashion recommendation is an evaluation strategy that: 1) includes multiple metrics that are relevant to fashion, and 2) is performed within segments of users with different interaction histories. Here, we present our multifaceted offline strategy for evaluating fashion RS. Using our proposed evaluation methodology, we compare the performance of three different algorithms, a most popular (MP) items strategy, a collaborative filtering (CF) strategy, and a content-based (CB) strategy. We demonstrate that only by considering the performance of these algorithms across multiple metrics and user segments can we determine the extent to which each algorithm is likely to fulfill fashion users' needs.