LG EM MLAug 3, 2019

The Use of Binary Choice Forests to Model and Estimate Discrete Choices

Ningyuan Chen, Guillermo Gallego, Zhuodong Tang

arXiv:1908.01109v62.72 citations

Originality Incremental advance

AI Analysis

This provides retail managers with an interpretable and flexible method to model customer choice behavior, though it is incremental as it adapts random forests to a known domain.

The authors tackled the problem of estimating discrete choice models in retailing, where existing methods are either too inflexible or uninterpretable, by proposing a binary choice forest approach that consistently predicts choice probabilities without misspecification and outperforms existing methods in experiments.

Problem definition. In retailing, discrete choice models (DCMs) are commonly used to capture the choice behavior of customers when offered an assortment of products. When estimating DCMs using transaction data, flexible models (such as machine learning models or nonparametric models) are typically not interpretable and hard to estimate, while tractable models (such as the multinomial logit model) tend to misspecify the complex behavior represeted in the data. Methodology/results. In this study, we use a forest of binary decision trees to represent DCMs. This approach is based on random forests, a popular machine learning algorithm. The resulting model is interpretable: the decision trees can explain the decision-making process of customers during the purchase. We show that our approach can predict the choice probability of any DCM consistently and thus never suffers from misspecification. Moreover, our algorithm predicts assortments unseen in the training data. The mechanism and errors can be theoretically analyzed. We also prove that the random forest can recover preference rankings of customers thanks to the splitting criterion such as the Gini index and information gain ratio. Managerial implications. The framework has unique practical advantages. It can capture customers' behavioral patterns such as irrationality or sequential searches when purchasing a product. It handles nonstandard formats of training data that result from aggregation. It can measure product importance based on how frequently a random customer would make decisions depending on the presence of the product. It can also incorporate price information and customer features. Our numerical experiments using synthetic and real data show that using random forests to estimate customer choices can outperform existing methods.

View on arXiv PDF

Similar