Beyond NDCG: behavioral testing of recommender systems with RecList
This addresses the need for more nuanced, real-world testing in recommender systems for developers and researchers, though it is incremental as it builds on existing behavioral testing concepts.
The authors tackled the problem of evaluating recommender systems beyond traditional metrics like NDCG by proposing RecList, a behavioral testing methodology that organizes systems by use case and provides a plug-and-play procedure, resulting in an open-source package for analyzing algorithms and commercial systems.
As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and deployment-specific tests must be employed to ensure the desired quality in actual deployments. In this paper, we propose RecList, a behavioral-based testing methodology. RecList organizes recommender systems by use case and introduces a general plug-and-play procedure to scale up behavioral testing. We demonstrate its capabilities by analyzing known algorithms and black-box commercial systems, and we release RecList as an open source, extensible package for the community.