Feature Selection and Model Comparison on Microsoft Learning-to-Rank Data Sets
This work addresses the core problem of information retrieval for search engines, but it is incremental as it applies standard methods to an existing dataset.
The study tackled feature selection and model comparison for learning-to-rank in search engines, finding that not all 137 features are useful and that boosting trees and random forest achieve the best prediction performance on the MSLR-WEB dataset.
With the rapid advance of the Internet, search engines (e.g., Google, Bing, Yahoo!) are used by billions of users for each day. The main function of a search engine is to locate the most relevant webpages corresponding to what the user requests. This report focuses on the core problem of information retrieval: how to learn the relevance between a document (very often webpage) and a query given by user. Our analysis consists of two parts: 1) we use standard statistical methods to select important features among 137 candidates given by information retrieval researchers from Microsoft. We find that not all the features are useful, and give interpretations on the top-selected features; 2) we give baselines on prediction over the real-world dataset MSLR-WEB by using various learning algorithms. We find that models of boosting trees, random forest in general achieve the best performance of prediction. This agrees with the mainstream opinion in information retrieval community that tree-based algorithms outperform the other candidates for this problem.