Reinforcement learning from comparisons: Three alternatives is enough, two is not
This addresses a fundamental issue in decision-making and ranking systems where intransitive preferences arise, offering a reliable method for applications like recommendation systems or social choice.
The paper tackled the problem of selecting the best alternative using pairwise comparisons that may be intransitive, and proved that a reinforcement learning model converges to the optimal solution when reinforcing the winner among three random alternatives, while a simpler two-alternative model can cycle and fail to converge.
The paper deals with the problem of finding the best alternatives on the basis of pairwise comparisons when these comparisons need not be transitive. In this setting, we study a reinforcement urn model. We prove convergence to the optimal solution when reinforcement of a winning alternative occurs each time after considering three random alternatives. The simpler process, which reinforces the winner of a random pair does not always converges: it may cycle.