IR CLApr 4, 2024

Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers

Yuan Wang, Xuyang Wu, Hsin-Tai Wu, Zhiqiang Tao, Yi Fang

arXiv:2404.03192v226.837 citationsh-index: 6NAACL

Originality Synthesis-oriented

AI Analysis

This addresses fairness concerns for users and content creators in search systems, but it is incremental as it builds on existing datasets and methods without introducing new solutions.

The paper tackles the problem of fairness in Large Language Models (LLMs) used as rankers in information retrieval, finding that LLMs exhibit biases in representing binary protected attributes like gender and geographic location, with specific metrics showing disparities in fairness compared to traditional models.

The integration of Large Language Models (LLMs) in information retrieval has raised a critical reevaluation of fairness in the text-ranking models. LLMs, such as GPT models and Llama2, have shown effectiveness in natural language understanding tasks, and prior works (e.g., RankGPT) have also demonstrated that the LLMs exhibit better performance than the traditional ranking models in the ranking task. However, their fairness remains largely unexplored. This paper presents an empirical study evaluating these LLMs using the TREC Fair Ranking dataset, focusing on the representation of binary protected attributes such as gender and geographic location, which are historically underrepresented in search outcomes. Our analysis delves into how these LLMs handle queries and documents related to these attributes, aiming to uncover biases in their ranking algorithms. We assess fairness from both user and content perspectives, contributing an empirical benchmark for evaluating LLMs as the fair ranker.

View on arXiv PDF

Similar