CY AI DL IR SI SOC-PHMay 29, 2025

Whose Name Comes Up? Auditing LLM-Based Scholar Recommendations

Daniele Barolo, Chiara Valentin, Fariba Karimi, Luis Galárraga, Gonzalo G. Méndez, Lisette Espín-Noboa

arXiv:2506.00074v23.32 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This work addresses reliability and equity issues in LLM-based scholar recommendations for academic and research applications, but it is incremental as it audits existing models without proposing new methods.

This paper evaluated six open-weight LLMs on recommending physics experts across five tasks, revealing inconsistencies and biases such as favoring senior scholars and replicating gender and ethnic imbalances, with mixtral-8x7b showing the most stable outputs and llama3.1-70b the highest variability.

This paper evaluates the performance of six open-weight LLMs (llama3-8b, llama3.1-8b, gemma2-9b, mixtral-8x7b, llama3-70b, llama3.1-70b) in recommending experts in physics across five tasks: top-k experts by field, influential scientists by discipline, epoch, seniority, and scholar counterparts. The evaluation examines consistency, factuality, and biases related to gender, ethnicity, academic popularity, and scholar similarity. Using ground-truth data from the American Physical Society and OpenAlex, we establish scholarly benchmarks by comparing model outputs to real-world academic records. Our analysis reveals inconsistencies and biases across all models. mixtral-8x7b produces the most stable outputs, while llama3.1-70b shows the highest variability. Many models exhibit duplication, and some, particularly gemma2-9b and llama3.1-8b, struggle with formatting errors. LLMs generally recommend real scientists, but accuracy drops in field-, epoch-, and seniority-specific queries, consistently favoring senior scholars. Representation biases persist, replicating gender imbalances (reflecting male predominance), under-representing Asian scientists, and over-representing White scholars. Despite some diversity in institutional and collaboration networks, models favor highly cited and productive scholars, reinforcing the rich-getricher effect while offering limited geographical representation. These findings highlight the need to improve LLMs for more reliable and equitable scholarly recommendations.

View on arXiv PDF

Similar