CLSep 19, 2024

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards

Furkan Şahinuç, Thy Thy Tran, Yulia Grishina, Yufang Hou, Bei Chen, Iryna Gurevych

arXiv:2409.12656v115.726 citationsh-index: 16Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of maintaining up-to-date leaderboards for researchers and practitioners, though it is incremental as it builds on existing automation efforts.

The paper tackles the problem of manually constructing scientific leaderboards by introducing SciLead, a manually-curated dataset, and an LLM-based framework for automated construction, achieving correct identification of task-dataset-metric triples but struggling with result extraction.

Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods. Typically, a leaderboard is defined by a task, dataset, and evaluation metric (TDM) triple, allowing objective performance assessment and fostering innovation through benchmarking. However, the exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually. Automatic leaderboard construction has emerged as a solution to reduce manual labor. Existing datasets for this task are based on the community-contributed leaderboards without additional curation. Our analysis shows that a large portion of these leaderboards are incomplete, and some of them contain incorrect information. In this work, we present SciLead, a manually-curated Scientific Leaderboard dataset that overcomes the aforementioned problems. Building on this dataset, we propose three experimental settings that simulate real-world scenarios where TDM triples are fully defined, partially defined, or undefined during leaderboard construction. While previous research has only explored the first setting, the latter two are more representative of real-world applications. To address these diverse settings, we develop a comprehensive LLM-based framework for constructing leaderboards. Our experiments and analysis reveal that various LLMs often correctly identify TDM triples while struggling to extract result values from publications. We make our code and data publicly available.

View on arXiv PDF Code

Similar