CLFeb 29, 2024

FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition

MILA
arXiv:2403.00126v21 citationsh-index: 6EMNLP
Originality Incremental advance
AI Analysis

This work addresses the need for more interpretable and fine-grained evaluation of LLMs for researchers and developers, though it is incremental in improving existing evaluation paradigms.

The paper tackles the problem of evaluating large language models (LLMs) by proposing FAC²E, a framework that dissociates language and cognitive skills into fine-grained capabilities and evaluates them through intermediate reasoning steps. The results show that identifying a common shortfall in knowledge utilization leads to performance enhancements with a knowledge-enhanced method.

Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. However, such a paradigm fails to comprehensively differentiate the fine-grained language and cognitive skills, rendering the lack of sufficient interpretation to LLMs' capabilities. In this paper, we present FAC$^2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation. Specifically, we formulate LLMs' evaluation in a multi-dimensional and explainable manner by dissociating the language-related capabilities and the cognition-related ones. Besides, through extracting the intermediate reasoning from LLMs, we further break down the process of applying a specific capability into three sub-steps: recalling relevant knowledge, utilizing knowledge, and solving problems. Finally, FAC$^2$E evaluates each sub-step of each fine-grained capability, providing a two-faceted diagnosis for LLMs. Utilizing FAC$^2$E, we identify a common shortfall in knowledge utilization among models and propose a straightforward, knowledge-enhanced method to mitigate this issue. Our results not only showcase promising performance enhancements but also highlight a direction for future LLM advancements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes