CL AIMay 8, 2023

ANALOGICAL -- A Novel Benchmark for Long Text Analogy Evaluation in Large Language Models

Thilini Wijesiriwardene, Ruwan Wickramarachchi, Bimal G. Gajera, Shreeyash Mukul Gowaikar, Chandan Gupta, Aman Chadha, Aishwarya Naresh Reganti, Amit Sheth, Amitava Das

arXiv:2305.05050v34.917 citations

Originality Synthesis-oriented

AI Analysis

This work addresses a gap in intrinsic evaluation for LLMs, providing a domain-specific benchmark for analogy tasks, but it is incremental as it builds on existing analogy concepts.

The authors tackled the problem of evaluating large language models' ability to draw analogies between long texts, which had been understudied, by introducing the ANALOGICAL benchmark with six complexity levels. Their evaluation of eight LLMs found that identifying analogies becomes increasingly challenging as complexity increases, though no concrete numbers were provided.

Over the past decade, analogies, in the form of word-level analogies, have played a significant role as an intrinsic measure of evaluating the quality of word embedding methods such as word2vec. Modern large language models (LLMs), however, are primarily evaluated on extrinsic measures based on benchmarks such as GLUE and SuperGLUE, and there are only a few investigations on whether LLMs can draw analogies between long texts. In this paper, we present ANALOGICAL, a new benchmark to intrinsically evaluate LLMs across a taxonomy of analogies of long text with six levels of complexity -- (i) word, (ii) word vs. sentence, (iii) syntactic, (iv) negation, (v) entailment, and (vi) metaphor. Using thirteen datasets and three different distance measures, we evaluate the abilities of eight LLMs in identifying analogical pairs in the semantic vector space. Our evaluation finds that it is increasingly challenging for LLMs to identify analogies when going up the analogy taxonomy.

View on arXiv PDF

Similar