CL AIMar 2, 2024

ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies

Oren Sultan, Yonatan Bitton, Ron Yosef, Dafna Shahaf

arXiv:2403.01139v417.737 citationsh-index: 26Has CodeNAACL

Originality Incremental advance

AI Analysis

This addresses the problem of limited analogy data for AI researchers, though it is incremental as it builds on existing LLM capabilities for dataset generation.

The authors tackled the lack of large-scale datasets for complex analogies by developing ParallelPARC, a pipeline using LLMs to generate paragraph-based analogies, resulting in the ProPara-Logy dataset where humans outperformed the best models by about 13% in recognition tasks.

Analogy-making is central to human cognition, allowing us to adapt to novel situations -- an ability that current AI systems still lack. Most analogy datasets today focus on simple analogies (e.g., word analogies); datasets including complex types of analogies are typically manually curated and very small. We believe that this holds back progress in computational analogy. In this work, we design a data generation pipeline, ParallelPARC (Parallel Paragraph Creator) leveraging state-of-the-art Large Language Models (LLMs) to create complex, paragraph-based analogies, as well as distractors, both simple and challenging. We demonstrate our pipeline and create ProPara-Logy, a dataset of analogies between scientific processes. We publish a gold-set, validated by humans, and a silver-set, generated automatically. We test LLMs' and humans' analogy recognition in binary and multiple-choice settings, and found that humans outperform the best models (~13% gap) after a light supervision. We demonstrate that our silver-set is useful for training models. Lastly, we show challenging distractors confuse LLMs, but not humans. We hope our pipeline will encourage research in this emerging field.

View on arXiv PDF Code

Similar