CLOct 28, 2025

MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

arXiv:2510.24295v1h-index: 13
Originality Incremental advance
AI Analysis

This addresses the need for efficient generalization testing in NLP, though it is incremental as it builds on existing benchmarks.

The authors tackled the problem of evaluating language models' robustness in natural language inference by proposing MERGE, an automated method to generate high-quality variants of NLI problems through word replacements that preserve reasoning. Their results showed that NLI models perform 4-20% worse on these variants, indicating low generalizability.

In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes