CL AI LGApr 22, 2025

FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking

Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Charese H. Smiley

arXiv:2504.16188v117.012 citationsh-index: 6NAACL

Originality Synthesis-oriented

AI Analysis

This addresses the need for robust financial reasoning benchmarks for NLP researchers and practitioners, though it is incremental as it focuses on dataset creation rather than novel methods.

The authors tackled the problem of evaluating natural language inference in financial texts by introducing FinNLI, a benchmark dataset with 21,304 premise-hypothesis pairs, and found that domain shift degrades general-domain models, with baseline scores reaching up to 78.62% F1, highlighting dataset difficulty and poor performance of instruction-tuned financial LLMs.

We introduce FinNLI, a benchmark dataset for Financial Natural Language Inference (FinNLI) across diverse financial texts like SEC Filings, Annual Reports, and Earnings Call transcripts. Our dataset framework ensures diverse premise-hypothesis pairs while minimizing spurious correlations. FinNLI comprises 21,304 pairs, including a high-quality test set of 3,304 instances annotated by finance experts. Evaluations show that domain shift significantly degrades general-domain NLI performance. The highest Macro F1 scores for pre-trained (PLMs) and large language models (LLMs) baselines are 74.57% and 78.62%, respectively, highlighting the dataset's difficulty. Surprisingly, instruction-tuned financial LLMs perform poorly, suggesting limited generalizability. FinNLI exposes weaknesses in current LLMs for financial reasoning, indicating room for improvement.

View on arXiv PDF

Similar