LG CLNov 28, 2025

Measuring What LLMs Think They Do: SHAP Faithfulness and Deployability on Financial Tabular Classification

Saeed AlMarri, Mathieu Ravaut, Kristof Juhasz, Gautier Marti, Hamdan Al Ahbabi, Ibrahim Elfadel

arXiv:2512.00163v17.11 citations

Originality Synthesis-oriented

AI Analysis

This addresses reliability concerns for LLMs in high-stakes financial applications like risk assessment, though it is incremental in highlighting explainability issues.

The study evaluated LLMs on financial classification tasks and found a divergence between their self-explanations and SHAP values, as well as differences compared to LightGBM, indicating limitations as standalone classifiers but potential with improved explainability.

Large Language Models (LLMs) have attracted significant attention for classification tasks, offering a flexible alternative to trusted classical machine learning models like LightGBM through zero-shot prompting. However, their reliability for structured tabular data remains unclear, particularly in high stakes applications like financial risk assessment. Our study systematically evaluates LLMs and generates their SHAP values on financial classification tasks. Our analysis shows a divergence between LLMs self-explanation of feature impact and their SHAP values, as well as notable differences between LLMs and LightGBM SHAP values. These findings highlight the limitations of LLMs as standalone classifiers for structured financial modeling, but also instill optimism that improved explainability mechanisms coupled with few-shot prompting will make LLMs usable in risk-sensitive domains.

View on arXiv PDF

Similar