Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models
This addresses the need for reliable and bias-aware annotation methods in financial communication, which is critical for investment decisions and public trust, though it is incremental as it applies existing models to a new domain.
The paper tackled the problem of assessing argument quality in financial communications by evaluating three large language models (GPT-4o, Llama 3.1, and Gemma 2) on the FinArgQuality dataset, finding that LLM-based annotations achieved higher inter-annotator agreement than human counterparts but exhibited varying degrees of gender bias.
Financial arguments play a critical role in shaping investment decisions and public trust in financial institutions. Nevertheless, assessing their quality remains poorly studied in the literature. In this paper, we examine the capabilities of three state-of-the-art LLMs GPT-4o, Llama 3.1, and Gemma 2 in annotating argument quality within financial communications, using the FinArgQuality dataset. Our contributions are twofold. First, we evaluate the consistency of LLM-generated annotations across multiple runs and benchmark them against human annotations. Second, we introduce an adversarial attack designed to inject gender bias to analyse models responds and ensure model's fairness and robustness. Both experiments are conducted across three temperature settings to assess their influence on annotation stability and alignment with human labels. Our findings reveal that LLM-based annotations achieve higher inter-annotator agreement than human counterparts, though the models still exhibit varying degrees of gender bias. We provide a multifaceted analysis of these outcomes and offer practical recommendations to guide future research toward more reliable, cost-effective, and bias-aware annotation methodologies.