CLNov 11, 2024

Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt

arXiv:2411.07142v124 citationsh-index: 5EMNLP
Originality Synthesis-oriented
AI Analysis

This addresses the problem of poor text embedding performance in finance for researchers and practitioners, though it is incremental as it adapts existing methods to a specific domain.

The paper tackles the challenge of specialized terminology in financial documents by developing BAM embeddings, achieving 62.8% Recall@1 compared to 39.2% for general-purpose embeddings and increasing question answering accuracy by 8% on FinanceBench.

Financial documents are filled with specialized terminology, arcane jargon, and curious acronyms that pose challenges for general-purpose text embeddings. Yet, few text embeddings specialized for finance have been reported in the literature, perhaps in part due to a lack of public datasets and benchmarks. We present BAM embeddings, a set of text embeddings finetuned on a carefully constructed dataset of 14.3M query-passage pairs. Demonstrating the benefits of domain-specific training, BAM embeddings achieve Recall@1 of 62.8% on a held-out test set, vs. only 39.2% for the best general-purpose text embedding from OpenAI. Further, BAM embeddings increase question answering accuracy by 8% on FinanceBench and show increased sensitivity to the finance-specific elements that are found in detailed, forward-looking and company and date-specific queries. To support further research we describe our approach in detail, quantify the importance of hard negative mining and dataset scale.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes