Taxonomy-Aligned Risk Extraction from 10-K Filings with Autonomous Improvement Using LLMs
This addresses the need for automated, scalable risk analysis in finance and other domains, though it is incremental as it builds on existing LLM and embedding methods.
The researchers tackled the problem of extracting structured risk factors from corporate 10-K filings while adhering to a predefined taxonomy, achieving a 104.7% improvement in embedding separation through autonomous taxonomy maintenance and showing that same-industry companies have 63% higher risk profile similarity than cross-industry pairs.
We present a methodology for extracting structured risk factors from corporate 10-K filings while maintaining adherence to a predefined hierarchical taxonomy. Our three-stage pipeline combines LLM extraction with supporting quotes, embedding-based semantic mapping to taxonomy categories, and LLM-as-a-judge validation that filters spurious assignments. To evaluate our approach, we extract 10,688 risk factors from S&P 500 companies and examine risk profile similarity across industry clusters. Beyond extraction, we introduce autonomous taxonomy maintenance where an AI agent analyzes evaluation feedback to identify problematic categories, diagnose failure patterns, and propose refinements, achieving 104.7% improvement in embedding separation in a case study. External validation confirms the taxonomy captures economically meaningful structure: same-industry companies exhibit 63% higher risk profile similarity than cross-industry pairs (Cohen's d=1.06, AUC 0.82, p<0.001). The methodology generalizes to any domain requiring taxonomy-aligned extraction from unstructured text, with autonomous improvement enabling continuous quality maintenance and enhancement as systems process more documents.