Spencer Mateega

CL
h-index2
5papers
22citations
Novelty30%
AI Score44

5 Papers

LGJan 30, 2025Code
FinanceQA: A Benchmark for Evaluating Financial Analysis Capabilities of Large Language Models

Spencer Mateega, Carlos Georgescu, Danny Tang

FinanceQA is a testing suite that evaluates LLMs' performance on complex numerical financial analysis tasks that mirror real-world investment work. Despite recent advances, current LLMs fail to meet the strict accuracy requirements of financial institutions, with models failing approximately 60% of realistic tasks that mimic on-the-job analyses at hedge funds, private equity firms, investment banks, and other financial institutions. The primary challenges include hand-spreading metrics, adhering to standard accounting and corporate valuation conventions, and performing analysis under incomplete information - particularly in multi-step tasks requiring assumption generation. This performance gap highlights the disconnect between existing LLM capabilities and the demands of professional financial analysis that are inadequately tested by current testing architectures. Results show that higher-quality training data is needed to support such tasks, which we experiment with using OpenAI's fine-tuning API. FinanceQA is publicly released at [this https URL](https://huggingface.co/datasets/AfterQuery/FinanceQA).

CLAug 28, 2025Code
UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools

Sam Jung, Agustin Garcinuno, Spencer Mateega

AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across competing AI text-to-app tools through expert pairwise comparison. Spanning 10 tools, 30 prompts, 300 generated sites, and 4,000+ expert judgments, UI-Bench ranks systems with a TrueSkill-derived model that yields calibrated confidence intervals. UI-Bench establishes a reproducible standard for advancing AI-driven web design. We release (i) the complete prompt set, (ii) an open-source evaluation framework, and (iii) a public leaderboard. The generated sites rated by participants will be released soon. View the UI-Bench leaderboard at https://uibench.ai/leaderboard.

CRMay 26, 2025Code
VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation

Ethan TS. Liu, Austin Wang, Spencer Mateega et al.

Ensuring that large language models (LLMs) can effectively assess, detect, explain, and remediate software vulnerabilities is critical for building robust and secure software systems. We introduce VADER, a human-evaluated benchmark designed explicitly to assess LLM performance across four key vulnerability-handling dimensions: assessment, detection, explanation, and remediation. VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub repositories and annotated by security experts. For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weakness Enumeration (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan. Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER, and human security experts evaluated each response according to a rigorous scoring rubric emphasizing remediation (quality of the code fix, 50%), explanation (20%), and classification and test plan (30%) according to a standardized rubric. Our results show that current state-of-the-art LLMs achieve only moderate success on VADER - OpenAI's o3 attained 54.7% accuracy overall, with others in the 49-54% range, indicating ample room for improvement. Notably, remediation quality is strongly correlated (Pearson r > 0.97) with accurate classification and test plans, suggesting that models that effectively categorize vulnerabilities also tend to fix them well. VADER's comprehensive dataset, detailed evaluation rubrics, scoring tools, and visualized results with confidence intervals are publicly released, providing the community with an interpretable, reproducible benchmark to advance vulnerability-aware LLMs. All code and data are available at: https://github.com/AfterQuery/vader

SEJan 28
IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks

Spencer Mateega, Jeff Yang, Tiana Costello et al.

IDE-Bench is a comprehensive framework for evaluating AI IDE agents on real-world software engineering tasks through an IDE-native tool interface. We present a Dockerized test harness that goes beyond raw terminal execution, granting models a structured tool ecosystem that represents AI-native IDEs like Cursor and Windsurf. By providing high-level abstractions for codebase search, structured file editing, and tools for testing full-stack applications, IDE-Bench evaluates an agent's ability to act as a true engineering collaborator. For evaluation and to prevent training data contamination, we created 80 tasks across eight never-published repositories spanning C/C++, Java, and MERN stacks, representing modern tech stack production scenarios, including feature implementation, bug fixing, refactoring, and performance optimization that mirror daily developer workflows in private codebases. Our benchmark is the first to systematically correlate agent-reported intent with successful project-level modifications in a multi-language, full-stack environment on completely uncontaminated code.

CLDec 13, 2025
Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics

Abhay Srivastava, Sam Jung, Spencer Mateega

We introduce MARKET-BENCH, a benchmark that evaluates large language models (LLMs) on introductory quantitative trading tasks by asking them to construct executable backtesters from natural language strategy descriptions and market assumptions. Each instance specifies one of three canonical strategies: scheduled trading on Microsoft (NASDAQ: MSFT), pairs trading on Coca-Cola (NASDAQ: KO) and Pepsi (NASDAQ: PEP), or delta hedging on MSFT. Models must produce code whose profit and loss (P and L), drawdown, and position paths match a verifiable reference implementation. We assess thirteen state-of-the-art models using a multi-round evaluation that separates structural reliability (whether the backtest runs) from numerical accuracy (mean absolute error of the backtest metrics), assigning failed outputs a duplicated-metrics baseline MAE. While most models reliably execute the simplest strategy (average executable passes of 4.08 out of 5 rounds), errors vary by orders of magnitude across models and tasks. Gemini 3 Pro and Claude 4.5 Sonnet combine strong reliability with low error on simpler strategies. GPT-5.2 achieves strong overall performance with perfect executability. GPT-5.1 Codex-Max achieves the lowest best-run error on the easiest task. Qwen3 Max attains perfect executability yet sometimes produces inaccurate profit and loss paths. These results show that current LLMs can scaffold basic trading infrastructure but still struggle to reason robustly about prices, inventory, and risk. We release MARKET-BENCH and a public leaderboard at https://marketbench.ai.