Langston Nashold

h-index2

3papers

7citations

Novelty33%

AI Score39

Ranked #79,290 of 194,257 authors (top 41%)#4,930 in AI (top 39%)

3 Papers

8.6AIMar 27

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

Nikil Ravi, Kexing Ying, Vasilii Nesterov et al.

We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models.

11.5SEMar 4

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran, Langston Nashold, Rayan Krishnan et al.

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves only 58.0% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.

1.2LGJul 16, 2020Code

Using LSTM and SARIMA Models to Forecast Cluster CPU Usage

Langston Nashold, Rayan Krishnan

As large scale cloud computing centers become more popular than individual servers, predicting future resource demand need has become an important problem. Forecasting resource need allows public cloud providers to proactively allocate or deallocate resources for cloud services. This work seeks to predict one resource, CPU usage, over both a short term and long term time scale. To gain insight into the model characteristics that best support specific tasks, we consider two vastly different architectures: the historically relevant SARIMA model and the more modern neural network, LSTM model. We apply these models to Azure data resampled to 20 minutes per data point with the goal of predicting usage over the next hour for the short-term task and for the next three days for the long-term task. The SARIMA model outperformed the LSTM for the long term prediction task, but performed poorer on the short term task. Furthermore, the LSTM model was more robust, whereas the SARIMA model relied on the data meeting certain assumptions about seasonality.