90.4AIApr 16Code
COMPOSITE-StemKyle Waters, Lucas Nuzzi, Tadhg Looram et al.
AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced with contributor permission to support reproducibility and to promote additional research towards AI's acceleration of scientific progress in these domains.
LGJan 24, 2025
Humanity's Last ExamLong Phan, Alice Gatti, Ziwen Han et al. · amazon-science, apple-ml
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
90.7CCApr 28
Symmetric Parameterised Holants on Hypergraphs: Towards a Classification for Parameterised VCSPsPanagiotis Aivasiliotis, Andreas Göbel, Marc Roth
We study the complexity of the parameterised counting constraint satisfaction problem: given a set of constraints over a set of variables and a positive integer $k$, how many ways are there to assign $k$ variables to 1 (and the others to 0) such that all constraints are satisfied. Existing work has so far exclusively focused on restricted settings such as finding and counting homomorphisms between relational structures due to Grohe (JACM 2007) and Dalmau and Jonsson (TCS 2004), or the case of finite constraint languages due to Creignou and Vollmer (SAT 2012), and Bulatov and Marx (SICOMP 2014). In this work, we tackle a more general setting of Valued Parameterised Counting Constraint Satisfaction Problems (VCSPs) with infinite constraint languages. In this setting we are able to model significantly more general problems such as (weighted) parameterised factor problems on hypergraphs and counting weight-$k$ solutions of systems of linear equations, not captured by existing complexity classifications. We express parameterised VCSPs as parameterised Holant problems on uniform hypergraphs, and we establish complete and explicit complexity dichotomy theorems. For resolving the $\mathrm{P}$ vs. $\#\mathrm{P}$ question, we mainly rely on hypergraph gadgets, the existence of which we prove using properties of degree sequences necessary for realisability in uniform hypergraphs. For the $\mathrm{FPT}$ vs. $\#\mathrm{W}[1]$ question, we build upon the recently established combinatorial toolkit for parameterised holants on the special case of graphs by Aivasiliotis et al. (ICALP 2025) and also rely on an extension of the framework of the homomorphism basis due to Curticapean, Dell and Marx (STOC 17) to uniform hypergraphs. As a technical highlight, we also employ Curticapean's "CFI Filters'' (SODA 2024) to establish polynomial-time algorithms for isolating vectors in the homomorphism basis.