Johannes Schmitt

CL
h-index89
3papers
350citations
Novelty40%
AI Score40

3 Papers

LGJan 24, 2025
Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han et al. · amazon-science, apple-ml

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

CLSep 30, 2025
IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

Johannes Schmitt, Gergely Bérczi, Jasper Dekoninck et al.

As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are limited, as they focus solely on final-answer questions or high-school competition problems. To address this gap, we introduce IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians. Each problem requires a detailed proof and is paired with subproblems that have final answers, supporting both an evaluation of mathematical reasoning capabilities by human experts and a large-scale quantitative analysis through automated grading. Furthermore, unlike prior benchmarks, the evaluation setup simulates a realistic research environment: models operate in an agentic framework with tools like web search for literature review and mathematical software such as SageMath. Our results show that current LLMs can succeed at the more accessible research-level questions, but still encounter significant difficulties on more challenging problems. Quantitatively, Grok-4 achieves the highest accuracy of 52% on final-answer subproblems, while GPT-5 obtains the best performance for proof generation, achieving a fully correct solution for 22% of problems. IMProofBench will continue to evolve as a dynamic benchmark in collaboration with the mathematical community, ensuring its relevance for evaluating the next generation of LLMs.

SEMay 21, 2015
Semantic Degrees for Industrie 4.0

Chih-Hong Cheng, Tuncay Guelfirat, Christian Messinger et al.

Under the context of Industrie 4.0 (I4.0), future production systems provide balanced operations between manufacturing flexibility and efficiency, realized in an autonomous, horizontal, and decentralized item-level production control framework. Structured interoperability via precise formulations on an appropriate degree is crucial to achieve engineering efficiency in the system life cycle. However, selecting the degree of formalization can be challenging, as it crucially depends on the desired common understanding (semantic degree) between multiple parties. In this paper, we categorize different semantic degrees and map a set of technologies in industrial automation to their associated degrees. Furthermore, we created guidelines to assist engineers selecting appropriate semantic degrees in their design. We applied these guidelines on publically available scenarios to examine the validity of the approach, and identified semantic elements over internally developed use cases targeting semantically-enabled plug-and-produce.