LGJan 24, 2025
Humanity's Last ExamLong Phan, Alice Gatti, Ziwen Han et al. · amazon-science, apple-ml
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
CRMay 12
HySecTwin: A Knowledge-Driven Digital Twin Framework Augmented with Hybrid Reasoning for Cyber-Physical SystemsDavid Holmes, Ahmad Moshin, Surya Nepal et al.
Existing Digital Twin (DT) approaches often lack semantic reasoning capabilities for effective cybersecurity modelling in Cyber-Physical Systems (CPS). This paper presents HySecTwin, a knowledge-driven digital twin architecture that places automated reasoning at the core of real-time threat detection. HySecTwin incorporates semantic modelling to transform heterogeneous CPS telemetry, device attributes, and operational relationships into machine-interpretable representations, combined with an embedded reasoning engine operating over contextualized system states. Unlike opaque detection methods, the framework integrates deterministic rule-based inference with hybrid fuzzy reasoning to generate explicit, interpretable, and auditable security assessments from live device telemetry. This enables context-aware monitoring of complex CPS environments while preserving transparency and trust. Experimental evaluation using a representative CPS testbed and MITRE ATT\&CK campaign-inspired attack scenarios demonstrates sub-millisecond twin synchronization latency and up to 21.5\% faster threat detection compared with deterministic reasoning alone. The results show that semantic modelling, semantic enrichment, and hybrid reasoning improve explainability and resilience without extra system overhead. HySecTwin provides a lightweight, containerized, and extensible framework for secure-by-design digital twin deployments in mission-critical infrastructures
LGApr 19, 2023
Points of non-linearity of functions generated by random neural networksDavid Holmes
We consider functions from the real numbers to the real numbers, output by a neural network with 1 hidden activation layer, arbitrary width, and ReLU activation function. We assume that the parameters of the neural network are chosen uniformly at random with respect to various probability distributions, and compute the expected distribution of the points of non-linearity. We use these results to explain why the network may be biased towards outputting functions with simpler geometry, and why certain functions with low information-theoretic complexity are nonetheless hard for a neural network to approximate.
CLSep 30, 2025
IMProofBench: Benchmarking AI on Research-Level Mathematical Proof GenerationJohannes Schmitt, Gergely Bérczi, Jasper Dekoninck et al.
As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are limited, as they focus solely on final-answer questions or high-school competition problems. To address this gap, we introduce IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians. Each problem requires a detailed proof and is paired with subproblems that have final answers, supporting both an evaluation of mathematical reasoning capabilities by human experts and a large-scale quantitative analysis through automated grading. Furthermore, unlike prior benchmarks, the evaluation setup simulates a realistic research environment: models operate in an agentic framework with tools like web search for literature review and mathematical software such as SageMath. Our results show that current LLMs can succeed at the more accessible research-level questions, but still encounter significant difficulties on more challenging problems. Quantitatively, Grok-4 achieves the highest accuracy of 52% on final-answer subproblems, while GPT-5 obtains the best performance for proof generation, achieving a fully correct solution for 22% of problems. IMProofBench will continue to evolve as a dynamic benchmark in collaboration with the mathematical community, ensuring its relevance for evaluating the next generation of LLMs.