Steven Kelk

h-index20

4papers

21citations

Novelty59%

AI Score43

Ranked #53,838 of 194,257 authors (top 28%)#3,225 in AI (top 26%)

4 Papers

7.8DSMar 21

Split-or-decompose: Improved FPT branching algorithms for maximum agreement forests

David Mestel, Steven Chaplick, Steven Kelk et al.

Phylogenetic trees are leaf-labelled trees used to model the evolution of species. In practice it is not uncommon to obtain two topologically distinct trees for the same set of species, and this motivates the use of distance measures to quantify dissimilarity. A well-known measure is the maximum agreement forest (MAF): a minimum-size partition of the leaf labels which splits both trees into the same set of disjoint, leaf-labelled subtrees (up to isomorphism after suppressing degree-2 vertices). Computing such a MAF is NP-hard and so considerable effort has been invested in finding FPT algorithms, parameterised by $k$, the number of components of a MAF. The state of the art has been unchanged since 2015, with running times of $O^*(3^k)$ for unrooted trees and $O^*(2.3431^k)$ for rooted trees. In this work we present improved algorithms for both the unrooted and rooted cases, with runtimes $O^*(2.846^k)$ and $O^*(2.3391^k)$ respectively. The key to our improvement is a novel branching strategy in which we show that any overlapping components obtained on the way to a MAF can be `split' by a branching rule with favourable branching factor, and then the problem can be decomposed into disjoint subproblems to be solved separately. We expect that this technique may be more widely applicable to other problems in algorithmic phylogenetics.

16.3CLSep 30, 2025

IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

Johannes Schmitt, Gergely Bérczi, Jasper Dekoninck et al.

As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are limited, as they focus solely on final-answer questions or high-school competition problems. To address this gap, we introduce IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians. Each problem requires a detailed proof and is paired with subproblems that have final answers, supporting both an evaluation of mathematical reasoning capabilities by human experts and a large-scale quantitative analysis through automated grading. Furthermore, unlike prior benchmarks, the evaluation setup simulates a realistic research environment: models operate in an agentic framework with tools like web search for literature review and mathematical software such as SageMath. Our results show that current LLMs can succeed at the more accessible research-level questions, but still encounter significant difficulties on more challenging problems. Quantitatively, Grok-4 achieves the highest accuracy of 52% on final-answer subproblems, while GPT-5 obtains the best performance for proof generation, achieving a fully correct solution for 22% of problems. IMProofBench will continue to evolve as a dynamic benchmark in collaboration with the mathematical community, ensuring its relevance for evaluating the next generation of LLMs.

8.5AIMay 31, 2019

Foundations of Digital Archæoludology

Cameron Browne, Dennis J. N. J. Soemers, Éric Piette et al.

Digital Archaeoludology (DAL) is a new field of study involving the analysis and reconstruction of ancient games from incomplete descriptions and archaeological evidence using modern computational techniques. The aim is to provide digital tools and methods to help game historians and other researchers better understand traditional games, their development throughout recorded human history, and their relationship to the development of human culture and mathematical knowledge. This work is being explored in the ERC-funded Digital Ludeme Project. The aim of this inaugural international research meeting on DAL is to gather together leading experts in relevant disciplines - computer science, artificial intelligence, machine learning, computational phylogenetics, mathematics, history, archaeology, anthropology, etc. - to discuss the key themes and establish the foundations for this new field of research, so that it may continue beyond the lifetime of its initiating project.

1.0LGJan 26, 2019

Discovery of Important Subsequences in Electrocardiogram Beats Using the Nearest Neighbour Algorithm

Ricards Marcinkevics, Steven Kelk, Carlo Galuzzi et al.

The classification of time series data is a well-studied problem with numerous practical applications, such as medical diagnosis and speech recognition. A popular and effective approach is to classify new time series in the same way as their nearest neighbours, whereby proximity is defined using Dynamic Time Warping (DTW) distance, a measure analogous to sequence alignment in bioinformatics. However, practitioners are not only interested in accurate classification, they are also interested in why a time series is classified a certain way. To this end, we introduce here the problem of finding a minimum length subsequence of a time series, the removal of which changes the outcome of the classification under the nearest neighbour algorithm with DTW distance. Informally, such a subsequence is expected to be relevant for the classification and can be helpful for practitioners in interpreting the outcome. We describe a simple but optimized implementation for detecting these subsequences and define an accompanying measure to quantify the relevance of every time point in the time series for the classification. In tests on electrocardiogram data we show that the algorithm allows discovery of important subsequences and can be helpful in detecting abnormalities in cardiac rhythms distinguishing sick from healthy patients.