Jonathan Roberts

CV
h-index52
17papers
758citations
Novelty37%
AI Score42

17 Papers

CVNov 24, 2023Code
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

Jonathan Roberts, Timo Lüddecke, Rehan Sheikh et al. · cambridge

Multimodal large language models (MLLMs) have shown remarkable capabilities across a broad range of tasks but their knowledge and abilities in the geographic and geospatial domains are yet to be explored, despite potential wide-ranging benefits to navigation, environmental research, urban development, and disaster response. We conduct a series of experiments exploring various vision capabilities of MLLMs within these domains, particularly focusing on the frontier model GPT-4V, and benchmark its performance against open-source counterparts. Our methodology involves challenging these models with a small-scale geographic benchmark consisting of a suite of visual tasks, testing their abilities across a spectrum of complexity. The analysis uncovers not only where such models excel, including instances where they outperform humans, but also where they falter, providing a balanced view of their capabilities in the geographic domain. To enable the comparison and evaluation of future models, our benchmark will be publicly released.

CVApr 23, 2023
SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models

Jonathan Roberts, Kai Han, Samuel Albanie · cambridge

Interpreting remote sensing imagery enables numerous downstream applications ranging from land-use planning to deforestation monitoring. Robustly classifying this data is challenging due to the Earth's geographic diversity. While many distinct satellite and aerial image classification datasets exist, there is yet to be a benchmark curated that suitably covers this diversity. In this work, we introduce SATellite ImageNet (SATIN), a metadataset curated from 27 existing remotely sensed datasets, and comprehensively evaluate the zero-shot transfer classification capabilities of a broad range of vision-language (VL) models on SATIN. We find SATIN to be a challenging benchmark-the strongest method we evaluate achieves a classification accuracy of 52.0%. We provide a $\href{https://satinbenchmark.github.io}{\text{public leaderboard}}$ to guide and track the progress of VL models in this important domain.

CVAug 21, 2024
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Jonathan Roberts, Kai Han, Samuel Albanie · cambridge

Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is predominantly synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 3284 questions, covering five tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.0%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB and a lightweight GRAB-Lite to encourage progress in this important, growing domain.

CLJan 16
How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

Jonathan Roberts, Kai Han, Samuel Albanie · cambridge

Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.

CLNov 7, 2024Code
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Jonathan Roberts, Kai Han, Samuel Albanie · cambridge

As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.

LGJan 24, 2025
Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han et al. · amazon-science, apple-ml

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

AIAug 29, 2013Code
Learning-Based Procedural Content Generation

Jonathan Roberts, Ke Chen

Procedural content generation (PCG) has recently become one of the hottest topics in computational intelligence and AI game researches. Among a variety of PCG techniques, search-based approaches overwhelmingly dominate PCG development at present. While SBPCG leads to promising results and successful applications, it poses a number of challenges ranging from representation to evaluation of the content being generated. In this paper, we present an alternative yet generic PCG framework, named learning-based procedure content generation (LBPCG), to provide potential solutions to several challenging problems in existing PCG techniques. By exploring and exploiting information gained in game development and public beta test via data-driven learning, our framework can generate robust content adaptable to end-user or target players on-line with minimal interruption to their experience. Furthermore, we develop enabling techniques to implement the various models required in our framework. For a proof of concept, we have developed a prototype based on the classic open source first-person shooter game, Quake. Simulation results suggest that our framework is promising in generating quality content.

CVMay 14, 2024
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation

Jonathan Roberts, Kai Han, Neil Houlsby et al. · cambridge

Large multimodal models (LMMs) have proven flexible and generalisable across many tasks and fields. Although they have strong potential to aid scientific research, their capabilities in this domain are not well characterised. A key aspect of scientific research is the ability to understand and interpret figures, which serve as a rich, compressed source of complex information. In this work, we present SciFIBench, a scientific figure interpretation benchmark consisting of 2000 questions split between two tasks across 8 categories. The questions are curated from arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control. We evaluate 28 LMMs on SciFIBench, finding it to be a challenging benchmark. Finally, we investigate the alignment and reasoning faithfulness of the LMMs on augmented question sets from our benchmark. We release SciFIBench to encourage progress in this domain.

CVFeb 13, 2025
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma et al. · cambridge, oxford

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

CLDec 18, 2024
GAMEBoT: Transparent Assessment of LLM Reasoning in Games

Wenye Lin, Jonathan Roberts, Yunhan Yang et al. · cambridge

Large Language Models (LLMs) are increasingly deployed in real-world applications that demand complex reasoning. To track progress, robust benchmarks are required to evaluate their capabilities beyond superficial pattern recognition. However, current LLM reasoning benchmarks often face challenges such as insufficient interpretability, performance saturation or data contamination. To address these challenges, we introduce GAMEBoT, a gaming arena designed for rigorous and transparent assessment of LLM reasoning capabilities. GAMEBoT decomposes complex reasoning in games into predefined modular subproblems. This decomposition allows us to design a suite of Chain-of-Thought (CoT) prompts that leverage domain knowledge to guide LLMs in addressing these subproblems before action selection. Furthermore, we develop a suite of rule-based algorithms to generate ground truth for these subproblems, enabling rigorous validation of the LLMs' intermediate reasoning steps. This approach facilitates evaluation of both the quality of final actions and the accuracy of the underlying reasoning process. GAMEBoT also naturally alleviates the risk of data contamination through dynamic games and head-to-head LLM competitions. We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics. Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts. Project page: https://visual-ai.github.io/gamebot

CLMay 30, 2023
GPT4GEO: How a Language Model Sees the World's Geography

Jonathan Roberts, Timo Lüddecke, Sowmen Das et al.

Large language models (LLMs) have shown remarkable capabilities across a broad range of tasks involving question answering and the generation of coherent text and code. Comprehensively understanding the strengths and weaknesses of LLMs is beneficial for safety, downstream applications and improving performance. In this work, we investigate the degree to which GPT-4 has acquired factual geographic knowledge and is capable of using this knowledge for interpretative reasoning, which is especially important for applications that involve geographic data, such as geospatial analysis, supply chain management, and disaster response. To this end, we design and conduct a series of diverse experiments, starting from factual tasks such as location, distance and elevation estimation to more complex questions such as generating country outlines and travel networks, route finding under constraints and supply chain analysis. We provide a broad characterisation of what GPT-4 (without plugins or Internet access) knows about the world, highlighting both potentially surprising capabilities but also limitations.

ROJun 10, 2021
3D Semantic Mapping from Arthroscopy using Out-of-distribution Pose and Depth and In-distribution Segmentation Training

Yaqub Jonmohamadi, Shahnewaz Ali, Fengbei Liu et al.

Minimally invasive surgery (MIS) has many documented advantages, but the surgeon's limited visual contact with the scene can be problematic. Hence, systems that can help surgeons navigate, such as a method that can produce a 3D semantic map, can compensate for the limitation above. In theory, we can borrow 3D semantic mapping techniques developed for robotics, but this requires finding solutions to the following challenges in MIS: 1) semantic segmentation, 2) depth estimation, and 3) pose estimation. In this paper, we propose the first 3D semantic mapping system from knee arthroscopy that solves the three challenges above. Using out-of-distribution non-human datasets, where pose could be labeled, we jointly train depth+pose estimators using selfsupervised and supervised losses. Using an in-distribution human knee dataset, we train a fully-supervised semantic segmentation system to label arthroscopic image pixels into femur, ACL, and meniscus. Taking testing images from human knees, we combine the results from these two systems to automatically create 3D semantic maps of the human knee. The result of this work opens the pathway to the generation of intraoperative 3D semantic mapping, registration with pre-operative data, and robotic-assisted arthroscopy

IVMar 3, 2021
Arthroscopic Multi-Spectral Scene Segmentation Using Deep Learning

Shahnewaz Ali, Yaqub Jonmohamadi, Yu Takeda et al.

Knee arthroscopy is a minimally invasive surgical (MIS) procedure which is performed to treat knee-joint ailment. Lack of visual information of the surgical site obtained from miniaturized cameras make this surgical procedure more complex. Knee cavity is a very confined space; therefore, surgical scenes are captured at close proximity. Insignificant context of knee atlas often makes them unrecognizable as a consequence unintentional tissue damage often occurred and shows a long learning curve to train new surgeons. Automatic context awareness through labeling of the surgical site can be an alternative to mitigate these drawbacks. However, from the previous studies, it is confirmed that the surgical site exhibits several limitations, among others, lack of discriminative contextual information such as texture and features which drastically limits this vision task. Additionally, poor imaging conditions and lack of accurate ground-truth labels are also limiting the accuracy. To mitigate these limitations of knee arthroscopy, in this work we proposed a scene segmentation method that successfully segments multi structures.

HCJul 30, 2020
Composition and Configuration Patterns in Multiple-View Visualizations

Xi Chen, Wei Zeng, Yanna Lin et al.

Multiple-view visualization (MV) is a layout design technique often employed to help users see a large number of data attributes and values in a single cohesive representation. Because of its generalizability, the MV design has been widely adopted by the visualization community to help users examine and interact with large, complex, and high-dimensional data. However, although ubiquitous, there has been little work to categorize and analyze MVs in order to better understand its design space. As a result, there has been little to no guideline in how to use the MV design effectively. In this paper, we present an in-depth study of how MVs are designed in practice. We focus on two fundamental measures of multiple-view patterns: composition, which quantifies what view types and how many are there; and configuration, which characterizes spatial arrangement of view layouts in the display space. We build a new dataset containing 360 images of MVs collected from IEEE VIS, EuroVis, and PacificVis publications 2011 to 2019, and make fine-grained annotations of view types and layouts for these visualization images. From this data we conduct composition and configuration analyses using quantitative metrics of term frequency and layout topology. We identify common practices around MVs, including relationship of view types, popular view layouts, and correlation between view types and layouts. We combine the findings into a MV recommendation system, providing interactive tools to explore the design space, and support example-based design.

ROSep 6, 2019
Real-time Joint Motion Analysis and Instrument Tracking for Robot-Assisted Orthopaedic Surgery

Mario Strydom, Artur Banach, Liao Wu et al.

Robotic-assisted orthopaedic surgeries demand accurate, automated leg manipulation for improved spatial accuracy to reduce iatrogenic damage. In this study, we propose novel rigid body designs and an optical tracking volume setup for tracking of the femur, tibia and surgical instruments. Anatomical points inside the leg are measured using Computed Tomography with an accuracy of 0.3mm. Combined with kinematic modelling, we can express these points relative to any frame and across joints to sub-millimetre accuracy. It enables the setup of vectors on the mechanical axes of the femur and tibia for kinematic analysis. Cadaveric experiments are used to verify the tracking of internal anatomies and joint motion analysis. The proposed integrated solution is a first step in the automation of leg manipulation and can be used as a ground-truth for future robot-assisted orthopaedic research.

ROMar 6, 2019
Optimal Dexterity for a Snake-like Surgical Manipulator using Patient-specific Task-space Constraints in a Computational Design Algorithm

Andrew Razjigaev, Ajay K. Pandey, Jonathan Roberts et al.

Tendon-driven snake-like arms have been used to create highly dexterous continuum robots so that they can bend around anatomical obstacles to access clinical targets. In this paper, we propose a design algorithm for developing patient-specific surgical continuum manipulators optimized for oriental dexterity constrained by task-space obstacles. The algorithm uses a sampling-based approach to finding the dexterity distribution in the workspace discretized by voxels. The oriental dexterity measured in the region of interest in the task-space formed a fitness function to be optimized through differential evolution. This was implemented in the design of a tendon-driven manipulator for knee arthroscopy. The results showed a feasible design that achieves significantly better dexterity than a rigid tool. This highlights the potential of the proposed method to be used in the process of designing dexterous surgical manipulators in the field.

ROFeb 1, 2019
Geometric interpretation of the general POE model for a serial-link robot via conversion into D-H parameterization

Liao Wu, Ross Crawford, Jonathan Roberts

While Product of Exponentials (POE) formula has been gaining increasing popularity in modeling the kinematics of a serial-link robot, the Denavit-Hartenberg (D-H) notation is still the most widely used due to its intuitive and concise geometric interpretation of the robot. This paper has developed an analytical solution to automatically convert a POE model into a D-H model for a robot with revolute, prismatic, and helical joints, which are the complete set of three basic one degree of freedom lower pair joints for constructing a serial-link robot. The conversion algorithm developed can be used in applications such as calibration where it is necessary to convert the D-H model to the POE model for identification and then back to the D-H model for compensation. The equivalence of the two models proved in this paper also benefits the analysis of the identifiability of the kinematic parameters. It is found that the maximum number of identifiable parameters in a general POE model is 5h+4r +2t +n+6 where h, r, t, and n stand for the number of helical, revolute, prismatic, and general joints, respectively. It is also suggested that the identifiability of the base frame and the tool frame in the D-H model is restricted rather than the arbitrary six parameters as assumed previously.