Maxwell Shepherd

CV
h-index45
3papers
336citations
Novelty37%
AI Score46

3 Papers

LGJan 24, 2025
Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han et al. · amazon-science, apple-ml

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

6.3CVApr 18
Frozen Vision Transformers for Dense Prediction on Small Datasets: A Case Study in Arrow Localization

Maxwell Shepherd

We present a system for automated detection, localization, and scoring of arrow punctures on 40\,cm indoor archery target faces, trained on only 48 annotated photographs (5{,}084 punctures). Our pipeline combines three components: a color-based canonical rectification stage that maps perspective-distorted photographs into a standardized coordinate system where pixel distances correspond to known physical measurements; a frozen self-supervised vision transformer (DINOv3 ViT-L/16) paired with AnyUp guided feature upsampling to recover sub-millimeter spatial precision from $32 \times 32$ patch tokens; and lightweight CenterNet-style detection heads for arrow-center heatmap prediction. Only 3.8\,M of 308\,M total parameters are trainable. Across three cross-validation folds, we achieve a mean F1 score of $0.893 \pm 0.011$ and a mean localization error of $1.41 \pm 0.06$\,mm, comparable to or better than prior fully-supervised approaches that require substantially more training data. An ablation study shows that the CenterNet offset regression head, typically essential for sub-pixel refinement, provides negligible detection improvement while degrading localization in our setting. This suggests that guided feature upsampling already resolves the spatial precision lost through patch tokenization. On downstream archery metrics, the system recovers per-image average arrow scores with a median error of 1.8\% and group centroid positions to within a median of 4.00\,mm. These results demonstrate that frozen foundation models with minimal task-specific adaptation offer a practical paradigm for dense prediction in small-data regimes.

2.2SIMar 24
Concurrent Streaming, Viewer Transfers, and Audience Loyalty in a Creator Ecosystem: A Minute-Level Longitudinal Study

Maxwell Shepherd

Live streaming platforms host interconnected communities of content creators whose audiences overlap and interact in ways that are poorly understood at fine temporal resolution. We present a descriptive longitudinal study of audience behavior within a creator ecosystem, analyzing 2.9 million minute-by-minute viewership observations across 7,762 livestreams from 18 affiliated channels over 3.3 years. We find that (1) concurrent streaming is associated with substantial raw per-stream audience decreases (14,377 to 6,057 viewers as concurrent stream count rises from 1 to 9), though hour-of-day controls reduce the residualized correlation to $ρ= -0.165$, indicating that scheduling confounds account for much of the observed drop; (2) algorithmically detected viewer transfer events achieve a median efficiency of approximately 50\% across 3,243 candidate events; and (3) audience loyalty metrics (stability, competition resistance, retention, and floor ratio) vary substantially across creators within the same organization, with competition resistance ranging from 0.36 to 1.00, indicating that audience exclusivity is a creator-level rather than organization-level property. These findings provide practical benchmarks for creator organizations making scheduling, cross-promotion, and talent management decisions.