Matthew Butler

HC
h-index20
7papers
93citations
Novelty31%
AI Score39

7 Papers

CLDec 19, 2025Code
WRAVAL -- WRiting Assist eVALuation

Gabriel Benedict, Matthew Butler, Naved Merchant et al.

The emergence of Large Language Models (LLMs) has shifted language model evaluation toward reasoning and problem-solving tasks as measures of general intelligence. Small Language Models (SLMs) -- defined here as models under 10B parameters -- typically score 3-4 times lower than LLMs on these metrics. However, we demonstrate that these evaluations fail to capture SLMs' effectiveness in common industrial applications, such as tone modification tasks (e.g., funny, serious, professional). We propose an evaluation framework specifically designed to highlight SLMs' capabilities in non-reasoning tasks where predefined evaluation datasets don't exist. Our framework combines novel approaches in data generation, prompt-tuning, and LLM-based evaluation to demonstrate the potential of task-specific finetuning. This work provides practitioners with tools to effectively benchmark both SLMs and LLMs for practical applications, particularly in edge and private computing scenarios. Our implementation is available at: https://github.com/amazon-science/wraval.

MMMar 28, 2025
Using AI to Summarize US Presidential Campaign TV Advertisement Videos, 1952-2012

Adam Breuer, Bryce J. Dietrich, Michael H. Crespin et al.

This paper introduces the largest and most comprehensive dataset of US presidential campaign television advertisements, available in digital format. The dataset also includes machine-searchable transcripts and high-quality summaries designed to facilitate a variety of academic research. To date, there has been great interest in collecting and analyzing US presidential campaign advertisements, but the need for manual procurement and annotation led many to rely on smaller subsets. We design a large-scale parallelized, AI-based analysis pipeline that automates the laborious process of preparing, transcribing, and summarizing videos. We then apply this methodology to the 9,707 presidential ads from the Julian P. Kanter Political Commercial Archive. We conduct extensive human evaluations to show that these transcripts and summaries match the quality of manually generated alternatives. We illustrate the value of this data by including an application that tracks the genesis and evolution of current focal issue areas over seven decades of presidential elections. Our analysis pipeline and codebase also show how to use LLM-based tools to obtain high-quality summaries for other video datasets.

CRMar 13
FraudFox: Adaptable Fraud Detection in the Real World

Matthew Butler, Yi Fan, Christos Faloutsos

The proposed method (FraudFox) provides solutions to adversarial attacks in a resource constrained environment. We focus on questions like the following: How suspicious is `Smith', trying to buy \$500 shoes, on Monday 3am? How to merge the risk scores, from a handful of risk-assessment modules (`oracles') in an adversarial environment? More importantly, given historical data (orders, prices, and what-happened afterwards), and business goals/restrictions, which transactions, like the `Smith' transaction above, which ones should we `pass', versus send to human investigators? The business restrictions could be: `at most $x$ investigations are feasible', or `at most \$$y$ lost due to fraud'. These are the two research problems we focus on, in this work. One approach to address the first problem (`oracle-weighting'), is by using Extended Kalman Filters with dynamic importance weights, to automatically and continuously update our weights for each 'oracle'. For the second problem, we show how to derive an optimal decision surface, and how to compute the Pareto optimal set, to allow what-if questions. An important consideration is adaptation: Fraudsters will change their behavior, according to our past decisions; thus, we need to adapt accordingly. The resulting system, \method, is scalable, adaptable to changing fraudster behavior, effective, and already in \textbf{production} at Amazon. FraudFox augments a fraud prevention sub-system and has led to significant performance gains.

HCFeb 2, 2021
Technology Developments in Touch-Based Accessible Graphics: A Systematic Review of Research 2010-2020

Matthew Butler, Leona Holloway, Samuel Reinders et al.

This paper presents a systematic literature review of 292 publications from 97 unique venues on touch-based graphics for people who are blind or have low vision, from 2010 to mid-2020. It is the first review of its kind on touch-based accessible graphics. It is timely because it allows us to assess the impact of new technologies such as commodity 3D printing and low-cost electronics on the production and presentation of accessible graphics. As expected our review shows an increase in publications from 2014 that we can attribute to these developments. It also reveals the need to: broaden application areas, especially to the workplace; broaden end-user participation throughout the full design process; and conduct more in situ evaluation. This work is linked to an online living resource to be shared with the wider community.

HCSep 1, 2020
"Hey Model!" - Natural User Interactions and Agency in Accessible Interactive 3D Models

Samuel Reinders, Matthew Butler, Kim Marriott

While developments in 3D printing have opened up opportunities for improved access to graphical information for people who are blind or have low vision (BLV), they can provide only limited detailed and contextual information. Interactive 3D printed models (I3Ms) that provide audio labels and/or a conversational agent interface potentially overcome this limitation. We conducted a Wizard-of-Oz exploratory study to uncover the multi-modal interaction techniques that BLV people would like to use when exploring I3Ms, and investigated their attitudes towards different levels of model agency. These findings informed the creation of an I3M prototype of the solar system. A second user study with this model revealed a hierarchy of interaction, with BLV users preferring tactile exploration, followed by touch gestures to trigger audio labels, and then natural language to fill in knowledge gaps and confirm understanding.

HCMar 31, 2020
Tactile Presentation of Network Data: Text, Matrix or Diagram?

Yalong Yang, Kim Marriott, Matthew Butler et al.

Visualisations are commonly used to understand social, biological and other kinds of networks. Currently, we do not know how to effectively present network data to people who are blind or have low-vision (BLV). We ran a controlled study with 8 BLV participants comparing four tactile representations: organic node-link diagram, grid node-link diagram, adjacency matrix and braille list. We found that the node-link representations were preferred and more effective for path following and cluster identification while the matrix and list were better for adjacency tasks. This is broadly in line with findings for the corresponding visual representations.

DSOct 18, 2012
Creating a level playing field for all symbols in a discretization

Matthew Butler, Dimitar Kazakov

In time series analysis research there is a strong interest in discrete representations of real valued data streams. One approach that emerged over a decade ago and is still considered state-of-the-art is the Symbolic Aggregate Approximation algorithm. This discretization algorithm was the first symbolic approach that mapped a real-valued time series to a symbolic representation that was guaranteed to lower-bound Euclidean distance. The interest of this paper concerns the SAX assumption of data being highly Gaussian and the use of the standard normal curve to choose partitions to discretize the data. Though not necessarily, but generally, and certainly in its canonical form, the SAX approach chooses partitions on the standard normal curve that would produce an equal probability for each symbol in a finite alphabet to occur. This procedure is generally valid as a time series is normalized before the rest of the SAX algorithm is applied. However there exists a caveat to this assumption of equi-probability due to the intermediate step of Piecewise Aggregate Approximation (PAA). What we will show in this paper is that when PAA is applied the distribution of the data is indeed altered, resulting in a shrinking standard deviation that is proportional to the number of points used to create a segment of the PAA representation and the degree of auto-correlation within the series. Data that exhibits statistically significant auto-correlation is less affected by this shrinking distribution. As the standard deviation of the data contracts, the mean remains the same, however the distribution is no longer standard normal and therefore the partitions based on the standard normal curve are no longer valid for the assumption of equal probability.