CYJan 15, 2023
Bike Frames: Understanding the Implicit Portrayal of Cyclists in the NewsXingmeng Zhao, Dan Schumacher, Sashank Nalluri et al.
Increasing cycling for transportation or recreation can boost health and reduce the environmental impacts of vehicles. However, news agencies' ideologies and reporting styles often influence public perception of cycling. For example, if news agencies overly report cycling accidents, it may make people perceive cyclists as "dangerous," reducing the number of cyclists who opt to cycle. Additionally, a decline in cycling can result in less government funding for safe infrastructure. In this paper, we develop a method for detecting the perceived perception of cyclists within news headlines. We introduce a new dataset called ``Bike Frames'' to accomplish this. The dataset consists of 31,480 news headlines and 1,500 annotations. Our focus is on analyzing 11,385 headlines from the United States. We also introduce the BikeFrame Chain-of-Code framework to predict cyclist perception, identify accident-related headlines, and determine fault. This framework uses pseudocode for precise logic and integrates news agency bias analysis for improved predictions over traditional chain-of-thought reasoning in large language models. Our method substantially outperforms other methods, and most importantly, we find that incorporating news bias information substantially impacts performance, improving the average F1 from .739 to .815. Finally, we perform a comprehensive case study on US-based news headlines, finding reporting differences between news agencies and cycling-specific websites as well as differences in reporting depending on the gender of cyclists. WARNING: This paper contains descriptions of accidents and death.
CLMar 12
Prompting Underestimates LLM Capability for Time Series ClassificationDan Schumacher, Erfan Nourbakhsh, Rocky Slavin et al.
Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model's representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.
CLApr 2, 2024Code
Team UTSA-NLP at SemEval 2024 Task 5: Prompt Ensembling for Argument Reasoning in Civil Procedures with GPT4Dan Schumacher, Anthony Rios
In this paper, we present our system for the SemEval Task 5, The Legal Argument Reasoning Task in Civil Procedure Challenge. Legal argument reasoning is an essential skill that all law students must master. Moreover, it is important to develop natural language processing solutions that can reason about a question given terse domain-specific contextual information. Our system explores a prompt-based solution using GPT4 to reason over legal arguments. We also evaluate an ensemble of prompting strategies, including chain-of-thought reasoning and in-context learning. Overall, our system results in a Macro F1 of .8095 on the validation dataset and .7315 (5th out of 21 teams) on the final test set. Code for this project is available at https://github.com/danschumac1/CivilPromptReasoningGPT4.
CVSep 27, 2024
Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data VisualizationsJames Ford, Xingmeng Zhao, Dan Schumacher et al.
We propose a novel framework that leverages Visual Question Answering (VQA) models to automate the evaluation of LLM-generated data visualizations. Traditional evaluation methods often rely on human judgment, which is costly and unscalable, or focus solely on data accuracy, neglecting the effectiveness of visual communication. By employing VQA models, we assess data representation quality and the general communicative clarity of charts. Experiments were conducted using two leading VQA benchmark datasets, ChartQA and PlotQA, with visualizations generated by OpenAI's GPT-3.5 Turbo and Meta's Llama 3.1 70B-Instruct models. Our results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures. Moreover, while our results demonstrate that few-shot prompting significantly boosts the accuracy of chart generation, considerable progress remains to be made before LLMs can fully match the precision of human-generated graphs. This underscores the importance of our work, which expedites the research process by enabling rapid iteration without the need for human annotation, thus accelerating advancements in this field.
CLOct 16, 2025
Speculative Model Risk in Healthcare AI: Using Storytelling to Surface Unintended HarmsXingmeng Zhao, Dan Schumacher, Veronica Rammouz et al.
Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low-barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real-world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 13 harm types. In contrast, those who did not read stories focused primarily on privacy and well-being (58.3%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI's impact on users.
CLJun 27, 2024
RASTeR: Robust, Agentic, and Structured Temporal ReasoningDan Schumacher, Fatemeh Haji, Tara Grey et al.
Temporal question answering (TQA) remains a challenge for large language models (LLMs), particularly when retrieved content may be irrelevant, outdated, or temporally inconsistent. This is especially critical in applications like clinical event ordering, and policy tracking, which require reliable temporal reasoning even under noisy or outdated information. To address this challenge, we introduce RASTeR: \textbf{R}obust, \textbf{A}gentic, and \textbf{S}tructured, \textbf{Te}mporal \textbf{R}easoning, a prompting framework that separates context evaluation from answer generation. RASTeR first assesses the relevance and temporal coherence of the retrieved context, then constructs a temporal knolwedge graph (TKG) to better facilitate reasoning. When inconsistencies are detected, RASTeR selectively corrects or discards context before generating an answer. Across multiple datasets and LLMs, RASTeR consistently improves robustness\footnote{\ Some TQA work defines robustness as handling diverse temporal phenomena. Here, we define it as the ability to answer correctly despite suboptimal context}. We further validate our approach through a ``needle-in-the-haystack'' study, in which relevant context is buried among distractors. With forty distractors, RASTeR achieves 75\% accuracy, over 12\% ahead of the runner up