Priya Pitre

CL
h-index6
3papers
3citations
Novelty52%
AI Score42

3 Papers

LGJan 30
Agentic Framework for Epidemiological Modeling

Rituparna Datta, Zihan Guan, Baltazar Espinoza et al.

Epidemic modeling is essential for public health planning, yet traditional approaches rely on fixed model classes that require manual redesign as pathogens, policies, and scenario assumptions evolve. We introduce EPIAGENT, an agentic framework that automatically synthesizes, calibrates, verifies, and refines epidemiological simulators by modeling disease progression as an iterative program synthesis problem. A central design choice is an explicit epidemiological flow graph intermediate representation that links scenario specifications to model structure and enables strong, modular correctness checks before code is generated. Verified flow graphs are then compiled into mechanistic models supporting interpretable parameter learning under physical and epidemiological constraints. Evaluation on epidemiological scenario case studies demonstrates that EPIAGENT captures complex growth dynamics and produces epidemiologically consistent counterfactual projections across varying vaccination and immune escape assumptions. Our results show that the agentic feedback loop prevents degeneration and significantly accelerates convergence toward valid models by mimicking professional expert workflows.

CLSep 29, 2025Code
BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models

Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi et al.

Evaluating language models fairly is becoming harder as static benchmarks available on the internet risk contamination by training data. This makes it unclear whether models are truly reasoning or just recalling answers. In this paper, we introduce BeyondBench, an evaluation framework that avoids this problem by using algorithmic problem generation. Unlike traditional benchmarks that risk contamination from internet-scale training data, BeyondBench creates mathematically grounded problems on the fly, ensuring each test remains fresh and uncontaminated. Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels: the Easy Suite (29 tasks) for basic arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) tackling NP-complete and constraint satisfaction problems. Each task generates problems from a combinatorial space larger than 10^15 unique instances, with solutions verified deterministically by mathematical proofs. We evaluated 101 language models, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters and multiple quantization schemes. Our results show consistent reasoning deficiencies across model families, with performance degrading sharply as problem complexity increases from polynomial to exponential. In our Hard Suite evaluations, models such as Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved average accuracies of 56.38%, 26.91%, and 33.60%, respectively. Moreover, we observe that performance drops drastically without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing a decline of 16.81%, 28.05%, and 47.59% accuracy on the hard suite. Our leaderboard is publicly available at https://ctrl-gaurav.github.io/BeyondBench/

HCJun 4, 2024
ArguMentor: Augmenting User Experiences with Counter-Perspectives

Priya Pitre, Kurt Luther

We encounter arguments everyday in the form of social media posts, presidential debates, news articles, and even advertisements. A ubiquitous, influential example is the opinion piece (op-ed). Opinion pieces can provide valuable perspectives, but they often represent only one side of a story, which can make readers susceptible to confirmation bias and echo chambers. Exposure to different perspectives can help readers overcome these obstacles and form more robust, nuanced views on important societal issues. We designed ArguMentor, a human-AI collaboration system that highlights claims in opinion pieces, identifies counter-arguments for them using a LLM, and generates a context-based summary of based on current events. It further enhances user understanding through additional features like a Q\&A bot (that answers user questions pertaining to the text), DebateMe (an agent that users can argue any side of the piece with) and highlighting (where users can highlight a word or passage to get its definition or context). Our evaluation on news op-eds shows that participants can generate more arguments and counter-arguments and display higher critical thinking skills after engaging with the system. Further discussion highlights a more general need for this kind of a system.